# Introduction
This work comprises the execution of different text processing and analysis tasks applied to patent documents in XML format. The required task is to extract the __grant_id, patent_kind, patent_title, number_of_claims, citations_examiner_count, citations_applicant_count, inventors, claims_text__ and __abstract__ features from a given text file and store them into a tidy dataset in __CSV__ and __JSON__ format. More details for each task will be given in the following sections.

External libraries allowed: Regular Expressions & Pandas.

# Libraries used

In [1]:
import re

# Loading and examining the data

As a first step, the file Group135.txt was examined in order to get familiar with it's structure and main features. This was done not only using python but also making use of text editors such as `vim` and the text viewer feature of `PyCharm`.

The first lines of the input file are printed below.

In [2]:
# Inspecting the first lines of the input file
file_handler = open('Group135.txt', mode = 'r')
input_file = file_handler.readlines()
file_handler.close()
for line in input_file[:15]:
    print(line, end ='')

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10357528-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10357528</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>15742391</doc-number>


As we can see from the first 15 lines, the document seems to be properly formatted. This allows to find and obtain the information of interest using regular expressions.
Some of the required information can be already seen in the first 10 lines, such as the grant_id as a combination of the data under `<country>` and `<doc-number>` tags, and what it seems to be a code for patent_kind under the `<kind>` tag.

From the previous result we see the string `<?xml version="1.0" encoding="UTF-8"?>` at the beggining of a given section of a particular patent grant data. Further inspection of the file using text editors confirmed that each particular section of patent grant data in fact starts with this. Hence, this string is a good candidate for splitting the input file to separate in different items the patents data.

In order to separate each patent data in items in a list, the following was performed:

In [3]:
# Opening the input data file
# File is read as a large string, then the '\n' character is eliminated to facilitate the use of regular expressions.
# Finally the string file is split, separating each patent grant in a single element in a list.

file_handler = open('Group135.txt', mode = 'r')
input_file = file_handler.read().replace('\n', '')
file_handler.close()
data = input_file.split('<?xml version="1.0" encoding="UTF-8"?>')[1:]

## Parsing Group135.txt File

Making use of text editors and sample output files provided, the location and xml tags for every feature of interest were identified. Then, a general approach was applied to extract the desired data:
1. Extracting a smaller section of the whole patent text where the data of interest is nested.
2. Extracting the data of interest.
3. Performing some data manipulation to get to the desired format, according to the sample files.

Extracting a smaller section of the text prior to get the actual data (step 1) was done in order to avoid the regular expressions extracting unwanted data under the same tag structure from wrong sections of the patent grant data.

### grant_id feature
The grant_id data was found nested inside `<publication-reference>` tag, specifically as a combination of the data under `<country>` and `<doc-number>` tags. Given so, the general approach was applied and a function `getGrantId` was defined for extracting the wanted data. The function returns a string where country and doc-number are concatenated.
Regex captures 2 groups of the grant_id: a 'letter part' which is in between `<country>` and `</co` tags and a 'numeric part' in between the `</co.+?number>(.+?)</doc` pattern. Then both groups are joined into one ID.

In [4]:
# Function to generate grant_id data
def getGrantId(data):
    # Extracting desired sub section of text
    section_pattern = r'<publication-reference.+?>.+?</publication-reference>' 
    text_section = re.findall(section_pattern, data)
    
    # Extracting wanted data
    grant_id_pattern = r'<country>(.+?)</co.+?number>(.+?)</doc'
    grant_id = re.findall(grant_id_pattern, ''.join(text_section))
    grant_id = ''.join(grant_id[0]) # Convert output list of tuples into string
    return str(grant_id)

### patent_title feature
This information was located under the tag `<invention-title>`. A function `getPatentTitle` was created, which returns different outputs according to the `fileType` argument. This argument was used to allow the function return appropiate output for writing into a csv file or into a json file, adding `"` around the returned string when necessary. Also, the function considers eliminating html tags found on patent_title data, making use of the regular expression `<.+?>` (the method used for finding html tags and special xml characters is detailed in the next section).

In [5]:
# Function to extract patent_title feature
def getPatentTitle(data, fileType = 'csv'):
    patent_title_pattern = r'<invention-title.+?>(.+?)</invention-title>'
    patent_title = re.findall(patent_title_pattern, data)
    patent_title = ''.join(patent_title[0])
    
    # Eliminating tags
    patent_title = re.sub(r'<.+?>', '', patent_title)
    
    # Formatting for different output files
    if ',' in patent_title and fileType != 'json':
        patent_title = '"' + patent_title + '"'
    elif '/' in patent_title and fileType == 'json':
        patent_title = re.sub(r'/', '\/', patent_title)
    return patent_title

### kind feature
The kind feature data was nested inside `<publication-reference>` section, under the tag `<kind>`. The data was actually a code for a longer description. A dictionary was used to translate the code value to the needed string. The dictionary was elaborated based on the sample input and ouptut files provided. The dictionary's structure is the following:

```python
kind_dict = {"B2": "Utility Patent Grant (with a published application) issued on or after January 2, 2001.", 
             "S1": "Design Patent", 
             "E1": "Reissue Patent", 
             "B1": "Utility Patent Grant (no published application) issued on or after January 2, 2001.",
             "P2": "Plant Patent Grant (no published application) issued on or after January 2, 2001",
             "P3": "Plant Patent Grant (with a published application) issued on or after January 2, 2001"}
```
The function `getKind` was defined and designed to make use of the dictionary, formatting the output as desired.
Regex captures the Key in between the 'kind'-tags and then Key is replaced with the Value from the dictionary.


In [6]:
# Function to extract kind code and translate to desired string using kind dictionary
def getKind(data, fileType = 'csv'):
    section_pattern = r'<publication-reference.+?>.+?</publication-reference>'
    text_section = re.findall(section_pattern, data)
    kind_pattern = r'<kind>(.+?)</kind>'
    kind_code = re.findall(kind_pattern, ''.join(text_section))
    kind_code = ''.join(kind_code[0])
    
    # Obtaining string-value from dictionary
    kind = kind_dict.get(kind_code)
    
    # Formatting for different output files
    if ',' in  kind and fileType != 'json':
        kind = '"' + kind + '"'
    return kind

### number_of_claims feature
Number of claims data was straightforward. It was found under `<number-of-claims>` tag, and it consisted in a number to be extracted with no further processing.
The following `getNumberOfClaims` was designed:

In [7]:
# Function to extract number of claims
def getNumberOfClaims(data):
    num_of_claims_pattern = r'<number-of-claims>(.+?)</number-of-claims>'
    num_of_claims = re.findall(num_of_claims_pattern, data)
    num_of_claims = ''.join(num_of_claims[0])
    return num_of_claims

### inventors feature
This feature data was nested under the `<inventors>` tag, specifically under `<last-name>` and `<first-name>` tags. The regular expression was designed to capture everything inside opening and closing name tags. The function then defined to extract this data performs several actions to get to the desired format.
The actions are:
- Extraction of the first and last name, compile them into one inventor;
- Separating inventors with the comma characters;
- Wrapping the list of inventors into square brackets.

In [8]:
# Function for extracting inventors. Output is a string enclosed in square brackets
def getInventors(data, fileType = 'csv'):
    section_pattern = r'<inventors>.+?</inventors>'
    text_section = re.findall(section_pattern, data)
    inventors_pattern = r'<last-name>(.+?)</last-name><first-name>(.+?)</first-name>'
    inventors = re.findall(inventors_pattern, ''.join(text_section))
    
    # Arranging inventors data based on output sample files
    inventors = '[' + ','.join([item[1]+' '+item[0] for item in inventors]) + ']'
    
    # Formatting for different output files
    if ',' in inventors and fileType != 'json':
        inventors = '"' + inventors + '"'
    elif '/' in inventors and fileType == 'json':
        inventors = re.sub(r'/', '\/', inventors)
    return inventors

### citations_count features
The tag `<us-references-cited>` contained all references for a particular patent. Each citation was defined by a category under `<category>` tag, from where it could be distinguished if it was an examiner or applicant citation. This feature was taken into account and used to define the function `getCitations`, which makes use of the regular expression `cited by applicant` or `cited by examiner` to make the corresponding counting.

In [9]:
# Function to obtain citations count. 
# Category argument allows to determine if the count is made for applicants or for examiners.
def getCitations(data, category):
    section_pattern = r'<us-references-cited>.+?</us-references-cited>'
    text_section = re.findall(section_pattern, data)
    
    # Search pattern for applicant count
    if category == 'applicant':
        cite_pattern = r'cited by applicant'
    
    # Search pattern for examiner count
    if category == 'examiner':
        cite_pattern = r'cited by examiner'
    
    # Computing number of citations
    result = len(re.findall(cite_pattern, ''.join(text_section)))
    return str(result)

### claims_text feature
This feature was nested inside the tag `<claims>`. The extraction of claims data was an exception to the general procedure, since once obtained the smaller section of text, undesired data was eliminated (cleaning text) instead of extracting relevant data. Undesired data consisted mainly in `<claim-text>` tags and `<claim-ref>` tags inside the claim text. Starting and ending `<claims>` tags were also eliminated, followed by the replacement of `</claim>` tags by a comma as in the output sample files. Finally, last undesired xml tags were eliminated (the method used for finding html tags and special xml characters is detailed in the next section).

In [10]:
# Function to clean claims text.
def getClaims(data, fileType = 'csv'):
    section_pattern = r'<claims.+?>.+?</claims>'
    claims_section = re.findall(section_pattern, data)
    
    # Cleaning of undesired tags and unwanted text
    claim_tags_pattern = r'<claims.+?claim-text>|</?claim-text>|<claim .+?>|<claim-ref.+?>|</claim-ref>|</claims>'
    claims_section = re.sub(claim_tags_pattern, '', ''.join(claims_section))
    claims_section = [re.sub(r'</claim>', ',', ''.join(claims_section))[:-1]]
    
    # Cleaning detected xml/html tags
    claims_section = [re.sub(r'\<.+?>', '', ''.join(claims_section))]
    
    # Formatting according to sample output files
    claims_section = "[" + ''.join(claims_section) + "]"
    
    # Formatting for different output files
    if ',' in claims_section and fileType != 'json':
        claims_section = '"' + claims_section + '"'
    elif '/' in claims_section and fileType == 'json':
        claims_section = re.sub(r'/', '\/', claims_section)
    return claims_section

### abstract feature
Finally, the abstract was obtained. It was located under `<abstract>` tag, inside html tags delimited by `<p>` tags. The input file was inspected to assure that every single abstract in it was made up by only one paragraph. It was detected as well that the `<abstract>` tag was sometimes missing. For this cases, a `'NA'` string was imputed. The following `getAbstract` function was defined, which handled undesired html tags found in the text for yielding a clean output.

In [11]:
# Function for extracting abstract text.
def getAbstract(data, fileType = 'csv'):
    section_pattern = r'<abstract.+?>.+?</abstract>'
    text_section = re.findall(section_pattern, data)
    abstract_pattern = r'<p.+?>(.+?)</p>'
    abstract = re.findall(abstract_pattern, ''.join(text_section))
    
    # Imputation of 'NA' string in cases of missing abstract tag
    if len(abstract) == 0:
        abstract = 'NA'
    else:
        abstract = ''.join(abstract[0])
    
    # Formatting for different output files
    if ',' in abstract and fileType != 'json':
        abstract = '"' + abstract + '"'
    elif '/' in abstract and fileType == 'json':
        abstract = re.sub(r'/', '\/', abstract)
    
    # Eliminating undesired html/xml tags
    abstract = re.sub(r'<.+?>', '', abstract)
    return abstract

## 3. Detecting unwanted data in functions output
### HTML/XML tags
The functions presented in the previous section were designed iteratively, based on interaction with the data and the corresponding function outputs. The first version of the functions didn't take into account handling undesired html/xml tags in the features containing written text (patent_title, claims, abstract, inventors). In order to find out which of these features contained this unwanted data, the first version of the functions where used in the following way to store in lists unwanted tags:
```python
patent_title = []
inventors = []
claims = []
abstract = []

# Search unwanted tags in functions output and append to corresponding lists
for item in data:
    patent_title.append(''.join(re.findall(r'<.+?>', getPatentTitle(item))))
    claims.append(''.join(re.findall(r'<.+?>', getClaims(item))))
    abstract.append(''.join(re.findall(r'<.+?>', getAbstract(item))))
    inventors.append(''.join(re.findall(r'<.+?>', getInventors(item))))

# Data processing to ease data interpretation
patent_title = ''.join([c for c in patent_title if c != '']).split('>')
claims = ''.join([c for c in claims if c != '']).split('>')
abstract = ''.join([c for c in abstract if c != '']).split('>')
inventors = ''.join([c for c in inventors if c != '']).split('>')

# Print unique unwanted tags
print('patent_title:', set(patent_title))
print('claims:', set(claims))
print('abstract:', set(abstract))
print('inventors:', set(inventors))
```
The following results were obtained, indicating the presence of tag `<i>` in patent title data, and tags `<sub>`, `<i>` and `<b>` in abstract data:
```python
patent_title: {'', '</i', '<i'}
claims: {''}
abstract: {'', '<sub', '<i', '</i', '</b', '</sub', '<b'}
inventors: {''}
```
Given these results, the functions where modified to handle these cases and the final version of them were defined as shown in the previous section.

### XML special characters
A similar approach was applied to detect unwanted xml special characters in the extracted data. In this case the finding was bigger, with 39 different special characters detected. The following was performed:
```python
patent_title = []
inventors = []
claims = []
abstract = []

# Search unwanted special characters in functions output and append to corresponding lists
for item in data:
    patent_title.append(''.join(re.findall(r'&.+?;', getPatentTitle(item))))
    claims.append(''.join(re.findall(r'&.+?;', getClaims(item))))
    abstract.append(''.join(re.findall(r'&.+?;', getAbstract(item))))
    inventors.append(''.join(re.findall(r'&.+?;', getInventors(item))))

# Data processing to ease data interpretation
patent_title = ''.join([c for c in patent_title if c != '']).split(';')
claims = ''.join([c for c in claims if c != '']).split(';')
abstract = ''.join([c for c in abstract if c != '']).split(';')
inventors = ''.join([c for c in inventors if c != '']).split(';')

# Print unique unwanted xml special characters
print('patent_title:', set(patent_title))
print('claims:', set(claims))
print('abstract:', set(abstract))
print('inventors:', set(inventors))
```
The above yielded the following results:
```python
patent_title: {'', '&#x2014', '&#x2019', '&#x3b1', '&#x2018'}
claims: {'', '&#x2229', '&#x2014', '&#x2205', '&#x2260', '&#x2261', '&#xb7', '&#x3c', '&#x3c4', '&#x2062', '&#x201c', '&#x222a', '&#x3b4', '&#x3b8', '&#x2212', '&#x2003', '&#x2264', '&#x2265', '&#x2207', '&#x2550', '&#x201d', '&#x3c1', '&#x3bd', '&#xb0', '&#x3c0', '&#x2208', '&#xd7', '&#x3b3', '&#x3e', '&#x2202', '&#x394', '&#x2019', '&#x3b1', '&#x2018', '&#x3ba', '&#x212b', '&#x3bc'}
abstract: {'', '&#x2014', '&#x2018', '&#x201d', '&#x3c', '&#x2212', '&#x2019', '&#x201c', '&#x2264', '&#x3b1', '&#xb0', '&#x3bc', '&#x3b4', '&#x3b3', '&#x3e'}
inventors: {'', '&#xe9', '&#xf4', '&#xe7'}
```
A dictionary was generated to handle these special characters based on the information contained in http://www.howtocreate.co.uk/sidehtmlentity.html. This page contained all the symbols corresponding to every special character detected. The dictionary was the following, used later in the extracting/writing output CSV (not JSON) file:
```python
# xml special characters dictionary
special_char_dict = {'&#x2003;':' ', '&#x2014;':'—', '&#x2018;':'‘', '&#x2019;':'’', '&#x201c;':'“',
'&#x201d;':'”','&#x2062;':'', '&#x212b;':'Å', '&#x2202;':'∂', '&#x2205;':'∅', '&#x2207;':'∇', '&#x2208;':'∈','&#x2212;':'−', '&#x2229;':'∩', '&#x222a;':'∪', '&#x2260;':'≠', '&#x2261;':'≡', '&#x2264;':'≤', '&#x2265;':'≥', '&#x2550;':'═', '&#x394;':'Δ', '&#x3b1;':'α', '&#x3b3;':'γ', '&#x3b4;':'δ', '&#x3b8;':'θ', '&#x3ba;':'κ', '&#x3bc;':'μ', '&#x3bd;':'ν', '&#x3c;':'<', '&#x3c0;':'π', '&#x3c1;':'ρ', '&#x3c4;':'τ', '&#x3e;':'>', '&#xb0;':'°', '&#xb7;':'·', '&#xd7;':'×', '&#xe7;':'ç', '&#xe9;':'é', '&#xf4;':'ô'}
```

## 4. Extracting data / writing output files
Finally, all dictionaries and functions presented above were combined in corresponding `for loops` to extract and clean the relevant data and immediately write the output to csv and json files. The procedure is described below.

In [12]:
# Generating kind dictionary needed for translating kind codes to desired text
kind_dict = {"B2": "Utility Patent Grant (with a published application) issued on or after January 2, 2001.", 
               "S1": "Design Patent", "E1": "Reissue Patent", 
               "B1": "Utility Patent Grant (no published application) issued on or after January 2, 2001.",
              "P2": "Plant Patent Grant (no published application) issued on or after January 2, 2001",
              "P3": "Plant Patent Grant (with a published application) issued on or after January 2, 2001"}


In [13]:
# xml special character dictionary
special_char_dict = {'&#x2003;':' ', '&#x2014;':'—', '&#x2018;':'‘', '&#x2019;':'’', '&#x201c;':'“', '&#x201d;':'”', '&#x2062;':'', \
                     '&#x212b;':'Å', '&#x2202;':'∂', '&#x2205;':'∅', '&#x2207;':'∇', '&#x2208;':'∈', '&#x2212;':'−', '&#x2229;':'∩',\
                      '&#x222a;':'∪', '&#x2260;':'≠', '&#x2261;':'≡', '&#x2264;':'≤', '&#x2265;':'≥', '&#x2550;':'═', '&#x394;':'Δ',\
                      '&#x3b1;':'α', '&#x3b3;':'γ', '&#x3b4;':'δ', '&#x3b8;':'θ', '&#x3ba;':'κ', '&#x3bc;':'μ', '&#x3bd;':'ν',\
                     '&#x3c;':'<', '&#x3c0;':'π', '&#x3c1;':'ρ', '&#x3c4;':'τ', '&#x3e;':'>', '&#xb0;':'°', '&#xb7;':'·', '&#xd7;':'×',\
                      '&#xe7;':'ç', '&#xe9;':'é', '&#xf4;':'ô'}

In [14]:
#Opening output file handlers
output_csv = open('Group135.csv', mode = 'w', encoding='utf-8')
output_json = open('Group135.json', mode = 'w', encoding='utf-8')

# Adding CSV file header and JSON opening curly bracket
output_csv.write("grant_id,patent_title,kind,number_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract\n")
output_json.write("{")

1

In [15]:
# Writing JSON file while extracting data
for i in range(len(data)):
    
    # Calling functions to get data for JSON output file
    grant_id = getGrantId(data[i])
    patent_title = getPatentTitle(data[i], fileType = 'json')
    kind = getKind(data[i], fileType = 'json')
    number_of_claims = getNumberOfClaims(data[i])
    inventors = getInventors(data[i], fileType = 'json')
    citations_applicant = getCitations(data[i], category = 'applicant')
    citations_examiner = getCitations(data[i], category = 'examiner')
    claims = getClaims(data[i], fileType = 'json')
    abstract = getAbstract(data[i], fileType = 'json')

    
    # Writing output data to file
    output_json.write('"' + grant_id + '"' + ":{" \
                      + '"patent_title"' + ":" + '"' + patent_title + '"' + "," \
                      + '"kind"' + ":" + '"' + kind + '"' + "," \
                      + '"number_of_claims"' + ":" + number_of_claims + "," \
                      + '"inventors"' + ":" + '"' + inventors + '"' + "," \
                      + '"citations_applicant_count"' + ":" + citations_applicant + "," \
                      + '"citations_examiner_count"' + ":" + citations_examiner + "," \
                      + '"claims_text"' + ":" + '"' + claims + '"' + "," \
                      + '"abstract"' + ":" + '"' + abstract + '"' \
                      + "}" \
                     )


    if i == len(data)-1:
        output_json.write("}") #closing curly bracket
    else:
        output_json.write(",") #comma separating patent grants
        
# Writing CSV file while extracting data
data_csv = data

# Replacing all special characters for csv file
for i in range(len(data_csv)):
    for (key, value) in special_char_dict.items():
        if key in data_csv[i]:
            data_csv[i] = data_csv[i].replace(key, value)

# Writing CSV file while extracting data
for i in range(len(data_csv)):
            
    # Calling functions to get data for CSV output file
    grant_id = getGrantId(data_csv[i])
    patent_title = getPatentTitle(data_csv[i])
    kind = getKind(data_csv[i])
    number_of_claims = getNumberOfClaims(data_csv[i])
    inventors = getInventors(data_csv[i])
    citations_applicant = getCitations(data_csv[i], category = 'applicant')
    citations_examiner = getCitations(data_csv[i], category = 'examiner')
    claims = getClaims(data_csv[i])
    abstract = getAbstract(data_csv[i])
    
    # CSV output lines
    output_csv.write(grant_id + "," \
                     + patent_title + "," \
                     + kind + "," \
                     + number_of_claims + "," \
                     + inventors + "," \
                     + citations_applicant + "," \
                     + citations_examiner + "," \
                     + claims + "," \
                     + abstract \
                     )
    if i != len(data_csv)-1:
        output_csv.write("\n") #line break for csv lines except last one


#Closing file handlers
output_csv.close()
output_json.close()

## 5. Summary
This assessment measured the understanding of XML, CSV and JSON file structure and ability to extract the information from the XML format and write to CSV and JSON formats. This assignment could be conditionally divided into the following blocks:

- **XML parsing and data extraction**. After exploring the input file and given samples, the general boundaries (HTML-tags) were identified for the following data extraction. The needed content was accessed using `re` library for regular expressions. 
- **Dealing with HTML/XML tags**. Some HTML/XML special characters were detected in the textual content. We used an external HTML entity reference in purpose to generate the dictionary for replacing them with the special symbols for CSV file. 
- **Output file structure compliance**. Output files have a specific framework. For example, in JSON file slash `/` should be represented as `\/`, which is critical for its pattern. Or excessive comma characters in a textual value of a particular feature can violate the structure of the CSV file. To avoid this, the textual value for features `patent_title, kind, inventors, claims_text, abstract` should be wrapped into quota marks. Some manipulations with the extracted values were executed on a data processing stage.
- **Exporting data to a specific format**. Two output file in CSV and JSON format were created using python file handlers. Before writing the content both files should be prepared: CSV file needs a header, JSON file starts with "{" sign. Then two iterative functions executed extracting of needed content and writing it into the files. Before closing the files we should avoid the creation of a new line in a CSV file and should close the bracket "}" for JSON. The two output files are closed and ready for further reading.

## 6. References
- Python Software Foundation, 2019. *re — Regular expression operations*. Retrieved from https://docs.python.org/3/library/re.html
- IBM Knowledge Center. *Markdown for Jupyter notebooks cheatsheet*. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSGNPV_1.1.3/dsx/markd-jupyter.html