# FIT5196 Assessment 1
## Parsing Raw Text Files

<br>

**Student Name:** Akshatha Shivashankar Chindalur 
**Student ID:** 29996503

**Student Name:** Pradnya Rajendra Alchetti
**Student ID:** 29595916

**Date:** 20/08/2019

**Version:** 1.0

**Environment:** Python 3.6.3 and Jupyter notebook

**Libraries used:**
* pandas - for dataframe, included in Anaconda Python 3.6
* re - for regular expression, included in Anaconda Python 3.6

## 1. Introduction:
This assignment focusses on analyzing text data and extracting meaningful information from the semi-structured text files. The dataset provided is a text file of size 17.4 MB whih contains the information in semi-structured xml format. The dataset contains the information about several grants given for IP patent claims.
Each patent grant consists of different fields such as patent_title, citations_network, patents_kind, claims etc.

The extracted data about the grants is to be converted into csv and json format.
The data becomes structured on storing into these formats and hence, it becomes easy to work with.

The assignment is divided into following tasks - 

1. Extract the data using regular expressions, convert it into CSV format and store the output in a CSV file.
2. Extract the data using regular expressions, convert it to json format and store the output in a json file.

Following are the steps to accomplish the above mentioned tasks - 





## 2. Importing the required libraries

In [1]:
import re
import pandas as pd

## 3. Reading the data file
The given data file Group113.txt is read into a variable

In [2]:
# The text file Group113.txt is opened in read mode 
# and read into the variable raw_data

with open('./Group113.txt') as data_file:
    raw_data = data_file.read()

## 4. Parsing and data extraction
Defining  all the required regular expressions for parsing the data

In [3]:
regex_patents = '<\?xml version=\"1\.0\" encoding=\"UTF-8\"\?>\n' # regex to extract all the patents
regex_grant = '<country>(.*)<\/country>\\n<doc-number>(.*)<\/doc-number>\\n' # regex to extract the grantId
regex_kind = '<kind>(.*)<\/kind>\\n' # regex to extract patent kind
regex_title = '<invention-title id=".*">(.*)<\/invention-title>' #regex to extract patent title
regex_claims = '<number-of-claims>(.*)<\/number-of-claims>' #regex to extract number of claims
regex_cited_examiners = '<category>cited by examiner<\/category>' #regex to extract citations examiner count
regex_cited_applicant = '<category>cited by applicant<\/category>' # regex to extract citations applicant count
regex_claims_text = '<claim id=".*">([\s\S]*?)<\/claim>' # regex to extract claims text
regex_abstract = '<abstract id="abstract">\n<p id=".*">(.*)<\/p>\n<\/abstract>' # regex to extract abstract
regex_inventors = '<inventor sequence=".*">\n<addressbook>\n<last-name>(.*)<\/last-name>\n<first-name>(.*)<\/first-name>' # regex to extract inventors' name

A dictionary is defined to map the **kind** of patent to its corressponding **application type**.

In [4]:
kind_dict = {'B1': 'Utility Patent Grant (no published application) issued on or after January 2, 2001.', 
        'B2': 'Utility Patent Grant (with a published application) issued on or after January 2, 2001.', 
        'S1': 'Design Patent', 
        'P3': 'Plant Patent Grant (with a published application) issued on or after January 2, 2001',
        'P2': 'Plant Patent Grant (no published application) issued on or after January 2, 2001',
        'E1': 'Reissue Patent'}

Since, each patent begins with the **xml encoding**, they are seperated using the regular expression **<\?xml version=\"1\.0\" encoding=\"UTF-8\"\?>\n**

In [5]:
patents = re.split(regex_patents, raw_data)

The initial element of the list created is empty as the text file begins with the **xml encoding** itself. Thus, this element is ignored during further processing of the raw data.

In [6]:
print(patents[1])

<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10360642-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10360642</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>14622553</doc-number>
<date>20150213</date>
</document-id>
</application-reference>
<us-application-series-code>14</us-application-series-code>
<us-term-of-grant>
<us-term-extension>394</us-term-extension>
</us-term-of-grant>
<classifications-ipcr>
<classification-ipcr>
<ipc-version-indicator><date>20120101</date></ipc-version-indicator>
<classification-level>A</classification-level>
<section>G</section>
<class>06</class>
<subclass>Q</subclass>
<main-group>50</main-group>
<subgroup>00</subgroup>
<symbol-position

We can observe that each grant has a semi-structured xml format where some tags are defined clearly while some xml tags start and end with **?**. Each grant contains a lot of unwanted information such as image and figure tags which can be filtered out. 

The following functions extracts and returns the grant id and the patent title from each grant. This is accomplished using the following two regular expressions:
* Regex to extract the grantId

`<country>(.*)<\/country>\\n<doc-number>(.*)<\/doc-number>\\n`
 
 This regular expression matches the contents between country tag and doc-number tag.

* Regex to extract patent kind

`<kind>(.*)<\/kind>\\n`
 
 This regular expression matches the contents between kind tag. 

In [7]:
def get_grant_id(patent):
    grant = re.search(regex_grant, patent)
    return grant.group(1) + grant.group(2)

def get_kind(patent):
    kind = re.search(regex_kind, patent)
    return kind_dict[kind.group(1)]

Each grant had a set of inventors associated with it. Each of the inventor had a first name and last name. The following functions extracts all such inventors and returns a list of them. This is accomplished using the regular expression `<inventor sequence=".*">\n<addressbook>\n<last-name>(.*)<\/last-name>\n<first-name>(.*)<\/first-name>` 
This expression matches all the text starting with inventor sequence, followed by last name and first name and returns only the contents that are grouped using `(.*)` that is the first and the last name.
In the following function we have used findall function of "re" library that matches all the occurrences of text using regex and returns the list of matched groups.  

In [8]:
def get_inventors(patent):
    inventor_list = []
    inventors = re.findall(regex_inventors, patent)
    
    if len(inventors) == 0:
        inventor_str == 'NA'
    else:
        for inventor in inventors:
            inventor_list.append(inventor[1] + " " + inventor[0])
        inventor_str = ",".join(inventor_list)
        
    return "[" + inventor_str + "]"

Each grant has multiple claims and each claim has multiple claim text. The following function extracts all the occurrences of claims, wrangles each claim text as it contains unwanted tags such as chemistry, insert-start, delete-start and returns a list of all the claims for a particular grant.
The following regex were used to accomplish this:
* Regex to extract claims text

  `<claim id=".*">([\s\S]*?)<\/claim>`

   This regular expression matches the text between `<claim> </claim>"`. Here \s represents any whitespace character and \S represents any non-whitespace character. The regular expression forms a group of all characters between the start and end tags where the "?" multiple occurrences of the same group. 
  
Each claim text had multiple extra tags which were found using different regular expressions and were substituted with a blank character.

In [9]:
def get_claims(patent):
    claim_list = []
    claims = re.findall(regex_claims_text, patent)
    
    if len(claims) == 0:
        claims_str = 'NA'
    else:
        for claim in claims:
            # the following substitution will replace the all the claim related tags with ''
            claim = re.sub('<claim-ref idref="CLM-.*">', '', claim)
            claim = re.sub('</claim-ref>', '', claim)
            claim = re.sub('<claim-text>','',claim)
            claim = re.sub('<\/claim-text>','',claim)
            
            # the following substitution will replace all bold and italic tags with ''
            claim = re.sub('<b>','',claim)
            claim = re.sub('<\/b>','',claim)
            claim = re.sub('<i>', '', claim)
            claim = re.sub('<\/i>', '', claim)
            
            # the following substitution will replace insert-start, delete-start, sub, sup tags with ''
            claim = re.sub('<sup>', '', claim)
            claim = re.sub('<\/sup>', '', claim)
            claim = re.sub('<sub>', '', claim)
            claim = re.sub('<\/sub>', '', claim)
            claim = re.sub('\n','',claim)
            claim = re.sub('<\?insert-start id="[A-z|\-|0-9]+"  date="[0-9]+" \?>', '', claim)
            claim = re.sub('<\?insert-end id="[A-z|\-|0-9]+" \?>', '', claim)
            claim = re.sub('<\?delete-start id="[A-z|\-|0-9]+"  date="[0-9]+" \?>', '', claim)
            claim = re.sub('<\?delete-end id="[A-z|\-|0-9]+" \?>', '', claim)
            claim = re.sub('<chemistry [\s\S]*?><\/chemistry>', '', claim)
            claim_list.append(claim)
        claims_str = ",".join(claim_list)
        claims_str = claims_str.replace('"','\\"')
        
    return "[" + claims_str + "]"

The following function extracts abstract from each grant using the following regular expression:

`<abstract id="abstract">\n<p id=".*">(.*)<\/p>\n<\/abstract>`

This regular expressions matches the text starting with abstract and returns the group matched between the given tags.

In [10]:
def get_abstract(patent):
    abstract = re.search(regex_abstract,patent)
    if abstract != None:
        abstract_str = abstract.group(1)
        abstract_str = re.sub('<b>','',abstract_str)
        abstract_str = re.sub('<\/b>','',abstract_str)
        return abstract_str
    return 'NA'

## 5. Generating CSV file
From each grant all the appropriate fields such as grant_id, patent_title, patent_kind, claim_text, abstract etc are extracted and stored in a list

In [12]:
parsed_patents = []

for each in patents[1:]:
    temp = []
    temp.append(get_grant_id(each))
    temp.append(re.search(regex_title, each).group(1))
    temp.append(get_kind(each))
    temp.append(int(re.search(regex_claims, each).group(1)))
    temp.append(get_inventors(each))
    temp.append(len(re.findall(regex_cited_applicant, each)))
    temp.append(len(re.findall(regex_cited_examiners, each)))
    temp.append(get_claims(each))
    temp.append(get_abstract(each))
    parsed_patents.append(temp)

Converting the above created list to a dataframe using pandas.

In [13]:
patent_df = pd.DataFrame(parsed_patents)
patent_df.columns = ["grant_id", "patent_title", "kind", "number_of_claims", "inventors",
             "citations_applicant_count", "citations_examiner_count", "claims_text", "abstract"]
patent_df.head()

Unnamed: 0,grant_id,patent_title,kind,number_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract
0,US10360642,Global comments for a media item,Utility Patent Grant (with a published applica...,17,"[Kevin Greene,Justin Lewis]",11,17,[1. A method comprising:identifying a request ...,Providing global comments for a media item is ...
1,US10362191,Photoelectric conversion device and image read...,Utility Patent Grant (with a published applica...,8,[Kensuke Ohara],4,4,[1. A photoelectric conversion device comprisi...,Provided is a photoelectric conversion device ...
2,US10358859,System and method for inhibiting automatic mov...,Utility Patent Grant (with a published applica...,20,"[Brian K. Lickfelt,Kentaro Yoshimura]",149,29,[1. A computer-implemented method for inhibiti...,A system and method for inhibiting automatic m...
3,US10358559,Colored organic peroxide compositions,Utility Patent Grant (with a published applica...,15,"[Marina Despotopoulou,Scot A. Swan,Leonard H. ...",15,4,[1. A colored organic peroxide composition com...,Stable organic peroxide compositions include a...
4,US10361066,Ion implantation apparatus,Utility Patent Grant (with a published applica...,12,"[Haruka Sasaki,Katsushi Fujita]",7,4,[1. An ion implantation apparatus comprising:a...,An ion implantation apparatus includes an ion ...


Transforming the extracted data into a csv file

In [14]:
patent_df.to_csv('Group113.csv', encoding='utf-8', index=False)

## 6. Generating Json file

The extracted fields from each grant are stored in the json format where for each grant-id which is the root, there is a json associated with it containing the corresponding grant fields.

In [15]:
file= open('Group113.json','w')
file.write('{\n')
second_last = len(patents)-1
for i in range(1, len(patents)):
    if i > 1:
        file.write('\n  "'+get_grant_id(patents[i])+'":{\n')
    else:
        file.write('  "'+get_grant_id(patents[i])+'":{\n')
    file.write('    "patent_title":'+'"'+re.search(regex_title, patents[i]).group(1)+'",\n')
    file.write('    "kind":'+'"'+get_kind(patents[i])+'",\n')
    file.write('    "number_of_claims":'+str(int(re.search(regex_claims, patents[i]).group(1)))+',\n')
    file.write('    "inventors":'+'"'+get_inventors(patents[i])+'",\n')
    file.write('    "citations_applicant_count":'+str(len(re.findall(regex_cited_applicant, patents[i])))+',\n')
    file.write('    "citations_examiner_count":'+str(len(re.findall(regex_cited_examiners, patents[i])))+',\n')
    file.write('    "claims_text":'+'"'+get_claims(patents[i])+'",\n')
    
    if i == second_last:
        file.write('    "abstract":'+'"'+get_abstract(patents[i])+'"\n  }\n')
    else:
        file.write('    "abstract":'+'"'+get_abstract(patents[i])+'"\n  },')
file.write('}')
file.close()

## 7. Summary

In this assignment, we learnt text prepocessing and analyzing by extracting data from semi-structured file. This was accomplished by observing the common pattern in the text which was xml encoding that separated the IP patent grants, then each grant was further analyzed to extract meaningful fields.

The challenging part was to find different common patterns and designing the regular expressions for them. Once the regex were designed, it became easy to extract the data and store in the desired format such as CSV and JSON.

Steps followed were as follows:
1. Extracted the data chunkwise that is a lsit of separate patent grants
2. Handled each patent grant at a time
3. Designed regex to extract the desired fields
4. Used re.search and re.findall functions to match the regex and extract the data
5. Used re.sub function to replace the pattern with blank
6. Processed the extracted values to store it in a data structure inorder to generate csv file.
7. Processed the extracted values to store it in a json file.