# FIT5196 Assessment 1
#### Student Names and Student Id : Akshay Rai Chopra(30228751) and  Parul (29507960)
#### Group Number: 154

Date: 20/08/2019

Version: 2.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:

* pandas 0.19.2 (for data frame, included in Anaconda Python 3.6) 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 


## 1. Introduction
This assignment comprises the execution of different text processing and analysis tasks applied to patent documents in XML format. There are a total of 150 patents in one file named `Group154.txt`.

This assessment touches the very first step of analyzing textual data, i.e., extracting data from semi-structured text files

The data-set contains information about several patent grants, e.g., patent title, patent ID, citation network, abstract etc. (see sample_input.txt). Your task is to extract the data and transform the data into the CSV and JSON format with the following elements:
1. grant_id: a unique ID for a patent grant consisting of alphanumeric characters.
2. patent_kind: a category to which the patent grant belongs.
3. patent_title: a title given by the inventor to the patent claim.
4. number_of_claims: an integer denoting the number of claims for a given grant.
5. citations_examiner_count: an integer denoting the number of citations made by the examiner for a given patent grant (0 if None)
6. citations_applicant_count: an integer denoting the number of citations made by the applicant for a given patent grant (0 if None)
7. inventors: a list of the patent inventors’ names ([NA] if the value is Null).
8. claims_text: a list of claim texts for the different patent claims ([NA] if the value is Null).
9. abstract: the patent abstract text (‘NA’ if the value is Null)

More details for each task will be given in the following sections.


## 2.  Import libraries 

In [1]:
import pandas as pd
import re


## 3. Examining and loading data

As a first step, the file `Group154.txt` will be loaded so its first 10 lines can be inspected.

In [3]:
# print first ten lines of the file
with open('Group154.txt','r') as infile:
    print('\n'.join([infile.readline().strip() for i in range(0, 10)]))

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10361682-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10361682</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>


We can see that the first XML document has an XML declaration `<?xml...?>` and a root tag `<us-patent-grant>`. Based on this information it's possible to properly delimit an XML document so it can be extracted individually.

A regex is defined so strings starting with an XML declaration `<?xml` and ending with the closing tag `</us-patent-grant>` are captured individually. The non-greedy pattern `*?` is necessary so the whole file is not matched. The regex also uses the pattern `[\s\S]` (white space or non white space characters) which causes to capture everything, even line breaks, between the XML declaration and the closing tag. 

In [4]:

# read the whole file
with open('Group154.txt','r') as infile:
    text = infile.read() 

# matches everything between the XML declaration and the root closing tag
regex = r'<\?xml[\s\S]*?</us-patent-grant>' 
patents = re.findall(regex, text)
print(len(patents))

150


## 4. Parsing and extracting the data 

The first task is to define the regular expressions so that we can capture the desired strings. There is regular expression cheatsheet:

<h4> Special Characters</h4>

\	- escape special characters

.	- matches any character

^	- matches beginning of string 

$	- matches end of string

()	- creates a capture group and indicates precedence  

<h4> Quantifiers </h4>

$*$	- 0 or more (append ? for non-greedy)  

$+$	- 1 or more (append ? for non-greedy)  

?	- 0 or 1 (append ? for non-greedy)

<h4> Special Sequences </h4>

\d	- digit

\D	- non-digit

\s	- whitespace: [ \t\n\r\f\v]

\S	- non-whitespace

\w	- alphanumeric: [0-9a-zA-Z_]

\W	- non-alphanumeric




In [5]:
#defining the regular expressions.
#re.DOTALL - '.' special character match any character at all, including a newline.

reg_grantid=r'<us-p.*(US\w+)'

reg_patent_title=r'<invention.*>(.*)<'

reg_no_claims=r'<nu.*claims>(\d+)<'

reg_lname=re.compile('<inventor .*?<last-name>(.*?)<\/',re.DOTALL)

reg_fname=re.compile('<inventor .*?<first-name>(.*?)<\/',re.DOTALL)

reg_applicant=r'<category>(.*t)<'

reg_examiner=r'<category>(.*r)<'

reg_claim=re.compile('<claim id.*?>(.*?)<\/claim>',re.DOTALL)

tags=r'<.*?>'

reg_abstract=r'<p id="p.*"0000">(.*)<\/p>'

reg_app_type=r'<ap.*"(.*)">'

reg_kind=r'<kind>(.*)<\/kind>'

Now we will capture the desired strings with the above defined regex.

In [8]:
#creating an empty list and an empty dictionary 
data2=[]
dic={}

#with the help of for loop, one patent details are passed at a time and the following extractions are performed. 
for i in patents:
    
    #storing the captured string in grant_id
    grant_id = str(''.join(re.findall(reg_grantid, i)))            
    
    #storing the captured string in patent_title and replacing the special characters with the single space. 
    patent_title = str(''.join(re.findall(reg_patent_title, i)))   
    patent_title = re.sub('&#x\d*;',' ',patent_title)
    
    
    #storing the captured string in app_type and applying conditions in order to get the desired string and then saving it in kind
    app_type = str(''.join(re.findall(reg_app_type, i)))
    if app_type =='design':
        kind = 'Design Patent'
    elif app_type =='reissue':
        kind = 'Reissue Patent'
    elif app_type =='plant': 
        if re.findall(reg_kind, i)[0]=='P2':
            kind='Plant Patent Grant (no published application) issued on or after January 2, 2001.'
        elif re.findall(reg_kind, i)[0]=='P3':
            kind='Plant Patent Grant (with a published application) issued on or after January 2, 2001.'
    elif app_type =='utility':        
        if re.findall(reg_kind, i)[0]=='B1':
            kind='Utility Patent Grant (no published application) issued on or after January 2, 2001.'
        elif re.findall(reg_kind, i)[0]=='B2':
            kind='Utility Patent Grant (with a published application) issued on or after January 2, 2001.'
    
    #extracting the no_of_claims and converting into string
    no_of_claims = str(''.join(re.findall(reg_no_claims, i)))
    
    #joining the captured data(fname and lname) and then storing it in inventors in the form of string
    fname=re.findall(reg_fname, i)
    lname=re.findall(reg_lname, i)
    inventor=[]
    for x in range(len(fname)):
        inventor.append(fname[x]+" "+lname[x])
    inventors = '['+str(','.join(inventor))+']'
    inventors = re.sub('&#x\w*;','',inventors)
    if inventors =='[]':
        inventors = '[NA]'
    
    
    
    #counting the number of citations made by applicant and examiner
    citations_applicant_count = len(re.findall(reg_applicant, i))
    citations_examiner_count = len(re.findall(reg_examiner, i))
    
    #extracting the claim text and removing the tags, new line and special characters.
    claims_text = ','.join(re.findall(reg_claim, i))
    claims_text = re.sub(tags,'',claims_text)
    claims_text = re.sub('\n','',claims_text)
    claims_text = '['+ re.sub('&#x\w*;','',claims_text)+']'
    if claims_text =='':
        claims_text = 'NA'
        
    #extracting the abstract text and removing the tags and special characters and if abstract is missing, writing 'NA' in place
    abstract = str(''.join(re.findall(reg_abstract, i)))
    abstract = re.sub(tags,'',abstract)
    abstract = re.sub('&#x\w*;',' ',abstract)
    if abstract =='':
        abstract = 'NA'
    
    data1=[grant_id,patent_title,kind,no_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract]
    data2.append(data2)
    
    #creating a dictionary with 'grant_id' as key and other dictionary with remaining data as values.
    dic[grant_id]={'patent_title':patent_title,'kind':kind,'no_of_claims':no_of_claims,'inventors':inventors,'citations_applicant_count':citations_applicant_count,'citations_examiner_count':citations_examiner_count,'claims_text':claims_text,'abstract':abstract}

    


## 5. Transforming the extracted data into the CSV and JSON format

In [14]:
#creating dataframe from the list
df=pd.DataFrame(data2,columns=['grant_id','patent_title','kind','number_of_claims','inventors','citations_applicant_count','citations_examiner_count','claims_text','abstract'])

#exporting tha pandas dataframe into csv file.
df.set_index('grant_id',inplace=True)
df.to_csv('154.csv')

#creating the new file and writing the data in the json file.
with open ('154.json','w') as outfile:
    outfile.write(str(dic))

## 6. Summary

This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:

1. **Applying regular expressions** - Using the regex to capture the desired information. 
2. **Parsing and extracting the data** - With the help of functions like findall() and len(), it was possible to access hierarchical data with only a few inspections. Luckily, the use of Python's functions like join() and in-line iterators made such tasks more easy and readable.
3. **DataFrame manipulation** - By using the pandas package, importing dictinary into data frames was quite straightforward. 
4. **Exporting data to specific format like CSV and JSON** -  By using built-in functions like DataFrame.to_csv() it was possible to export data frames into .csv files without additional formatting and transformations. In other cases, native file operations like open() and write() were required where data had to be processed line by line. 