# FIT5196 Assessment 1
# FIT5196 Task  in Assessment 1
#### Student Name: Pattranit Chaiyabud and Viet Tai Le
#### Student ID: 30304148 and 29975336 

Date: 22/08/2019

Version: 1.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* pandas (for dataframe, included in Anaconda Python 3.6.0) 
* re (for regular expression, included in Anaconda Python 3.6.0)


## 1. Introduction
This assignment comprises the execution of different text processing and analysis tasks applied to patent documents in XML format. There are a total of 150 patents in the file named `Group148.txt`. The required tasks are the following:

Extracting data from semi-structured text files. Based on the data-set that contains information about
grants given for IP patent claims . The data-set contains information about several patent grants, e.g.,
patent title, patent ID, citation network, abstract etc. Our task is to extract
the data and transform the data into the CSV and JSON format with the following elements:
1. grant_id: a unique ID for a patent grant consisting of alphanumeric characters.
2. patent_kind: a category to which the patent grant belongs.
3. patent_title: a title given by the inventor to the patent claim.
4. number_of_claims: an integer denoting the number of claims for a given grant.
5. citations_examiner_count: an integer denoting the number of citations made by the
examiner for a given patent grant (0 if None)
6. citations_applicant_count: an integer denoting the number of citations made by the
applicant for a given patent grant (0 if None)
7. inventors: a list of the patent inventors’ names ([NA] if the value is Null).
8. claims_text: a list of claim texts for the different patent claims ([NA] if the value is Null).
9. abstract: the patent abstract text (‘NA’ if the value is Null)
More details for each task will be given in the following sections.

## 2.  Import libraries 

In [2]:
import pandas as pd
import re

## 3. Examining and loading data

As a first step, the file `Group148.txt` will be loaded so its first 10 lines can be inspected.

In [3]:
# print first ten lines of the file
with open('Group148.txt','r',encoding="utf-8") as infile:
    print('\n'.join([infile.readline().strip() for i in range(0, 10)]))

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10362598-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10362598</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>


We can see that each patent start from tag `<?xml...?>` and root tag `<us-patent-grant>` and end with the closing tag `</us-patent-grant>` . So to get list of all patent we use re.findall with regex `r'<\?xml[\s\S]*?</us-patent-grant>'` .
The non-greedy pattern `*?` is necessary so the whole file is not matched. The regex also uses the pattern `[\s\S]` (white space or non white space characters) which causes to capture everything, even line breaks, between the XML declaration and the closing tag. 

In [6]:
# Read a whole file and put data in a list that contains information of each patent
with open('Group148.txt','r',encoding="utf-8") as infile:
    fileline = infile.read()    
regex = r'<\?xml[\s\S]*?</us-patent-grant>' 
patents = re.findall(regex, fileline)
# patents

## 4. Parsing Raw Text Files 
 Our approach is extracting 9 elements of a sample patent and then using for loop to extracting all patents.
 
 In order to transfrom data into csv file, we save extracted data in a data frame and using to_csv method of pandas library
 
 In order to saving data to json file, we try to write a text file in json format.

In [4]:
#Initialize a list contains processed data of patents
data=[]
#Initialize a dictionary 
dict1={}
for i in range(len(patents)):
    sample_row=patents[i]
    #grant_id
    grant_id=re.search(r'file="(.*?)-(.*?)"\sstatus',sample_row).group(1)
    #finding title
    p_title=re.search('<invention-title.*?>(.*?)</invention-title>',sample_row).group(1)
    #number of claims
    number_of_claims=re.search(r'<number-of-claims>(\d*?)</number-of-claims>',sample_row).group(1)
    #citations by applicant,examiner
    number_cia=re.findall(r'<patcit[\s\S]*?</category>|<nplcit[\s\S]*?</category>',sample_row)
    if len(number_cia)!=0:
        # Initialize a variable for citation applicant count
        applicant_count=0
        # Initialize a variable for citation examiner count
        examiner_count=0
        for each in number_cia:
            if 'cited by applicant' in each:
                applicant_count+=1
            elif 'cited by examiner'in each:
                examiner_count+=1
    else:
        applicant_count=0
        examiner_count=0
    #inventors
    list_inventors=[]
    inventors=re.search(r'<inventors[\s\S]*?</inventors>',sample_row)
    if inventors:
        inventor_info=re.findall(r'<inventor[\s\S]*?</inventor>',inventors.group())
        if len(inventor_info) != 0:
            for each_inventor in inventor_info:     
                inventor_lname=re.search(r"<last-name>(.*?)</last-name>",each_inventor).group(1)
                inventor_fname=re.search(r"<first-name>(.*?)</first-name>",each_inventor).group(1)
                inventor_name=inventor_lname + " " +inventor_fname
                list_inventors.append(inventor_name)
            inventors_str=re.sub(r'\'|\"','',str(list_inventors))
                
        else:
            list_inventors="NA"
    else:
        list_inventors="NA"
    #kind
    kind_grant=re.search(r'<publication-reference>([\s\S]*?)</publication-reference>',sample_row).group(1)
    kind_info= re.search(r'<kind>([\s\S]*?)</kind>',kind_grant).group(1)
    if kind_info=='B1':
        kind='Utility Patent Grant (no pre-grant publication) issued on or after January 2, 2001.'
    elif kind_info=='B2':
        kind='Utility Patent Grant (with pre-grant publication) issued on or after January 2, 2001.'
    elif kind_info=='E1':
        kind='Reissue Patent'
    elif kind_info=='P1':
        kind='Plant Patent Grant issued prior to January 2, 2001'
    elif kind_info=='P2':
        kind='Plant Patent Grant (no pre-grant publication) issued on or after January 2, 2001.'
    elif kind_info=='P3':
        kind='Plant Patent Grant (with pre-grant publication) issued on or after January 2, 2001.'
    elif kind_info=='S1':
        kind='Design Patent'
    #claim text
    claims_info=re.findall(r'<claim\sid=".+">([\s\S]+?)<\/claims>',sample_row)
    claims_text=re.sub(r'<[^>]*>|\'|\"|\\n','',str(claims_info))
    claims_text=re.sub(r'\s{2,}','',claims_text)
    if claims_text==[]:
        claims_text=["NA"]
    #abstract
    abstract=re.search(r'<abstract[\s\S]+?<p.+?>(.+)?<\/p[\s\S]+?>',sample_row)
    if abstract:
        abstract=abstract.group(1)
        abstract=re.sub(r'<.+?>|\s{2,}','',abstract)
    else:
        abstract="NA"
    #for csv
    data_row=[grant_id,p_title,kind,number_of_claims,inventors_str,applicant_count,examiner_count,claims_text,abstract]
    data.append(data_row)
    #for json
    dict1[grant_id]={"patent_title":p_title,"kind":kind,"number_of_claims":int(number_of_claims),"inventors":inventors_str,"citations_applicant_count":applicant_count,"citations_examiner_count":examiner_count,\
                     "claims_text":claims_text,"abstract":abstract}

For all elements in a patent, we using re.search method to extract each element 
### *Grant_Id:* 
The information about grant id located in tag <us-patent-grant..>

for example: `<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10362598-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">`
The id begin from `file="` and end before the next character `-`. Hence, we use regex: `r'file="(.*?)-(.*?)"\sstatus'` 
.The first group is the Id

### *Patent_title:*
The title information located between tag `<invention-title..>` and `</invention-title>`.

for example: `<invention-title id="d2e53">Method for relaying D2D link in wireless communication system and device for performing same</invention-title>`

Similarly, the regex here is `'<invention-title.*?>(.*?)</invention-title>'` 
.The first group is patent title

### *Number of claims:*
The number of claims located between tag `<number-of-claims>` and `</number-of-claims>`

for example: `<number-of-claims>18</number-of-claims>`

To extract this number, we use re.search method with regex `r'<number-of-claims>(\d*?)</number-of-claims>'` and then .group(1)
to take number between 2 tag `<number-of-claims>` and `</number-of-claims>`

### *Citations by applicant,examiner:*
To counting number of citaions made by applicant and examiner, we need to find all citations and then we will define which is made by applicant or examiner. So each citation will begin from tag `<patcit..>` or `<nplcit..>` and finish at next `</category>` tag. For example,

`<patcit num="00007">
<document-id>
<country>US</country>
<doc-number>2013/0053042</doc-number>
<kind>A1</kind>
<name>Tanikawa</name>
<date>20130200</date>
</document-id>
</patcit>
<category>cited by examiner</category>`


**OR** 


`<nplcit num="00022">
<othercit>Korean Intellectual Property Office Application No. 10-2016-7024846, Office Action dated May 23, 2017, 5 pages.</othercit>
</nplcit>
<category>cited by applicant</category>
</us-citation>
<us-citation>
<nplcit num="00023">
<othercit>Korean Intellectual Property Office Application No. 10-2016-7024846, Final Office Action dated Nov. 10, 2017, 5 pages.</othercit>
</nplcit>
<category>cited by applicant</category>`

For each citation, it ends with a information about `cited by applicant` or `cited by examiner`. So our approach is that we use re.findall() method to find all citations first.Secondly, for each citation, if it has a string `cited by applicant` we count it for element  citation_count_applicant. Otherwise, if a string `cited by examiner` placed on citation we count it for element citation citation_count_examiner
Regex here is: `r'<patcit[\s\S]*?</category>|<nplcit[\s\S]*?</category>'`
### *Inventors*
All information of inventors placed between `<inventors>` and `</inventors>`. For each inventor, information begin from `<inventor..>` ending with `</inventor>`. For example,
`<inventor sequence="002" designation="us-only"> <addressbook> <last-name>Seo</last-name> <first-name>Hanbyul</first-name> <address> <city>Seoul</city> <country>KR</country> </address> </addressbook> </inventor>`

Last name locate between `<last-name>`,`</last-name>` and first name place between `<first-name>`,`</first-name>`.  
Firstly, we extract data from `<inventors>` to `</inventors>` by re.search method with regex `r'<inventors[\s\S]*?</inventors>'` . Similarly, using re.findall method with regex `r'<inventor[\s\S]*?</inventor>'` on data we extract from the first step, we get a list of string that contains first name and last name of inventors. To obtain those information, we use re.search method `(r"<last-name>(.*?)</last-name>")` and `(r"<first-name>(.*?)</first-name>")`. After that, we concentrate two of them and put it on a list of inventors 


### *Kind*:
The kind of patent document be categoized according to the [kind codes](https://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent). 
The information of patents kind lie in between tag `<publication-reference>` and `</publication-reference>` then kind code itself can be found between `<kind>` and `</kind>`

for example:

`<publication-reference>
<document-id>
<country>US</country>
<doc-number>10362598</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>
</publication-reference>`

To extract this kind code and translate into the right group description, we need re.search method with regex `r'<publication-reference>([\s\S]*?)</publication-reference>'` and then `.group(1)` to match the information of patents kind.
In addition, use re.search method with regex `r'<kind>([\s\S]*?)</kind>'` and `.group(1)` again to capture the kinds code, after that match the derived kind code with the rigth group description as shown below.

<img src = "Capture.png" height = "600" width = "500">


### *Claim text:*
The claims information is located between tag `<claim-text>` and `</claim-text>`.
In the whole claims information, there could be one or more claim text of each claim depending on the number of claim in that particular patent. 
We start with extract the whole claims text using re.findall to capture the claims information which begins with `<claim id="CLM-00001" num="00001">` and end with `</claims>` then we remove all of unnecesary character for instance, white space, new line and other claims number in the blanket using regex re.sub.

for example: 

`<claim id="CLM-00001" num="00001">
<claim-text>The ornamental design for a weighted golf club grip, as shown and described.</claim-text>
</claim>
</claims>`

The regex that we use to capture this `<claim\sid=".+">([\s\S]+?)<\/claims>` and to remove all the thing we use regex re.sub with `r'<[^>]*>|\'|\"|\\n'` and `r'\s{2,}'`

In the case that there is no claims text, re.findall will return empty set `[]`. Therefore we add another line of code for empty set to become `"NA"`
  

### *Abstract:*
The abstract of the patents is located between tag `<abstract id="abstract">` and `</abstract>`
To extract the information between the tag, we use regex re.search with `<abstract[\s\S]+?<p.+?>(.+)?<\/abstract>` to match the pattern of the abstract and remove inside-text tag using re.sub `r'<.+?>|\s{2,}'` since there are some tags and white space there.

for example:

`<abstract id="abstract">\n<p id="p-0001" num="0000">This invention discloses a novel system and method for distributing electronic ticketing to mobile devices such that the ticket stored on the device is checked for its integrity from tampering and the device periodically reports on ticket usage with a central server.</p>\n</abstract>`

In the case that no abstract is found, we set it to `"NA"`


## 5.Transform data into csv file

In [5]:
#insert the list of data into dataframe 
df=pd.DataFrame(data,columns=["grant_id","patent_kind","kind",'number_of_claims',"inventors",'citations_applicant_count',"citations_examiner_count","claims_text","abstract"])
#convert dataframe into .csv file
df.to_csv("Group148.csv",index=False)

In order to transform all extracted data into `.csv` file, we need to put our data which is currently in a list into dataframe after that use the function of pandas package function `.to_csv` to turn dataframe into the `.csv` file

## 6.Transform data into json file

In [6]:
# remove all unnecessary white spaces
str1=str(dict1).replace(": ",":").replace(", \'",",\'")
# change all single quote to double quote
replacement=[("{\'","{\""),("\'}","\"}"),("\':","\":"),(":\'",":\""),("\',","\","),(",\'",",\"")]
for old,new in replacement:
    str1=re.sub(old,new,str1)
# write a txt file with json format
with open('Group148.json','w',encoding='utf-8') as outfile:
    outfile.write(str1)

First step here is remove all unnecessary white space since from json format ther is no white space next to character `:` which located between key-value pairs and also there is no space between items.


Secondly, in json file, each key and value enclosed by double quotes not single quotes. Hence, the next step is changing all heading and trailing single quote of every key and value 

## 7. Summary
This assessment measured the understanding of basic text file processing techniques in the Python programming language. The main outcomes achieved while applying these techniques were:

- **XML parsing and data extraction**. By using the built-in package called `re` or Regex module. With functions like `findall()` ,`search()` and `sub()`, allow us to spot all the pattern of data with only a few inspections inluding replace unecessary characters in a proper way.
- **Data frame manipulation**. By using the `pandas` package, we can import the list into data frames straightforward and it is simple to do when adding and arranging all the columns' name.
- **Exporting data to specific format**. By using built-in functions like `DataFrame.to_csv()` it is possible to export data frames into `.csv` file without additional formatting and transformations. Exporting data into `.json` file without using `json` package is quite challanging. All we need to do is import data into dictiony and arrange everything according to the right format and export it as output file rightaway.

## 8.Reference

- Regular Expression HOWTO — Python 3.7.4 documentation. (2019). Retrieved 22 August 2019, from https://docs.python.org/3/howto/regex.html
- Efficient way of matching and replacing multiple strings in python 3?. (2019). Retrieved 22 August 2019, from https://stackoverflow.com/questions/44528197/efficient-way-of-matching-and-replacing-multiple-strings-in-python-3?fbclid=IwAR2RI5myx3NKrRoB_ZhUL45yAQc8sZ54Yp_MPZZw5MefcYLMtv15jBgnqo4
- pandas.DataFrame.to_csv — pandas 0.25.0 documentation. (2019). Retrieved 22 August 2019, from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
- Dib, F. (2019). Online regex tester and debugger: PHP, PCRE, Python, Golang and JavaScript. Retrieved 22 August 2019, from https://regex101.com
- JSON Formatter & Validator. (2019). Retrieved 22 August 2019, from https://jsonformatter.curiousconcept.com/?fbclid=IwAR3uOy87IPaE0oZWQeFHScBOOBMn8kwYNiTShTqVIKpan1jmOjLI5UzeiD8
