# FIT5196 Assessment 1
#### Student Name:
#### Student ID: 

Date: 02/04/2017

Version: 2.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* xml.etree.ElementTree (for parsing XML doc, included in Anaconda Python 3.6)
* pandas 0.19.2 (for data frame, included in Anaconda Python 3.6) 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* nltk 3.2.2 (Natural Language Toolkit, included in Anaconda Python 3.6)
* nltk.collocations (for finding bigrams, included in Anaconda Python 3.6)
* nltk.tokenize (for tokenization, included in Anaconda Python 3.6)
* nltk.corpus (for stop words, not included in Anaconda, `nltk.download('stopwords')` provided)


## 1. Introduction
This assignment comprises the execution of different text processing and analysis tasks applied to patent documents in XML format. There are a total of 2500 patents in one 158 MB file named `patents.xml`. The required tasks are the following:

1. Extract the IPC code for each patent and store the list into a `.txt` file.
2. Extract all the citations for each patent and store the list into a `.txt` file.
3. Analyse the abstracts for each patent and generate a vocabulary count vector. Store the results into a `.txt` file.

More details for each task will be given in the following sections.

## 2.  Import libraries 

In [1]:
import pandas as pd
import re

## 3. Examining and loading data

As a first step, the file `Group006.txt` will be loaded so its first 10 lines can be inspected.

In [2]:
# print first ten lines of the file
with open('./Group006.txt','r') as infile:
    print('\n'.join([infile.readline().strip() for i in range(0, 10)]))

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10359188-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10359188</doc-number>
<kind>B1</kind>
<date>20190723</date>
</document-id>


In [3]:
# matches everything between the XML declaration and the root closing tag
with open('./Group006.txt','r') as infile:
    text = infile.read()
regex = r'<\?xml[\s\S]*?</us-patent-grant>' 
patents = re.findall(regex, text)
print(patents[0])

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10359188-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10359188</doc-number>
<kind>B1</kind>
<date>20190723</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>16011150</doc-number>
<date>20180618</date>
</document-id>
</application-reference>
<us-application-series-code>16</us-application-series-code>
<classifications-ipcr>
<classification-ipcr>
<ipc-version-indicator><date>20060101</date></ipc-version-indicator>
<classification-level>A</classification-level>
<section>A</section>
<class>63</class>
<subclass>B</subclass>
<main-group>45</main-group>
<subgroup>00</subgroup>
<symbol-position>F</symbol-position>
<classification-value>

## 4. Parsing given txt file

Find the grant-id for each record

In [5]:
regex = r'file="(.*)-\d{8}.XML'
grant_id = re.findall(regex,text)
#grant_id

Find the kind of each patent

In [6]:
regex = r'appl-type="(.*)">'
patent_kind = re.findall(regex, text)
#patent_kind

Find the patent_title

In [7]:
regex = r'<invention-title id="[\w\d]{5,6}">(.*)</invention-title>'
patent_title = re.findall(regex, text)
#patent_title

Find the number_of_claims

In [8]:
regex = r'<number-of-claims>(.*)</number-of-claims>'
number_of_claims = re.findall(regex, text)
#number_of_claims

Find the citations_examiner_count and citations_applicant_count

In [9]:
citations_examiner_count = []
citations_applicant_count = []
for patent in patents:
    regex = r'<us-citation>[\s\S]*?</us-citation>'
    citations = re.findall(regex, patent)
    applicant = []
    examiner = []
    for i in citations:
        regex = r'cited by (.*)<'
        type = re.search(regex, i).group(1)
        if type == 'applicant':
            applicant.append(i)
        else:
            examiner.append(i)
    citations_examiner_count.append(len(examiner))
    citations_applicant_count.append(len(applicant))
#citations_examiner_count

FInd the name of inventors

In [10]:
inventors = []
for patent in patents:
    regex = r'<inventors>[\s\S]*?</inventors>'
    inventor1 = re.findall(regex, patent)
    regex = r'<last-name>(.*)?</last-name>'
    last_name = re.findall(regex, inventor1[0])
    last_name
    regex = r'<first-name>(.*)?</first-name>'
    first_name = re.findall(regex, inventor1[0])
    first_name
    inventor = []
    for i,j in zip(last_name,first_name):
        full_name = i+ j
        inventor.append(full_name)
    inventors.append(inventor)
#inventors

Find the claim_text of patents

In [12]:
claim_text = []
for i in patents:
    regex = r"<claim-text>[\s\S]*</claim>"
    claim_text1 = re.findall(regex, i)
    claim_text2 = re.sub('(<.*?>)','',claim_text1[0])
    claim_text3 = re.sub(r'(\n)','',claim_text2)
    claim_text.append(claim_text3)

#len(claim_text)

Find the abstract of patents

In [13]:
abstract = []
for patent in patents:
    regex = r'<abstract id="abstract">[\s\S]*</abstract>'
    abstract1 = re.findall(regex, patent)
    if len(abstract1) == 0:
        abstract3 = 'NA'
    else:
        abstract2 = re.sub('<.*?>','',abstract1[0])
        abstract3 = re.sub(r'(\n)','',abstract2)
    abstract.append(abstract3)
#abstract

In [12]:
raw_data = {'grant_id': grant_id,
                  'patent_kind': patent_kind,
                  'patent_title': patent_title,
                  'number_of_claims': number_of_claims,
                  'citations_examiner_count': citations_examiner_count,
                  'citations_applicant_count': citations_applicant_count,
                  'inventors': inventors,
                  'claim_text': claim_text,
                  'abstract': abstract
                  }
df = pd.DataFrame(raw_data, columns = ['grant_id', 'patent_kind', 'patent_title', 'number_of_claims', 'citations_examiner_count', \
                                       'citations_applicant_count', 'inventors', 'claim_text','abstract'])
df

Unnamed: 0,grant_id,patent_kind,patent_title,number_of_claims,citations_examiner_count,citations_applicant_count,inventors,claim_text,abstract
0,US10359188,utility,LED lights for deep ocean use,1,0,8,"[OlssonMark S., SimmonsJon E., SteinerAaron J....","1. A submersible LED light for deep ocean use,...",An underwater LED light for use in high ambien...
1,US10361898,utility,Complexity reduction for OFDM signal transmiss...,22,9,13,"[JiaMing, MaJianglei]","1. A method for wireless communications, the m...",An OFDM signal may include a first edge band s...
2,US10357939,utility,High performance light weight carbon fiber fab...,11,3,0,"[DhakateSanjay Ragnath, ChaudharyAnisha, Gupta...",1. A carbon fiber fabric-carbon nanofiber epox...,The present disclosure relates to the developm...
3,US10361741,utility,Mobile device enclosure system,20,7,34,[LiuWarren],1. A mobile device enclosure system comprising...,A mobile device enclosure system is an apparat...
4,US10361864,utility,Enabling a secure OEM platform feature in a co...,15,8,0,[GlendinningDuncan],1. A platform feature licensing device (PFLD) ...,"A platform feature licensing module (e.g., a U..."
5,US10359775,utility,Managing electricity usage for an appliance,18,4,21,"[AryaVijay, GanuTanuja Hrishikesh, HusainSaifu...",1. A method comprising:prior to an initial per...,One embodiment provides a method including: pr...
6,US10359973,utility,Image forming apparatus adjusting image formin...,5,5,1,[ItoAya],"1. An image forming apparatus, comprising:a pl...",An image forming apparatus includes a selectio...
7,US10361865,utility,Signature method and system,20,11,8,"[HibshooshEliphaz, KipnisAviad, MosheNir, Shal...",1. A method for digitally signing blocks of da...,"In one embodiment, a method, system, and appar..."
8,US10362050,utility,System and methods for scalably identifying an...,22,16,9,"[BorohovskiMichael, BraunAinsley K, SedatBenja...",1. A method comprising:receiving information a...,A security auditing computer system efficientl...
9,US10357290,utility,Apparatus and method for connecting surgical rods,16,8,0,[TorresEller],"1. An apparatus for connecting surgical rods, ...",An exemplary apparatus for connecting surgical...


In [None]:
df.to_csv(r')

In [15]:
dataDict = {}
for i in range(0,149):
    dataDict[grant_id[i]] = {'patent_kind': patent_kind[i],
                  'patent_title': patent_title[i],
                  'number_of_claims': number_of_claims[i],
                  'citations_examiner_count': citations_examiner_count[i],
                  'citations_applicant_count': citations_applicant_count[i],
                  'inventors': inventors[i],
                  'claim_text': claim_text[i],
                  'abstract': abstract[i]
}

In [19]:
dataDict

{'US10359188': {'patent_kind': 'utility',
  'patent_title': 'LED lights for deep ocean use',
  'number_of_claims': '1',
  'citations_examiner_count': 0,
  'citations_applicant_count': 8,
  'inventors': ['OlssonMark S.',
   'SimmonsJon E.',
   'SteinerAaron J.',
   'Sanderson, IVJohn R.'],
  'claim_text': '1. A submersible LED light for deep ocean use, comprising:a pressure and leak resistant housing structured to withstand ambient exterior water pressures corresponding to liquid depths of approximately 1000 meters or more;a transparent pressure bearing window positioned in the forward end of the housing and extending across an aperture therein;an MCPCB including an LED driver circuit and a plurality of LEDs disposed within the housing adjacent to the aperture so as to pass light through the aperture and the transparent pressure bearing window; anda multilayer stack of spacers of a high compressive strength material comprising one or more of a PEEK plastic, ULTEM resin, ceramic, and met

In [16]:
def save_dict_to_file(dic):
    f = open('dict.json','w')
    f.write(str(dic))
    f.close()

In [17]:
save_dict_to_file(dataDict)