# FIT5196 Data wrangling S2 2017

## Assessment 1: Text Preprocessing

### Introduction
In this assignment, I am going to use some data wrangling techniques to do the text preprocessing for the data been provided in XML format which contains 2,500 patents and to convert the data into a proper format for information references. In the given data, there have information include publication reference, application reference, abstract, description, claims, citations, etc. 

And the text preprocessing for this assignment focus on patents information collocation including International Patent Classification (IPC) code and abstract, also the patents' citation network.

* Student: CHAO KAI HSU
* Student ID: 28397371 

There are 4 tasks wirtten in seperated ipynb files in Python 2.0 by using Jupytur notebook for this Assignment of Text Preprocessing which includes:

1. Extract the hierarchical IPC code
2. Extract the citation network
3. Identify number of patent citation
4. Extract and preprocess abstracts
-----------------

## This file belongs the task 2.  

# Extract the citation network

To extract all the references for each patent, in other word, to get the patent's citations for all the patents in the XML file which contain contains 2,500 patents in total and store them in output file citations.txt in the following format: 
* citing_patent_id:cited_patent_id,cited_patent_id,....

## Step 1. Import Libraries 

In [1]:
from bs4 import BeautifulSoup

## Step 2. Use BeautifulSoup to extract information from the XML file.

When first time tried to open the XML file and parse the data to jupyter book by using BeautifulSoup, lxml’s XML parser, it turns out to be that the parser cannot parser all the data from the file and returned some error information. After examining the original XML file by editor, I found out that the file is not a well-formed XML file, which instead the file itself contains 2,500 individual XML files in it. So, the lxml’s XML parser can only successfully parses the first section as it recognizes the section that belong to a well-formed XML file structure.

The way to solve this problem is to open the file directly without parsing it and do the edition of the original files to the correct structure and save each patent to the list separately before using any parser. Or use Python’s html.parser with BeautifulSoup, it can parse the so call un-well-formed XML file without error. And I am going the use the Python’s html.parser to parse and extract data from XML files with Python.

In [2]:
soup = BeautifulSoup(open("./patents.xml"),"html.parser") 

## Step 3. Examining the hierarchy of the xml file and obtain the information we need.

By examining the hierarchy of the xml file, we noted that the information we would like to extract is stored in the following tags:

 * for citing_patent_id:  
 
        publication-reference > doc-number 
        (the citing patent id is located under tags publication-reference > doc-number)
 
 * for cited_patent_id:
 
                     
        references-cited> citation > doc-number 
                          citation > doc-number 
                              .
                              .
                              .
                          citation > doc-number 
       (the cited patent id for a citing patent is located under tags references-cited> citation > doc-number. 
       And a citing patent may have more than one cited patent)     
        
###### 3.1 To get the citing_patent_id and return to a list: citing_patent_id

In [3]:
publication_reference_tags = soup.find_all("publication-reference")
citing_patent_id = [item.find("doc-number").string for item in publication_reference_tags] 
citing_patent_id

[u'PP021722',
 u'RE042159',
 u'RE042170',
 u'07891018',
 u'07891019',
 u'07891020',
 u'07891021',
 u'07891023',
 u'07891025',
 u'07891026',
 u'07891027',
 u'07891029',
 u'07891030',
 u'07891032',
 u'07891033',
 u'07891034',
 u'07891036',
 u'07891037',
 u'07891038',
 u'07891039',
 u'07891041',
 u'07891044',
 u'07891053',
 u'07891055',
 u'07891056',
 u'07891057',
 u'07891058',
 u'07891059',
 u'07891060',
 u'07891063',
 u'07891067',
 u'07891070',
 u'07891071',
 u'07891076',
 u'07891078',
 u'07891082',
 u'07891083',
 u'07891084',
 u'07891086',
 u'07891087',
 u'07891097',
 u'07891098',
 u'07891104',
 u'07891107',
 u'07891111',
 u'07891114',
 u'07891115',
 u'07891116',
 u'07891117',
 u'07891118',
 u'07891121',
 u'07891123',
 u'07891129',
 u'07891133',
 u'07891136',
 u'07891139',
 u'07891140',
 u'07891141',
 u'07891146',
 u'07891148',
 u'07891152',
 u'07891158',
 u'07891159',
 u'07891160',
 u'07891161',
 u'07891162',
 u'07891163',
 u'07891165',
 u'07891166',
 u'07891167',
 u'07891169',
 u'078

###### 3.2  To find the cited_patent_id and store to a list: cited_patents

In [4]:
cited_patents = []

references_cited_tags = soup.find_all("references-cited")

citation_tags = []
for item in references_cited_tags:
    citation_tags.append(item.find_all("citation"))

cited_patents = []
for l in citation_tags:
    temp = []
    for item in l:
        temp.append(item.find("doc-number").string)
    cited_patents.append(temp)
cited_patents    

[[u'PP17672', u'PP18482', u'PP18483'],
 [u'4954776',
  u'4956606',
  u'5015948',
  u'5115193',
  u'5180978',
  u'5332966',
  u'5332996',
  u'5351003',
  u'5381090',
  u'5521496',
  u'5914593'],
 [u'3988719',
  u'4206996',
  u'4803623',
  u'4905098',
  u'5012281',
  u'5161222',
  u'5172244',
  u'5253152',
  u'5263153',
  u'5270775',
  u'5301262',
  u'5341363',
  u'5355490',
  u'5410754',
  u'5537626',
  u'5559958',
  u'5574859',
  u'5580177',
  u'5611046',
  u'5647056',
  u'5828864'],
 [u'4561124',
  u'4831666',
  u'4920577',
  u'5105473',
  u'5134726',
  u'D338281',
  u'5611081',
  u'5729832',
  u'5845333',
  u'6115838',
  u'6332224',
  u'6805957',
  u'7089598'],
 [u'4355632',
  u'4702235',
  u'5032705',
  u'5148002',
  u'5603648',
  u'6439942',
  u'6757916',
  u'6910229'],
 [u'4599609',
  u'4734072',
  u'4843014',
  u'5061636',
  u'5493730',
  u'5635909',
  u'6080690',
  u'6267232',
  u'6388422',
  u'6767509',
  u'2003/0214408',
  u'2004/0009729',
  u'197 49 862',
  u'101 55 935',
  u

## Step 4. Data formatting for the collected data and output to file.

A patent list which contains all the 2,500 patents ID and another information list that contains patents cited by the 2,500 patents in the patent list have been generated.
And the citing patents in the patent list are all in the same position with the patents it cited in the information list. 
###### 4.1  Combine both lists to a list so each item in the new list store the citing patent's id and it's cited patents id list.

In [5]:
citation_network_scheme = []
for i in range (len(citing_patent_id)):
    citation_network_scheme.append([citing_patent_id[i], cited_patents[i]])
citation_network_scheme

[[u'PP021722', [u'PP17672', u'PP18482', u'PP18483']],
 [u'RE042159',
  [u'4954776',
   u'4956606',
   u'5015948',
   u'5115193',
   u'5180978',
   u'5332966',
   u'5332996',
   u'5351003',
   u'5381090',
   u'5521496',
   u'5914593']],
 [u'RE042170',
  [u'3988719',
   u'4206996',
   u'4803623',
   u'4905098',
   u'5012281',
   u'5161222',
   u'5172244',
   u'5253152',
   u'5263153',
   u'5270775',
   u'5301262',
   u'5341363',
   u'5355490',
   u'5410754',
   u'5537626',
   u'5559958',
   u'5574859',
   u'5580177',
   u'5611046',
   u'5647056',
   u'5828864']],
 [u'07891018',
  [u'4561124',
   u'4831666',
   u'4920577',
   u'5105473',
   u'5134726',
   u'D338281',
   u'5611081',
   u'5729832',
   u'5845333',
   u'6115838',
   u'6332224',
   u'6805957',
   u'7089598']],
 [u'07891019',
  [u'4355632',
   u'4702235',
   u'5032705',
   u'5148002',
   u'5603648',
   u'6439942',
   u'6757916',
   u'6910229']],
 [u'07891020',
  [u'4599609',
   u'4734072',
   u'4843014',
   u'5061636',
   u'549

###### 4.2  After the final list been setted up. Sorting the sequence of the patents by citing patents ID in the list and to write item to the file line by line as task required output format:
* citing_patent_id:cited_patent_id,cited_patent_id,....

In [6]:
citation_network_scheme.sort()

output_file = open ("citations.txt", "w+")

for item in citation_network_scheme:
    line = item[0]+":"+ ",".join(item[1])+"\n"
    output_file.write(line) 

output_file.close()

### Task 2 end.