# FIT5196 Data wrangling S2 2017

## Assessment 1: Text Preprocessing

### Introduction
In this assignment, I am going to use some data wrangling techniques to do the text preprocessing for the data been provided in XML format which contains 2,500 patents and to convert the data into a proper format for information references. In the given data, there have information include publication reference, application reference, abstract, description, claims, citations, etc. 

And the text preprocessing for this assignment focus on patents information collocation including International Patent Classification (IPC) code and abstract, also the patents' citation network.

* Student: CHAO KAI HSU
* Student ID: 28397371 

There are 4 tasks wirtten in seperated ipynb files in Python 2.0 by using Jupytur notebook for this Assignment of Text Preprocessing which includes:

1. Extract the hierarchical IPC code
2. Extract the citation network
3. Identify number of patent citation
4. Extract and preprocess abstracts
-----------------

## This file belongs the task 3.   

# Identify number of patent citation

To identify number of times a particular patent has been cited according to data in the XML file and store them in output file "cited.txt" in the following format: 

 * cited_patent_id: number of times it is cited

## Step 1. Import Libraries

In [1]:
from bs4 import BeautifulSoup
from collections import Counter

## Step 2. Use BeautifulSoup to extract information from the XML file.

When first time tried to open the XML file and parse the data to jupyter book by using BeautifulSoup, lxml’s XML parser, it turns out to be that the parser cannot parser all the data from the file and returned some error information. After examining the original XML file by editor, I found out that the file is not a well-formed XML file, which instead the file itself contains 2,500 individual XML files in it. So, the lxml’s XML parser can only successfully parses the first section as it recognizes the section that belong to a well-formed XML file structure.

The way to solve this problem is to open the file directly without parsing it and do the edition of the original files to the correct structure and save each patent to the list separately before using any parser. Or use Python’s html.parser with BeautifulSoup, it can parse the so call un-well-formed XML file without error. And I am going the use the Python’s html.parser to parse and extract data from XML files with Python.

In [2]:
soup = BeautifulSoup(open("./patents.xml"),"html.parser")  

## Step 3. Examining the hierarchy of the xml file and obtain the information we need.

By examining the hierarchy of the xml file, we noted that the information we would like to extract is stored in the following tags:

 * for cited_patent_id:
 
                     
        references-cited> citation > doc-number 
                          citation > doc-number 
                              .
                              .
                              .
                          citation > doc-number 
       (the cited patent id for a citing patent is located under tags references-cited> citation > doc-number. 
       And a citing patent may have more than one cited patent)                  
                         
By collecting all cited patents in the parsing data, we can have a list of patents ID, for these patent ID each represent a time it is cited by a particular patent. So, by do some counting process, we can obtain information in a list with all the cited patents and number of the time a patent be cited.

###### To find the cited_patent_id and store to a list : cited_patents

In [3]:
cited_patents = []

references_cited_tags = soup.find_all("references-cited")

citation_tags = []
for item in references_cited_tags:
    citation_tags.append(item.find_all("citation"))

cited_patents = []
for l in citation_tags:
    for item in l:
        cited_patents.append(item.find("doc-number").string)
    
cited_patents    

[u'PP17672',
 u'PP18482',
 u'PP18483',
 u'4954776',
 u'4956606',
 u'5015948',
 u'5115193',
 u'5180978',
 u'5332966',
 u'5332996',
 u'5351003',
 u'5381090',
 u'5521496',
 u'5914593',
 u'3988719',
 u'4206996',
 u'4803623',
 u'4905098',
 u'5012281',
 u'5161222',
 u'5172244',
 u'5253152',
 u'5263153',
 u'5270775',
 u'5301262',
 u'5341363',
 u'5355490',
 u'5410754',
 u'5537626',
 u'5559958',
 u'5574859',
 u'5580177',
 u'5611046',
 u'5647056',
 u'5828864',
 u'4561124',
 u'4831666',
 u'4920577',
 u'5105473',
 u'5134726',
 u'D338281',
 u'5611081',
 u'5729832',
 u'5845333',
 u'6115838',
 u'6332224',
 u'6805957',
 u'7089598',
 u'4355632',
 u'4702235',
 u'5032705',
 u'5148002',
 u'5603648',
 u'6439942',
 u'6757916',
 u'6910229',
 u'4599609',
 u'4734072',
 u'4843014',
 u'5061636',
 u'5493730',
 u'5635909',
 u'6080690',
 u'6267232',
 u'6388422',
 u'6767509',
 u'2003/0214408',
 u'2004/0009729',
 u'197 49 862',
 u'101 55 935',
 u'203 08 642',
 u'103 11 185',
 u'103 50 869',
 u'103 57 193',
 u'WO 00/6

## Step 4. Counting Data from obtain list for number of times patents be cited than output to obtain information to file.

A patent list which contains all cited patents has been generated.
###### 4.1 Use Counter function from the collections package to return a list for every cited patents and number of the time it's cited.

In [4]:
cited_patents_count = Counter(cited_patents)

cited_scheme = [(patent, cited_patents_count[patent]) for patent in cited_patents_count]
cited_scheme

[(u'2001/0040064', 1),
 (u'6508656', 1),
 (u'7083454', 1),
 (u'2553225', 1),
 (u'5047630', 1),
 (u'6442434', 1),
 (u'2562095', 1),
 (u'2009/0028258', 1),
 (u'1246688', 2),
 (u'5835116', 1),
 (u'2002/0077193', 2),
 (u'4341235', 1),
 (u'5126607', 1),
 (u'6422408', 1),
 (u'2002/0021296', 1),
 (u'2007/0210929', 1),
 (u'4077158', 1),
 (u'6350939', 1),
 (u'6515473', 1),
 (u'4570628', 1),
 (u'200720003525.6', 1),
 (u'4708312', 1),
 (u'6229418', 1),
 (u'4005186', 1),
 (u'2004/0241451', 1),
 (u'6115508', 1),
 (u'3994805', 1),
 (u'5167921', 1),
 (u'6960289', 3),
 (u'7193223', 1),
 (u'5500046', 1),
 (u'2004/0109459', 1),
 (u'7082505', 1),
 (u'10 2004 032 005', 1),
 (u'4006982', 1),
 (u'2007/0129759', 1),
 (u'4676249', 1),
 (u'6723692', 1),
 (u'6683784', 1),
 (u'WO2005/084546', 3),
 (u'7248460', 1),
 (u'6766668', 1),
 (u'4960642', 1),
 (u'1074832', 2),
 (u'19970001293', 1),
 (u'1762515', 1),
 (u'6425841', 2),
 (u'19970001294', 1),
 (u'5752152', 1),
 (u'6491177', 1),
 (u'5129200', 2),
 (u'102004286

###### 4.2 After the final list been setted up. Sorting the sequence of the patents by cited patents ID in the list and to write item to the file line by line as task required output format: 
* cited_patent_id: number of times it is cited

In [5]:
cited_scheme.sort()

output_file = open ("cited.txt", "w+")

for item in cited_scheme:
    line = item[0]+":"+ str(item[1])+"\n"
    output_file.write(line) 

output_file.close()

### Task 3 end.