# FIT5196 Data wrangling S2 2017

## Assessment 1: Text Preprocessing

### Introduction
In this assignment, I am going to use some data wrangling techniques to do the text preprocessing for the data been provided in XML format which contains 2,500 patents and to convert the data into a proper format for information references. In the given data, there have information include publication reference, application reference, abstract, description, claims, citations, etc. 

And the text preprocessing for this assignment focus on patents information collocation including International Patent Classification (IPC) code and abstract, also the patents' citation network.

* Student: CHAO KAI HSU
* Student ID: 28397371 

There are 4 tasks wirtten in seperated ipynb files in Python 2.0 by using Jupytur notebook for this Assignment of Text Preprocessing which includes:

1. Extract the hierarchical IPC code
2. Extract the citation network
3. Identify number of patent citation
4. Extract and preprocess abstracts
-----------------

## This file belongs the task 1.  

# Extract the hierarchical IPC code

To extract the hierarchical IPC codes for all the patents in the XML file which contain contains 2,500 patents in total and store them in output file "classification.txt" with the following information in the below format: 
                
* patent’s_ID:Section,Class,Subclass,Main_group,Subgroup

## Step 1. Import Libraries

In [1]:
from bs4 import BeautifulSoup

## Step 2. Use BeautifulSoup to extract information from the XML file.

When first time tried to open the XML file and parse the data to jupyter book by using BeautifulSoup, lxml’s XML parser, it turns out to be that the parser cannot parser all the data from the file and returned some error information. After examining the original XML file by editor, I found out that the file is not a well-formed XML file, which instead the file itself contains 2,500 individual XML files in it. So, the lxml’s XML parser can only successfully parses the first section as it recognizes the section that belong to a well-formed XML file structure.

The way to solve this problem is to open the file directly without parsing it and do the edition of the original files to the correct structure and save each patent to the list separately before using any parser. Or use Python’s html.parser with BeautifulSoup, it can parse the so call un-well-formed XML file without error. And I am going the use the Python’s html.parser to parse and extract data from XML files with Python.

In [2]:
soup = BeautifulSoup(open("./patents.xml"),"html.parser") 

## Step 3. Examining the hierarchy of the xml file and obtain the information we need.

By examining the hierarchy of the xml file, we noted that the information we would like to extract is stored in the following tags:

 * for patent's ID:  
 
        publication-reference > doc-number
         * the citing patent id is located under tags publication-reference > doc-number
 
 * for patent's Section,Class,Subclass,Main_group and Subgroup:
                     
        classification-ipcr> section 
        classification-ipcr> class 
        classification-ipcr> main-group 
        classification-ipcr> subgroup 
         * classification information for patents are under tag classification-ipcr)
        
###### 3.1 To get the patent_id and return to a list: patents_id

In [3]:
publication_reference_tags = soup.find_all("publication-reference")
patents_id = [item.find("doc-number").string for item in publication_reference_tags] 
patents_id

[u'PP021722',
 u'RE042159',
 u'RE042170',
 u'07891018',
 u'07891019',
 u'07891020',
 u'07891021',
 u'07891023',
 u'07891025',
 u'07891026',
 u'07891027',
 u'07891029',
 u'07891030',
 u'07891032',
 u'07891033',
 u'07891034',
 u'07891036',
 u'07891037',
 u'07891038',
 u'07891039',
 u'07891041',
 u'07891044',
 u'07891053',
 u'07891055',
 u'07891056',
 u'07891057',
 u'07891058',
 u'07891059',
 u'07891060',
 u'07891063',
 u'07891067',
 u'07891070',
 u'07891071',
 u'07891076',
 u'07891078',
 u'07891082',
 u'07891083',
 u'07891084',
 u'07891086',
 u'07891087',
 u'07891097',
 u'07891098',
 u'07891104',
 u'07891107',
 u'07891111',
 u'07891114',
 u'07891115',
 u'07891116',
 u'07891117',
 u'07891118',
 u'07891121',
 u'07891123',
 u'07891129',
 u'07891133',
 u'07891136',
 u'07891139',
 u'07891140',
 u'07891141',
 u'07891146',
 u'07891148',
 u'07891152',
 u'07891158',
 u'07891159',
 u'07891160',
 u'07891161',
 u'07891162',
 u'07891163',
 u'07891165',
 u'07891166',
 u'07891167',
 u'07891169',
 u'078

###### 3.2 To find the "section","class","subclass","main-group"and "subgroup" under patents and store to a list : patents_info

In [4]:
tags_list = ["section","class","subclass","main-group","subgroup"]

patents_info = []
classification_ipcr_tags = soup.find_all("classification-ipcr")
for item in classification_ipcr_tags:
    temp = []
    for tag in tags_list:
        temp.append(item.find(tag).string)
    patents_info.append(temp)      
    
patents_info

[[u'A', u'01', u'H', u'5', u'00'],
 [u'G', u'01', u'B', u'7', u'14'],
 [u'G', u'06', u'F', u'11', u'00'],
 [u'A', u'41', u'D', u'13', u'00'],
 [u'A', u'41', u'D', u'13', u'00'],
 [u'A', u'41', u'D', u'13', u'00'],
 [u'A', u'62', u'B', u'17', u'00'],
 [u'A', u'41', u'F', u'19', u'00'],
 [u'A', u'61', u'F', u'9', u'02'],
 [u'A', u'41', u'D', u'13', u'00'],
 [u'E', u'03', u'D', u'9', u'00'],
 [u'A', u'61', u'G', u'9', u'00'],
 [u'A', u'47', u'K', u'11', u'06'],
 [u'A', u'47', u'G', u'9', u'00'],
 [u'B', u'68', u'G', u'5', u'00'],
 [u'A', u'47', u'D', u'5', u'00'],
 [u'A', u'47', u'L', u'11', u'283'],
 [u'G', u'11', u'B', u'23', u'50'],
 [u'B', u'08', u'B', u'9', u'04'],
 [u'A', u'47', u'L', u'13', u'142'],
 [u'A', u'47', u'L', u'13', u'26'],
 [u'B', u'60', u'S', u'1', u'40'],
 [u'A', u'47', u'B', u'95', u'02'],
 [u'E', u'05', u'C', u'17', u'64'],
 [u'E', u'05', u'D', u'11', u'06'],
 [u'E', u'05', u'D', u'5', u'00'],
 [u'F', u'16', u'G', u'11', u'00'],
 [u'F', u'16', u'G', u'11', u'14'],
 

## Step 4. Data formatting for the collected data and output to file.

A patent list which contains all the 2,500 patents ID and another information list that contains information of the patent including "section", "class", "subclass", "main-group" and "subgroup" have been generated. The patents in the patent list is in the same position of the information list. 

###### 4.1 Combine both lists to a list so each item in the new list store the patent's id and its classification data as below.

In [5]:
classification_scheme = []
for i in range (len(patents_id)):
    classification_scheme.append([patents_id[i], patents_info[i]])
classification_scheme

[[u'PP021722', [u'A', u'01', u'H', u'5', u'00']],
 [u'RE042159', [u'G', u'01', u'B', u'7', u'14']],
 [u'RE042170', [u'G', u'06', u'F', u'11', u'00']],
 [u'07891018', [u'A', u'41', u'D', u'13', u'00']],
 [u'07891019', [u'A', u'41', u'D', u'13', u'00']],
 [u'07891020', [u'A', u'41', u'D', u'13', u'00']],
 [u'07891021', [u'A', u'62', u'B', u'17', u'00']],
 [u'07891023', [u'A', u'41', u'F', u'19', u'00']],
 [u'07891025', [u'A', u'61', u'F', u'9', u'02']],
 [u'07891026', [u'A', u'41', u'D', u'13', u'00']],
 [u'07891027', [u'E', u'03', u'D', u'9', u'00']],
 [u'07891029', [u'A', u'61', u'G', u'9', u'00']],
 [u'07891030', [u'A', u'47', u'K', u'11', u'06']],
 [u'07891032', [u'A', u'47', u'G', u'9', u'00']],
 [u'07891033', [u'B', u'68', u'G', u'5', u'00']],
 [u'07891034', [u'A', u'47', u'D', u'5', u'00']],
 [u'07891036', [u'A', u'47', u'L', u'11', u'283']],
 [u'07891037', [u'G', u'11', u'B', u'23', u'50']],
 [u'07891038', [u'B', u'08', u'B', u'9', u'04']],
 [u'07891039', [u'A', u'47', u'L', u'13

###### 4.2 After the final list been setted up. Rearrange the sequence of the patents in the list by sort function and to write item to the file line by line as task required output format:
* patent’s_ID:Section,Class,Subclass,Main_group,Subgroup

In [6]:
classification_scheme.sort()

output_file = open ("classification.txt", "w+")   

for item in classification_scheme:
    line = item[0]+":"+ ",".join(item[1])+"\n"
    output_file.write(line) 

    
output_file.close()

### Task 1 end.