# Identify number of patent citation

To identify number of times a particular patent has been cited according to data in the XML file and store them in output file "cited.txt" in the following format: 

 * cited_patent_id: number of times it is cited

## Step 1. Import Libraries

In [1]:
from bs4 import BeautifulSoup
from collections import Counter

## Step 2. Use BeautifulSoup to extract information from the XML file.

When first time tried to open the XML file and parse the data to jupyter book by using BeautifulSoup, lxml’s XML parser, it turns out to be that the parser cannot parser all the data from the file and returned some error information. After examining the original XML file by editor, I found out that the file is not a well-formed XML file, which instead the file itself contains 2,500 individual XML files in it. So, the lxml’s XML parser can only successfully parses the first section as it recognizes the section that belong to a well-formed XML file structure.

The way to solve this problem is to open the file directly without parsing it and do the edition of the original files to the correct structure and save each patent to the list separately before using any parser. Or use Python’s html.parser with BeautifulSoup, it can parse the so call un-well-formed XML file without error. And I am going the use the Python’s html.parser to parse and extract data from XML files with Python.

In [2]:
soup = BeautifulSoup(open("./patents.xml"),"html.parser")  

## Step 3. Examining the hierarchy of the xml file and obtain the information we need.

By examining the hierarchy of the xml file, we noted that the information we would like to extract is stored in the following tags:

 * for cited_patent_id:
 
                     
        references-cited> citation > doc-number 
                          citation > doc-number 
                              .
                              .
                              .
                          citation > doc-number 
       (the cited patent id for a citing patent is located under tags references-cited> citation > doc-number. 
       And a citing patent may have more than one cited patent)                  
                         
By collecting all cited patents in the parsing data, we can have a list of patents ID, for these patent ID each represent a time it is cited by a particular patent. So, by do some counting process, we can obtain information in a list with all the cited patents and number of the time a patent be cited.

###### To find the cited_patent_id and store to a list : cited_patents

In [3]:
cited_patents = []

references_cited_tags = soup.find_all("references-cited")

citation_tags = []
for item in references_cited_tags:
    citation_tags.append(item.find_all("citation"))

cited_patents = []
for l in citation_tags:
    for item in l:
        cited_patents.append(item.find("doc-number").string)
    
cited_patents[0:5]    

[u'PP17672', u'PP18482', u'PP18483', u'4954776', u'4956606']

## Step 4. Counting Data from obtain list for number of times patents be cited than output to obtain information to file.

A patent list which contains all cited patents has been generated.
###### 4.1 Use Counter function from the collections package to return a list for every cited patents and number of the time it's cited.

In [4]:
cited_patents_count = Counter(cited_patents)

cited_scheme = [(patent, cited_patents_count[patent]) for patent in cited_patents_count]
cited_scheme[0:5]

[(u'2001/0040064', 1),
 (u'6508656', 1),
 (u'7083454', 1),
 (u'2553225', 1),
 (u'5047630', 1)]

###### 4.2 After the final list been setted up. Sorting the sequence of the patents by cited patents ID in the list and to write item to the file line by line as task required output format: 
* cited_patent_id: number of times it is cited

In [5]:
cited_scheme.sort()

output_file = open ("cited.txt", "w+")

for item in cited_scheme:
    line = item[0]+":"+ str(item[1])+"\n"
    output_file.write(line) 

output_file.close()

### Task 3 end.