# World Intellectual Property Organization: Data Wrangling

Date: 01/04/2017

Version: 1.0

Environment: Python 3.6.0 and Jupyter notebook

Libraries used:
* nltk (for RegexpTokenizer, MWETokenizer, Stopwords, Collocations, Probability, downloaded from [import nltk nltk.download()])
* re (for regular expression, included in Anaconda Python 3.6.0) 
* sklearn (for CountVectorizer, included in Anaconda Python 3.6.0)
* BeautifulSoup (xml parser, included in Anaconda Python 3.6.0)

## 1. Introduction
The following tasks as required by:
#### Task 1: 
<b>Extract the hierarchical IPC code</b>. World Intellectual Property Organization has a hierarchical classification scheme that contains Section, Class, Subclass, Main Group, and Subgroup. Extract the hierarchical IPC codes for all the patents, and store them in a file, called “classification.txt”,  in the following format:  patent’s_ID:Section,Class,Subclass,Main_group,Subgroup.

#### Task 2:
<b>Extract the citation network</b>. Each patent cites a number of existing granted patents. Extract all the references for each patent, and store them in a file, called “citations.txt”, in the following format: citing_patent_id:cited_patent_id,cited_patent_id,

#### Task 3:
<b>Extract and preprocess abstracts</b>. 
1. We are required to extract all the abstracts for all the patents, and then process and store those abstracts as sparse count vectors. The output file, called “count_vectors.txt”,Each row corresponds to a patent’s abstract, starting with patent_id, followed by “word_index:count” pairs.

2. Generate Vocab list called “vocab.txt”. Format of list is 'word_index:vocab'.


## 2.  Import libraries 

In [1]:
from bs4 import BeautifulSoup
import re
from time import process_time
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.corpus import stopwords
from nltk.probability import *
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
import nltk

## 3. Examining and loading data

<b>Examine "patents.xml" to determine the content and format, then load it into Python. Explain your finds here.</b>

I examined the patents.xml two ways:
1. By text editor
2. BeautifulSoup xml parser using prettify()

By inspection with text editor, it is revealed many xml documents are connected together in the file. Each xml document have a xml header <?xml version="1.0" encoding="utf-8"?>.

In [2]:
#Count the number of xml documents (by xml header) in the file.
header=0
file = open('patents.xml','r')
for line in file:
    if line.startswith("<?xml "):
        header += 1      
print(header)

2500


This shows the 2500 patents are individually stored in each xml document. Each document will contain the patent's section, class, subclass, main group, and subgroup within their respective tags. It also contains a number of citations related to the patent, and one abstract describing the patent.

In [3]:
# Use parser to inspect the patents.xml file
soup = BeautifulSoup(open('patents.xml'),'lxml-xml')
#print(soup.prettify())

The patents are located in the (doc-number) tag under (us-bibliographic-data-grant) >> (publication-reference) >> (document-id). 

However, BeautifulSoup stops parsing once it reaches the next document header as shown above. This means the concatenated xml documents must be broken down in some sort of way in order to extract all of the information in the file.

## 4. Parsing XML and Extracting all the required information 


### Task 1

In order to solve the problem of concatenated xml documents, beautiful soup can parse the xml document in html mode. This prevents the parser from stopping after reading one document. It will loop over each document and store patent id, section, class, subclass, main group and subgroup. It will write these information into the required format into classification.txt.

In [4]:
def parse_xml_task1(xml_document):
    """
    Parse through the xml document using beautiful soup. Use html mode to obtain a collection of documents
    param: xml_document
    return: text document according to task 1
    """
    soup = BeautifulSoup(open(xml_document))
    root = soup.find_all('us-patent-grant')
    output_file = open("classification.txt","a")

    for document in root:
        finalStr = ""
        data = []
        pat_id = document.find('publication-reference').find('doc-number').contents[0]
        data.append(document.find('class').contents[0])
        data.append(document.find('subclass').contents[0])
        data.append(document.find('main-group').contents[0])
        data.append(document.find('subgroup').contents[0])
        string = ",".join(str(i) for i in data)
        finalStr = "{}:{}\n".format(pat_id,string)
        output_file.write(finalStr) #write into txt file

    output_file.close()

Below run the code to create the file for task 1.

In [5]:
# create classification text file
writefile = open("classification.txt","w")
writefile.close()

start = process_time()
parse_xml_task1('patents.xml')
end = process_time()
print("The function {} took {:8.25f} seconds".format(parse_xml_task1.__name__, end-start))

The function parse_xml_task1 took 67.8750000000000000000000000 seconds


Take a look at the top 10 lines of classification.txt.

In [6]:
with open("classification.txt") as file:
    head = [next(file) for x in range(10)]
print(head)

['PP021722:01,H,5,00\n', 'RE042159:01,B,7,14\n', 'RE042170:06,F,11,00\n', '07891018:41,D,13,00\n', '07891019:41,D,13,00\n', '07891020:41,D,13,00\n', '07891021:62,B,17,00\n', '07891023:41,F,19,00\n', '07891025:61,F,9,02\n', '07891026:41,D,13,00\n']


### Task 2

This function creates the citation.txt as required by task 2.

This function is similar to classification function. The only difference is it finds all of the citations in the documents and stores it before formatting and writing into the citation.txt.

In [7]:
def parse_xml_task2(xml_document):
    """
    Parse through the xml document using beautiful soup. Use html mode to obtain a collection of documents
    param: xml_document
    return: text document according to task 2
    """
    soup = BeautifulSoup(open(xml_document))
    root = soup.find_all('us-patent-grant')
    output_file = open("citation.txt","a")

    for document in root:
        finalStr = ""
        citation_list = []
        
        pat_id = document.find('publication-reference').find('doc-number').contents[0]
        citations = document.find('us-bibliographic-data-grant').findAll('citation') 
        
        for citation in citations:
            citation_list.append(citation.find('doc-number').contents[0]) #stores the citated patents id
        
        string = ",".join(str(i) for i in citation_list) #joins all citations together seperated by comma
        finalStr = "{}:{}\n".format(pat_id,string)
        output_file.write(finalStr) #write into txt file

    output_file.close()

Below run the code to create the file for task 2.

In [8]:
# create citation text file
writefile = open("citation.txt","w")
writefile.close()

start = process_time()
parse_xml_task2('patents.xml')
end = process_time()
print("The function {} took {:8.25f} seconds".format(parse_xml_task2.__name__, end-start))

The function parse_xml_task2 took 70.7031250000000000000000000 seconds


Take a look at the top 10 lines of citation.txt.

In [9]:
with open("citation.txt") as file:
    head = [next(file) for x in range(10)]
print(head)

['PP021722:PP17672,PP18482,PP18483\n', 'RE042159:4954776,4956606,5015948,5115193,5180978,5332966,5332996,5351003,5381090,5521496,5914593\n', 'RE042170:3988719,4206996,4803623,4905098,5012281,5161222,5172244,5253152,5263153,5270775,5301262,5341363,5355490,5410754,5537626,5559958,5574859,5580177,5611046,5647056,5828864\n', '07891018:4561124,4831666,4920577,5105473,5134726,D338281,5611081,5729832,5845333,6115838,6332224,6805957,7089598\n', '07891019:4355632,4702235,5032705,5148002,5603648,6439942,6757916,6910229\n', '07891020:4599609,4734072,4843014,5061636,5493730,5635909,6080690,6267232,6388422,6767509,2003/0214408,2004/0009729,197 49 862,101 55 935,203 08 642,103 11 185,103 50 869,103 57 193,WO 00/62633,WO 2004/073798\n', '07891021:4507808,4627112,4864655,5010591,5031242,5165110,5410759\n', '07891023:770761,1335927,1398962,1446948,1839143,1852030,1983636,2133505,2411724,2682669,3167786,3401857,4923105,5214806,5319806,5413262,5488738,5497923,5611079,5623735,6021528,6088831,6216931,67665

## 5. Preprocessing the abstracts 


### Task 3
In order to obtain the abstracts, we use the similar method as the previous two tasks. This time, after obtaining the abstract in the document I would normalize it into lower case, then produce a list of tokens to be stored as values within a dictionary, according to patent ids(keys).

I have tokenized the abstract by using regular expression which only allow words with a single or mutiple hyphens. (i.e. walking-beam and analog-to-digital are allowed). No numbers will be included since we are only interested in the vocabulary in the abstracts. Also, abbreviations are not included. 

I will assume words with hyphens are considered to be a unigram (i.e. walking-beam and analog-to-digital are unigrams).

According to the task, I will be generating at least 100 bigrams. Then filter the tokens with stopwords, top-20 most frequent words based on word’s document frequency, and words only appearing in one abstract.

Finally, I will produce the count_vector.txt and vocab.txt.

In [10]:
def parse_xml_task3(xml_document):
    """
    Parse through the xml document using beautiful soup. Use html mode to obtain a collection of documents
    param: xml_document
    return: text document according to task 3
    """
    soup = BeautifulSoup(open(xml_document))
    root = soup.find_all('us-patent-grant')
    tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-]?[a-zA-Z]+)+", gaps=False) # keeps only words and hyphenated words
    
    for document in root:
        abstract_list = ""
        
        pat_id = document.find('publication-reference').find('doc-number').contents[0]
           
        # Certain documents may have multiple paragraphs in an abstract 
        for abstract in document.abstract.p.contents:
            abstract_list = abstract_list + " " + abstract.lower() #normalize the abstract words into lower cases
     
        tokens = tokenizer.tokenize(abstract_list) #tokenize the abstract

        abstract_dic[pat_id] = list(tokens) #store the tokens into abstract dict

Below run the code to create produce abstract_dic which contains patent id and abstract tokens.

In [11]:
abstract_dic = {} #create empty dictionary for key (patent ids) and values (abstract tokens)

start = process_time()
parse_xml_task3('patents.xml')
end = process_time()
print("The function {} took {:8.25f} seconds".format(parse_xml_task3.__name__, end-start))

The function parse_xml_task3 took 75.5937500000000000000000000 seconds


In order to construct the bigrams, I will need to obtain a list of words from all the abstracts.

In [12]:
words = []
for value in abstract_dic.values():
    for word in value:
        words.append(word)

Then I do POS tagging on the words and keep words that are nouns (NN) and adjectives (JJ).

In [13]:
tagged_word = []
tagged_words = []
tagged_word = nltk.pos_tag(words)
tagged_words = [word for word in tagged_word if (word[1]=='NN' or word[1]=='JJ')]

Bigrams will be generated from the tagged words with the nltk.collocations.BigramAssocMeasures().

The loop operation below will generate bigrams, first, by ignoring bigrams that appear less than 40 times in the tagged words list. A bigram list will store the top 15 bigrams according to PMI score. The next loop, the finder.apply_freq_filter() will be reduced by one (to 39), then store the next top 15 bigrams according to PMI score that is not already in the bigrams list. 

Once the operation obtains at least 130 bigrams it will stop. 

In [14]:
bigram = []
i = 0
while len(bigram) < 130:
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = nltk.collocations.BigramCollocationFinder.from_words(tagged_words) 
    g = 40 - i
    finder.apply_freq_filter(g) #ignore bigrams to do not appear more than g
    j=0
    while j <15:
        store = finder.nbest(bigram_measures.pmi, 15) #generate top 15 bigrams based on PMI score
        save = (store[j][0][0],store[j][1][0])
        if save not in bigram and store[j][0][0] != store[j][1][0]:
            bigram.append(save)
        j+=1
    i += 1

By scanning through the bigram list, there are a few bigrams that do not make sense. Therefore I removed them from the list. 

In [15]:
bigrams = []
discard = [('left', 'right'),('red', 'green'),('float', 'fraction'),('green', 'blue'),
           ('water-swellable', 'water-insoluble'),('embarkation', 'disembarkation'),('attributed', 'categorical')]
bigrams = [word for word in bigram if word not in discard]

In [16]:
len(bigrams)

133

Now we can add those generated bigrams into the abstract_dic by using MWETokenizer. Words in a bigram will be separated by a underscore.

In [17]:
abstract_bigrams = {}
tokenizer = MWETokenizer(bigrams, separator='_')
for key,values in abstract_dic.items():
    abstract_bigrams[key] = tokenizer.tokenize(values)

Use the built-in stopwords from nltk to filter the abstract tokens.

In [18]:
stopwords_list = stopwords.words('english')
abstract_stop = {}
for key in abstract_bigrams.keys():
    abstract_stop[key] = [word for word in abstract_bigrams[key] if word not in stopwords_list]

Now we need to generate the top 20 words by document frequency. 
Firstly, by using the set function on the values for each document in the abstract_stop dictionary.
This generates unique words in each documents and store it into a list.

In [19]:
set_top20list = []
for value in abstract_stop.values():
    set_top20list.extend(list(set(value)))

Below we can obtain the top 20 words that occur in each document

In [20]:
fd = FreqDist(set_top20list)
top20 = fd.most_common(20) #extract words from top 20 words tuples
fd.plot(20, cumulative=True)

<Figure size 640x480 with 1 Axes>

In [21]:
top20list = list([top20[x][0] for x in range(0,20)])
print(top20list) #list of top 20 words

['includes', 'one', 'first', 'second', 'method', 'provided', 'system', 'least', 'device', 'plurality', 'portion', 'apparatus', 'surface', 'including', 'may', 'connected', 'formed', 'also', 'data', 'control']


We remove the top 20 words from abstract_stop dictionary

In [22]:
abstract_removedtop20 = {}
for key in abstract_stop.keys():
    abstract_removedtop20[key] = [word for word in abstract_stop[key] if word not in top20list]

Now we want to find words only appearing in one abstract. Use the hapaxes(). Then we filter the abstract_removedtop20 dictionary.
This is the final step before we generate the count vectors.

In [23]:
one_word = list(fd.hapaxes()) # list of words that occur only in one docoment's abstract
abstract_complete = {}
for key in abstract_removedtop20.keys():
    abstract_complete[key] = [word for word in abstract_removedtop20[key] if word not in one_word]  

Now we generate the count vector using CountVectorizer. 
The token_patten's regular expression is the same as the tokenizer but also allows bigrams to be included.
After constructing the matrix. We obtain 2500 rows which corresponds to the number of documents/patent ids and a total of 5654 column/words.

In [24]:
vectorizer = CountVectorizer(analyzer = "word", token_pattern=u'(?u)[a-zA-Z]+(?:[-_]?[a-zA-Z]+)+')
data_features = vectorizer.fit_transform([' '.join(value) for value in abstract_complete.values()])
print(data_features.shape)

(2500, 5654)


We can generate a vocab list by get_feature_names(). It takes all of the 5654 vocabulary as a vector.

In [25]:
vocab = vectorizer.get_feature_names()

Produce the Vocab.txt. In the matrix, the vocab are arranged alphabetically with the first column starting with index of 0. Therefore, I just need to add ascending numbers down the vocab list. Please refer to [1] regarding the code.

In [26]:
output_file = open("vocab.txt", "w")
for word,i in zip(vocab,range(0,len(vocab))):
    output_file.write(str(i) + ":" + word +"\n") # index:word format
output_file.close()

In order to construct the count vector, each vocabulary in the document has to be indexed to the vocab.txt. This is done by passing the words into vocab.index(). The index and word count pairs are recorded into store list, which is formatted and written into the count_vector.txt. Please refer to [2] regarding a portion of the code.

In [27]:
store=[]
vocab = list(vocab)
output_file = open("count_vector.txt", 'w') #make a new count_vector text file

for key,value in abstract_complete.items(): 
    i = [vocab.index(word) for word in value] #index the words in the document according to the vocab list
    for k, v in FreqDist(i).items(): #pass index list to freqdist
        string = ("{}:{}".format(k,v)) # produce word_index:count format for a particular document
        store.append(string) 
    
    string = ",".join(i for i in store)
    finalStr = "{},{}\n".format(key,string)
    store=[]
    output_file.write(finalStr) 
    
output_file.close() 

## 6. Summary
* Task 1. Produces a classification.txt
* Task 2. Produces a citation.txt
* Task 3. Produces a Vocab.txt and count_vectors.txt