# FIT5196 Assessment 1
#### Student Name: Mark Joseph Jose
#### Student ID: 28066049

Date: 23/03/2017

Version: 1.0

Environment: Python 2.7.11 and Jupyter notebook

Libraries used: 
* BeautifulSoup
* nltk
* itertools (to help iterating dictionaries)
* CountVectorizer (to create a count vector)


## 1. Introduction

This assessment is about extracting information from the XML file of patents and generating the count vectors of each patent's abstracts. 

To extract information from the file, I used BeautifulSoup, which makes the retrieval of data easy to do. One of the earliest problems I've encountered is that the XML is not made up of a single parent, but it starts with a collection of sibling nodes. Fortunately, BeautifulSoup's default parser ("lxml") parses the file without any problems, so I went on with it.

To generate the count vector from each abstract, I've used the concepts discussed in the lectures 3 and 4, mainly about using NLTK libraries to tokenize and manipulate these tokens.


In [1]:
!python --version

Python 2.7.13 :: Anaconda 4.3.0 (64-bit)


## 2.  Import libraries 

In [2]:
# Code to import libraries as you need in this assessment, e.g.,
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.probability import *
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer
from itertools import chain
from sklearn.feature_extraction.text import CountVectorizer
import nltk.data


## 3. Examining and loading data

The 2500 patents are stored in an xml file, and each patent is formatted with the following structure:

``` xml
    <us-patent-grant>
        <publication-reference>
            <doc-number></doc-number>
        </publication-reference>
        
        <!-- more nodes here -->
        
        <classifications-ipcr>
            <classification-ipcr>
                <section> </section>
                <class> </class>
                <subclass> </subclass>
                <maingroup> </maingroup>
                <subgroup> </subgroup>
            </classification-ipcr>
        </classifications-ipcr>
        
        <!-- more nodes here -->
        
        <abstract>
        </abstract>
        
        <!-- more nodes here -->
    </us-patent-grant>
         ...
```

It was easy to determine which of the tags pertain to the patent id, classification, etc. thanks to the sample output files given. I just had to follow the pattern in the file while mining the information.

Here I used BeautifulSoup's default parser to get all the nodes.

In [3]:
infile = open("patents.xml", "r")
contents = infile.read()
soup = BeautifulSoup(contents, "lxml")

Sanity check - did we find all the patents?

In [4]:
len(soup.find_all('us-patent-grant'))

2500

Looks like we've got all 2500 of them. **Now let's start mining the data!**

## 4. Parsing XML and Extracting all the required information 

This section is made up of the first two tasks in the assessment:
1. Extracting the *classification* - where we find a patent's section, class, subclass, main group, and sub group
2. Extracting the *citation network* - where we find the respective patent ids that a patent has cited

In [5]:
# Initialize lists and dictionary to store data
patent_ids = []
classifications = []
citations = {}

First, I have defined a method to write the text files.

In [6]:
def write_file(file_name, list_name):
    file = open(file_name, "w")
    for item in list_name:
      file.write("%s\n" % item.encode("utf-8"))
    file.close()

### 4.1 Extract all the classifications

For this I used BeautifulSoup's **find_all()** and **find()** methods. I was expecting some problems like missing nodes or invalid data, but fortunately all the required patent data are complete so the extraction was pretty straightforward.

I would have liked to use XPath to retrieve the text but sadly BeautifulSoup does not support this. Nevertheless, this library is the easiest one to use so I sticked with it.

In [7]:
index = 0

# extract classifications
# for each <us-patent-grant>, find <publication-reference>
for upg in soup.find_all('us-patent-grant'):
    # extract the <doc-number> inside the <publication-reference>
    patent_id = upg.find('publication-reference').find('doc-number').text
    if(patent_id != None):
        # create a list of patent ids. we will need it later.
        patent_ids.append(patent_id)
        # extract the classifications under <classification-ipcr>
        classification = upg.find('classifications-ipcr').find_all('classification-ipcr')

        # extract section
        section_text = classification[0].find('section').text
        # extract class
        class_text = classification[0].find('class').text
        # extract subclass
        subclass_text = classification[0].find('subclass').text
        # extract main group
        maingroup_text = classification[0].find('main-group').text
        # extract subgroup
        subgroup_text = classification[0].find('subgroup').text
                
        # append everything into the prescribed format
        classifications.append(patent_ids[index] + ':' 
                                   + section_text + ',' 
                                   + class_text + ',' 
                                   + subclass_text + ',' 
                                   + maingroup_text + ',' 
                                   + subgroup_text) 

        index = index + 1

**Check.** Did we get the correct data and format?

In [8]:
classifications[:5]

[u'PP021722:A,01,H,5,00',
 u'RE042159:G,01,B,7,14',
 u'RE042170:G,06,F,11,00',
 u'07891018:A,41,D,13,00',
 u'07891019:A,41,D,13,00']

Good. Now put everything into the text file.

In [9]:
write_file("classification.txt", classifications)

### 4.2 Extract the patent citation network

Continuing from the last task, I used BeautifulSoup's **find_all()** and **find()** to crawl the file and extract each of the patent's citation network.

All of a patent's citations are stored under the *references-cited* node, and we can get each citation's patent id under each *citation* node.


In [10]:
# extract citations
citations_strings = []
index = 0
for upg in soup.find_all('us-patent-grant'):
    citations = upg.find('references-cited').find_all('citation')
    citations_list = []
    
    # add each citation to the citation list
    for citation in citations:
        citations_list.append(citation.find('doc-number').get_text())

    # format into the required way
    # note: here I re-used the patent_ids list that I made earlier
    citations_strings.append(patent_ids[index] +':' + '%s' % ','.join(citations_list)) 
    index = index + 1

**Check.** Have I got the correct data and format?

In [11]:
citations_strings[:5]

[u'PP021722:PP17672,PP18482,PP18483',
 u'RE042159:4954776,4956606,5015948,5115193,5180978,5332966,5332996,5351003,5381090,5521496,5914593',
 u'RE042170:3988719,4206996,4803623,4905098,5012281,5161222,5172244,5253152,5263153,5270775,5301262,5341363,5355490,5410754,5537626,5559958,5574859,5580177,5611046,5647056,5828864',
 u'07891018:4561124,4831666,4920577,5105473,5134726,D338281,5611081,5729832,5845333,6115838,6332224,6805957,7089598',
 u'07891019:4355632,4702235,5032705,5148002,5603648,6439942,6757916,6910229']

Great. Now put everything into the text file.

In [12]:
write_file("citations.txt", citations_strings)

## 5. Preprocessing the abstracts 

*In this section, you should write Python code to pre-process the abstracts, convert them into the required format, and store the processed data. We suggest that you refer to the lecture and tutorial materials and the references provided in those materials.*

This section utilises a lot from Python's NLTK library to derive the abstracts' count vectors. It is made up of the following steps:
1. extracting the abstracts from the file
2. tokenizing the abstracts
3. creating an initial unigram vocabulary
4. extracting meaningful bigrams to enrich the vocabulary
5. filtering it to remove irrelevant words like stopwords, most frequent words, and hapaxes 
6. creating the final vocabulary
7. creating the count vector

### 5.1 Extract the abstracts from the file
Extract the abstracts per patent and store them into the dictionary. Note that this is where I do **case-normalization** as I turn the abstracts into lower case.

In [13]:
# create a dictionary of abstracts, where the patent id is the key
abstracts = {}
index = 0
for upg in soup.find_all('us-patent-grant'):
    abstracts[patent_ids[index]] = upg.find('abstract').find('p').text.lower()
    index = index + 1

Check if we have the correct abstracts. Define a method that prints our dictionary.

In [14]:
def print_dictionary(dictionary, length):
    index = 0
    for key, value in dictionary.iteritems():
        print key + '\t' + str(value)
        if index == length-1:
            break
        index = index + 1

Seems like we got the abstracts all right.

In [15]:
print_dictionary(abstracts, 3)

07910771	the invention relates to a method for producing acrylic acid in one step by an oxydehydration reaction of glycerol in the presence of molecular oxygen. the reaction preferably carried out in gaseous phase in the presence of a suitable catalyst.
07910479	a method for manufacturing a photodiode array includes providing a semiconductor substrate having first and second main surfaces opposite to each other. the semiconductor substrate has a first layer of a first conductivity proximate the first main surface and a second layer of a second conductivity proximate the second main surface. a via is formed in the substrate which extends to a first depth position relative to the first main surface. the via has a first aspect ratio. generally simultaneously with forming the via, an isolation trench is formed in the substrate spaced apart from the via which extends to a second depth position relative to the first main surface. the isolation trench has a second aspect ratio different from 

### 5.2 Tokenize each patent abstract
Now it's time to tokenize the abstracts. For this I used **nltk's built in tokenizer** but also included an **isalpha()** check to ensure all of the characters in each token is an alphabet.

In [16]:
#tokenize the abstracts
tokenized_abstracts = {} 
for key,value in abstracts.iteritems():
    tokenized_abstracts[key] = [word for word in nltk.word_tokenize(value) if word.isalpha()] 

In [17]:
print_dictionary(tokenized_abstracts, 3)

07910771	[u'the', u'invention', u'relates', u'to', u'a', u'method', u'for', u'producing', u'acrylic', u'acid', u'in', u'one', u'step', u'by', u'an', u'oxydehydration', u'reaction', u'of', u'glycerol', u'in', u'the', u'presence', u'of', u'molecular', u'oxygen', u'the', u'reaction', u'preferably', u'carried', u'out', u'in', u'gaseous', u'phase', u'in', u'the', u'presence', u'of', u'a', u'suitable', u'catalyst']
07909794	[u'an', u'autoinflating', u'catheter', u'and', u'balloon', u'assembly', u'having', u'an', u'autoregulating', u'structure', u'to', u'prevent', u'overinflation', u'of', u'the', u'balloon', u'as', u'a', u'result', u'of', u'variable', u'fluid', u'flow', u'rates', u'through', u'the', u'catheter', u'lumen', u'a', u'elastomeric', u'balloon', u'is', u'provided', u'on', u'the', u'distal', u'end', u'of', u'the', u'catheter', u'body', u'and', u'the', u'assembly', u'is', u'constructed', u'so', u'that', u'at', u'least', u'a', u'portion', u'of', u'the', u'fluid', u'flow', u'through', u

### 5.3 Create initial vocabulary

In creating the initial vocabulary I've decided to create a method that generates *meaningful bigrams* first. I based it from this  <a href="http://www.nltk.org/howto/collocations.html">tutorial</a>. 

To ensure a diverse set of bigrams, I've used both PMI and the tokens' POS tags. **NLTK's built-in BigramCollocationFinder** allows users to generate bigrams from tokens. Afterwards, to ensure that I'll get meaningful bigrams, I filtered them further **based on their POS tags**.

Bigrams which more likely to be meaningful have patterns like

Pattern|Example
--|--
JJ+NN|secure strap
NN+NN|crown portion
VBG+NN|measuring cup
VBN+NN|installed state


As a precursor let's retrieve the **stop words** first. Filtering bigrams without stop words will help us find more meaningful ones.

In [18]:
# generate stopwords from nltk's built-in stopwords list
stop_words = stopwords.words('english')

In [19]:
list(stop_words)[:15]

[u'i',
 u'me',
 u'my',
 u'myself',
 u'we',
 u'our',
 u'ours',
 u'ourselves',
 u'you',
 u'your',
 u'yours',
 u'yourself',
 u'yourselves',
 u'he',
 u'him']

#### Meaningful Bigram Finder
Here is my defined method for generating meaningful bigrams. First thing to do is try to generate the POS tags of the words, so the method must accept a list of sentences that will aid in giving context for word tagging. 

After tagging the words, remove unnecessary stopwords which we don't need in bigram finding. Only then we can find the bigrams.

Then, filter the bigrams based on the combination of their POS tags as described above. Only bigrams that fit into the given pattern will be deemed meaningful.

In [20]:
def generate_meaningful_bigrams(sentences_list, length):
    
    tagged_bigrams = []
    bigrams = []
    
    for sentences in sentences_list:
        for sentence in sentences:
            # tokenize sentence
            uni_sent = [word for word in nltk.word_tokenize(sentence) if word.isalpha()]
            
            # tag each word in the sentence
            tagged_sent = nltk.tag.pos_tag(uni_sent)
            
            # remove stop words so bigrams won't have stop words
            stopped_tagged_sent = [x for x in tagged_sent if x[0] not in stop_words] 
            
            # find bigrams based on their tags
            bigram_measures = nltk.collocations.BigramAssocMeasures()
            finder = nltk.collocations.BigramCollocationFinder.from_words(stopped_tagged_sent)
            tagged_bigrams.extend(finder.nbest(bigram_measures.pmi, 5))
    
    # get unique bigrams
    unique_bigrams = list(set(tagged_bigrams))
            
    # filter bigrams by POS tags to determine which of them is meaningful
    # meaningful bigrams are usually in the pattern of JJ + NN, VBN + NN, VBG + NN and so on
    filtered_bigrams = [bigram for bigram in unique_bigrams 
                        if bigram[0][1] in ['JJ','NN','VBN','VBG','NNP','NNS'] 
                        and bigram[1][1] in ['NN','NNP','NNS']]
    
    # remove POS tags from the filtered bigrams
    bigram_tuples = []
    for bigram in filtered_bigrams:
        bigram_tuples.append((bigram[0][0], bigram[1][0]))
        
    # pass the whole list of filtered bigrams if less than the requested length
    # otherwise only pass the requested number of bigrams
    return (bigram_tuples if (len(bigrams) < length) else bigram_tuples[:length])

Below, I've retrieved the unique set of unigrams in the abstracts then added the meaningful bigrams generated from the method **generate_meaningful_bigrams()**. 

This will be used to re-tokenize abstracts and produce multiword tokens.

In [21]:
# create vocabulary that we will use to generate multiword tokens
words = list(chain.from_iterable(tokenized_abstracts.values()))
vocab = list(set(words))

# generate bigrams
# first generate a list of all sentences from the abstracts
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences_list = [sent_detector.tokenize(value.strip()) for value in abstracts.values()]

# create 100 meaningful bigrams from these sentences
bigrams = generate_meaningful_bigrams(sentences_list, 100)

# extend the vocabulary with the generated bigrams
vocab.extend(bigrams)

Let's check the generated bigrams. Do they make sense?

In [22]:
bigrams[:30]

[(u'thickness', u'sensors'),
 (u'crown', u'portion'),
 (u'light', u'switch'),
 (u'front', u'backpack'),
 (u'electrode', u'layer'),
 (u'mounted', u'underside'),
 (u'measuring', u'plaque'),
 (u'second', u'encoding'),
 (u'impedance', u'transmission'),
 (u'data', u'record'),
 (u'ratio', u'npr'),
 (u'echo', u'cancellation'),
 (u'piezoelectric', u'transducers'),
 (u'installed', u'state'),
 (u'front', u'surface'),
 (u'memory', u'response'),
 (u'ring', u'segments'),
 (u'hydrogen', u'chloride'),
 (u'epitaxial', u'semiconductor'),
 (u'power', u'output'),
 (u'lip', u'portions'),
 (u'biasing', u'element'),
 (u'pressure', u'springs'),
 (u'exposing', u'subject'),
 (u'ink', u'reservoirs'),
 (u'extending', u'inboard'),
 (u'mobile', u'phone'),
 (u'flange', u'properties'),
 (u'hole', u'tree'),
 (u'acquired', u'imaging')]

**Looks like they do! :^)** Now tokenize the abstracts again using our newly formed vocabulary, this time utilising the **MWETokenizer**.

In [23]:
#use MWETokenizer and tokenize it again
mwe_tokenizer = MWETokenizer(vocab)
mwe_tokenized_abs = {}
for key,value in tokenized_abstracts.iteritems():
    mwe_tokenized_abs[key] = mwe_tokenizer.tokenize(value) 

Note that the bigrams we have selected are mostly meaningful and do not include any stop words.

Check if the MWE tokenizer worked. Did it include bigram tokens?

In [24]:
print_dictionary(mwe_tokenized_abs, 3)

07910771	[u'the', u'invention', u'relates', u'to', u'a', u'method', u'for', u'producing', u'acrylic_acid', u'in', u'one', u'step', u'by', u'an', u'oxydehydration', u'reaction', u'of', u'glycerol', u'in', u'the', u'presence', u'of', u'molecular', u'oxygen', u'the', u'reaction', u'preferably', u'carried', u'out', u'in', u'gaseous_phase', u'in', u'the', u'presence', u'of', u'a', u'suitable', u'catalyst']
07910479	[u'a', u'method', u'for', u'manufacturing', u'a', u'photodiode', u'array', u'includes', u'providing', u'a', u'semiconductor_substrate', u'having', u'first', u'and', u'second', u'main_surfaces', u'opposite', u'to', u'each', u'other', u'the', u'semiconductor_substrate', u'has', u'a', u'first_layer', u'of', u'a', u'first_conductivity', u'proximate', u'the', u'first', u'main_surface', u'and', u'a', u'second', u'layer', u'of', u'a', u'second', u'conductivity', u'proximate', u'the', u'second', u'main_surface', u'a', u'via', u'is', u'formed', u'in', u'the', u'substrate', u'which', u'ext

### 5.4 Filter tokens

Now let's remove certain tokens from our bag. Remove tokens that are:
- stop words
- in the top 20 most frequent words based on document frequency
- words that appear only in one abstract

#### Remove stop words
Remove stop words based from the NLTK's built in stopword list.

In [25]:
# remove stop words from the tokens
mwe_tokenized_abs_stop = {}
for key,value in mwe_tokenized_abs.iteritems():
    mwe_tokenized_abs_stop[key] = [word for word in value if word not in stop_words]

Check if we have removed the stop words. Are they still there?

In [26]:
print_dictionary(mwe_tokenized_abs_stop, 1)

07910771	[u'invention', u'relates', u'method', u'producing', u'acrylic_acid', u'one', u'step', u'oxydehydration', u'reaction', u'glycerol', u'presence', u'molecular', u'oxygen', u'reaction', u'preferably', u'carried', u'gaseous_phase', u'presence', u'suitable', u'catalyst']


Let's see the initial vocabulary -- how many words are there right now?

In [27]:
words = list(chain.from_iterable(mwe_tokenized_abs_stop.values()))
vocab_list = list(set(words))
len(vocab_list)

17320

Looks a lot! Now let's go and filter our tokens further

#### Remove most frequent words and hapaxes
Let's remove the most frequent words based on document frequency and hapaxes (words that appear only once) by using NLTK's **FreqDist()** method.

In [28]:
# extract a set of unique words per abstract
# this means each instance of a word means it was used in an abstract (once or multiple times--we don't care)
words = list(chain.from_iterable([set(value) for value in mwe_tokenized_abs_stop.values()]))
fd = FreqDist(words)

# get the 20 most frequent words in terms of document frequency
most_common_words = fd.most_common(20)

# get words that occur only once 
abstract_hapaxes = fd.hapaxes()

len(words)

82835

What are the most common words?

In [29]:
most_common_words

[(u'includes', 1119),
 (u'one', 820),
 (u'provided', 559),
 (u'first', 536),
 (u'least', 525),
 (u'method', 523),
 (u'second', 506),
 (u'plurality', 413),
 (u'system', 376),
 (u'may', 333),
 (u'including', 333),
 (u'device', 314),
 (u'connected', 309),
 (u'formed', 289),
 (u'also', 274),
 (u'apparatus', 271),
 (u'portion', 269),
 (u'two', 265),
 (u'wherein', 247),
 (u'comprises', 243)]

Which of the words appear only in one abstract?

In [30]:
list(abstract_hapaxes)[:20]

[u'reciprocation',
 u'rail_element',
 u'life_indicator',
 u'converter_converts',
 u'bus_device',
 u'radial_grind',
 u'radio_communication',
 u'positioning_indicia',
 u'localized',
 u'electrophotographic_device',
 u'debug_software',
 u'canes',
 u'hydrofluoric_acid',
 u'lineman_pole',
 u'arrangement_direction',
 u'computer_machine',
 u'packers',
 u'nozzle_member',
 u'aircraft_features',
 u'herbicide']

In [31]:
len(abstract_hapaxes)

10650

Now let's remove these words from our bag:

In [32]:
final_abstract_tokens = {}
for key,value in mwe_tokenized_abs_stop.iteritems():
    final_abstract_tokens[key] = [word for word in value if word not in most_common_words and word not in abstract_hapaxes]

In [33]:
print_dictionary(final_abstract_tokens, 5)

07910771	[u'invention', u'relates', u'method', u'producing', u'one', u'step', u'reaction', u'glycerol', u'presence', u'molecular', u'oxygen', u'reaction', u'preferably', u'carried', u'presence', u'suitable', u'catalyst']
07910479	[u'method', u'manufacturing', u'photodiode', u'array', u'includes', u'providing', u'semiconductor_substrate', u'first', u'second', u'opposite', u'semiconductor_substrate', u'first_layer', u'first_conductivity', u'proximate', u'first', u'main_surface', u'second', u'layer', u'second', u'conductivity', u'proximate', u'second', u'main_surface', u'via', u'formed', u'substrate', u'extends', u'first', u'relative', u'first', u'main_surface', u'via', u'first_aspect', u'ratio', u'generally', u'simultaneously', u'forming', u'via', u'formed', u'substrate', u'spaced', u'apart', u'via', u'extends', u'second', u'relative', u'first', u'main_surface', u'second', u'aspect_ratio', u'different', u'first_aspect', u'ratio']
07909645	[u'housing', u'provided', u'coaxial_cable', u'con

### 5.5 Create the final vocabulary

The final tokens list contains all the words that we will use for the vocabulary. I will now use the **set()** function to get all the unique words from the remaining tokens.

Let's see how many words I have left.

In [34]:
# create the final list of words and the vocabulary
words = list(chain.from_iterable(final_abstract_tokens.values()))
vocab_list = list(set(words))
len(vocab_list)

6670

### 5.6 Generate the sparse count vectors

This is the final step! I generated a sparse count vector using the **CountVectorizer()**. 
For each abstract, I generated the count vector, then matched the word and its word count using the **zip()** method. 

In [35]:
# Code to generate the sparse count vectors
vectorizer = CountVectorizer(analyzer = "word") 
count_vectors_strings = []
count_vectors = {}

# get the word count vector per abstract
for key,value in final_abstract_tokens.iteritems():
    data_features = vectorizer.fit_transform([' '.join(value)]).toarray()
    wordcount = []
    for word,count in zip(vectorizer.get_feature_names(), data_features[0]):
        wordcount.append(str(vocab_list.index(word))+ ":" + str(count))
    count_vectors[key] = ",".join(wordcount)

# format into a string
for key,value in count_vectors.iteritems():
    count_vectors_strings.append(key + "," + value)

Check if I got the format correctly.

In [36]:
count_vectors_strings[:5]

[u'07910771,6616:1,2080:1,4104:1,1861:1,845:1,986:1,574:1,1808:1,80:1,5387:2,2582:1,5682:2,6396:1,2866:1,3419:1',
 u'07909794,2451:2,5181:4,5418:1,4573:1,3904:1,6163:1,2273:1,651:3,6006:1,4340:1,1852:2,3358:1,435:1,5785:1,1848:1,237:1,529:1,398:1',
 u'07892023,4827:1,4645:6,2975:2,4007:1,1259:1,2281:1,3801:1,6500:1,5837:1,5115:1,5815:4,6175:1,1402:1,4308:1,3897:2,3576:2,3042:1,1634:1,4714:1,5848:1,1070:2,2626:2,2710:2,6491:2,1184:3,2435:1,39:2,221:2,2000:2,5055:2,3530:2,3985:1,2804:1,2430:2,5184:1,3974:4,2102:2,1237:1,1149:1,2997:1,4223:3,51:2,2142:1,863:1',
 u'07909645,1003:1,1842:1,2668:2,3224:5,2922:1,2293:4,3801:1,5503:2,2729:3,3181:2,1092:4,1890:2,6617:2,3081:3,5848:3,2323:1,3665:2,3031:2,1054:1,3358:2,5785:1,1633:1,4469:1',
 u'07909644,450:1,1606:1,3842:1,6524:1,2283:1,3801:3,1222:1,1227:1,2651:1,376:1,3897:5,617:1,4312:2,3874:1,6169:1,209:1,5617:1,5507:1,2037:2,6491:1,2581:1,4283:1,3530:1,395:1,5785:1,2263:1,1294:3,24:4,4885:1,3082:1,4323:1,5132:1,3284:2,5185:1,2142:1,6513:2']

Looks fine. Now write the list of count vectors into the file.

In [37]:
write_file("count_vectors.txt", count_vectors_strings)

Finally, write the corresponding vocabulary into the file.

In [38]:
index = 0
vocab_strings = []
for i in range(len(vocab_list)):
    vocab_strings.append(str(i) + ":" + vocab_list[i]) 

vocab_strings[:5]

[u'0:selection_signal',
 u'1:first_lens',
 u'2:watermarked',
 u'3:four',
 u'4:encoded_data']

In [39]:
write_file("vocab.txt", vocab_strings)

### 5.7 Check the sparse count vector

Let's see if the generated sparse count vector is correct. Here's the check done for patent id **07910771**.

Compare the tokens (alphabetically sorted) and the generated sparse count vector.

In [40]:
sorted_tokens = []
sorted_tokens.extend(final_abstract_tokens["07910771"])
sorted_tokens.sort()
print '\n'.join(str(p) for p in sorted_tokens) 
print '\n'

i = 0

# print the sorted tokens based on the number of count
for cv in count_vectors["07910771"].split(","):
    # get the index from the vocabulary
    index = int(cv[:(cv.find(":"))])
    
    # get the count
    count = int(cv[cv.find(":")+1:])
    
    print vocab_list[index] + ' - ' + str(count)

carried
catalyst
glycerol
invention
method
molecular
one
oxygen
preferably
presence
presence
producing
reaction
reaction
relates
step
suitable


carried - 1
catalyst - 1
glycerol - 1
invention - 1
method - 1
molecular - 1
one - 1
oxygen - 1
preferably - 1
presence - 2
producing - 1
reaction - 2
relates - 1
step - 1
suitable - 1


We can see that the final tokens and the count vectors match, so the generated count vector must be correct.

## 6. Summary

In summary, assessment 1 required us to extract the patent's classification, citation network, and its abstract's sparse count vectors. This is one of the challenges of text processing and data science in particular--working on a huge amount of data and wrangling it in preparation for further modeling and analysis. However, doing them is now easier thanks to these built-in tools in Python.