# FIT5196 Assessment 1
#### Student Name: Dong Lei Qian
#### Student ID: 29115566

Date: 23/08/2017

Version: 1.0

Environment: Python 2.7.11 and Jupyter notebook

Libraries used:
* pandas (for dataframe, included in Anaconda Python 2.7) 
* re (for regular expression, included in Anaconda Python 2.7) 
* numpy (for numpy array, included in Anaconda Python 2.7) 
* lxml (for xml extracting, included in Anaconda Python 2.7)
* beautifulsoup (for xml extracting)
* nltk (for Natural Language Processing such as tokenizing, removing stop words, stemming, lemmatizing and finding bigrams)
* collections(for word counting, included in Anaconda Python 2.7)

## 1. Introduction

The assignment is to read an xml file consiitng of 2500 patents information, and extract infomation from the xml file and output into txt files. The assignement consists of 4 parts:
* Extract International Patent Cassification(IPC) codes for each patent, output the information to the classfication.txt file, the format of the file should be
    patent's_ID:Section,Class,Subclass,Main_group,Subgroup
* Extract citing references for each patent, store them in citations.txt file, the format should be
    citing_patent_id:cited_patent_id,cited_patent_id....
* For each referenced patent, count how many times it has been referenced in the whole xml file, store in cited.txt, and the format should be
    cited_patent_id:number of times it is cited
* Extract abstract information from each patent and build a sparse count vectors from the corpus. The stopwords are removed as well as words that appear in only one abstract and top 20 words that have the highest document frequency, the words are stemmed and lammatized, and at least 100 bigrams are extracted, then store the vectors in count_vectors.txt, and the format should be
    PatentID,word index:count,word index:count.... 
  Finally also build a vocabulary file called vocab.txt, the format should be
    word index:word
   


## 2.  Import libraries 


In [1]:
import re
import pandas as pd
from lxml import etree
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.collocations import BigramCollocationFinder
import collections

## 3. Examining and loading data

The xml file is very big, try to read with lxml package since lxml has good performance with large files(Lecture 3 pg 15).

In [370]:
ltree = etree.parse('patents.xml')

XMLSyntaxError: XML declaration allowed only at the start of the document, line 1051, column 6 (line 1051)

There seems to be an error in reading the file, on closer look at file, I noticed it is a concatnation of multiple xml files, so in order to read into lxml, first need to reformat the original file.

First let's just load the first xml and have a look what is in it.

In [104]:
xmlstring = '<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>\r\n'
with open('patents.xml','r') as f:
    wholefile = f.read()
firstxml = wholefile.split(xmlstring)[1]
soup = BeautifulSoup(firstxml)
print(soup.prettify())

<html>
 <body>
  <us-patent-grant country="US" date-produced="20110208" date-publ="20110222" dtd-version="v4.2 2006-08-23" file="USPP021722-20110222.XML" id="us-patent-grant" lang="EN" status="PRODUCTION">
   <us-bibliographic-data-grant>
    <publication-reference>
     <document-id>
      <country>
       US
      </country>
      <doc-number>
       PP021722
      </doc-number>
      <kind>
       P3
      </kind>
      <date>
       20110222
      </date>
     </document-id>
    </publication-reference>
    <application-reference appl-type="plant">
     <document-id>
      <country>
       US
      </country>
      <doc-number>
       12316880
      </doc-number>
      <date>
       20081216
      </date>
     </document-id>
    </application-reference>
    <us-application-series-code>
     12
    </us-application-series-code>
    <us-term-of-grant>
     <us-term-extension>
      108
     </us-term-extension>
    </us-term-of-grant>
    <classifications-ipcr>
     <classification-i

Ok after looking at the file, we see we can get the patent ID from the us-patent-grant tag, file attribute, use a regular expression "US(.\*)-" to extract that. The IPC information is inside classifications-ipcr tag.

If we remove the xml tags and then wrap the document in a <body> tag as its root tag, then put xml tags in the beginning of the file that should be the correct format to be read in by lxml.

In [2]:
xmlstring = '<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant-v42-2006-08-23.dtd" [ ]>\r\n'
with open('patents.xml','r') as f:
    wholefile = f.read()
wholefilenoxml = wholefile.replace(xmlstring,'')
newfile = xmlstring + '<body>\r\n' + wholefilenoxml + '\r\n</body>'

## 4. Parsing XML and Extracting all the required information 

Now we can read the modified string into lxml.

In [3]:
ltree = etree.fromstring(newfile)

Now let's extract all the relevant infomation out of the xml and store them in lists.
We build lists for level, class, subclass, group, subgroup, patentID, docnumber(citation), and abstract

For patentID use a regular expression to extract the ID from the tag

Because there are multiple citation tags inside each patent, we build a dictionary with key of patentID and a comma seperated string of all the citations. For abstract there are also multiple paragraphs for each abstract, we will build a dictionary with patentID as its key and a list for all the paragraphes, also some paragraph tags have nothing so we need to check if != None

The citation information is inside the tag doc-number, so we get that element but we need to check that this tag is inside citation rather than some other tag, as doc-number appears in multiple tags, and for abstract p tag we need to do the same

Finally we print the len for each list to make sure they are all the same length

In [4]:
level = []
classclass = []
subclass = []
group = []
subgroup = []
patentID = []
docnumberlist = {}
abstractlist = {}
for elem in ltree.iter():
    if elem.tag == 'classification-level':
        level.append(elem.text)
    elif elem.tag == 'class':
        classclass.append(elem.text)
    elif elem.tag == 'subclass':
        subclass.append(elem.text)
    elif elem.tag == 'main-group':
        group.append(elem.text)
    elif elem.tag == 'subgroup':
        subgroup.append(elem.text)
    elif elem.tag == 'us-patent-grant':
        currentID = re.findall('US(.*)-',elem.attrib['file'])[0]
        patentID.append(currentID)
    elif elem.tag == 'doc-number':
        if elem.getparent().getparent().getparent().tag == 'citation':
            if currentID in docnumberlist:
                docnumberlist[currentID] = docnumberlist[currentID] + ',' + elem.text
            else:
                docnumberlist[currentID] = elem.text
    elif elem.tag == 'p':
        if elem.getparent().tag == 'abstract':
            if elem.text != None:
                if not currentID in abstractlist:
                    abstractlist[currentID] = []
                    abstractlist[currentID].append(elem.text)
                else:
                    abstractlist[currentID].append(elem.text)
                    
print(len(level),len(classclass),len(subclass),len(group),len(subgroup),len(patentID),len(docnumberlist),len(abstractlist))

(2500, 2500, 2500, 2500, 2500, 2500, 2500, 2500)


We can put all the lists in a dataframe to have a look(Parsing XML files, Week 3 reading material)

In [5]:
dataDict = {}
dataDict['level'] = level
dataDict['class'] = classclass
dataDict['subclass'] = subclass
dataDict['group'] = group
dataDict['subgroup'] = subgroup
dataDict['cited'] = docnumberlist
dataDict['abstract'] = abstractlist
dfPatent = pd.DataFrame(dataDict, index = patentID)
dfPatent.index.name = 'patentID'
dfPatent.head()

Unnamed: 0_level_0,abstract,cited,class,group,level,subclass,subgroup
patentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
PP021722,[A new apple tree named ‘Daligris’ is disclose...,"PP17672,PP18482,PP18483",1,5,A,H,0
RE042159,[A sensing device includes a circuit that comp...,"4954776,4956606,5015948,5115193,5180978,533296...",1,7,A,B,14
RE042170,[At least one peripheral processing apparatus ...,"3988719,4206996,4803623,4905098,5012281,516122...",6,11,A,F,0
07891018,[A knee protective device for garments compris...,"4561124,4831666,4920577,5105473,5134726,D33828...",41,13,A,D,0
07891019,[A heatable garment having a plurality of laye...,"4355632,4702235,5032705,5148002,5603648,643994...",41,13,A,D,0


That looks good, to write the the file we need patentID:level, so create a new column with this information

In [12]:
test = dfPatent[dfPatent.index == '07909891'].abstract

In [15]:
print(test.all())

['The invention relates to a dye of the general formula (I)', 'The invention for relates to a process to prepare the dye and their use.']


In [152]:
dfPatent['newcol'] = dfPatent.index + ':' + dfPatent['level']
dfPatent.head()

Unnamed: 0_level_0,abstract,cited,class,group,level,subclass,subgroup,newcol
patentID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PP021722,[A new apple tree named ‘Daligris’ is disclose...,"PP17672,PP18482,PP18483",1,5,A,H,0,PP021722:A
RE042159,[A sensing device includes a circuit that comp...,"4954776,4956606,5015948,5115193,5180978,533296...",1,7,A,B,14,RE042159:A
RE042170,[At least one peripheral processing apparatus ...,"3988719,4206996,4803623,4905098,5012281,516122...",6,11,A,F,0,RE042170:A
07891018,[A knee protective device for garments compris...,"4561124,4831666,4920577,5105473,5134726,D33828...",41,13,A,D,0,07891018:A
07891019,[A heatable garment having a plurality of laye...,"4355632,4702235,5032705,5148002,5603648,643994...",41,13,A,D,0,07891019:A


Now we can write this to the classification file

In [153]:
dfPatent.to_csv('classification.txt',header = False,index = False,columns = ['newcol','class','subclass','group','subgroup'])

Now write citation file, use : as seperator, we already have comma in our string in the cited column

In [154]:
dfPatent.to_csv('citations.txt',sep = ':',header = False, columns = ['cited'])

To get information on how many times each refrence is cited, we want to build a dataframe with all the cited references. We can do this by spliting the docnumberlist by ',' then pad the citing parentID for each cited reference, finally concatenate all the cited lists together

In [6]:
dataDict = {}
citeddf = pd.DataFrame()
for pid in patentID:
    dataDict['cited'] = docnumberlist[pid].split(',')
    dataDict['patentID'] = pid
    newdf = pd.DataFrame(dataDict)
    citeddf = pd.concat([citeddf,newdf])

citeddf.head()

Unnamed: 0,cited,patentID
0,PP17672,PP021722
1,PP18482,PP021722
2,PP18483,PP021722
0,4954776,RE042159
1,4956606,RE042159


Check for each cited reference how many times they are cited using groupby

In [8]:
cited = citeddf.groupby('cited').count()
cited

Unnamed: 0_level_0,patentID
cited,Unnamed: 1_level_1
0 010 995,1
0 012 075,1
0 027 549,1
0 038 176,1
0 041 022,1
0 058 325,1
0 094649,1
0 106 026,1
0 123 237,1
0 128 015,1


This looks good, we can write to file

In [10]:
cited.to_csv('cited.txt',sep = ':',header = False)

## 5. Preprocessing the abstracts 

For this part, tokenize the words from the abstract list, this regular expression should allow - ' and . in the middle of words, but not at the end to avoid including full stop at the end of the sentences. The idea is taken from the Fundamentals of Text Pre-processing from week 3 reading material, by combining the expression (\w+\.?)+ and \w+(?:[-']\w+)? I think that words with - in the middle are similar to bigrams, they mgiht have different meaning to single words, so I will keep them as one token, similarly words wity ' and . will be treated as one token

In [114]:
tokenizer = RegexpTokenizer(r"(?:\w+)+(?:[-'.](?:\w+)+)*")

Test to see if the tokenizer works with . ' and - in words, and it seems to work ok, the only problem is it removed . in Mr. and U.S.A at the end but that is fine as long as it is consistant to all the words with this format. Also noticed % is removed in 100%, we are only concerned with alphanumeric characters.

In [115]:
teststring = "Mr. Smith's dream is to be a data-scientist in U.S.A and predict with 100% accuracy. That's great."
test = tokenizer.tokenize(teststring)
test

['Mr',
 "Smith's",
 'dream',
 'is',
 'to',
 'be',
 'a',
 'data-scientist',
 'in',
 'U.S.A',
 'and',
 'predict',
 'with',
 '100',
 'accuracy',
 "That's",
 'great']

Now try a standard tonkenizer

In [116]:
nltk.wordpunct_tokenize("Mr. Smith's dream is to be a data-scientist in U.S.A and predict with 100% accuracy. That's great.")

['Mr',
 '.',
 'Smith',
 "'",
 's',
 'dream',
 'is',
 'to',
 'be',
 'a',
 'data',
 '-',
 'scientist',
 'in',
 'U',
 '.',
 'S',
 '.',
 'A',
 'and',
 'predict',
 'with',
 '100',
 '%',
 'accuracy',
 '.',
 'That',
 "'",
 's',
 'great',
 '.']

Ok, stick with original one as it seems to work better

Now we will tokenize the abstractlist

In [117]:
tokenized = []
for pid in patentID:
    tokenized.append(tokenizer.tokenize(abstractlist[pid][0]))

Just check that worked

In [118]:
tokenized[0]

[u'A',
 u'new',
 u'apple',
 u'tree',
 u'named',
 u'Daligris',
 u'is',
 u'disclosed',
 u'The',
 u'fruit',
 u'of',
 u'the',
 u'new',
 u'variety',
 u'is',
 u'particularly',
 u'notable',
 u'for',
 u'its',
 u'eating',
 u'quality',
 u'and',
 u'distinctive',
 u'flavor',
 u'and',
 u'appearance',
 u'The',
 u'fruit',
 u'is',
 u'very',
 u'sweet',
 u'and',
 u'has',
 u'a',
 u'pronounced',
 u'aniseed',
 u'flavor',
 u'and',
 u'takes',
 u'on',
 u'a',
 u'distinctive',
 u'red',
 u'orange',
 u'coloration',
 u'as',
 u'it',
 u'ripens',
 u'on',
 u'the',
 u'tree']

Now let's remove stop words because these words are very common and do not contribute to the meanings of the text(the Fundamentals of Text Pre-processing, 2.3)

In [119]:
stopwords_list = set([stopword.encode('utf-8') for stopword in stopwords.words('english')])
stopwords_list

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 'd',
 'did',
 'didn',
 'do',
 'does',
 'doesn',
 'doing',
 'don',
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 'has',
 'hasn',
 'have',
 'haven',
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 'it',
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 'more',
 'most',
 'mustn',
 'my',
 'myself',
 'needn',
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 're',
 's',
 'same',
 'shan',
 'she',
 'should',
 'shouldn',
 'so',
 'some',
 'such',
 't',
 'than',
 'that',
 'the',
 'their',
 'theirs',
 'them',
 

Now change all words to lower case and remove stop words

In [120]:
filtered_list = map(lambda x:[word.lower() for word in x if word.lower() not in stopwords_list], [word for word in tokenized])
filtered_list

[[u'new',
  u'apple',
  u'tree',
  u'named',
  u'daligris',
  u'disclosed',
  u'fruit',
  u'new',
  u'variety',
  u'particularly',
  u'notable',
  u'eating',
  u'quality',
  u'distinctive',
  u'flavor',
  u'appearance',
  u'fruit',
  u'sweet',
  u'pronounced',
  u'aniseed',
  u'flavor',
  u'takes',
  u'distinctive',
  u'red',
  u'orange',
  u'coloration',
  u'ripens',
  u'tree'],
 ['sensing',
  'device',
  'includes',
  'circuit',
  'compensates',
  'time',
  'spatial',
  'changes',
  'temperature',
  'circuit',
  'includes',
  'elements',
  'correct',
  'variation',
  'permeability',
  'highly',
  'permeable',
  'core',
  'differential',
  'variable',
  'reluctance',
  'transducer',
  'temperature',
  'changes',
  'circuit',
  'also',
  'provides',
  'correction',
  'temperature',
  'gradients',
  'across',
  'coils',
  'transducer'],
 ['least',
  'one',
  'peripheral',
  'processing',
  'apparatus',
  'least',
  'one',
  'information',
  'processing',
  'apparatus',
  'interconnected

The next step is to run stemmer, because we have alot of similar words like disclosed and disclose, eating and eat, etc, that have the same meanings. By running stemmer we can reduse the number of types in our corpus(the Fundamentals of Text Pre-processing, 2.4)

First let's check the number of distinct words in our corpus

In [125]:
len(set([w for para in filtered_list for w in para ]))

11257

Now run the stemmer

In [127]:
stemmer = SnowballStemmer('english')
stemmed = [[stemmer.stem(w) for w in para] for para in filtered_list]
stemmed

[[u'new',
  u'appl',
  u'tree',
  u'name',
  u'daligri',
  u'disclos',
  u'fruit',
  u'new',
  u'varieti',
  u'particular',
  u'notabl',
  u'eat',
  u'qualiti',
  u'distinct',
  u'flavor',
  u'appear',
  u'fruit',
  u'sweet',
  u'pronounc',
  u'anise',
  u'flavor',
  u'take',
  u'distinct',
  u'red',
  u'orang',
  u'color',
  u'ripen',
  u'tree'],
 [u'sens',
  u'devic',
  u'includ',
  u'circuit',
  u'compens',
  u'time',
  u'spatial',
  u'chang',
  u'temperatur',
  u'circuit',
  u'includ',
  u'element',
  u'correct',
  u'variat',
  u'permeabl',
  u'high',
  u'permeabl',
  u'core',
  u'differenti',
  u'variabl',
  u'reluct',
  u'transduc',
  u'temperatur',
  u'chang',
  u'circuit',
  u'also',
  u'provid',
  u'correct',
  u'temperatur',
  u'gradient',
  u'across',
  u'coil',
  u'transduc'],
 [u'least',
  u'one',
  u'peripher',
  u'process',
  u'apparatus',
  u'least',
  u'one',
  u'inform',
  u'process',
  u'apparatus',
  u'interconnect',
  u'network',
  u'includ',
  u'storag',
  u'mean'

Check number of distinct words now

In [129]:
len(set([w for para in stemmed for w in para ]))

6976

We have reduced our type from 11257 to 6976, however looking at the results apple now appears as appl. Maybe we should try lemmatizer instead to see if we get a better result(the Fundamentals of Text Pre-processing, 2.4)

In [131]:
lemmatizer = WordNetLemmatizer()
lemmatized = [[lemmatizer.lemmatize(w) for w in para] for para in filtered_list]

In [132]:
lemmatized

[[u'new',
  u'apple',
  u'tree',
  u'named',
  u'daligris',
  u'disclosed',
  u'fruit',
  u'new',
  u'variety',
  u'particularly',
  u'notable',
  u'eating',
  u'quality',
  u'distinctive',
  u'flavor',
  u'appearance',
  u'fruit',
  u'sweet',
  u'pronounced',
  u'aniseed',
  u'flavor',
  u'take',
  u'distinctive',
  u'red',
  u'orange',
  u'coloration',
  u'ripens',
  u'tree'],
 ['sensing',
  'device',
  'includes',
  'circuit',
  'compensates',
  'time',
  'spatial',
  u'change',
  'temperature',
  'circuit',
  'includes',
  u'element',
  'correct',
  'variation',
  'permeability',
  'highly',
  'permeable',
  'core',
  'differential',
  'variable',
  'reluctance',
  'transducer',
  'temperature',
  u'change',
  'circuit',
  'also',
  'provides',
  'correction',
  'temperature',
  u'gradient',
  'across',
  u'coil',
  'transducer'],
 ['least',
  'one',
  'peripheral',
  'processing',
  'apparatus',
  'least',
  'one',
  'information',
  'processing',
  'apparatus',
  'interconnected'

It seems the lemmatization hasn't worked very well, as words like named and particularly hasn't been changed at all, let's check the number of distinct words now

In [133]:
len(set([w for para in lemmatized for w in para ]))

10064

The number of distinct words reduced is not much, so we still need to use stemmer

In [134]:
lemmatized = [[lemmatizer.lemmatize(w) for w in para] for para in stemmed]

Now let's find the top 20 words with highest document frequency using the collections package(Pre-processing tweets, week 4 reading material). We use set to get unique words in each abstract, so the total count will be document frequency(Exploring Pre-Processed text and Generating Features, week 4 reading material). As these words are common among different documents, they add little value in identifying the documents, we will need to remove those.

In [136]:
wordlist = [word for para in lemmatized for word in set(para)]
word_counter = collections.Counter(wordlist)
most_freq = word_counter.most_common()[:20]
most_freq

[(u'includ', 1419),
 (u'provid', 907),
 (u'one', 825),
 (u'first', 739),
 (u'method', 666),
 (u'second', 652),
 (u'devic', 638),
 (u'system', 580),
 (u'use', 574),
 (u'form', 529),
 (u'least', 526),
 (u'connect', 499),
 (u'compris', 456),
 (u'portion', 435),
 (u'control', 428),
 (u'plural', 426),
 (u'receiv', 413),
 (u'posit', 402),
 (u'surfac', 397),
 (u'oper', 376)]

Next let's get the words that appear in only one document, as these words are only relevant to one document, they are only useful for identifying that one particular document, but not useful in general, and they add alot of extra features to our model and add unnecessary complexity. We need to remove those too.

In [137]:
not_common = [word for word in word_counter if word_counter[word] == 1]
not_common

[u'polypeptid',
 u'h.264',
 u'hinge-mount',
 u'iron-typ',
 u'frequency-select',
 u'crossbar',
 u'sputter',
 u'content-empti',
 u'fur',
 u'wooden',
 u'crbt',
 u'watered-ash',
 u'neuro-stimul',
 u'crotch',
 u'message-driven',
 u'scraper',
 u'budget',
 u'half-duplex',
 u'semicircular',
 u'correlat',
 u'off-stat',
 u'herebi',
 u'herb',
 u'chink',
 u'militari',
 u'co-rot',
 u'conting',
 u'climber',
 u'short-liv',
 u'topographi',
 u'nitrat',
 u'fragment-fre',
 u'extrapol',
 u'ips-typ',
 u'mass-select',
 u'non-ink',
 u'aut',
 u'subsurfac',
 u'multi-channel',
 u'hung-up',
 u'notic',
 u'accid',
 u'non-return',
 u'antifog',
 u'locker',
 u'household',
 u're-align',
 u'titanium',
 u'unassign',
 u'drywel',
 u'51.0',
 u'umid',
 u'current-driven',
 u'hom',
 u'hop',
 u'perspect',
 u'hvpw',
 u'anti-block',
 u'ductil',
 u'modest',
 u'beauti',
 u'scandium',
 u'non-hybrid',
 u'alias',
 u'wing',
 u'labor-reduc',
 u'foodstuff',
 u'auditori',
 u'liftabl',
 u'undercut',
 u'point-of-car',
 u'enrich',
 u'multi-

Join the two lists together, we only want the words not frequency from most_freq

In [138]:
remove_words = set([word[0] for word in most_freq] + not_common)
remove_words

{u'h.264',
 u'previously-mad',
 u'untrust',
 u'4-mask',
 u'polypeptid',
 u'watered-ash',
 u'hinge-mount',
 u'ion-driven',
 u'call-handl',
 u'half-duplex',
 u'iron-typ',
 u'oldest',
 u'wireless-feedback',
 u'water-suppli',
 u'disappear',
 u'crumbl',
 u'swab',
 u'sub-scan',
 u'palladium',
 u'frequency-select',
 u'time-vari',
 u'unidirectionally-ori',
 u'multi-cor',
 u'chondrocyt',
 u'aut',
 u'over-the-road',
 u'crossbar',
 u'sputter',
 u'ldscr',
 u'furrow',
 u'device-depend',
 u'merchant',
 u'milit',
 u'vermin',
 u'sub-area',
 u'ductil',
 u'spallat',
 u'quantif',
 u'content-empti',
 u'govern',
 u'first-in-first-out',
 u'series-connect',
 u'quantit',
 u'evers',
 u'ultra-high',
 u'bernoulli',
 u'wooden',
 u'crbt',
 u'subject-specif',
 u'upload',
 u'tm-polar',
 u'open-back',
 u'neuro-stimul',
 u'unblock',
 u'unmanag',
 u'crotch',
 u'dual-pawl',
 u'pre-stretch',
 u'fold-ov',
 u'influent',
 u'run-on',
 u'electronically-control',
 u'far-far',
 u'message-driven',
 u'fuel-oil',
 u'beam-combin',


Now we can remove those from our corpus

In [139]:
lemmatized = map(lambda x:[word for word in x if word not in remove_words], [word for word in lemmatized])
lemmatized

[[u'new',
  u'tree',
  u'name',
  u'disclos',
  u'fruit',
  u'new',
  u'varieti',
  u'particular',
  u'notabl',
  u'eat',
  u'qualiti',
  u'distinct',
  u'flavor',
  u'appear',
  u'fruit',
  u'flavor',
  u'take',
  u'distinct',
  u'red',
  u'color',
  u'tree'],
 [u'sen',
  u'circuit',
  u'compens',
  u'time',
  u'spatial',
  u'chang',
  u'temperatur',
  u'circuit',
  u'element',
  u'correct',
  u'variat',
  u'permeabl',
  u'high',
  u'permeabl',
  u'core',
  u'differenti',
  u'variabl',
  u'transduc',
  u'temperatur',
  u'chang',
  u'circuit',
  u'also',
  u'correct',
  u'temperatur',
  u'gradient',
  u'across',
  u'coil',
  u'transduc'],
 [u'peripher',
  u'process',
  u'apparatus',
  u'inform',
  u'process',
  u'apparatus',
  u'interconnect',
  u'network',
  u'storag',
  u'mean',
  u'store',
  u'inform',
  u'inform',
  u'process',
  u'apparatus',
  u'peripher',
  u'apparatus',
  u'network',
  u'inform',
  u'store',
  u'storag',
  u'mean',
  u'transfer',
  u'network',
  u'inform',
  u'

Next step is to extract bigrams from our corpus, bigrams express meanings that are not always the same from the single words, so they will need to be treated as seperate features from the single words(Exploring Pre-Processed text and Generating Features, week 4 reading material)

We will filter freqencies for more than 10 times, so we can get bigrams that are more meaningful rather than ones just by coincidence, then pick the first most likely 200 using a measurement called pmi(pointwise mutual information) which is a measure of association used in information theory and statistics(wikipedia)

In [140]:
wordlist = [word for para in lemmatized for word in para]
finder = BigramCollocationFinder.from_words(wordlist)
finder.apply_freq_filter(10)
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigrams = finder.nbest(bigram_measures.pmi,200)
bigrams

[(u'hydrophob', u'drug'),
 (u'golf', u'club'),
 (u'rfid', u'reader'),
 (u'planetari', u'gear'),
 (u'fire', u'suppress'),
 (u'acceler', u'pedal'),
 (u'class-d', u'amplifi'),
 (u'refract', u'index'),
 (u'left', u'right'),
 (u'resourc', u'alloc'),
 (u'order', u'price'),
 (u'hard', u'mask'),
 (u'color', u'gamut'),
 (u'call', u'parti'),
 (u'promot', u'code'),
 (u'identif', u'tag'),
 (u'cutter', u'blade'),
 (u'esd', u'protect'),
 (u'articul', u'paper'),
 (u'ink', u'jet'),
 ('p', u'type'),
 (u'mo', u'transistor'),
 (u'perman', u'magnet'),
 (u'heat', u'sink'),
 (u'electrostat', u'discharg'),
 (u'intern', u'combust'),
 (u'dope', u'well'),
 (u'hardwar', u'node'),
 (u'emit', u'diod'),
 (u'liquid', u'crystal'),
 (u'space', u'apart'),
 (u'electromagnet', u'radiat'),
 (u'elast', u'band'),
 (u'describ', u'herein'),
 (u'encrypt', u'key'),
 (u'combust', u'engin'),
 (u'club', u'head'),
 (u'coaxial', u'cabl'),
 (u'wind', u'turbin'),
 (u'heat', u'exchang'),
 (u'crown', u'head'),
 (u'laser', u'beam'),
 (u'

Now for each bigram check the word list and replace the two single words with the bigrams

In [141]:
for para in lemmatized:
    for j in range(0,len(para)-1):
        temp_tuple = (para[j], para[j+1])
        if temp_tuple in bigrams:
            para[j] = para[j] + '_' + para[j+1]
            para[j+1] = None

lemmatized = [filter(None, para) for para in lemmatized]
lemmatized

[[u'new',
  u'tree',
  u'name',
  u'disclos',
  u'fruit',
  u'new',
  u'varieti',
  u'particular',
  u'notabl',
  u'eat',
  u'qualiti',
  u'distinct',
  u'flavor',
  u'appear',
  u'fruit',
  u'flavor',
  u'take',
  u'distinct',
  u'red',
  u'color',
  u'tree'],
 [u'sen',
  u'circuit',
  u'compens',
  u'time',
  u'spatial',
  u'chang',
  u'temperatur',
  u'circuit',
  u'element',
  u'correct',
  u'variat',
  u'permeabl',
  u'high',
  u'permeabl',
  u'core',
  u'differenti',
  u'variabl',
  u'transduc',
  u'temperatur',
  u'chang',
  u'circuit',
  u'also',
  u'correct',
  u'temperatur',
  u'gradient',
  u'across',
  u'coil',
  u'transduc'],
 [u'peripher',
  u'process',
  u'apparatus',
  u'inform',
  u'process',
  u'apparatus',
  u'interconnect',
  u'network',
  u'storag',
  u'mean',
  u'store',
  u'inform',
  u'inform',
  u'process',
  u'apparatus',
  u'peripher',
  u'apparatus',
  u'network',
  u'inform',
  u'store',
  u'storag',
  u'mean',
  u'transfer',
  u'network',
  u'inform',
  u'

This looks good, let's make a dictionary by removing all duplicated words, using set as it will give us the unique list

In [142]:
worddict = set([word for para in lemmatized for word in para])
worddict = [word for word in worddict]
worddict

[u'bit_line',
 u'circuitri',
 u'orthogon',
 u'interchang',
 u'four',
 u'sleev',
 u'sleep',
 u'volumetr',
 u'concaten',
 u'whose',
 u'coupleabl',
 u'accur',
 u'cabriolet',
 u'descript',
 u'piezoelectr_element',
 u'deviat',
 u'swap',
 u'under',
 u'non-rotat',
 u'swivel',
 u'everi',
 u'risk',
 u'downstream',
 u'void',
 u'rise',
 u'voic',
 u'distort',
 u'pigment',
 u'lumin',
 u'quantiz',
 u'microbi',
 u'jack',
 u'upstream',
 u'affect',
 u'hitch',
 u'voip',
 u'disturb',
 u'prize',
 u'pinch',
 u'look-up',
 u'prefix',
 u'\xbc',
 u'correl',
 u'matric',
 u'verif',
 u'recov',
 u'build-up',
 u't-shape',
 u'apertur',
 u'erron',
 u'abil',
 u'speci',
 u'fabric_panel',
 u'direct',
 u'batch',
 u'nail',
 u'dish',
 u'molecular',
 u'manur',
 u'aggreg',
 u'in-phas',
 u'tether',
 'n',
 u'illumin',
 u'even',
 u'aim',
 u'near',
 u'sourc_drain',
 u'bipolar',
 u'polyest',
 u'pin',
 u'conduct',
 u'new',
 u'symmetr',
 u'topolog',
 u'metadata',
 u'ongo',
 u'elimin',
 u'gimbal',
 u'mem',
 u'elast',
 u'aberr',
 u'p

Use index to get the index of the word, this can be used to write to the vocab file

In [143]:
worddict.index('new')

72

In [144]:
with open('vocab.txt','w') as f:
    map(lambda x:f.write(str(worddict.index(x)) + ':' + x.encode('utf-8') + '\n'),worddict)

Next build the word vector to be used to write the vector file, we build a dictionary with the word as the key and how many times it has been used in one abstract, then concatenate them into a list

In [145]:
word_vector = []

for para in lemmatized:
    word_vectordict = {}
    for word in para:
        wordindex = worddict.index(word)
        if wordindex in word_vectordict:
            word_vectordict[wordindex] = word_vectordict[wordindex] + 1
        else:
            word_vectordict[wordindex] = 1
    word_vector.append(word_vectordict)

Check the length of our vector, it should be 2500 as number of parents

In [148]:
len(word_vector)

2500

Finally write the count_vectors file

In [149]:
word_list = ''
with open('count_vectors.txt','w') as f:
    for i in range(0,len(patentID)):
        if word_list != '':
            f.write(word_list.strip(',') + '\n')
            word_list = ''
        f.write(patentID[i] + ',')
        for word in word_vector[i]:
            word_list = word_list + str(word) + ':' + str(word_vector[i][word]) + ','
    f.write(word_list)

## 6. Summary

The word processing techniques used are not perfect in this assessment, in fact there is not really a uniform way of doing text processing simply because human languages are so complicated, there are alot of factors to consider when doing NLP. For example, how do we tokenize words, do you include special characters in them or discard all punctuations, how do we extract bigrams, even though some words may happen to be collocated multiple times it doesn't make a meaningful biagram, when we do stemming do we allow words to be cut in half, these are all very interesting quetsions to think about. It is particularly challenging to automate those tasks as the rules are unclear and at the moment we still require human knowledge to perform them.

## 7. References
* Lecture notes from Lecture 3(http://moodle.vle.monash.edu/mod/resource/view.php?id=4221618)
* Parsing XML files(http://moodle.vle.monash.edu/mod/resource/view.php?id=4221615)
* The fundamentals of text processing(http://moodle.vle.monash.edu/mod/resource/view.php?id=4221616)
* Preprocess Tweets(http://moodle.vle.monash.edu/mod/resource/view.php?id=4221625)
*  Exploring Pre-Processed text and Generating Features(http://moodle.vle.monash.edu/mod/resource/view.php?id=4221624)
* Pointwise Mutual Information(https://en.wikipedia.org/wiki/Pointwise_mutual_information)