## Extract and preprocess abstracts

Abstract provides important information that can assist in learning the citation network. So, our main task for task 4 is to extract all the abstracts for all the patents in the XML file which contain contains 2,500 patents in total.
And store those abstracts as sparse count vectors to produce a output file called “count_vectors.txt”, content inside starting with patent_id for each patent, and patent_id followed by “word_index:count” pairs. Also output another file named “vocab.txt” for all the vocabularies with the index that been shown in the “count_vectors.txt” as word_index for cross reference. The “vocab.txt” output format are each vocabulary starting with word_index followed by word and line by line for each vocabulary.

## Step 1. Import Libraries we need for the task

In [1]:
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer 
from nltk.stem import LancasterStemmer  
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer  
from itertools import chain
from nltk.tokenize import MWETokenizer
from nltk.collocations import *
from nltk.probability import *
from __future__ import division

## Step 2. Use BeautifulSoup to extract information from the XML file.

##### 2.1 Here we are using the same parse method as task 1 we use to parse and extract data from XML files with Python. Use BeautifulSoup to extract information from the XML file.

In [2]:
soup = BeautifulSoup(open("./patents.xml"),"html.parser") 

By examining the hierarchy of the xml file, we noted that the information we would like to extract are stored in the following tags:

 * for patent's ID:  
 
        publication-reference > doc-number 
        (the 2,500 patents' id is located under tags publication-reference > doc-number)
 
 * for patent's abstracts: 
              
        abstract > p 
                 > p
                  .
                  .
                 > p
        (a patent may contains more than one abstract data in p tags under patent's abstract tag)
        
##### 2.2 To get the patent_id and return to a list: patents_id

In [3]:
publication_reference_tags = soup.find_all("publication-reference")
patents_id = [item.find("doc-number").string for item in publication_reference_tags] 
patents_id[0:5]

[u'PP021722', u'RE042159', u'RE042170', u'07891018', u'07891019']

##### 2.3  To get patent's abstracts and return to a list as a item for each patent: ab

In [4]:
ab_tags = soup.find_all("abstract")
p_tags = [item.find_all("p") for item in ab_tags] 

ab = []
for l in p_tags:
    if len(l) == 1:
        ab.append((l[0].get_text().encode('ascii', 'ignore').decode('ascii')))   
                                                    # convert Unicode to ASCII 
    else:
        temp = [tag.get_text().encode('ascii', 'ignore').decode('ascii') for tag in l] # convert Unicode to ASCII 
        ab.append(" ".join(temp))

ab[0:5]

[u'A new apple tree named Daligris is disclosed. The fruit of the new variety is particularly notable for its eating quality and distinctive flavor and appearance. The fruit is very sweet and has a pronounced aniseed flavor, and takes on a distinctive red orange coloration as it ripens on the tree.',
 u'A sensing device includes a circuit that compensates for time and spatial changes in temperature. The circuit includes elements to correct for variation in permeability of a highly permeable core of a differential variable reluctance transducer as temperature changes. The circuit also provides correction for temperature gradients across coils of the transducer.',
 u'At least one peripheral processing apparatus and at least one information processing apparatus, interconnected through a network, include a storage means for storing control information by which the information processing apparatus controls the peripheral apparatus through the network. The control information stored in the s

## Step 3. Text pre-processing for patent's abstract.

Since now we have extracted all the abstract data for the 2,500 patents and the data are in our hand. 
We will then come to the main stage of Text Pre-Processing. This text pre-processing task if for us to generate a structured and suitable format for the data given that to provide some useful information in the output file for next stage of analysis as our final goal.

### We are using below steps for the text pre-processing process for patents' abstracts: 

#### 3.1 Tokenization (sentences > words) and Case normalization
    * sentence tokenization
    * words tokenization (the words tokenization uses regular expression paired with RegexpTokenizer to get the sentences to be tokenized to tokens. 
    * lower all tokens to lower case.
#### 3.2 Stemming and Lemmatization
    * use different stemmers and lemmar tool to reduce some different forms of a word to a common base form to reduce the complexity of the word's document. After compared the different results from each stemmers and lemmar. I decided to WordNetLemmatizer as it return a better, more suitable result of the words Stemming and Lemmatization while the three stemmers return some wired word transformation result in not been adopted in this text pre-processing process. 
#### 3.3 Producing meaningful bigram collocation
    * before removing stop words, first is to use the BigramAssocMeasures to get all combination of bigrams. After adjusting some setting in the BigramAssocMeasures to produce a better set of bigrams, choosing 300 bigrams to the list. Do some more filtering to remove bigram which are not apparently not meaningful (bigram in stopwords) and not logical (bigram combined with two same words) to get a final list of suitable bigram, 122 meaningful bigrams in total.
#### 3.4 Removing Stop words
    * once obtained the bigrams, we can remove the Stop words which are words that are extremely common and carry little lexical content. They are usually referred to as function words in linguistics that not useful for telling us about the meaning of the text. 
#### 3.5 Tokenization (tokens updated to bigrams)
    * use MWETokenizer to produce bigrams to the abstract token list by replacing its unigram form 
#### 3.6 Removing the Most and Less Frequent Words
    * the top-20 most frequent words based on word’s document frequency, and words only appearing in one abstract will be removed from the abstracts token list as task requirement before generating the files as the text-preprocessing output.
-----------------------------------

#### Start:
#### 3.1 Tokenization (sentences > words ) and Case normalization:
###### 3.1-a  sentence tokenization for each abstract

In [5]:
# from nltk.tokenize import sent_tokenize
ab_sent = [sent_tokenize(string) for string in ab]  
ab_sent[0:5]

[[u'A new apple tree named Daligris is disclosed.',
  u'The fruit of the new variety is particularly notable for its eating quality and distinctive flavor and appearance.',
  u'The fruit is very sweet and has a pronounced aniseed flavor, and takes on a distinctive red orange coloration as it ripens on the tree.'],
 [u'A sensing device includes a circuit that compensates for time and spatial changes in temperature.',
  u'The circuit includes elements to correct for variation in permeability of a highly permeable core of a differential variable reluctance transducer as temperature changes.',
  u'The circuit also provides correction for temperature gradients across coils of the transducer.'],
 [u'At least one peripheral processing apparatus and at least one information processing apparatus, interconnected through a network, include a storage means for storing control information by which the information processing apparatus controls the peripheral apparatus through the network.',
  u'The 

###### 3.1-b  word tokenization for each sentence and case normalization for each token when tokenized

Before tokenization, we first to look the contents of the abstract, and decided how to do the tokenization so we can get the first raw tokens for us to proceed the next text pre-processing.

As the task is to extract the patents abstract for the purpose of getting some information about the patent's network, which means to getting to know the relationship between patents. For this reason, words are important while the figures such as number in digit are not important for extract the information of relationship of patents.
So, the token strategy is to token out the word and give up all the figures in the raw text. 
In addition, 

a) for words have hyphens in between will be treated as a whole word: 
* high-torque
* case-packer
* case-blank
* Metal-Oxide-Silicon

b) for words have slash in between will be treated as multiple words: 
* insert/standard
* picture/booklet

c) for words have apostrophe will be tokened to obtain the "Subject" words without character after the apostrophe,
   as the Contractions and Possessives word that with apostrophe provide no different meaning for as in the patent citation network task, even the obtained "Subject" word will be removed is the it is a stopword:
* Contractions: she'll -> she
* Possessives: cat's -> cat
###### The above a) to c) will use the regular expression to dealing with the tokenization:  (?:[A-Za-z]+[\-]?[A-Za-z]+[\-]?)+ 

d) for the abbreviations will be treated as a whole word: 
* U.S.A
* B.B.C
* i.e.
* e.g.
###### The above d) will use the regular expression to dealing with the tokenization: regular expression: (?:[A-Za-z]+[\.]?){2,}
###### And for the rest non digit word will use the regular expression to dealing with the tokenization: regular expression: (?:[A-Za-z])+

In [6]:
# from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"(?:[A-Za-z]+[\/\-]?[A-Za-z]+[\/\-]?)+|(?:[A-Za-z]+[\.]?){2,}|(?:[A-Za-z])+")   #ADD
patents_tokens = []                                                                       
for l in ab_sent:
    temp = []
    for sent in l:
        tokens = tokenizer.tokenize(sent) 
        tokens_lower = [token.lower() for token in tokens]  # lower it 
        temp.extend(tokens_lower) 
    patents_tokens.append(temp)

patents_tokens[0:2]

[[u'a',
  u'new',
  u'apple',
  u'tree',
  u'named',
  u'daligris',
  u'is',
  u'disclosed',
  u'the',
  u'fruit',
  u'of',
  u'the',
  u'new',
  u'variety',
  u'is',
  u'particularly',
  u'notable',
  u'for',
  u'its',
  u'eating',
  u'quality',
  u'and',
  u'distinctive',
  u'flavor',
  u'and',
  u'appearance',
  u'the',
  u'fruit',
  u'is',
  u'very',
  u'sweet',
  u'and',
  u'has',
  u'a',
  u'pronounced',
  u'aniseed',
  u'flavor',
  u'and',
  u'takes',
  u'on',
  u'a',
  u'distinctive',
  u'red',
  u'orange',
  u'coloration',
  u'as',
  u'it',
  u'ripens',
  u'on',
  u'the',
  u'tree'],
 [u'a',
  u'sensing',
  u'device',
  u'includes',
  u'a',
  u'circuit',
  u'that',
  u'compensates',
  u'for',
  u'time',
  u'and',
  u'spatial',
  u'changes',
  u'in',
  u'temperature',
  u'the',
  u'circuit',
  u'includes',
  u'elements',
  u'to',
  u'correct',
  u'for',
  u'variation',
  u'in',
  u'permeability',
  u'of',
  u'a',
  u'highly',
  u'permeable',
  u'core',
  u'of',
  u'a',
  u'diff

###### 3.1-c to get tokens' information

In [7]:
# from itertools import chain
words = list(chain.from_iterable(patents_tokens))
voc = set(words)
lexical_diversity = len(words)/len(voc)
print "Vocabulary size: ",len(voc)
print "Total number of tokens: ", len(words)
print "Lexical diversity: ", lexical_diversity

Vocabulary size:  11360
Total number of tokens:  281176
Lexical diversity:  24.7514084507


#### 3.2 Stemming and Lemmatization:
After words tokenization, process to do the words Stemming and Lemmatization to see if any Stemming or Lemmatization procedure can help to make some word to a common base form to reduce the complexity of the word's document.

###### 3.2-a try PorterStemmer:

In [8]:
# from nltk.stem import PorterStemmer             
stemmer = PorterStemmer()
porter_stem = []
for l in patents_tokens:
    temp = ['{0} -> {1}'.format(w, stemmer.stem(w)) for w in l]
    porter_stem.append(temp)
porter_stem[0:2]

[['a -> a',
  'new -> new',
  'apple -> appl',
  'tree -> tree',
  'named -> name',
  'daligris -> daligri',
  'is -> is',
  'disclosed -> disclos',
  'the -> the',
  'fruit -> fruit',
  'of -> of',
  'the -> the',
  'new -> new',
  'variety -> varieti',
  'is -> is',
  'particularly -> particularli',
  'notable -> notabl',
  'for -> for',
  'its -> it',
  'eating -> eat',
  'quality -> qualiti',
  'and -> and',
  'distinctive -> distinct',
  'flavor -> flavor',
  'and -> and',
  'appearance -> appear',
  'the -> the',
  'fruit -> fruit',
  'is -> is',
  'very -> veri',
  'sweet -> sweet',
  'and -> and',
  'has -> ha',
  'a -> a',
  'pronounced -> pronounc',
  'aniseed -> anise',
  'flavor -> flavor',
  'and -> and',
  'takes -> take',
  'on -> on',
  'a -> a',
  'distinctive -> distinct',
  'red -> red',
  'orange -> orang',
  'coloration -> color',
  'as -> as',
  'it -> it',
  'ripens -> ripen',
  'on -> on',
  'the -> the',
  'tree -> tree'],
 ['a -> a',
  'sensing -> sens',
  'de

###### 3.2-b try LancasterStemmer:

In [9]:
# from nltk.stem import LancasterStemmer                        
stemmer = LancasterStemmer()
lancaster_stem = []
for l in patents_tokens:
    temp = ['{0} -> {1}'.format(w, stemmer.stem(w)) for w in l]
    lancaster_stem.append(temp)
lancaster_stem[0:2]

[['a -> a',
  'new -> new',
  'apple -> appl',
  'tree -> tre',
  'named -> nam',
  'daligris -> daligr',
  'is -> is',
  'disclosed -> disclos',
  'the -> the',
  'fruit -> fruit',
  'of -> of',
  'the -> the',
  'new -> new',
  'variety -> vary',
  'is -> is',
  'particularly -> particul',
  'notable -> not',
  'for -> for',
  'its -> it',
  'eating -> eat',
  'quality -> qual',
  'and -> and',
  'distinctive -> distinct',
  'flavor -> flav',
  'and -> and',
  'appearance -> appear',
  'the -> the',
  'fruit -> fruit',
  'is -> is',
  'very -> very',
  'sweet -> sweet',
  'and -> and',
  'has -> has',
  'a -> a',
  'pronounced -> pronount',
  'aniseed -> anisee',
  'flavor -> flav',
  'and -> and',
  'takes -> tak',
  'on -> on',
  'a -> a',
  'distinctive -> distinct',
  'red -> red',
  'orange -> orang',
  'coloration -> col',
  'as -> as',
  'it -> it',
  'ripens -> rip',
  'on -> on',
  'the -> the',
  'tree -> tre'],
 ['a -> a',
  'sensing -> sens',
  'device -> dev',
  'include

###### 3.2-c try SnowballStemmer:

In [10]:
# from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer('english')
snowball_stem = []
for l in patents_tokens:
    temp = ['{0} -> {1}'.format(w, stemmer.stem(w)) for w in l]
    snowball_stem.append(temp)
snowball_stem[0:2]

[['a -> a',
  'new -> new',
  'apple -> appl',
  'tree -> tree',
  'named -> name',
  'daligris -> daligri',
  'is -> is',
  'disclosed -> disclos',
  'the -> the',
  'fruit -> fruit',
  'of -> of',
  'the -> the',
  'new -> new',
  'variety -> varieti',
  'is -> is',
  'particularly -> particular',
  'notable -> notabl',
  'for -> for',
  'its -> it',
  'eating -> eat',
  'quality -> qualiti',
  'and -> and',
  'distinctive -> distinct',
  'flavor -> flavor',
  'and -> and',
  'appearance -> appear',
  'the -> the',
  'fruit -> fruit',
  'is -> is',
  'very -> veri',
  'sweet -> sweet',
  'and -> and',
  'has -> has',
  'a -> a',
  'pronounced -> pronounc',
  'aniseed -> anise',
  'flavor -> flavor',
  'and -> and',
  'takes -> take',
  'on -> on',
  'a -> a',
  'distinctive -> distinct',
  'red -> red',
  'orange -> orang',
  'coloration -> color',
  'as -> as',
  'it -> it',
  'ripens -> ripen',
  'on -> on',
  'the -> the',
  'tree -> tree'],
 ['a -> a',
  'sensing -> sens',
  'dev

###### 3.2-d try WordNetLemmatizer:

In [12]:
# Get pos tags to all tokens
# import nltk
patents_tokens_pos = []
for l in patents_tokens:
    temp = nltk.pos_tag(l)
    patents_tokens_pos.append (temp)

# Function for change the pos tags for WordNetLemmatizer can regonize in Lemmatization procedure   
from nltk.corpus import wordnet
def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
#Do Lemmatization
# from nltk.stem import WordNetLemmatizer       
lemmatizer = WordNetLemmatizer()
wordNet_lemma = []
for l in patents_tokens_pos:
    temp = ['{0} -> {1}'.format(w[0], lemmatizer.lemmatize(w[0], get_wordnet_pos(w[1]))) for w in l]
    wordNet_lemma.append(temp)
wordNet_lemma[0:2]

[['a -> a',
  'new -> new',
  'apple -> apple',
  'tree -> tree',
  'named -> name',
  'daligris -> daligris',
  'is -> be',
  'disclosed -> disclose',
  'the -> the',
  'fruit -> fruit',
  'of -> of',
  'the -> the',
  'new -> new',
  'variety -> variety',
  'is -> be',
  'particularly -> particularly',
  'notable -> notable',
  'for -> for',
  'its -> it',
  'eating -> eating',
  'quality -> quality',
  'and -> and',
  'distinctive -> distinctive',
  'flavor -> flavor',
  'and -> and',
  'appearance -> appearance',
  'the -> the',
  'fruit -> fruit',
  'is -> be',
  'very -> very',
  'sweet -> sweet',
  'and -> and',
  'has -> have',
  'a -> a',
  'pronounced -> pronounce',
  'aniseed -> aniseed',
  'flavor -> flavor',
  'and -> and',
  'takes -> take',
  'on -> on',
  'a -> a',
  'distinctive -> distinctive',
  'red -> red',
  'orange -> orange',
  'coloration -> coloration',
  'as -> a',
  'it -> it',
  'ripens -> ripen',
  'on -> on',
  'the -> the',
  'tree -> tree'],
 ['a -> a',

###### 3.2-e 
###### to get tokens information after getting the result of all Stemming and Lemmatization, I found that WordNetLemmatizer can do a better job in words formalization for example: 
* named -> name
* pronounced -> pronounce
* means -> mean
* receives -> receive

###### while three of Stemmers all return some wired transformation of the words, for example:
* changes -> chang
* permeability -> permeabl
* variable -> variabl,
* reluctance -> reluct,
* transducer -> transduc, and so on. 

###### which are not useful to make words to a common base form to reduce the complexity of the word's document. 
###### So, I choose to do the Lemmatization by using WordNetLemmatizer and not using Stemming.

In [13]:
patents_tokens_lemma = []

for l in patents_tokens_pos:
    temp = [lemmatizer.lemmatize(w[0], get_wordnet_pos(w[1])) for w in l]
    patents_tokens_lemma.append(temp)
    
patents_tokens_lemma[0:2]

[[u'a',
  u'new',
  u'apple',
  u'tree',
  u'name',
  u'daligris',
  u'be',
  u'disclose',
  u'the',
  u'fruit',
  u'of',
  u'the',
  u'new',
  u'variety',
  u'be',
  u'particularly',
  u'notable',
  u'for',
  u'it',
  u'eating',
  u'quality',
  u'and',
  u'distinctive',
  u'flavor',
  u'and',
  u'appearance',
  u'the',
  u'fruit',
  u'be',
  u'very',
  u'sweet',
  u'and',
  u'have',
  u'a',
  u'pronounce',
  u'aniseed',
  u'flavor',
  u'and',
  u'take',
  u'on',
  u'a',
  u'distinctive',
  u'red',
  u'orange',
  u'coloration',
  u'a',
  u'it',
  u'ripen',
  u'on',
  u'the',
  u'tree'],
 [u'a',
  u'sensing',
  u'device',
  u'include',
  u'a',
  u'circuit',
  u'that',
  u'compensate',
  u'for',
  u'time',
  u'and',
  u'spatial',
  u'change',
  u'in',
  u'temperature',
  u'the',
  u'circuit',
  u'include',
  u'element',
  u'to',
  u'correct',
  u'for',
  u'variation',
  u'in',
  u'permeability',
  u'of',
  u'a',
  u'highly',
  u'permeable',
  u'core',
  u'of',
  u'a',
  u'differential',


###### 3.2-f to get tokens' information 

In [14]:
# from itertools import chain
words_1 = list(chain.from_iterable(patents_tokens_lemma))
voc_1 = set(words_1)
lexical_diversity_1 = len(words_1)/len(voc_1)
print "Vocabulary size: ",len(voc_1)
print "Total number of tokens: ", len(words_1)
print "Lexical diversity: ", lexical_diversity_1

Vocabulary size:  9474
Total number of tokens:  281176
Lexical diversity:  29.6786995989


###### 3.2-g Have a quick look for the words frequency distribution: We found out the most common words do belong to the stopwords

In [15]:
fd = FreqDist(words_1)            
fd.most_common(20)

[(u'the', 25766),
 (u'a', 19750),
 (u'of', 10129),
 (u'and', 8793),
 (u'be', 8096),
 (u'to', 7481),
 (u'in', 4134),
 (u'an', 4046),
 (u'for', 2986),
 (u'first', 2658),
 (u'include', 2325),
 (u'with', 2143),
 (u'second', 2110),
 (u'have', 1940),
 (u'that', 1845),
 (u'at', 1748),
 (u'on', 1731),
 (u'from', 1674),
 (u'by', 1581),
 (u'one', 1539)]

###### 3.2-h To exclude the stopwords and see the words frequency distribution again:
Found that word "include" which appears 2325 times is the most common word now. However, this word as a verb, also not
providing any meaningful information for the text analysis of patent's citation network. So I will remove it by treating it as a stopword.

In [17]:
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()
stopwordsSet = set(stopwords)

words_WithOutStopwords = [item for item in words_1 if item not in stopwordsSet ] 
fd_1 = FreqDist(words_WithOutStopwords)            
fd_1.most_common(20)

[(u'include', 2325),
 (u'device', 1461),
 (u'portion', 1346),
 (u'provide', 1273),
 (u'signal', 1119),
 (u'system', 1088),
 (u'form', 1025),
 (u'surface', 997),
 (u'data', 972),
 (u'method', 959),
 (u'control', 929),
 (u'member', 889),
 (u'unit', 886),
 (u'layer', 867),
 (u'position', 811),
 (u'plurality', 805),
 (u'end', 740),
 (u'circuit', 704),
 (u'connect', 695),
 (u'base', 676)]

###### 3.2-i Updating word "include" into the stopwords list it the stopword list.

In [18]:
stopwordsSet.add("include") 

#### 3.3 Producing meaningful bigram collocation

I have obtained abstracts' tokens from tokenization process and make tokens to the common form base by Lemmatization. 
In step 3.3, I'm going to use some technique to found some meaningful bigram from this all unigram token list.
###### 3.3-a Look for suitable bigram:

In [20]:
# import nltk
# from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words_1, window_size = 10)  
finder.apply_freq_filter(4)   #filter out the word which is not appear more than 4 times 
finder.apply_word_filter(lambda w: len(w) < 4)   #filter out the word which is not longer then 4 characters
bigram = finder.nbest(bigram_measures.likelihood_ratio, 300) 
bigram[0:10]

[(u'first', u'second'),
 (u'present', u'invention'),
 (u'layer', u'layer'),
 (u'signal', u'signal'),
 (u'portion', u'portion'),
 (u'memory', u'cell'),
 (u'data', u'data'),
 (u'member', u'member'),
 (u'image', u'image'),
 (u'region', u'region')]

###### 3.3-b 
###### I found that it returns some good and suitable bigram which can be used to update the tokens list such as:
* memory cell
* semiconductor substrate
* laser beam
* circuit board
* wind turbine

###### However, it also returns bigrams that which are not meaningful for some bigram contain stopword or bigram are combined with two some words such as:
* first second
* such that
* great than
* data data
* valve valve
* heat heat

###### Removed all unwanted bigrams and retain the meaningful bigrams only.

In [21]:
# remove bigrams which have stopwords as a not meaningful bigram
bigram_2 = [item for item in bigram if item[0] not in stopwordsSet and item[1] not in stopwordsSet] 

# remove bigrams which have two same words paired as a not a meaningful bigram
bigram_fn = [item for item in bigram_2 if item[0] != item[1]] 

print len(bigram_fn)
bigram_fn[0:10]

123


[(u'present', u'invention'),
 (u'memory', u'cell'),
 (u'light', u'emit'),
 (u'light', u'source'),
 (u'semiconductor', u'substrate'),
 (u'invention', u'relate'),
 (u'power', u'supply'),
 (u'liquid', u'crystal'),
 (u'integrate', u'circuit'),
 (u'electrically', u'connect')]

###### 3.3-c Produce a vocabulary list with the new adding 123 set of bigrams:

In [22]:
voc_with_bigram = list(voc_1)
voc_with_bigram += [ bigram for bigram in bigram_fn]  
voc_with_bigram[0:10]

[u'four',
 u'prefix',
 u'circuitry',
 u'hanging',
 u'backend-tier',
 u'semi-hermetic',
 u'localized',
 u'hingedly',
 u'electricity',
 u'originality']

##### 3.4 Removing Stop words

###### Tokens belong to the stopwords will be removed from the patents_tokens list

In [23]:
patents_tokens_lemma_mvStopwords = []
for l in patents_tokens_lemma:
    temp = [token for token in l if token not in stopwordsSet]
    patents_tokens_lemma_mvStopwords.append(temp)

##### 3.5 Tokenization (tokens updated to bigrams)
###### Update words to bigrams for the selected words combination from the bigrams list to the patents_tokens list by using MWETokenizer

In [25]:
# from nltk.tokenize import MWETokenizer
mwe_tokenizer = MWETokenizer(voc_with_bigram)     
patents_tokens_lemma_mvStopwords_2 =[mwe_tokenizer.tokenize(tokens) for tokens in patents_tokens_lemma_mvStopwords] 
patents_tokens_lemma_mvStopwords_2[0:2]

[[u'apple',
  u'tree',
  u'daligris',
  u'disclose',
  u'fruit',
  u'variety',
  u'notable',
  u'eating',
  u'quality',
  u'distinctive',
  u'flavor',
  u'appearance',
  u'fruit',
  u'sweet',
  u'pronounce',
  u'aniseed',
  u'flavor',
  u'distinctive',
  u'red',
  u'orange',
  u'coloration',
  u'ripen',
  u'tree'],
 [u'sensing',
  u'device',
  u'circuit',
  u'compensate',
  u'time',
  u'spatial',
  u'change',
  u'temperature',
  u'circuit',
  u'element',
  u'correct',
  u'variation',
  u'permeability',
  u'highly',
  u'permeable',
  u'core',
  u'differential',
  u'variable',
  u'reluctance',
  u'transducer',
  u'temperature',
  u'change',
  u'circuit',
  u'provide',
  u'correction',
  u'temperature',
  u'gradient',
  u'coil',
  u'transducer']]

##### 3.6 Removing the Most and Less Frequent Words
 * top-20 most frequent words based on word’s document frequency
 * words only appearing in one abstract as required

###### 3.6-a To get the list for top 20 most frequent words based on word’s document frequency

In [26]:
# from itertools import chain
words_2 = list(chain.from_iterable(patents_tokens_lemma_mvStopwords_2))
fd_2 = FreqDist(words_2)            
top_20_frequent_words= [ w[0] for w in fd_2.most_common(20) ]
top_20_frequent_words[0:10]

[u'portion',
 u'provide',
 u'device',
 u'system',
 u'form',
 u'surface',
 u'data',
 u'signal',
 u'member',
 u'position']

###### 3.6-b To get the list for words only appearing in one abstract

In [28]:
patents_tokens_lemma_mvStopwords_2_set = list(chain.from_iterable([set(value) for value in patents_tokens_lemma_mvStopwords_2]))
fd_3 = FreqDist(patents_tokens_lemma_mvStopwords_2_set)
word_only_appear_in_one = [ w for w in fd_3.hapaxes()]
word_only_appear_in_one[0:10]

[u'transit_tier',
 u'backend-tier',
 u'semi-hermetic',
 u'localized',
 u'originality',
 u'image-side',
 u'crossbar',
 u'sputter',
 u'herbicide',
 u'collate']

###### 3.6-c To remove words in patents tokens list for the words in to 2 word removing list "top_20_frequent_words"  and "word_only_appear_in_one"

In [29]:
top_20_frequent_words = set(top_20_frequent_words)
word_only_appear_in_one = set(word_only_appear_in_one)

patents_tokens_fn = []
for l in patents_tokens_lemma_mvStopwords_2 :
    temp = [token for token in l if token not in top_20_frequent_words and token not in word_only_appear_in_one]
    patents_tokens_fn.append (temp)  
    
patents_tokens_fn[0:2]

[[u'tree',
  u'disclose',
  u'fruit',
  u'variety',
  u'eating',
  u'quality',
  u'flavor',
  u'appearance',
  u'fruit',
  u'flavor',
  u'red',
  u'coloration',
  u'tree'],
 [u'sensing',
  u'circuit',
  u'compensate',
  u'time',
  u'spatial',
  u'change',
  u'temperature',
  u'circuit',
  u'element',
  u'correct',
  u'variation',
  u'highly',
  u'permeable',
  u'core',
  u'differential',
  u'variable',
  u'transducer',
  u'temperature',
  u'change',
  u'circuit',
  u'correction',
  u'temperature',
  u'gradient',
  u'coil',
  u'transducer']]

###### 3.6-d to get tokens' information 

In [30]:
# from itertools import chain
words_fn = list(chain.from_iterable(patents_tokens_fn))
voc_fn = set(words_fn)
lexical_diversity_fn = len(words_fn)/len(voc_fn)
print "Vocabulary size: ",len(voc_fn)
print "Total number of tokens: ", len(words_fn)
print "Lexical diversity: ", lexical_diversity_fn

Vocabulary size:  4436
Total number of tokens:  116868
Lexical diversity:  26.3453561767


## Step 4. Data formatting for the collected data and output to file.

##### 4.1 Now a final tokens have been set up with the words we want to retain only. To combine this tokens to its patent id number together in a dictionary:

In [33]:
patent_ID_tokens = { patents_id[i]: patents_tokens_fn[i] for i in range (0,len(patents_tokens_fn))}
#len(patent_ID_tokens)            

#####  4.2 Generating the sparse count vectors and output to file “count_vectors.txt” where each row corresponds to a patent’s abstract, starting with patent_id, followed by “word_index:count” pairs. 

In [27]:
output_file = open ("count_vectors.txt", "w+")

voc_fn = list(voc_fn)
voc_fn.sort()
for key in patent_ID_tokens.keys():
    output_file.write(key+":")
    d_idx = [voc_fn.index(w) for w in patent_ID_tokens[key]]
    for k, v in FreqDist(d_idx).iteritems():
        output_file.write("{}:{} ".format(k,v))
    output_file.write('\n')
    
output_file.close()

#####  4.3 Generating vocabulary list to file “vocab.txt” with all vocabulary where each row corresponds to a vocabulary, starting with word_index followed by word:

In [28]:
output_file = open ("vocab.txt", "w+")

for i in range (0, len(voc_fn)):
    output_file.write(str(i) + " : " + voc_fn[i] + "\n")        
    
output_file.close()

## Conclusion after text preprocessing task

After finished the text preprocessing task, we have not only extract out data we have, but also remove some not useful words by adopted some text preprocessing strategy and techniques, the final token list is in a better volume and with a better standardize case and word form. The final token that stored in an output file can then be used by the Data Analyst for preforming analysis to get some information from this pre-processed data.  

Below shown the different stage of words and vocabularies volume. Total number of tokens in the document reduced from 281176 to 116868 while Vocabulary size reduced from 11360 to 4436.

In [29]:
print "Original tokens forom abstracts: "
print "Vocabulary size: ",len(voc)
print "Total number of tokens: ", len(words)
print "Lexical diversity: ", lexical_diversity , "\n"

print "Tokens after Lemmatization: "
print "Vocabulary size: ",len(voc_1)
print "Total number of tokens: ", len(words_1)
print "Lexical diversity: ", lexical_diversity_1, "\n"

print "Final tokens after text pre-processing: "
print "Vocabulary size: ",len(voc_fn)
print "Total number of tokens: ", len(words_fn)
print "Lexical diversity: ", lexical_diversity_fn

Original tokens forom abstracts: 
Vocabulary size:  11360
Total number of tokens:  281176
Lexical diversity:  24.7514084507 

Tokens after Lemmatization: 
Vocabulary size:  9474
Total number of tokens:  281176
Lexical diversity:  29.6786995989 

Final tokens after text pre-processing: 
Vocabulary size:  4436
Total number of tokens:  116868
Lexical diversity:  26.3453561767


### Task 4 end.