# FIT5196 Data Wrangling - S2 2016
 
## Assessment 4 - Text Preprocessing
 
Filename: Patents_data_text_preprocessing.ipynb
 
Author: Lynn Miller
 
Date: 16-Oct-2016
 
Version: 1.0
 
Language: Python 2.7.12 and Jupyter notebook Anaconda 2 version 4.1.1 

Libraries used:
- `bs4 (version 4.4.1)`: beautifulSoup - for xml parsing (Richardson, 2015)
- `numpy (version 1.11.1)`: for numpy arrays (Walt, Colbert, & Varoquaux, 2011)
- `os`: listdir - read directory contents
- `nltk (version 3.2.1)`: Natural Language ToolKit (Bird, Loper, & Klein, 2009)
- `__future__`: division - ensures float returned from integer division operations
- `itertools`: chain - merge multiple dictionary values into one list
- `sklearn (version 0.17.1)`: vectorizer functions to create count and tf-idf vectors (Pedregosa et al., 2011) 

Input files:
- All file in directory patents/100/ - patents data

Output files:
- section_labels.txt - The patent section labels for each patent
- binary_vectors.txt - Binary feature vector - includes stopwords
- count_vectors.txt - Count feature vector - includes stopwords
- tf_idf_vectors.txt - TF-IDF weighted feature vector - includes stopwords
- binary_vectors_2.txt - Binary feature vector - excludes stopwords
- count_vectors_2.txt - Count feature vector - excludes stopwords
- tf_idf_vectors_2.txt - TF-IDF weighted feature vector - excludes stopwords
 
## Introduction

This code is developed to pre-process the text in the patent files:
 
**Task 1: Parsing XML Files -** read each patent file in the patents/100 sub-directories and processes each patent file to:
- Extract the patent IDs
- Extract the IPC section labels and generate the section_labels.txt file
- Extract the abstract, description and claims text fields and combine into a single document for each patent

**Task 2: Tokenise text and generate vocabulary -** the steps for this task are:
- Write a tokeniser regular expression and tokenise the patent documents
- Generate bigram and trigram collocations
- Retokenise the patent documents using the bigrams and trigrams

**Task 3: Generate and save feature vectors -** the steps for task 3 are:
- Generate binary, count and tf-idf weighted feature vectors
- Save the feature vectors in sparse format
- Remove stop words from the tokenised documents
- Re-generate and re-save the generated festure vectors

**Task 4: Run the SVM classifier**

The results of running the SVM classifier on all the feature vector files are summarised in the final cell of the notebook.

## Resources Used

- Online Python documentation - (Python Software Foundation, 2016) 
- Online NLTK documentation - (NLTK Project, 2014)
- Online ScikitLearn documentation - (Scikit Learn, 2016)
- Online BeautifulSoup documentation - (Richardson, 2015)
- Module 6 lecture and tutorial notebooks for code examples - (Du, 2016)

Import required libraries

In [1]:
from bs4 import BeautifulSoup
from os import listdir
import numpy as np

import nltk
from nltk.tokenize.regexp import RegexpTokenizer
from nltk.probability import *
from nltk.util import ngrams
from nltk.tokenize import MWETokenizer

from __future__ import division
from itertools import chain

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

## Parse XML Files

- Read each file in each sub-directory of the patents/100 directory and format into an XML tree using BeautifulSoup
- From each file extract:
 - the document id (doc-number in the publication-reference tag)
 - the patent section label (section in the classifications-ipcr tag)
 - the text components - abstract, description and claim
- The document id and section label for each document are written to the "section_labels.txt" file
- The abstract, description and claim for each patent are concatenated and stored in a dictionary indexed by the document id

In [2]:
# path to patent files
patentPath = "patents/100/"

# File for patent section labels
labelFile = "section_labels.txt"
labelFH = open(labelFile,"w")

# Document dictionary 
text = {}

# For each directory ...
for patentDir in listdir(patentPath):
    # ... and file ...
    for patentFile in listdir(patentPath + patentDir):
        # Read the file
        patentFH = open(patentPath + patentDir + "/" + patentFile)
        patentData = patentFH.read()
        patentFH.close()
        # Format into an XML tree
        xmldata = BeautifulSoup(patentData, "lxml")
        # Extract the document ID
        prTag = xmldata.find("publication-reference")
        patentNum = prTag.find("doc-number").decode_contents()
        # Extract the section label
        classTag = xmldata.find("classifications-ipcr")
        sectionTag = classTag.find("section")
        labelFH.write(patentNum + "," + sectionTag.decode_contents() + "\n")
        # Extract the text components    
        abstract = xmldata.find("abstract").get_text()
        description = xmldata.find("description").get_text()
        claim = xmldata.find("claim").get_text()
        # Add document to the dictionary
        text[patentNum] = abstract + description + claim
# close the section labels file        
labelFH.close()        

# Check a document
text["07640887"]

u'\nThe present invention provides a surface wave excitation plasma generator in which surface wave excitation plasma is efficiently generated. A surface wave excitation plasma generator including an annular waveguide and a dielectric tube is provided. The annular waveguide 2 includes an inlet 2a for introducing a microwave M, an end plate 2b for reflecting the microwave M introduced and propagating within the waveguide, and a bottom plate 2c on which slot antennas 2d are formed at a predetermined interval. When the wavelength of the microwave M in the waveguide is \u03bbg, the length from position b through positions c, d, e to position f on the bottom plate 2c, i.e. the circumferential length (\u03c0\xd7D1) of the annular waveguide, is set as 2 \u03bbg, and the positions b, c, d, e, f are spaced apart at an interval of \u03bbg/2. Since the slot antennas 2d are arranged at two positions c and e, the interval between these two slot antennas 2d is equal to the wavelength \u03bbg in the 

## Tokenise Text and Generate Vocabulary

### Tokenise the Text

#### Define a regular expression tokeniser

The tokeniser is built to recognise the following tokens based on common patterns seen in the patent text:
- Dates, which are specified in Mmm. dd, yyyy format (e.g. Oct. 16, 2016) with optional punctuation
- References to figures (e.g. FIG. 2, FIGS. 1-6, FIGS. 2 and 3)
- References to pages (e.g. Page 2, pages 1-6, pages 2 and 3)
- Strings of numbers separated by white space (e.g. 123 456 789)
- Alpha-numberic strings with embedded or trailing "-.,/='" characters. A trailing "." is treated as a full-stop and not included in the token if it is followed by white space and a capital letter. This isn't 100% accurate, but seems to do a good enough job.

In [3]:
# Define the tokeniser
tokenizer = RegexpTokenizer(r"""(?x)
    (?:[A-Z][a-z]{2}\.?\s[0-9]{1,2},\s[0-9]{4}) |               # Dates formatted as Mmm. dd, yyyy
    (?:FIGS?.\s+\d+\a*(?:[\-˜]\d+\a*)?(?:\sand\s\d+\a*)?) |     # FIG. n, FIGS. n-m, FIGS. n and m 
    (?:[Pp]ages?\s+\d+\a*(?:[\-˜]\d+\a*)?(?:\sand\s\d+\a*)?) |  # Page n, Pages n-m, Pages n and m 
    (?:\d+(?:\s\d+)+) |                                         # Numbers separated by white space
    (?:\w+(?:[-\.,/=\']\w+)*(?:\.(?:!\s[A-Z]))?)                # Alpha-numeric strings and embedded -.,/=' characters.
        # Include a trailing "." - unless it is followed by whitespace and a capital letter (assume this is a full stop). 
    """)

# Check the tokeniser results on a document
print tokenizer.tokenize(text["07640887"])

[u'The', u'present', u'invention', u'provides', u'a', u'surface', u'wave', u'excitation', u'plasma', u'generator', u'in', u'which', u'surface', u'wave', u'excitation', u'plasma', u'is', u'efficiently', u'generated', u'A', u'surface', u'wave', u'excitation', u'plasma', u'generator', u'including', u'an', u'annular', u'waveguide', u'and', u'a', u'dielectric', u'tube', u'is', u'provided', u'The', u'annular', u'waveguide', u'2', u'includes', u'an', u'inlet', u'2a', u'for', u'introducing', u'a', u'microwave', u'M', u'an', u'end', u'plate', u'2b', u'for', u'reflecting', u'the', u'microwave', u'M', u'introduced', u'and', u'propagating', u'within', u'the', u'waveguide', u'and', u'a', u'bottom', u'plate', u'2c', u'on', u'which', u'slot', u'antennas', u'2d', u'are', u'formed', u'at', u'a', u'predetermined', u'interval', u'When', u'the', u'wavelength', u'of', u'the', u'microwave', u'M', u'in', u'the', u'waveguide', u'is', u'\u03bbg', u'the', u'length', u'from', u'position', u'b', u'through', u'pos

Define the tokenizer function

In [4]:
def tokenizeRawData(patentNum):
    """
        This function tokenizes the raw patent text by:
        1. Creating a token list using the tokeniser developed above
        2. Filtering out tokens with non-alphanumeric characters or only contain digits
        3. Converting tokens to lower case
        
        Parameters:
            patentNum - the dictionary key of the patent document to be processed
        
        Returns a tuple containing:
            patentNum - (unchanged)
            tokenisedPatent - the retokenised patent document
    """
    # tokenize the patent document
    tokenList = tokenizer.tokenize(text[patentNum]) 
    # exclude tokens with non-alphanumeric characters or only contain digits, convert tokens to lower case
    tokenisedPatent = [word.lower() for word in tokenList if (word.isalnum() & (not word.isdigit()))]
    # return the tokenised patent document
    return (patentNum, tokenisedPatent) 

#### Tokenize the patent documents

In [5]:
# Tokenise all the patent documents
tokenisedText = dict(tokenizeRawData(patentNum) for patentNum in text.iterkeys())

# Check a document
print tokenisedText["07640887"]

[u'the', u'present', u'invention', u'provides', u'a', u'surface', u'wave', u'excitation', u'plasma', u'generator', u'in', u'which', u'surface', u'wave', u'excitation', u'plasma', u'is', u'efficiently', u'generated', u'a', u'surface', u'wave', u'excitation', u'plasma', u'generator', u'including', u'an', u'annular', u'waveguide', u'and', u'a', u'dielectric', u'tube', u'is', u'provided', u'the', u'annular', u'waveguide', u'includes', u'an', u'inlet', u'2a', u'for', u'introducing', u'a', u'microwave', u'm', u'an', u'end', u'plate', u'2b', u'for', u'reflecting', u'the', u'microwave', u'm', u'introduced', u'and', u'propagating', u'within', u'the', u'waveguide', u'and', u'a', u'bottom', u'plate', u'2c', u'on', u'which', u'slot', u'antennas', u'2d', u'are', u'formed', u'at', u'a', u'predetermined', u'interval', u'when', u'the', u'wavelength', u'of', u'the', u'microwave', u'm', u'in', u'the', u'waveguide', u'is', u'\u03bbg', u'the', u'length', u'from', u'position', u'b', u'through', u'positions

#### Generate the vocabulary and display vocabulary statistics

Define a function to display vocabulary statistics

In [6]:
def checkVocabStats(patentDict):
    """
        This function prints vocabulary statistics from the provided word list
        using the code provided in the module 6 lecture notebooks (Du, 2016).
                
        Parameters:
            patentDict - the dictionary of tokenised patent documents
        
        Returns:
            The contenated list of words/tokens from all patent documents
    """
    words = list(chain.from_iterable(patentDict.values()))
    vocab = set(words)
    lexical_diversity = len(words)/len(vocab)

    print "Vocabulary size: ",len(vocab)
    print "Total number of tokens: ", len(words)
    print "Lexical diversity: ", lexical_diversity
    print "Total number of articles:", len(patentDict)
    lens = [len(value) for value in patentDict.values()]
    print "Average document length:", np.mean(lens)
    print "Maximum document length:", np.max(lens)
    print "Minimum document length:", np.min(lens)
    print "Standard deviation of document length:", np.std(lens)
    
    return (words)

Display the vocabulary statistics after tokenising the documents

In [7]:
words = checkVocabStats(tokenisedText)

Vocabulary size:  33427
Total number of tokens:  3800603
Lexical diversity:  113.698596943
Total number of articles: 800
Average document length: 4750.75375
Maximum document length: 58050
Minimum document length: 238
Standard deviation of document length: 4294.71059131


### Generate Bigram and Trigram Collocations

The nltk collocations methods are used to generate bigram and trigram collocations. To avoid the issue of generating bigram collocations that are also part of trigram collocations, the trigram collocations are generated first, then bigram collocations.

#### Trigram Collocations

Generate the trigram collocations and filter out meaningless trigrams. Filtering on the following criteria removed most meaningless trigrams:
- occur less than five times
- have repeated words. If all three words are the same, it is probably not meaningful. If two words are the same, it is probably better to process the single word and one occurrence of the repeated word as a bigram.
- contain non-alphabetic characters

Then extract the 100 best trigram collocations based on the pmi measure.

In [8]:
# Get the trigram measures
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# Create the trigram collocation finder
finder = nltk.collocations.TrigramCollocationFinder.from_words(words)

# Filter out trigram collocations that occur less than five times
finder.apply_freq_filter(5)

# Filter out trigram collocations with repeated words
finder.apply_ngram_filter(lambda w1, w2, w3: ((w1 == w2) | (w1 == w3) | (w2 == w3)))

# filter out trigram collocations containing non-alpha characters
finder.apply_ngram_filter(lambda w1, w2, w3: (not (w1.isalpha() & w2.isalpha() & w3.isalpha())))

# Extract the 100 best trigram collocations
trigrams = finder.nbest(trigram_measures.pmi, 100)

# Display the trigram collocations
trigrams

[(u'ddatp', u'ddgtp', u'ddctp'),
 (u'riken', u'wako', u'saitama'),
 (u'boosters', u'bleaches', u'alkalinity'),
 (u'hill', u'englewood', u'cliffs'),
 (u'prentice', u'hill', u'englewood'),
 (u'etsi', u'scp', u'ts'),
 (u'perborate', u'monohydrate', u'tetraacetyl'),
 (u'mesocarbon', u'microbeads', u'mcmb'),
 (u'dactylopius', u'coccus', u'costa'),
 (u'graphitized', u'mesocarbon', u'microbeads'),
 (u'higashi', u'tsukuba', u'ibaraki'),
 (u'toray', u'dow', u'corning'),
 (u'los', u'angeles', u'calif'),
 (u'mgo', u'cao', u'sro'),
 (u'coccus', u'costa', u'insect'),
 (u'van', u'der', u'waals'),
 (u'john', u'wiley', u'sons'),
 (u'curr', u'opin', u'biotechnol'),
 (u'nicotinamide', u'adenine', u'dinucleotide'),
 (u'crit', u'rev', u'biochem'),
 (u'ed', u'prentice', u'hill'),
 (u'wilmington', u'del', u'usa'),
 (u'naoh', u'borax', u'naf'),
 (u'vac', u'sci', u'technol'),
 (u'stz', u'genbank', u'accession'),
 (u'rec', u'flag', u'frec'),
 (u'proc', u'natl', u'acad'),
 (u'natl', u'acad', u'sci'),
 (u'clin',

Most of these trigrams look good. Some look rather meaningless but investigation shows they make sense:
- 'ddatp', 'ddgtp' and 'ddctp' are molecules associated with DNA (Wikipedia, 2014)
- 'ti', 'nb' and 'zr' refers to an alloy (Liu, Meng, Guo, & Zhao, 2013)

Create the trigram tokeniser and re-tokenise the word list to include the trigram collocations. The resulting word list will be used to find the bigram collocations.

In [9]:
# Create the trigram tokenizer
tri_tokenizer = MWETokenizer(trigrams)

# Re-tokenise the word list
triTokens = tri_tokenizer.tokenize(words)

# Check the results
trigramSet = {token for token in triTokens if "_" in token}
print "Number of distinct trigram tokens:", len(trigramSet)
print trigramSet

Number of distinct trigram tokens: 82
set([u'mgo_cao_sro', u'microbiological_interception_enhancing', u'weakly_cemented_sediments', u'scanning_calorimeter_dsc', u'occluded_blocked_clogged', u'lactis_subsp_cremoris', u'minimally_invasive_surgery', u'methyl_glycine_diacetic', u'wilmington_del_usa', u'oxides_perlite_talc', u'clin_endocrinol_metab', u'odor_neutralizers_polymeric', u'brighteners_hydrotropes_suds', u'modifiers_thickeners_abrasives', u'riken_wako_saitama', u'nicotinamide_adenine_dinucleotide', u'toray_dow_corning', u'federally_sponsored_research', u'ice_cream_yogurt', u'am_chem_soc', u'alkoxy_aldehyde_carboxyl', u'mitsui_takeda_chemicals', u'stz_genbank_accession', u'orthosis_afo_pedorthic', u'spectrometry_photoelectron_spectroscopy', u'etsi_scp_ts', u'internet_address_http', u'photoelectron_spectroscopy_xps', u'polyphosphate_na_tartrate', u'regarding_federally_sponsored', u'engl_j_med', u'granite_limestone_stainless', u'higashi_tsukuba_ibaraki', u'penta_prism_beamsplitter', 

The resulting token list actually contains only 82 of the 100 trigrams. This can happen if phrases of four or more words are split into two overlapping trigrams as only one trigram will be used. For example, the trigram list contains "proc natl acad" and "natl acad sci". So the phrase "proc natl acad sci" will be re-tokenized as "proc_natl_acad" and "sci" and does not use the trigram "natl acad sci".

#### Bigram Collocations

Generate the bigram collocations and filter out meaningless bigrams. Filtering on the following criteria removed most meaningless bigrams:
- occur less than five times
- consist of a repeated word
- contain non-alphabetic characters

Then extract the 200 best bigram collocations based on the pmi measure.

In [10]:
# Get the bigram measures
bigram_measures = nltk.collocations.BigramAssocMeasures()

# Create the bigram collocation finder
finder = nltk.collocations.BigramCollocationFinder.from_words(triTokens)

# Filter out trigram collocations that occur less than five times
finder.apply_freq_filter(5)

# Filter out trigram collocations with repeated words
finder.apply_ngram_filter(lambda w1, w2: (w1 == w2))

# filter out bigram collocations containing non-alpha characters
finder.apply_ngram_filter(lambda w1, w2: (not (w1.isalpha() & w2.isalpha())))

# Extract the 100 best bigram collocations
bigrams = finder.nbest(bigram_measures.pmi, 200)

# Display the results
bigrams

[(u'adenine', u'guanine'),
 (u'prolate', u'spheroid'),
 (u'du', u'pont'),
 (u'englewood', u'cliffs'),
 (u'quinolines', u'acridines'),
 (u'intracardiac', u'echocardiography'),
 (u'overload', u'safeguard'),
 (u'autonomous', u'switchover'),
 (u'myocardial', u'infarction'),
 (u'reductive', u'amination'),
 (u'thiobacillus', u'ferrooxidans'),
 (u'san', u'juan'),
 (u'ewing', u'sarcoma'),
 (u'borax', u'naf'),
 (u'dehydrase', u'dh'),
 (u'harvest', u'leftovers'),
 (u'heptyl', u'oleate'),
 (u'microfiche', u'appendix'),
 (u'unmelted', u'unsintered'),
 (u'gondola', u'stanchion'),
 (u'carbonic', u'anhydrase'),
 (u'ectopic', u'pregnancy'),
 (u'squeaking', u'noises'),
 (u'carboxymethyl', u'inulin'),
 (u'houston', u'tex'),
 (u'digestive', u'tract'),
 (u'amersham', u'biosciences'),
 (u'shewanella', u'japonica'),
 (u'shewanella', u'olleyana'),
 (u'flatly', u'conjoined'),
 (u'unguiform', u'clasping'),
 (u'ethidium', u'bromide'),
 (u'hail', u'storm'),
 (u'super', u'cbl'),
 (u'flying', u'stones'),
 (u'golf'

Most of these bigrams look good. A few look odd but could be terms or acronyms specific to the patent subject area.

Create the bigram tokenised and re-tokenise the word list to check the results.

In [11]:
# Create the bigram tokenizer
bi_tokenizer = MWETokenizer(bigrams)

# Re-tokenise the word list and check the results
mweTokens = bi_tokenizer.tokenize(triTokens)
mweSet = {token for token in mweTokens if "_" in token}
print "Number of distinct multi-word tokens:", len(mweSet)
print mweSet

Number of distinct multi-word tokens: 281
set([u'alkoxylates_suds', u'microbiological_interception_enhancing', u'estimator_algorithm', u'knickknack_pouch', u'higashi_tsukuba_ibaraki', u'osb_lvl', u'adenine_guanine', u'modifiers_thickeners_abrasives', u'borax_naf', u'nicotinamide_adenine_dinucleotide', u'shikimate_dehydrogenase', u'occluded_blocked_clogged', u'super_cbl', u'scanning_calorimeter_dsc', u'korean_intellectual', u'law_enforcement', u'silicates_ceramics_zeolites', u'drinking_straw', u'shadow_masks', u'unchanged_discolored', u'towable_sealcoating', u'reductive_amination', u'ef_mul', u'barbecue_grills', u'ad_xenograft', u'gobo_editor', u'graphitized_mcmb', u'bacillus_gibsonii', u'redifferentiating_dedifferentiated_chondrocytes', u'multimedia_reporter', u'dorus_ps', u'acp_acyltransferase_mat', u'tetraacetyl_ethylene_diamine', u'toy_replica', u'grassland_science_tsukuba', u'nat_genet', u'anal_biochem', u'oz_gal', u'curr_opin_biotechnol', u'nucleoside_triphosphates', u'pourable_mo

The resulting token list includes 199 bigrams (plus 82 trigrams), so all bigrams but one will be used in the final token set. 

#### Retokenise Patent Documents using the Bigrams and Trigrams

Define a function to retokenise the patent documents using the bigrams and trigrams

In [12]:
def retokenise(patentNum):
    """
        This function re-tokenises a patent document to convert the trigram and bigram collocations to single tokens.
        Trigram collocations are processed first, then bigram collocations.
        
        Parameters:
            patentNum - the dictionary key of the patent document to be processed
        
        Returns a tuple containing:
            patentNum - (unchanged)
            tokenisedPatent - the retokenised patent document
    """
    this_patent = tokenisedText[patentNum] 
    tokenisedPatent = bi_tokenizer.tokenize(tri_tokenizer.tokenize(this_patent)) 
    return (patentNum, tokenisedPatent) 

Retokenise the tokens to convert trigrams and bigrams to single tokens

In [13]:
# call retokenise for each patent
finalTokens = dict(retokenise(patentNum) for patentNum in tokenisedText.iterkeys())

# check the results
print finalTokens["07640887"]

[u'the', u'present', u'invention', u'provides', u'a', u'surface', u'wave', u'excitation', u'plasma', u'generator', u'in', u'which', u'surface', u'wave', u'excitation', u'plasma', u'is', u'efficiently', u'generated', u'a', u'surface', u'wave', u'excitation', u'plasma', u'generator', u'including', u'an', u'annular', u'waveguide', u'and', u'a', u'dielectric', u'tube', u'is', u'provided', u'the', u'annular', u'waveguide', u'includes', u'an', u'inlet', u'2a', u'for', u'introducing', u'a', u'microwave', u'm', u'an', u'end', u'plate', u'2b', u'for', u'reflecting', u'the', u'microwave', u'm', u'introduced', u'and', u'propagating', u'within', u'the', u'waveguide', u'and', u'a', u'bottom', u'plate', u'2c', u'on', u'which', u'slot', u'antennas', u'2d', u'are', u'formed', u'at', u'a', u'predetermined', u'interval', u'when', u'the', u'wavelength', u'of', u'the', u'microwave', u'm', u'in', u'the', u'waveguide', u'is', u'\u03bbg', u'the', u'length', u'from', u'position', u'b', u'through', u'positions

Display the vocabulary statistics after re-tokenising the documents

In [14]:
# Convert to a word list and display the vocabulary stats
words = checkVocabStats(finalTokens)

Vocabulary size:  33623
Total number of tokens:  3797190
Lexical diversity:  112.934300925
Total number of articles: 800
Average document length: 4746.4875
Maximum document length: 58005
Minimum document length: 238
Standard deviation of document length: 4287.91401206


The vocabulary size has increased by 196 but 281 new bigram and trigram tokens were created, so a few of the original tokens only occur in a bigram or trigram.

## Generate and Save Feature Vectors

### Run 1 - Include Stop Words

#### Generate and save Count and Binary Feature Vectors

Generate the count feature vectors

In [15]:
# Get the count vectoriser
vectorizer = CountVectorizer(analyzer = "word") 

# Vectorise the patent documents
dataFeatures = vectorizer.fit_transform([" ".join(value) for value in finalTokens.values()])
print "Dimensions of dataFeatures:", dataFeatures.shape

# Extract the vocabulary
vocab = vectorizer.get_feature_names()
vocabSize = dataFeatures.shape[1]

print "Words omitted from vectorizer vocab:", set(words) - set(vocab)

Dimensions of dataFeatures: (800, 33564)
Words omitted from vectorizer vocab: set([u'\u025b', u'\xbd', u'\u2146', u'\u03b1', u'\u03b3', u'\u03b2', u'\u03b5', u'\u03b4', u'\u03b7', u'\u03b8', u'\u03bb', u'\u03ba', u'\u03bd', u'\xbc', u'\xbe', u'\u03c1', u'\u03c0', u'\u03c3', u'\u03c4', u'\u2147', u'\u03c6', u'\u03c9', u'\u2153', u'\u2155', u'\u2154', u'x', u'\u2159', u'\xe5', u'\u215b', u'\u215d', u'\u215c', u'z', u'\u215e', u'a', u'c', u'b', u'e', u'd', u'g', u'f', u'i', u'h', u'k', u'j', u'm', u'l', u'o', u'n', u'q', u'p', u's', u'r', u'u', u't', u'w', u'v', u'y', u'\u03bc', u'\u03be'])


The vectorizer has omitted single character tokens from the vocabulary. 

Save the count and binary feature vectors to files in sparse format - records are in "`patent_id,token_index,value`" format and a record is only generated if the value is non-zero.
- For the binary feature vector, each record has the value 1 to indicate the token appears in the document.
- For the count feature vector, the value is the number of times the token occurs in the document.

In [16]:
# Open the binary and count feature vector files
binaryFile = open("./binary_vectors.txt", "w")
countFile = open("./count_vectors.txt", "w")

# Process each patent document and output records for each word in the vocabulary that appears in the document
for doc, rec in zip(finalTokens.keys(), dataFeatures.toarray()):     
    for wordIndex, count in zip(range(vocabSize), rec):
        if count > 0:
            # Output a binary feature record - value is 1
            binaryFile.write(doc + "," + str(wordIndex) + ",1\n")
            # Output a count feature record - value is the word count
            countFile.write(doc + "," + str(wordIndex) + "," + str(count) + "\n")
            
# Close the feature vector files
binaryFile.close()
countFile.close()

#### Generate and save TF-IDF Weighted Feature Vectors

Generate the TF-IDF weighted feature vectors

In [17]:
# Get the tf-idf vectoriser
tfidf = TfidfVectorizer(analyzer = "word")

# Vectorise the patent documents
dataFeatures = tfidf.fit_transform([" ".join(value) for value in finalTokens.values()])
print "Dimensions of dataFeatures:", dataFeatures.shape

# Extract the vocabulary and vocab size
vocab = tfidf.get_feature_names()
vocabSize = dataFeatures.shape[1]

# Check the results - display the weight vector for the first document
for word, weight in zip(vocab, dataFeatures.toarray()[0]):
    if weight > 0:
        print word, ":", weight

Dimensions of dataFeatures: (800, 33564)
able : 0.00260182231466
about : 0.00349016303366
accommodated : 0.00831404077876
according : 0.00319355360518
achieved : 0.00460681837407
acted : 0.00549974726189
acts : 0.0130932412314
addition : 0.00170367826561
additional : 0.00194029643054
additionally : 0.00244008181937
adjusted : 0.00267506295669
advantageously : 0.00321439062826
advantages : 0.00190811359563
after : 0.00320223687257
again : 0.00943460775299
against : 0.0206939291927
aid : 0.00362793891864
all : 0.00305505965263
along : 0.00490273213502
already : 0.00313158509122
also : 0.00240279048484
alternative : 0.00478818808285
although : 0.00189095184541
an : 0.0273459433677
and : 0.066474665631
angles : 0.00348417106108
any : 0.00457041350267
apart : 0.00284499269108
aperture : 0.0425661583018
applied : 0.00394080379612
arc : 0.00415702038938
are : 0.0219862773987
arm : 0.00606027986288
around : 0.00208963650975
arranged : 0.0127033593359
art : 0.0013190624899
as : 0.0231144315005


Save the TF-IDF weighted vectors to a file in sparse format - records are in "`patent_id,token_index,weight`" format and a record is only generated if the tf-idf weight is non-zero.

In [18]:
# Open the tf-idf feature vector files
tfidfFile = open("./tf_idf_vectors.txt", "w")

# Process each patent document and output a record for each word in the document with a non-zero weight
for doc, rec in zip(finalTokens.keys(), dataFeatures.toarray()):     
    for wordIndex, weight in zip(range(vocabSize), rec):
        if weight > 0:
            tfidfFile.write(doc + "," + str(wordIndex) + "," + str(weight) + "\n")
            
# Close the feature vector file
tfidfFile.close()

### Run 2 - Exclude Stop Words

Two categories of stop words have been excluded from the patent document vectors:
- standard stop words as defined in the the "`stopwords_en.txt`" file
- tokens that very commonly or very rarely occur in the patent documents

#### Standard Stop words

Generate a stopword list from the provided stopword file and exclude these stopwords from the tokenised documents

In [19]:
# Read the stopword file into a list
stopwordList = []
with open("./stopwords_en.txt") as f:
    stopwordList = f.read().splitlines()
print stopwordList

# Convert the stopword list to a set for faster processing
stopwordSet = set(stopwordList)

# Filter out the stopwords from the tokenised patents
for patentNum in finalTokens.keys():
    finalTokens[patentNum] = [w for w in finalTokens[patentNum] if w not in stopwordSet]

['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', "c'mon", "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'conta

#### Patent Data Specific Stop Words

Exclude other tokens that are common or rare in the patent documents. These definitions of common and rare tokens have been used:
- Common tokens are tokens that occur in over 200 documents. These tokens must occur in documents in at least three sections, so are not that useful. Testing with different cut-off points showed that 200 documents was the best (to the nearest 100).
- Rare tokens are tokens that only occur once. Tokens that only occur once are of very little benefit for document analysis. There are about 10,000 of these words, so removing them greatly reduces the dimensionality of the data. 

Find the tokens that occur in over 200 documents

In [20]:
# Create a list from the sets of tokens in each document
wordDocs = list(chain.from_iterable([set(value) for value in finalTokens.values()]))

# Get the frequency of each token - the number of documents containing the token
fdDocs = FreqDist(wordDocs)

# Check the results - display the tokens that occur in over 200 documents
print [w for w in fdDocs.most_common(500) if w[1] > 200]

[(u'comprising', 767), (u'invention', 766), (u'background', 747), (u'description', 722), (u'present', 720), (u'relates', 705), (u'embodiment', 668), (u'summary', 665), (u'reference', 664), (u'shown', 663), (u'drawings', 663), (u'provide', 651), (u'art', 648), (u'field', 631), (u'provided', 626), (u'application', 616), (u'detailed', 613), (u'made', 612), (u'form', 595), (u'includes', 594), (u'view', 582), (u'embodiments', 575), (u'surface', 557), (u'claims', 551), (u'end', 544), (u'formed', 543), (u'including', 541), (u'side', 541), (u'portion', 522), (u'preferred', 517), (u'order', 506), (u'comprises', 489), (u'include', 487), (u'scope', 486), (u'part', 485), (u'material', 485), (u'time', 484), (u'method', 483), (u'generally', 481), (u'position', 475), (u'related', 474), (u'thereof', 469), (u'number', 465), (u'lower', 464), (u'means', 460), (u'addition', 455), (u'direction', 449), (u'high', 448), (u'connected', 448), (u'substantially', 446), (u'filed', 444), (u'patent', 443), (u'incorp

Find the tokens that only occur once and create a set combining the common and rare words, then remove all these tokens from the tokenised patent documents. 

In [21]:
# Convert tokenised patents to a word list
words = list(chain.from_iterable(finalTokens.values()))

# Generate a frequency distribution of the words
fd = FreqDist(words)

# Extract the set of words that appear in over 200 documents and those that only occur once
filterSet = {w[0] for w in fdDocs.most_common(500) if w[1] > 200}.union(set(fd.hapaxes()))

# Filter out these common and rare words from the tokenised patents
for patentNum in finalTokens.keys():
    finalTokens[patentNum] = [w for w in finalTokens[patentNum] if w not in filterSet]

# check the results
print finalTokens["07640887"]

[u'wave', u'excitation', u'plasma', u'generator', u'wave', u'excitation', u'plasma', u'efficiently', u'wave', u'excitation', u'plasma', u'generator', u'annular', u'waveguide', u'dielectric', u'tube', u'annular', u'waveguide', u'inlet', u'2a', u'introducing', u'microwave', u'2b', u'reflecting', u'microwave', u'introduced', u'propagating', u'waveguide', u'2c', u'slot', u'antennas', u'2d', u'interval', u'wavelength', u'microwave', u'waveguide', u'\u03bbg', u'2c', u'circumferential', u'\u03c0', u'd1', u'annular', u'waveguide', u'\u03bbg', u'spaced', u'interval', u'slot', u'antennas', u'2d', u'interval', u'slot', u'antennas', u'2d', u'equal', u'wavelength', u'\u03bbg', u'waveguide', u'benefit', u'japanese', u'serial', u'japanese', u'wave', u'excitation', u'plasma', u'generator', u'generating', u'wave', u'excitation', u'plasma', u'introducing', u'microwave', u'microwave', u'generator', u'wave', u'excitation', u'plasma', u'processing', u'performing', u'chemical', u'vapor', u'deposition', u'cv

In [22]:
# Convert tokenised patents to a word list and display the vocabulary statistics
words = checkVocabStats(finalTokens)

Vocabulary size:  22662
Total number of tokens:  1238478
Lexical diversity:  54.649986762
Total number of articles: 800
Average document length: 1548.0975
Maximum document length: 20578
Minimum document length: 102
Standard deviation of document length: 1534.10535753


Removing the stop words has reduced the vocabulary size by about a third, but reduced the total number of tokens to about 33% of the number in the original tokenised documents.

#### Re-generate the feature vectors and save to files

Generate and save count and binary feature vectors

In [23]:
# Vectorise the patent documents
dataFeatures = vectorizer.fit_transform([" ".join(value) for value in finalTokens.values()])

# Extract the vocabulary size
#vocab = vectorizer.get_feature_names()
vocabSize = dataFeatures.shape[1]

# Open the binary and count feature vector files
binaryFile = open("./binary_vectors_2.txt", "w")
countFile = open("./count_vectors_2.txt", "w")

# Process each patent document and output records for each word in the vocabulary that appears in the document
for doc, rec in zip(finalTokens.keys(), dataFeatures.toarray()):     
    for wordIndex, count in zip(range(vocabSize), rec):
        if count > 0:
            # Output a binary feature record - value is 1
            binaryFile.write(doc + "," + str(wordIndex) + ",1\n")
            # Output a count feature record - value is the word count
            countFile.write(doc + "," + str(wordIndex) + "," + str(count) + "\n")
            
# Close the feature vector files
binaryFile.close()
countFile.close()

Generate and save the TF-IDF weighted feature vectors

In [24]:
# Vectorise the patent documents
dataFeatures = tfidf.fit_transform([" ".join(value) for value in finalTokens.values()])

# Extract the vocabulary size
vocabSize = dataFeatures.shape[1]

# Open the tf-idf feature vector files
tfidfFile = open("./tf_idf_vectors_2.txt", "w")

# Process each patent document and output a record for each word in the document with a non-zero weight
for doc, rec in zip(finalTokens.keys(), dataFeatures.toarray()):     
    for wordIndex, weight in zip(range(vocabSize), rec):
        if weight > 0:
            tfidfFile.write(doc + "," + str(wordIndex) + "," + str(weight) + "\n")
            
# Close the feature vector file
tfidfFile.close()

## SVM Classifier Results

SVM Classifier results - AUC for each feature vector:

Feature Vector | with stop words | generic stop words removed | all stop words removed
---------------|-----------------|----------------------------|-----------------------
Binary Vector  | 0.64            | 0.65                       | 0.69
Count Vector   | 0.57            | 0.57                       | 0.61
TF-IDF Vector  | 0.74            | 0.74                       | 0.74

The TF-IDF weighted feature vector out-performed the other feature vectors by quite a large margin.

Removing only the stop words from the supplied stopword file made a slight improvement to the binary feature vector classification, but no difference to the other classifications.

Removing stop words specific to the patent documents (both common and rare words), as well as the supplied stopword list improved both the binary and count feature vectors classification results, but made no difference to the TF-IDF feature vector classification result. The best improvement was obtained by removing words that occurred in more than 200 documents (so all removed words must occur in documents from three or more sections).

The submitted code removes both the stop words in the stopword file and those specific to the patent documents (both common and rare words).

## References

<p style="margin-left:.5in;text-indent:-.5in">Bird, S., Loper, E., & Klein, E. (2009). Natural Language Processing with Python: O'Reilly Media Inc.</p>

<p style="margin-left:.5in;text-indent:-.5in">Du, L. (2016). 6. Text Data Preprocessing. Melbourne: Monash University. Retrieved from https://www.alexandriarepository.org/syllabus/preprocessing-text-data/. </p>

<p style="margin-left:.5in;text-indent:-.5in">Liu, Q., Meng, Q., Guo, S., & Zhao, X. (2013). α′ Type Ti–Nb–Zr alloys with ultra-low Young's modulus and high strength. Progress in Natural Science: Materials International, 23(6), 562-565. Retrieved from http://www.sciencedirect.com/science/article/pii/S1002007113001548 doi:http://dx.doi.org/10.1016/j.pnsc.2013.11.005</p>

<p style="margin-left:.5in;text-indent:-.5in">NLTK Project. (2014). NLTK 3.0 documentation.   Retrieved from http://www.nltk.org/</p>

<p style="margin-left:.5in;text-indent:-.5in">Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., . . . Duchesnay, É. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825-2830. </p>

<p style="margin-left:.5in;text-indent:-.5in">Python Software Foundation. (2016). Python 2.7.12 Documentation.   Retrieved from https://docs.python.org/2/index.html</p>

<p style="margin-left:.5in;text-indent:-.5in">Richardson, L. (2015). Beautiful Soup (Version 4.4.1) [Python Library]. Retrieved from https://www.crummy.com/software/BeautifulSoup/</p>

<p style="margin-left:.5in;text-indent:-.5in">Scikit Learn. (2016). Scikit-Learn Machine Learning in Python.   Retrieved from http://scikit-learn.org/stable/</p>

<p style="margin-left:.5in;text-indent:-.5in">Walt, S. v. d., Colbert, S. C., & Varoquaux, G. (2011). The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering, 13, 22-30. doi:10.1109/MCSE.2011.37</p>

<p style="margin-left:.5in;text-indent:-.5in">Wikipedia. (2014). Dideoxynucleotide.   Retrieved from https://en.wikipedia.org/wiki/Dideoxynucleotide</p>