### City University of New York | IS620 | Web Analytics

# Project 4  :  Bayesian Sense Tag Classifier with Python and NLTK

***

**We will use the senseval corpus to build a Sense Tag classifier.
We start by reading the instances for a particular sense (in our case 'hard').**

In [90]:
from nltk.corpus import senseval

import nltk
import random

instances = senseval.instances('hard.pos')
size = int(0.9 * len(instances))
train, test = instances[size:], instances[:size]

**We have loaded the instances in the 'instances' variable. We will now display three instances (from 1 to 3). 
Notice that each instance has a position attribute that denotes that position of the sense ('hard' in our case) within the set of words. 
We will use the position to create a feature set which will be composed of the words and tags before and after the sense word.**

In [91]:
instances[1:4]

[SensevalInstance(word=u'hard-a', position=10, context=[('clever', 'NNP'), ('white', 'NNP'), ('house', 'NNP'), ('``', '``'), ('spin', 'VB'), ('doctors', 'NNS'), ("''", "''"), ('are', 'VBP'), ('having', 'VBG'), ('a', 'DT'), ('hard', 'JJ'), ('time', 'NN'), ('helping', 'VBG'), ('president', 'NNP'), ('bush', 'NNP'), ('explain', 'VB'), ('away', 'RB'), ('the', 'DT'), ('economic', 'JJ'), ('bashing', 'NN'), ('that', 'IN'), ('low-and', 'JJ'), ('middle-income', 'JJ'), ('workers', 'NNS'), ('are', 'VBP'), ('taking', 'VBG'), ('these', 'DT'), ('days', 'NNS'), ('.', '.')], senses=('HARD1',)),
 SensevalInstance(word=u'hard-a', position=3, context=[('i', 'PRP'), ('find', 'VBP'), ('it', 'PRP'), ('hard', 'JJ'), ('to', 'TO'), ('believe', 'VB'), ('that', 'IN'), ('the', 'DT'), ('sacramento', 'NNP'), ('river', 'NNP'), ('will', 'MD'), ('ever', 'RB'), ('be', 'VB'), ('quite', 'RB'), ('the', 'DT'), ('same', 'JJ'), (',', ','), ('although', 'IN'), ('i', 'PRP'), ('certainly', 'RB'), ('wish', 'VBP'), ('that', 'IN'),

**Following is an example of getting the sense label. We will use this during creation of the feature set.**

In [92]:
instances[0].senses[0]

'HARD1'

**We will now define our feature set. As noted before, we will use the 'position' atrribute of the instance to find the word and tag combination of the previous word. If the position is zero, there is no previous word, so we will use the next word in that case. The previous word and tag in case of zero position will be hard-coded to 'START' (Start Of Sentence).**

In [93]:
def getFeatures(instance):
    featureList = dict()
    position = instance.position
#Check if the position is greater than 0. If it is, we have a previous word. Otherwise we just have following words.
    if position > 0:
        featureList['previousWord'] = instance.context[position - 1][0]
        featureList['previousTag'] = instance.context[position - 1][1]
        featureList['nextWord'] = instance.context[position + 1][0]
        featureList['nextTag'] = instance.context[position + 1][1]
    else:  
        featureList['previousWord'] = 'START'
        featureList['previousTag'] = 'START'        
        featureList['nextWord'] = instance.context[position + 1][0]
        featureList['nextTag'] = instance.context[position + 1][1]
    return featureList

**We will now build our dataset with features by calling the 'getFeatures' function defined above for each instance object. The sense will be fetched thorough the 'senses' attribute of the instance.**

In [94]:
dataSet = []
for instance in instances:
    if len(instance.senses) == 1:
        featuresForInstance = getFeatures(instance)
        senseForInstance = instance.senses[0]
        dataSet.append((featuresForInstance, senseForInstance))

**We will now randomly shuffle the data set.**

In [95]:
random.shuffle(dataSet)

**We will print out the size of the data set.**

In [96]:
len(dataSet)

4333

**Following are the first 10 entries from the data set.**

In [97]:
dataSet[0:10]

[({'nextTag': 'TO',
   'nextWord': 'to',
   'previousTag': 'VBP',
   'previousWord': 'are'},
  'HARD1'),
 ({'nextTag': 'PRP',
   'nextWord': 'it',
   'previousTag': 'WRB',
   'previousWord': 'how'},
  'HARD1'),
 ({'nextTag': 'NN',
   'nextWord': 'day',
   'previousTag': 'DT',
   'previousWord': 'the'},
  'HARD1'),
 ({'nextTag': 'NN',
   'nextWord': 'question',
   'previousTag': 'DT',
   'previousWord': 'a'},
  'HARD1'),
 ({'nextTag': 'NN',
   'nextWord': 'look',
   'previousTag': ',',
   'previousWord': ','},
  'HARD2'),
 ({'nextTag': 'NNS',
   'nextWord': 'shoes',
   'previousTag': 'PRP$',
   'previousWord': 'their'},
  'HARD3'),
 ({'nextTag': 'NN',
   'nextWord': 'work',
   'previousTag': 'PRP$',
   'previousWord': 'their'},
  'HARD2'),
 ({'nextTag': 'NN',
   'nextWord': 'thinking',
   'previousTag': 'IN',
   'previousWord': 'of'},
  'HARD2'),
 ({'nextTag': 'TO',
   'nextWord': 'to',
   'previousTag': 'RB',
   'previousWord': 'sometimes'},
  'HARD1'),
 ({'nextTag': 'TO',
   'nextWord

**We will now split the data set into training, dev and test sets. We will use 350 entries for the dev and test set each, and use the remaining for the training set.**

In [98]:
trainingSetStartLocation = 700 # Training set starts from 700 to the end of the list
devSetSize = 350 # Dev set starts at 0 and ends at 350
trainingSet = dataSet[trainingSetStartLocation:]
devSet = dataSet[:devSetSize]
testSet = dataSet[devSetSize:trainingSetStartLocation] # Remaining is the test set from 350 to 700

**We now print out the sizes of the various data sets.**

In [99]:
print('The Training set size is : ' + str(len(trainingSet)))
print('The Dev set size is : ' + str(len(devSet)))
print('The Test set size is : ' + str(len(testSet)))

The Training set size is : 3633
The Dev set size is : 350
The Test set size is : 350


**We now create the bayes classifier and train it with the training set.**

In [100]:
classifier = nltk.NaiveBayesClassifier.train(trainingSet)

**We will now run the classifier on the Dev set and print it's classification accuracy.**

In [101]:
devSetAccuracy = nltk.classify.accuracy(classifier, devSet)
print ("The Dev set classification accuracy is : ", devSetAccuracy)

('The Dev set classification accuracy is : ', 0.8885714285714286)


**We will now run the classifier on the Test set and print it's classification accuracy.**

In [102]:
testSetAccuracy = nltk.classify.accuracy(classifier, testSet)
print ("The Test set classification accuracy is : ", testSetAccuracy)

('The Test set classification accuracy is : ', 0.8942857142857142)
