# Multinomial Naive Bayes Model

We now build the Multinomial Naive Bayes. First, let's import all the modules we need.

In [37]:
import numpy as np
import re
import gensim.utils
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import cross_val_score
from collections import defaultdict
from sklearn.naive_bayes import MultinomialNB

## Loading Data

Now we create a function that can take a filename of the word we want to train and return a dictionary. The keys of this dictionary are the different meanings of the words. The values of each meaning is a list of sentences of that particular meaning.

In [38]:
def load_data(filename):
    with open('new_train/{}'.format(filename)) as f:
        d = defaultdict(list)
        for line in f.readlines():
            m = re.search("<s>(.*)<\/s>", line)
            if m:
                m2 = re.search("<tag \"(.*)\">(.*)<\/>", line)
                if m2:
                    meaning = m2.group(1)
                    word = m2.group(2)
                    sentence = m.group(1)
                    sentence = sentence.replace('<s>', '')
                    sentence = sentence.replace('</s>', '')
                    sentence = sentence.replace('<p>', '')
                    sentence = sentence.replace('</p>', '')
                    sentence = sentence.replace('<@>', '')
                    sentence = sentence.replace(m2.group(0), '')
                    d[meaning].append(sentence)
    return d

In [39]:
d = load_data('line.cor')

Let's output the meanings and count how many examples there are for each meaning

In [40]:
for key in d.keys():
    print key, len(d[key])

cord 373
division 374
product 2217
text 404
phone 429
formation 349


Next, we extract the labels into a numpy array

In [None]:
def set_labels(d):
    ys = []
    for i in d.keys():
        yi = np.repeat(str(i), len(d[i]))
        ys.append(yi)
    return np.concatenate(ys)

In [None]:
y = set_labels(d)

In [None]:
print y

With this dictionary, we can make a list of only text documents (without labels)

In [41]:
train_documents = [sentence for value in d.values() for sentence in value]

In [42]:
len(train_documents)

4146

## Lemmatization

To simplify words in our text back to their original forms, wew create a function to lemmatize any input documents.

In [43]:
def lem(documents):
    lem_documents = []
    for doc in documents:
        no_tag_words = [w[:-3] for w in gensim.utils.lemmatize(doc)]
        lem_documents.append(' '.join(no_tag_words)) 
    return lem_documents

Let's test this this function with the train_documents we created before. Note that the lemmatization is not perfect and some words are converted to strange forms.

In [44]:
lem_train_documents = lem(train_documents)

In [45]:
lem_train_documents[0]

'company argue foreman needn have tell worker not move plank lifeline be tie come common sense commission note however dellovade hadn instruct employee secure lifeline didn heed federal inspector earlier suggestion company install special safety frame structure be build'

## TF-IDF Vectorizing

Multinomial Naive Bayes can only work with numeric matrices, so we need to use Tfidf vectorizer to convert our text to word-feature matrices. The function is tested on the lemmatized documents created above.

In [46]:
def Tfidf(documents):
    vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,3)).fit(documents)
    vectors = vectorizer.transform(documents)
    return vectors, vectorizer

In [47]:
vectors, vectorizer = Tfidf(lem_train_documents)

In [48]:
vectorizer

TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 3), norm=u'l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [49]:
vectors

<4146x164961 sparse matrix of type '<type 'numpy.float64'>'
	with 257442 stored elements in Compressed Sparse Row format>

## Multinomial NB

We are ready to fit the Multinomial Naive Bayes classification to our data.

In [55]:
NB = MultinomialNB(alpha= 0.12)
NB.fit(vectors,y)

MultinomialNB(alpha=0.12, class_prior=None, fit_prior=True)

## Predicting one Sentence

In [1]:
# Function that applies NB model to predict the meaning of the word in a sentence
def predict_NB(sentence, vectorizer, nb):
    vector = vectorizer.transform([sentence]) # use the vectorizer from training
    return nb.predict(vector)

## Evaluating Predictions on Data

Although not very useful, we can still see how well our Multinomial Naive Bayes model does on the training set.

In [59]:
def scoring():
    for key in d.keys():
        score = 0
        for sent in d[key]:
            if predict_NB(sent, vectorizer, NB) == key:
                score += 1
        print "{} score: ".format(key), score, '/', len(d[key])

In [60]:
scoring()

cord score:  351 / 373
division score:  361 / 374
product score:  2217 / 2217
text score:  394 / 404
phone score:  410 / 429
formation score:  324 / 349


Since we predict on the training set, the model works very well just as expected.

## Cross validation

Fina

In [61]:
cross_val_score(NB, vectors, y, scoring='accuracy')

array([ 0.72904624,  0.73227207,  0.71521739])

In [62]:
cross_val_score(NB, vectors, y, scoring='precision')

  sample_weight=sample_weight)
  sample_weight=sample_weight)
  sample_weight=sample_weight)


array([ 0.73745631,  0.7519013 ,  0.72692749])

cross_val_score(NB, vectors, y, scoring='')