**To use Naive Bayes in sklearn, we need to have feature matrix. In this case, the each word present in any of the training set would become a feature.**

**So if there are 2000 reviews (1000 postive and 1000 negative) and the entire corpus has about 46000 distict words, then our feature matrix will be a numpy array of 2000 rows and 46000 columns.**

In [25]:
import glob
import os
from collections import defaultdict
import re
import numpy as np
from sklearn.model_selection import train_test_split

In [26]:
def processFile(filename):
    f = open(filename, 'r')
    content = f.read()
    content = re.sub('[^A-z \n]','',content)
    return content.split()

**We read through all the docments and build a list-of-list of words:**

In [27]:
path1 = 'review_polarity/txt_sentoken/pos'
path2 = 'review_polarity/txt_sentoken/neg'
content = []
for filename in glob.glob(os.path.join(path1, '*.txt')):
    content.append(processFile(filename))
for filename in glob.glob(os.path.join(path2, '*.txt')):
    content.append(processFile(filename))

In [28]:
# make the list-of-lists to be list-of-texts
alter = [' '.join(c) for c in content]

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
v = CountVectorizer()
features = v.fit_transform(alter)

In [31]:
features.shape

(2000, 47008)

**We set the labels, 1 for postive reviews and 0 for negative review:**

In [32]:
label = [1]*1000 + [0]*1000

**As we have often done before, we split the data into training and test:**

In [33]:
X_train, X_test, Y_train, Y_test = train_test_split(features,label)

**We choose Multinomial Naive Bayes:**

In [34]:
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score, classification_report

model = naive_bayes.MultinomialNB()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.804
             precision    recall  f1-score   support

          0       0.80      0.81      0.80       246
          1       0.81      0.80      0.81       254

avg / total       0.80      0.80      0.80       500



In [11]:
v = CountVectorizer(ngram_range=(1, 2))
features = v.fit_transform(alter)

In [12]:
features.shape

(2000, 550699)

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(features,label)

In [14]:
model = naive_bayes.MultinomialNB()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.840
             precision    recall  f1-score   support

          0       0.86      0.82      0.84       250
          1       0.82      0.86      0.84       250

avg / total       0.84      0.84      0.84       500



**Let's try other classifiers:**

In [15]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

** Linear SVC gets the same results but is much slower**

In [16]:
model = LinearSVC()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.834
             precision    recall  f1-score   support

          0       0.82      0.86      0.84       250
          1       0.85      0.81      0.83       250

avg / total       0.83      0.83      0.83       500



In [17]:
model = SVC()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

KeyboardInterrupt: 

** Random Forest doesn't do that well**

In [18]:
model = RandomForestClassifier()
model.fit(X_train, Y_train)

print "Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test))
print classification_report(Y_test, model.predict(X_test))

Accuracy: 0.654
             precision    recall  f1-score   support

          0       0.62      0.78      0.69       250
          1       0.71      0.53      0.60       250

avg / total       0.66      0.65      0.65       500



In [35]:
r = 'There is much to ponder in “Fast & Furious 6,” beginning with the title. The first film, in 2001, was called “The Fast and the Furious,” but the going has been so rough and so raw, over the years, that at some point the definite articles dropped off. I prefer the stripped-down version, and can’t help wishing that the principle had been applied more freely in the past: “Bad & Beautiful,” “Good, Bad & Ugly,” “Remains of Day.” Spruce though the new name may be, however, is it true? Does a lifetime of road use not teach us that the fast are very rarely furious? Anyone who has tried a German Autobahn, where there is no official speed limit, merely an “advisory” one, will know that the smile on the face of the Mercedes owner, as he passes you in a whispering blur, is identical to that of the Dalai Lama. It is those of us in gridlock, waiting for red to turn green half a mile away, who know the true meaning of rage.'

In [36]:
a = v.transform([r])

In [37]:
model.predict(a)

array([1])

In [38]:
model.predict_proba(a)

array([[  1.63315401e-04,   9.99836685e-01]])

In [39]:
a = v.transform(['fantastic'])

In [40]:
model.predict_proba(a)

array([[ 0.14211433,  0.85788567]])