**To use Naive Bayes in sklearn, we need to have feature matrix. In this case, the each word present in any of the training set would become a feature.**

**So if there are 1800 reviews (900 postive and 900 negative) and the entire corpus has 45142 distict words, then our feature matrix will be a numpy array of 1800 rows and 45142 columns.**

**We could build it on our own or use CountVectorizer.**

In [None]:
import glob
import os
from collections import defaultdict
import re
import numpy as np
from sklearn.cross_validation import train_test_split

In [None]:
def processFile(filename):
    f = open(filename, 'r')
    content = f.read()
    content = re.sub('[^A-z \n]','',content)
    return content.split()

**We read through all the docments and build a list-of-list of words:**

In [None]:
path1 = '/Users/jb/Desktop/review_polarity/txt_sentoken/pos'
path2 = '/Users/jb/Desktop/review_polarity/txt_sentoken/neg'
content = []
for filename in glob.glob(os.path.join(path1, '*.txt')):
    content.append(processFile(filename))
for filename in glob.glob(os.path.join(path2, '*.txt')):
    content.append(processFile(filename))

In [None]:
# make the list-of-lists to be list-of-texts
alter = [' '.join(c) for c in content]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
features = CountVectorizer().fit_transform(alter)

In [None]:
features.shape

**We set the labels, 1 for postive reviews and 0 for negative review:**

In [None]:
label = [1]*1000 + [0]*1000

**As we have often done before, we split the data into training and test:**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(features,label)

**We choose Multinomial Naive Bayes:**

In [None]:
from sklearn import naive_bayes
from sklearn.metrics import accuracy_score, classification_report

model = naive_bayes.MultinomialNB()
model.fit(X_train, Y_train)

print("Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test)))
print(classification_report(Y_test, model.predict(X_test)))

In [None]:
features = CountVectorizer(ngram_range=(1, 2)).fit_transform(alter)

In [None]:
features.shape

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(features,label)

In [None]:
model = naive_bayes.MultinomialNB()
model.fit(X_train, Y_train)

print ("Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test)))
print (classification_report(Y_test, model.predict(X_test)))

**Let's try other classifiers:**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

** Linear SVC gets the same results but is much slower**

In [None]:
model = LinearSVC()
model.fit(X_train, Y_train)

print ("Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test)))
print (classification_report(Y_test, model.predict(X_test)))

** Random Forest doesn't do that well**

In [None]:
model = RandomForestClassifier()
model.fit(X_train, Y_train)

print ("Accuracy: %.3f"% accuracy_score(Y_test, model.predict(X_test)))
print (classification_report(Y_test, model.predict(X_test)))