# Classifing text with NLTK (natural language processing) and ScikitLearn (machine learning) 

Examples of classifiers from NLTK and ScikitLearn with default parameters. One run of a single training/testing set setup.

NLTK and other imports (see nb_txt_classifier for getting the corpus and training set ready)

In [49]:
import nltk
import random
from nltk.corpus import movie_reviews

ScikitLearn imports

In [50]:
from nltk.classify.scikitlearn import SklearnClassifier  ## wrapper for scikitlearn in nltk
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier # Stochastic Gradiant Descent
from sklearn.svm import SVC, LinearSVC, NuSVC

### Get feature sets from the corpus

In [51]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]

# Select most common words in all reviews (positive and negative)¶
all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
    
# Convert all_words from a list to a frequency distribution.
all_words = nltk.FreqDist(all_words)
all_words.most_common(15)

# Select only top commonly used words.
word_features = [wordFreq[0] for wordFreq in all_words.most_common(3000)]

Define training features (whether corpus words belong to the most frequent words list)

In [52]:
def find_features(document):
    words = set(document)  ## pick only unique words in the review
    features = {}
    for w in words:
        features[w] = (w in word_features)
    return features

feature_sets = [(find_features(review_words),category) for review_words, category in documents]

## Classification

Do a common train and test set setup for all the classifiers below.

In [53]:
# Randomise items in the feature set to select test and training sets randomly.
random.shuffle(feature_sets)

# Select training and test sets
training_set = feature_sets[:1900]
testing_set = feature_sets[1900:]

### Classifier 1: Original NLTK Naive Bayes

In [54]:
%%timeit -r1 -n1
classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
classifier.show_most_informative_features(15)

Classifier accuracy: 0.7
Most Informative Features
               ludicrous = False             neg : pos    =     14.3 : 1.0
             outstanding = True              pos : neg    =     13.8 : 1.0
                  avoids = False             pos : neg    =     12.3 : 1.0
              astounding = False             pos : neg    =     11.6 : 1.0
                    slip = False             pos : neg    =     11.6 : 1.0
               insulting = False             neg : pos    =     11.0 : 1.0
                headache = False             neg : pos    =     11.0 : 1.0
                    3000 = False             neg : pos    =     10.4 : 1.0
                  hudson = False             neg : pos    =     10.4 : 1.0
                  hatred = False             pos : neg    =     10.3 : 1.0
                thematic = False             pos : neg    =     10.3 : 1.0
                   sucks = False             neg : pos    =     10.2 : 1.0
              incoherent = False             neg 

### Classifier 2: Multinomial Naive Bayes from ScikitLearn

In [55]:
%%timeit -r1 -n1

from sklearn.naive_bayes import MultinomialNB

mnb_classifier = SklearnClassifier(MultinomialNB())
mnb_classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(mnb_classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#mnb_classifier.show_most_informative_features(15)

Classifier accuracy: 0.86
613 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Classifier 3: Gaussian Naive Bayes from ScikitLearn

In [56]:
#Doesn't work, need to look it up
#gnb_classifier = SklearnClassifier(GaussianNB())
#gnb_classifier.train(training_set)
#classifier_accuracy = nltk.classify.accuracy(gnb_classifier, testing_set)
#print("Classifier accuracy: {}".format(classifier_accuracy))
##gnb_classifier.show_most_informative_features(15)

### Classifier 4: Bernoulli Naive Bayes from ScikitLearn

In [57]:
%%timeit -r1 -n1

from sklearn.naive_bayes import BernoulliNB

bnb_classifier = SklearnClassifier(BernoulliNB())
bnb_classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(bnb_classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#bnb_classifier.show_most_informative_features(15)

Classifier accuracy: 0.83
596 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Classifier 5: Logistic Regression from ScikitLearn

In [58]:
%%timeit -r1 -n1

from sklearn.linear_model import LogisticRegression

classifier = SklearnClassifier(LogisticRegression())
classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(bnb_classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#bnb_classifier.show_most_informative_features(15)

Classifier accuracy: 0.89
754 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Classifier 6: Stochastic Gradient Descent from ScikitLearn

In [59]:
%%timeit -r1 -n1

from sklearn.linear_model import SGDClassifier
classifier = SklearnClassifier(SGDClassifier())
classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#classifier.show_most_informative_features(15)


Classifier accuracy: 0.78
615 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)




### Classifier 7: SVM from ScikitLearn

In [60]:
%%timeit -r1 -n1

from sklearn.svm import SVC

classifier = SklearnClassifier(SVC())
classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#classifier.show_most_informative_features(15)

Classifier accuracy: 0.48
7.3 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Classifier 8: Linear SVM from ScikitLearn

In [61]:
%%timeit -r1 -n1

from sklearn.svm import LinearSVC

classifier = SklearnClassifier(LinearSVC())
classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#classifier.show_most_informative_features(15)

Classifier accuracy: 0.82
673 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


### Classifier 9: SVM with numeric parameter from ScikitLearn

You can specify the number of support vectors

In [62]:
%%timeit -r1 -n1

from sklearn.svm import NuSVC

classifier = SklearnClassifier(NuSVC())
classifier.train(training_set)
classifier_accuracy = nltk.classify.accuracy(classifier, testing_set)
print("Classifier accuracy: {}".format(classifier_accuracy))
#classifier.show_most_informative_features(15)

Classifier accuracy: 0.86
5.89 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
