# Classifying Text (e.g., newsgroups)

We'll step through a reasonably complete ML workflow and test the accuracy of a few of the ML algorithms we've discussed in class on the [20 newsgroups]( http://qwone.com/~jason/20Newsgroups) data set using [nltk](http://nltk.org/) and [scikit-learn](http://scikit-learn.org/).

By default, nltk only includes a small sample of the 20 newsgroups data, so for this demo you'll need to download the complete collection of texts, from [here](http://people.csail.mit.edu/jrennie/20Newsgroups/20news-18828.tar.gz).

Shamelessly adapted from:
http://nbviewer.ipython.org/urls/dl.dropboxusercontent.com/u/4864294/ML/20%2520newsgroups.ipynb

## Import data

Once you've downloaded and uncompressed the collection, you should have a folder called '20news-18828' in your current directory.  We can now use set up nltk's corpus tools to allow us to easily access the text. 

In [None]:
import nltk
newsgroups = \
  nltk.corpus.PlaintextCorpusReader('/Users/jbloom/Downloads/20news-18828', '.*/[0-9]+', encoding='latin1')

An nltk corpus can be viewed as a collection of files:

In [None]:
ids = newsgroups.fileids()
print(len(ids))
ids[::4000]

In [None]:
!cat /Users/jbloom/Downloads/20news-18828/sci.electronics/53771

We'll take the list of file ids, randomly shuffle it, and divide it into training and test sections.  And, to make the demo go faster, we'll only use a small sample of the available text.

In [None]:
import random

random.seed(0)

random.shuffle(ids)
ids = ids[:5000]
size = len(ids)

testSet = ids[:int(size*0.1)]
trainSet = ids[int(size*0.1):]

print(len(trainSet), len(testSet))

## Extract features

Now that we've got our data, we need to convert our newsgroup texts into features that can be used by machine learning algorithms.  For this example, we'll use lexical features: each word is a feature, and it's value is the number of times that word occurs in a text. 



In [None]:
from collections import defaultdict

def features(text):
    """Convert a post into a dictionary of features"""
    features = defaultdict(int)
    for word in text:
        if word.isalpha():
            features[word.lower()] += 1
    return features

print(features(newsgroups.words(fileids=trainSet[0])))

We'll also extract the class names for training instances.  For the class, we'll just use the name of the newsgroup that the text was taken from, and we can get that from the first part of the fileid.

In [None]:
def getclass(fileid):
    """Get class name from fileid"""
    return fileid.split('/')[0]

print(getclass(trainSet[0]))

Finally, we'll apply these functions to all the posts in the dataset (this may a while!)

In [None]:
%time trainData = [(features(newsgroups.words(fileids=f)),getclass(f)) for f in trainSet]

In [None]:
%time testData = [(features(newsgroups.words(fileids=f)),getclass(f)) for f in testSet]

In [None]:
trainData[0]

In [None]:
len(trainData[0][0].keys())

## Baseline

Now that we've got all our posts converted into features and classes, we can try building some classifiers.  First, we'll establish a baseline score: how accurate is a classifier that assigns the most frequent class to every instance?

In [None]:
c = nltk.FreqDist(item[1] for item in trainData)
default = list(c.keys())[0]
print(c)

In [None]:
list(c.keys())

In [None]:
sum(c==default for f,c in testData) / float(len(testData))

## Naive Bayes

Next, we can try a few variations on the Naive Bayes classifier.  Nltk includes a simple function for training Naive Bayes classifiers which is very easy to use, though slow and not very accurate.

In [None]:
nb = nltk.NaiveBayesClassifier.train(trainData)

Nltk also provides easy tools for evaluating classifiers.

In [None]:
print(nltk.classify.accuracy(nb, testData))

Nltk also has functions that allow us to call other machine learning libraries, including scikit-learn, using wrapper classes.

In [None]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB

bernoulli = SklearnClassifier(BernoulliNB())
bernoulli.train(trainData)

print(nltk.classify.accuracy(bernoulli, testData))

In [None]:
from sklearn.naive_bayes import MultinomialNB

multi = SklearnClassifier(MultinomialNB())
multi.train(trainData)

print(nltk.classify.accuracy(multi, testData))

Scikit-learn includes functions for performing feature selection and error analysis.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('chi2', SelectKBest(chi2, k=1000)),
                     ('nb', MultinomialNB())])

pmulti = SklearnClassifier(pipeline)
pmulti.train(trainData)

print(nltk.classify.accuracy(pmulti, testData))

## Support Vector Machines

We can use any of the learning algorithms implemented by scikit-learn ([decision trees](http://scikit-learn.org/stable/modules/tree.html#classification), [knn](http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification), maxent [aka [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)], [adaboost](http://scikit-learn.org/stable/modules/ensemble.html#adaboost), [linear](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn-svm-linearsvc) and [non-linear](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn-svm-svc) SVMs, etc) in the same way.

In [None]:
from sklearn.svm import LinearSVC

svm = SklearnClassifier(LinearSVC())
svm.train(trainData)

print(nltk.classify.accuracy(svm, testData))

In [None]:
results = svm.classify_many(item[0] for item in testData)
results[0], testData[0][1]

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
# Compute confusion matrix
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

cmm = confusion_matrix([x[1] for x in testData], results)

print(cmm)
cmm = np.array(cmm,dtype=np.float)
print(cmm.shape)

f,ax = plt.subplots()

# Show confusion matrix in a separate window
ax.imshow(cmm,interpolation='nearest')
ax.set_title('Confusion matrix')
ax.set_ylabel('True label')
ax.set_xlabel('Predicted label')


## Word2Vec

Deep learning approach to understanding word meeting (in the context of sentences), and by extension paragraphs and documents. The main module for this is gensim. See https://radimrehurek.com/gensim/models/word2vec.html

- Talk on word2vec https://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994

- https://www.kernix.com/blog/similarity-measure-of-textual-documents_p12
- https://github.com/sdimi/average-word2vec/blob/master/notebook.ipynb
- Doc2vec on newsgroups: https://github.com/skillachie/nlpArea51/tree/master/doc2vec