# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"


## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [7]:
len(data_train.data)


2034

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- what are the 20 words that are most common in the whole corpus?
- what are the 20 most common words in each of the 4 classes?
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
- try the following 3 modification:
    - restrict the max_features
    - change max_df and min_df
    - use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
- for each of the above print a confusion matrix and investigate what gets mixed
> Anwer: not surprisingly if we reduce the feature space we lose accuracy
- print out the number of features for each model

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x_vect = v.fit_transform(data_train.data)
print "the dict is %s" % str(x_vect.shape)

v = CountVectorizer(stop_words = 'english')
x_vect = v.fit_transform(data_train.data)
print "the dict is %s with stopwords" % str(x_vect.shape)
print "It is barely smaller"

the dict is (2034, 26879)
the dict is (2034, 26576) with stopwords
It is barely smaller


In [59]:
top_20 = CountVectorizer(stop_words = 'english', max_features = 20)
top_20.fit_transform(data_train.data)
print top_20.vocabulary_

{u'know': 10, u'don': 2, u'say': 14, u'just': 9, u'like': 11, u'edu': 3, u'people': 13, u'time': 17, u'data': 0, u'image': 7, u'jesus': 8, u'space': 15, u'graphics': 6, u'does': 1, u'think': 16, u'nasa': 12, u'good': 5, u'use': 18, u'god': 4, u'way': 19}


In [80]:
X = pd.DataFrame(data_train.data, columns =['data'])
y = pd.DataFrame(data_train.target, columns =['class'])
stuff = pd.concat([X, y], axis = 1)
temp = []
top_20_byclass = {}

vect = CountVectorizer(stop_words = 'english', max_features = 20)
for i in range(4):
    temp = stuff[stuff['class'] == i]['data']
    vect.fit_transform(temp)
    top_20_byclass[i] = vect.vocabulary_

print "top 20 by class"
print top_20_byclass

top 20 by class
{0: {u'people': 12, u'time': 17, u'argument': 0, u'say': 15, u'religion': 13, u'atheists': 2, u'don': 6, u'jesus': 8, u'does': 5, u'way': 19, u'true': 18, u'atheism': 1, u'said': 14, u'just': 9, u'think': 16, u'bible': 4, u'like': 11, u'god': 7, u'believe': 3, u'know': 10}, 1: {u'format': 7, u'data': 2, u'image': 11, u'gif': 9, u'ftp': 8, u'graphics': 10, u'does': 3, u'software': 18, u'available': 0, u'use': 19, u'pub': 17, u'like': 15, u'images': 12, u'file': 5, u'edu': 4, u'jpeg': 13, u'color': 1, u'program': 16, u'files': 6, u'know': 14}, 2: {u'people': 12, u'time': 17, u'data': 0, u'just': 3, u'year': 19, u'space': 16, u'launch': 4, u'orbit': 11, u'new': 10, u'don': 1, u'lunar': 6, u'shuttle': 15, u'like': 5, u'earth': 2, u'satellite': 14, u'moon': 8, u'program': 13, u'mission': 7, u'nasa': 9, u'use': 18}, 3: {u'life': 11, u'people': 13, u'time': 18, u'say': 16, u'jesus': 8, u'does': 4, u'way': 19, u'think': 17, u'don': 5, u'said': 15, u'just': 9, u'did': 3, u'bible

In [125]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
logit = LogisticRegression()

def vectorizer(v):
    X_vect = v.fit_transform(data_train.data)
    y_train = data_train.target

    X_test_vect = v.transform(data_test.data)
    y_test = data_test.target

    logit.fit(X_vect, y_train)
    y_pred = logit.predict(X_test_vect)
    cnf = confusion_matrix(y_test, y_pred)
    labels =  ['pred_0','pred_1','pred_2','pred_3']
    labels2 =  ['0','1','2','3']
    print pd.DataFrame(cnf, columns =labels, index = labels2)
    print "acc score"
    print logit.score(X_test_vect,y_test)
    print "number of words"
    print len(v.vocabulary_)

v = CountVectorizer(stop_words = 'english')
print "Logit stuff"
vectorizer(v)

Logit stuff
   pred_0  pred_1  pred_2  pred_3
0     187      16      46      70
1      13     345      28       3
2      22      23     333      16
3      67      14      27     143
acc score
0.745011086475
number of words
26576


In [126]:
v = CountVectorizer(stop_words = 'english', max_features = 100)
print "Acc score Logit, only 100 features"
vectorizer(v)

Acc score Logit, only 100 features
   pred_0  pred_1  pred_2  pred_3
0     161      28      56      74
1      19     302      59       9
2      38      34     298      24
3     107      19      48      77
acc score
0.619364375462
number of words
100


In [127]:
v = CountVectorizer(stop_words = 'english', max_df = .002)
print "Acc score Logit, max_df = .002"
vectorizer(v)

Acc score Logit, max_df = .002
   pred_0  pred_1  pred_2  pred_3
0     137     106      38      38
1      11     341      32       5
2      32     137     217       8
3      56      91      32      72
acc score
0.566888396157
number of words
21260


In [128]:
v = CountVectorizer(stop_words = 'english', min_df = .2)
print "Acc score Logit, min_df = .2"
vectorizer(v)

Acc score Logit, min_df = .2
   pred_0  pred_1  pred_2  pred_3
0      42     205      70       2
1      23     264     101       1
2      31     257     105       1
3      25     180      44       2
acc score
0.305247597931
number of words
4


In [129]:
fixed_80 = []
for x in top_20_byclass:
    fixed_80 += top_20_byclass[x]

v = CountVectorizer(stop_words = 'english', vocabulary = list(set(fixed_80)))
print "Acc score Logit with only 80"
vectorizer(v)

Acc score Logit with only 80
   pred_0  pred_1  pred_2  pred_3
0     160      67      33      59
1      25     314      44       6
2      37      88     247      22
3      90      70      15      76
acc score
0.589061345159
number of words
54


## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- can you improve on your best score above?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model

In [136]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer

def func(v):
    X_vect = v.fit_transform(data_train.data)
    y_train = data_train.target

    X_test_vect = v.transform(data_test.data)
    y_test = data_test.target

    logit.fit(X_vect, y_train)
    y_pred = logit.predict(X_test_vect)
    cnf = confusion_matrix(y_test, y_pred)
    labels =  ['pred_0','pred_1','pred_2','pred_3']
    labels2 =  ['0','1','2','3']
    print pd.DataFrame(cnf, columns =labels, index = labels2)
    
    print "acc score"
    print logit.score(X_test_vect,y_test)
    print "number of words"
    print v.n_features

print "Hashing"
hv = HashingVectorizer(stop_words = 'english')
func(hv)
print "did not improve"
print "I don't see any parameters that would improve it"

Hashing
   pred_0  pred_1  pred_2  pred_3
0     197      15      65      42
1       9     347      32       1
2      21      23     350       0
3      86      18      44     103
acc score
0.736881005174
number of words
1048576
did not improve
I don't see any parameters that would improve it


In [139]:
print "TFIDF"
tv = TfidfVectorizer(stop_words = 'english')
vectorizer(tv)
print "it is .002 better than above"

TFIDF
   pred_0  pred_1  pred_2  pred_3
0     198      15      65      41
1       8     351      29       1
2      17      21     356       0
3      82      16      46     107
acc score
0.747967479675
number of words
26576
it is .002 better than above


In [150]:
print "TFIDF"
tv = TfidfVectorizer(stop_words = 'english', max_df = .2)
vectorizer(tv)
print "changing max_df to .2 improves score"

TFIDF
   pred_0  pred_1  pred_2  pred_3
0     194      14      66      45
1       7     351      30       1
2      13      22     359       0
3      79      16      46     110
acc score
0.749445676275
number of words
26572
changing max_df to .2 improves score


## 4. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

In [170]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
#logit
#hashing 

v = TfidfVectorizer(stop_words = 'english')
X_vect = v.fit_transform(data_train.data)
y_train = data_train.target

X_test_vect = v.transform(data_test.data)
y_test = data_test.target
    
def vectorizer2(model):
    model.fit(X_vect, y_train)
    y_pred = model.predict(X_test_vect)
    cnf = confusion_matrix(y_test, y_pred)
    labels =  ['pred_0','pred_1','pred_2','pred_3']
    labels2 =  ['0','1','2','3']
    print pd.DataFrame(cnf, columns =labels, index = labels2)
    print "acc score"
    print model.score(X_test_vect,y_test)
    

In [171]:
print "Logistic Regression"
model = LogisticRegression()
vectorizer2(model)

Logistic Regression
   pred_0  pred_1  pred_2  pred_3
0     198      15      65      41
1       8     351      29       1
2      17      21     356       0
3      82      16      46     107
acc score
0.747967479675


In [172]:
print "DecisionTreeClassifier"
model = DecisionTreeClassifier()
vectorizer2(model)

DecisionTreeClassifier
   pred_0  pred_1  pred_2  pred_3
0     146      55      42      76
1      21     315      36      17
2      33      75     261      25
3      84      53      16      98
acc score
0.606060606061


In [173]:
print "RandomForestClassifier"
model = RandomForestClassifier()
vectorizer2(model)

RandomForestClassifier
   pred_0  pred_1  pred_2  pred_3
0     196      33      35      55
1      25     325      32       7
2      44      63     275      12
3     127      31      20      73
acc score
0.642276422764


In [174]:
print "ExtraTreesClassifier"
model = ExtraTreesClassifier()
vectorizer2(model)

ExtraTreesClassifier
   pred_0  pred_1  pred_2  pred_3
0     195      34      45      45
1      21     345      20       3
2      33      63     293       5
3      92      34      40      85
acc score
0.678492239468


In [175]:
print "KNeighborsClassifier"
model = KNeighborsClassifier()
vectorizer2(model)

KNeighborsClassifier
   pred_0  pred_1  pred_2  pred_3
0     115      95      55      54
1     140     117      64      68
2     111     137      80      66
3      66      87      42      56
acc score
0.271988174427


In [176]:
print "SVC"
model = SVC()
vectorizer2(model)

SVC
   pred_0  pred_1  pred_2  pred_3
0       0       0     319       0
1       0       0     389       0
2       0       0     394       0
3       0       0     251       0
acc score
0.291204730229


## Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

## Bonus: NLTK

NLTK is a vast library. Can you find some interesting bits to share with classmates?
Start here: http://www.nltk.org/