# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- what are the 20 words that are most common in the whole corpus?
- what are the 20 most common words in each of the 4 classes?
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
- try the following 3 modification:
    - restrict the max_features
    - change max_df and min_df
    - use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
- for each of the above print a confusion matrix and investigate what gets mixed
> Anwer: not surprisingly if we reduce the feature space we lose accuracy
- print out the number of features for each model

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
cvec1 = CountVectorizer()
cvec1.fit(data_test['data'])
cvec = CountVectorizer(stop_words='english')
cvec.fit(data_train['data'])
print(len(cvec1.get_feature_names()),
len(cvec.get_feature_names()))



X_train  = pd.DataFrame(cvec1.transform(data_train['data']).todense(),
             columns=cvec1.get_feature_names())

X_train.transpose().sort_values(0, ascending=False).head(20).transpose()

(21544, 26577)


Unnamed: 0,the,file,to,in,prj,is,3ds,you,save,are,format,anyone,that,texture,information,orientation,does,if,from,model
0,7,6,4,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,1,1
1,1,0,4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,1,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,11,0,14,2,0,2,0,5,0,0,0,1,9,0,0,0,0,1,2,0
4,1,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,1,0,0,0
5,4,0,3,2,0,1,0,2,0,2,0,0,1,0,0,0,0,0,0,0
6,2,0,1,2,0,2,0,1,0,1,0,0,2,0,0,0,0,1,2,0
7,10,0,3,1,0,3,0,1,0,1,0,0,0,0,0,0,1,2,1,0
8,1,0,1,1,0,1,0,0,0,0,0,1,2,0,0,0,0,0,0,0
9,1,0,3,3,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [6]:
X_train['target'].value_counts()

0    2022
1      10
2       2
Name: target, dtype: int64

In [7]:
word_counts = X_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

the     19159
to      10000
of       9931
and      8233
is       6551
in       5764
that     5602
it       4589
for      3950
you      3765
this     2833
be       2792
on       2687
are      2683
not      2608
as       2412
have     2340
or       2104
with     2049
but      1861
dtype: int64

In [9]:

y_test = data_test['target']
y_train = X_train['target']
data_X_test = data_test.drop('target', axis=0, inplace=1) #need to fix this

X_test = data_X_test

X_test  = pd.DataFrame(cvec1.transform(data_test['data']).todense(),
             columns=cvec1.get_feature_names())

X_test.transpose().sort_values(0, ascending=False).head(20).transpose()

AttributeError: drop

In [None]:
from sklearn import linear_model, decomposition, datasets
# sklearn.linear_model.LogisticRegression(penalty='l2', dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False, n_jobs=1)


logis = linear_model.LogisticRegression()
logis.fit(X_train, y_train)
logis.score(X_test, y_test)


## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- can you improve on your best score above?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

hvec = HashingVectorizer(stop_words='english')
# hvec1 = HashingVectorizer(stop_words='english')

hvec.fit(data_train['data'])

# hvec1.fit(data_train['data'])


X_train_hash = pd.DataFrame(hvec.transform(data_train['data']).todense())

X_train_hash.transpose().sort_values(0, ascending=False).head(15).transpose()



print(len(cvec1.get_feature_names()),
len(hvec.get_feature_names()))

# feature_extraction.text.HashingVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', non_negative=False, dtype=<class 'numpy.float64'>)[source]

In [None]:
y_test = data_test['target']
y_train = X_train_hash['target']
X_test = data_test.drop('target', axis=0, inplace=1)


X_test_hash  = pd.DataFrame(hvec.transform(data_test['data']).todense(),
             columns=hvec.get_feature_names())

X_test.transpose().sort_values(0, ascending=False).head(20).transpose()


## 4. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

## Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

## Bonus: NLTK

NLTK is a vast library. Can you find some interesting bits to share with classmates?
Start here: http://www.nltk.org/