# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [3]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

No handlers could be found for logger "sklearn.datasets.twenty_newsgroups"


## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
- Inspect the first data point, what does it look like?
> A blurb of text

In [7]:
print data_train.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']


In [8]:
type(data_train) ## fancier dictonary 

sklearn.datasets.base.Bunch

In [10]:
len(data_train['data'])

2034

In [11]:
data_train['data'][0]

u"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- what are the 20 words that are most common in the whole corpus?
- what are the 20 most common words in each of the 4 classes?
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it
- try the following 3 modification:
    - restrict the max_features
    - change max_df and min_df
    - use a fixed vocabulary of size 80 combining the 20 most common words per group found earlier
- for each of the above print a confusion matrix and investigate what gets mixed
> Anwer: not surprisingly if we reduce the feature space we lose accuracy
- print out the number of features for each model

In [50]:
bad_chars = ["\n", "\t", "_"] 

def remove_bad_chars(document):
    for char in bad_chars: 
        document = document.replace(char, '')
    return document 

clean_data = lambda data_dict: map(remove_bad_chars, data_dict['data'])
clean_data_train = clean_data(data_train)
clean_data_test = clean_data(data_test)

In [57]:
print data_train['data'][0]

Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych


In [102]:
from sklearn.feature_extraction.text import CountVectorizer
doc2vec = CountVectorizer(ngram_range=(1, 2), stop_words='english', min_df=0.005, max_df=0.95)
doc2vec.fit(clean_data_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.95, max_features=None, min_df=0.005,
        ngram_range=(1, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [71]:
for max_freq in [0.75, 0.8, 0.85, 0.9, .95]:
    for min_freq in [0.0001, 0.001, 0.005, 0.01, 0.05, 0.1]:
        doc2vec_tmp = CountVectorizer(ngram_range=(1, 2), stop_words='english', min_df=min_freq, max_df=max_freq)
        doc2vec_tmp.fit(clean_data_train)
        print "Min DF:{}, Max DF: {}, Vocabulary Size:{}".format(min_freq, max_freq, len(doc2vec_tmp.vocabulary_)) 

Min DF:0.0001, Max DF: 0.75, Vocabulary Size:205421
Min DF:0.001, Max DF: 0.75, Vocabulary Size:11235
Min DF:0.005, Max DF: 0.75, Vocabulary Size:2514
Min DF:0.01, Max DF: 0.75, Vocabulary Size:1176
Min DF:0.05, Max DF: 0.75, Vocabulary Size:95
Min DF:0.1, Max DF: 0.75, Vocabulary Size:15
Min DF:0.0001, Max DF: 0.8, Vocabulary Size:205421
Min DF:0.001, Max DF: 0.8, Vocabulary Size:11235
Min DF:0.005, Max DF: 0.8, Vocabulary Size:2514
Min DF:0.01, Max DF: 0.8, Vocabulary Size:1176
Min DF:0.05, Max DF: 0.8, Vocabulary Size:95
Min DF:0.1, Max DF: 0.8, Vocabulary Size:15
Min DF:0.0001, Max DF: 0.85, Vocabulary Size:205421
Min DF:0.001, Max DF: 0.85, Vocabulary Size:11235
Min DF:0.005, Max DF: 0.85, Vocabulary Size:2514
Min DF:0.01, Max DF: 0.85, Vocabulary Size:1176
Min DF:0.05, Max DF: 0.85, Vocabulary Size:95
Min DF:0.1, Max DF: 0.85, Vocabulary Size:15
Min DF:0.0001, Max DF: 0.9, Vocabulary Size:205421
Min DF:0.001, Max DF: 0.9, Vocabulary Size:11235
Min DF:0.005, Max DF: 0.9, Vocabular

In [103]:
df = pd.DataFrame(doc2vec.transform(clean_data_test + clean_data_train).todense(), columns=doc2vec.get_feature_names())

In [104]:
df.sum(axis=0).sort_values(ascending=False)

space            1355
don              1213
god              1207
like             1137
image            1135
people           1121
just             1052
know              978
does              953
edu               950
think             914
time              857
graphics          828
use               773
jpeg              725
data              720
good              695
say               669
way               638
file              622
available         605
jesus             593
program           587
images            576
make              567
new               565
point             548
software          528
nasa              527
believe           522
                 ... 
apologize          16
los                16
93 04              16
accepting          16
successfully       16
exception          15
honor              15
extract            15
01 14              15
04 01              15
fellow             15
ordinary           15
attempting         15
theworld           15
managed   

In [105]:
df.head()

Unnamed: 0,00,000,01,01 14,04,04 01,10,100,1000,11,...,yeah,year,years,years ago,yes,yesterday,york,young,zero,zip
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [106]:
df['target'] = np.hstack((data_train.target, data_test.target))

In [111]:
df_target_sums = df.groupby('target').sum()

In [123]:
for label in df_target_sums.index:
    print label, data_train.target_names[label], ":"
    print df_target_sums.loc[label].sort_values(ascending=False)[:20]
    print '---'

0 alt.atheism :
god          356
space        295
don          239
image        235
think        232
like         232
jpeg         230
people       213
does         210
know         203
just         202
data         193
use          170
time         168
jesus        168
nasa         163
available    162
say          155
lord         154
edu          151
Name: 0, dtype: int64
---
1 comp.graphics :
space        463
edu          360
image        350
don          350
graphics     344
people       327
like         320
god          307
just         299
know         278
time         277
does         249
data         244
jpeg         242
good         240
use          239
available    237
file         230
think        228
ftp          198
Name: 1, dtype: int64
---
2 sci.space :
image     457
don       401
like      363
god       351
people    345
does      326
just      321
know      296
space     296
think     263
edu       237
time      234
jpeg      231
use       229
way       205
say       

In [141]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report

def fit_logit(data, min_df, ngram_range=(1, 2)):
    ## fit count vectorizer
    doc2vec = CountVectorizer(ngram_range=ngram_range, stop_words='english', min_df=min_df, max_df=0.95)
    docs_train, target_train, docs_test, target_test = data
    X_train = doc2vec.fit_transform(docs_train)
    X_test = doc2vec.transform(docs_test)
    
    ## fit the logistic regression with ridge model 
    logit = LogisticRegressionCV()
    logit.fit(X_train, target_train)
    
    ## print metrics 
    test_predictions = logit.predict(X_test)
    print classification_report(target_test, test_predictions, target_names = data_train.target_names)
    print logit.score(X_test, target_test)
    return logit 

In [None]:
## input data 

data = clean_data_train, data_train.target, clean_data_test, data_test.target
min_freqs = [0.0001, 0.001, 0.005, 0.01, 0.05, 0.1]
ngram_ranges = [(1, 1), (1, 2), (1, 3)]
for ngram_range in ngram_ranges:
    for freq in min_freqs:
        print 'Minimum freq hp:{}, ngram range: {}'.format(freq, ngram_range)
        fit_logit(data, freq, ngram_range=ngram_range)
# Best Hyperparameters 
#         Minimum freq hp:0.0001, ngram range: (1, 2)


In [143]:
best_logit_count_vect = fit_logit(data, 0.001, ngram_range=(1, 2))

                    precision    recall  f1-score   support

       alt.atheism       0.62      0.56      0.59       319
     comp.graphics       0.84      0.86      0.85       389
         sci.space       0.74      0.83      0.79       394
talk.religion.misc       0.59      0.53      0.55       251

       avg / total       0.71      0.72      0.72      1353

0.720620842572


## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- can you improve on your best score above?
    - can you change any of the default parameters to improve it?
- print out the number of features for this model

In [144]:
from sklearn.feature_extraction.text import TfidfVectorizer

def fit_logit(data, min_df, ngram_range=(1, 2), vectorizer_type='tfidf'):
    ## fit count vectorizer
    docs_train, target_train, docs_test, target_test = data
    if vectorizer_type == 'tfidf':
        doc2vec = TfidfVectorizer(ngram_range=ngram_range, stop_words='english', min_df=min_df, max_df=0.95)
    else:
        doc2vec = CountVectorizer(ngram_range=ngram_range, stop_words='english', min_df=min_df, max_df=0.95)

    X_train = doc2vec.fit_transform(docs_train)
    X_test = doc2vec.transform(docs_test)
    
    ## fit the logistic regression with ridge model 
    logit = LogisticRegressionCV()
    logit.fit(X_train, target_train)
    
    ## print metrics 
    test_predictions = logit.predict(X_test)
    print classification_report(target_test, test_predictions, target_names = data_train.target_names)
    print logit.score(X_test, target_test)
    return logit

In [154]:
## attributes, instance variable input: data, min_df, ngram_Range, vectorizer_type 
## attributes: scores, logistic regression cv, doc2vec 
## method: classification report, fitting/predict/score the logistic regression,
class LogisticRegresionText(object):
    
    def __init__(self, 
                 data,
                 min_df = 0.001, 
                 max_df = 0.95,
                 ngram_range = (1, 2), 
                 vecotrizer_type = 'tfidf'):
        
        self.data = data 
        self.min_df = min_df
        self.max_df = max_df
        self.ngram_range = ngram_range
        self.vectorizer_type = vecotrizer_type    
        self.docs_train, self.target_train, self.docs_test, self.target_test = data
    
    def vectorize_docs(self):
        if self.vectorizer_type == 'tfidf':
            self.doc2vec = TfidfVectorizer(ngram_range=self.ngram_range, 
                                           min_df=self.min_df, 
                                           max_df=self.max_df,
                                           stop_words='english')
        else:
            self.doc2vec = CountVectorizer(ngram_range=self.ngram_range, 
                                           min_df=self.min_df, 
                                           max_df=self.max_df,
                                           stop_words='english')
        
        self.X_train = self.doc2vec.fit_transform(self.docs_train)
        self.X_test = self.doc2vec.transform(self.docs_test)
        
    def fit_logit(self):
        self.vectorize_docs()
        self.logit = LogisticRegressionCV()
        self.logit.fit(self.X_train, self.target_train)
    
    def predict_logit(self):
        self.fit_logit()
        self.test_predictions = self.logit.predict(self.X_test)
        self.accuracy =  self.logit.score(self.X_test, self.target_test)
        print classification_report(self.target_test, 
                                    self.test_predictions, 
                                    target_names = data_train.target_names)


In [155]:
sample_logit = LogisticRegresionText(data)
sample_logit.predict_logit()

                    precision    recall  f1-score   support

       alt.atheism       0.66      0.58      0.62       319
     comp.graphics       0.82      0.90      0.86       389
         sci.space       0.81      0.82      0.81       394
talk.religion.misc       0.61      0.60      0.61       251

       avg / total       0.74      0.75      0.74      1353



In [156]:
sample_logit.doc2vec.get_feature_names()

[u'00',
 u'00 pounds',
 u'000',
 u'000 000',
 u'000 feet',
 u'000 years',
 u'01',
 u'01 14',
 u'02',
 u'03',
 u'04',
 u'04 01',
 u'04 17computer',
 u'04 21',
 u'05',
 u'06',
 u'0674',
 u'07',
 u'08',
 u'09',
 u'10',
 u'10 00',
 u'10 000',
 u'10 10',
 u'10 15',
 u'10 20',
 u'10 30',
 u'10 clicks',
 u'10 minutes',
 u'10 years',
 u'100',
 u'100 000',
 u'100 years',
 u'1000',
 u'101',
 u'101010',
 u'101010 binary',
 u'102',
 u'102 18',
 u'1024x768',
 u'1030',
 u'104',
 u'105',
 u'107',
 u'109',
 u'11',
 u'11 nineteenth',
 u'110',
 u'111',
 u'112',
 u'113',
 u'115',
 u'1150',
 u'12',
 u'12 13',
 u'120',
 u'1200',
 u'1200 2400',
 u'121',
 u'125',
 u'128',
 u'128 102',
 u'128 149',
 u'128 214',
 u'129',
 u'129 92',
 u'13',
 u'13 38',
 u'130',
 u'130 11',
 u'130 167',
 u'131',
 u'133',
 u'1333',
 u'134',
 u'135',
 u'136',
 u'137',
 u'138',
 u'13h',
 u'14',
 u'14 1993',
 u'14 39',
 u'140',
 u'1400',
 u'141',
 u'142',
 u'144',
 u'145',
 u'146',
 u'147',
 u'149',
 u'15',
 u'15 000',
 u'15 16',
 u

In [145]:
fit_logit(data, min_df=0.001)

                    precision    recall  f1-score   support

       alt.atheism       0.66      0.58      0.62       319
     comp.graphics       0.82      0.90      0.86       389
         sci.space       0.81      0.82      0.81       394
talk.religion.misc       0.61      0.60      0.61       251

       avg / total       0.74      0.75      0.74      1353

0.745011086475


LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)

## 4. Classifier comparison

Of all the vectorizers tested above, choose one that has a reasonable performance with a manageable number of features and compare the performance of these models:

- KNN
- Logistic Regression
- Decision Trees
- Support Vector Machine
- Random Forest
- Extra Trees

In order to speed up the calculation it's better to vectorize the data only once and then compare the models.

## Bonus: Other classifiers

Adapt the code from [this example](http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html#example-text-document-classification-20newsgroups-py) to compare across all the classifiers suggested and to display the final plot

## Bonus: NLTK

NLTK is a vast library. Can you find some interesting bits to share with classmates?
Start here: http://www.nltk.org/