# Natural Language Processing Lab

In this lab we will further explore Scikit's and NLTK's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import fetch_20newsgroups

In [84]:

categories = [
    'alt.atheism',
    'talk.religion.misc',
    'comp.graphics',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an SKLearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

Let's inspect them.

1. What data taype is `data_train`
> sklearn.datasets.base.Bunch
- Is it like a list? Or like a Dictionary? or what?
> Dict
- How many data points does it contain?
>2034
- Inspect the first data point, what does it look like?
> A blurb of text

In [4]:
type(data_train)

sklearn.datasets.base.Bunch

In [5]:
data_train.keys()

['description', 'DESCR', 'filenames', 'target_names', 'data', 'target']

In [6]:
len(data_train['data'])

2034

In [7]:
len(data_train['target'])

2034

In [8]:
data_train['data'][0]

u"Hi,\n\nI've noticed that if you only save a model (with all your mapping planes\npositioned carefully) to a .3DS file that when you reload it after restarting\n3DS, they are given a default position and orientation.  But if you save\nto a .PRJ file their positions/orientation are preserved.  Does anyone\nknow why this information is not stored in the .3DS file?  Nothing is\nexplicitly said in the manual about saving texture rules in the .PRJ file. \nI'd like to be able to read the texture rule information, does anyone have \nthe format for the .PRJ file?\n\nIs the .CEL file format available from somewhere?\n\nRych"

## 2. Bag of Words model

Let's train a model using a simple count vectorizer

1. Initialize a standard CountVectorizer and fit the training data
- how big is the feature dictionary?
- repeat eliminating english stop words
- is the dictionary smaller?
- transform the training data using the trained vectorizer
- evaluate the performance of a Lotistic Regression on the features extracted by the CountVectorizer
    - you will have to transform the test_set too. Be carefule to use the trained vectorizer, without re-fitting it

BONUS:
- try a couple modifications:
    - restrict the max_features
    - change max_df and min_df

In [20]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer()
cvec.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [21]:
## 2. How big is the feature dictionary?
len(cvec.get_feature_names())

26879

In [22]:
## 3. repeat eliminating english stop words
cvec = CountVectorizer(stop_words='english')
cvec.fit(data_train['data'])

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [23]:
# 4. is the dictionary smaller?
features = cvec.get_feature_names()
len(features)

## yes there are about 300 fewer features

26576

In [24]:
# 5. transform the training data using the trained vectorizer

X_train  = pd.DataFrame(cvec.transform(data_train['data']).todense(),columns=cvec.get_feature_names())
y_train = data_train['target']

In [25]:
print 'X_train shape: ',X_train.shape
print
print
X_train

X_train shape:  (2034, 26576)




Unnamed: 0,00,000,0000,00000,000000,000005102000,000062david42,0001,000100255pixel,00041032,...,zurich,zurvanism,zus,zvi,zwaartepunten,zwak,zwakke,zware,zwarte,zyxel
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
word_counts = X_train.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

22343    1061
18046     793
11291     745
8591      730
14706     682
13905     675
8554      600
14154     592
23927     584
24067     546
12642     534
8956      501
25108     468
11326     449
7506      444
16536     419
11444     414
13708     411
21217     409
25833     387
dtype: int64

In [26]:
X_test  = pd.DataFrame(cvec.transform(data_test['data']).todense(),columns=cvec.get_feature_names())
y_test = data_test['target']

In [27]:
import statsmodels.formula.api as sm #import statsmodels
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.74501108647450109

## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat the test with no restriction on the number of features
- does the score improve with respect to the count vectorizer?
- print out the number of features for this model
- Initialize a TF-IDF Vectorizer and repeat the analysis above
- print out the number of features for this model

BONUS
- Change the parameters of either (or both!) models to improve your score

In [31]:
from sklearn.feature_extraction.text import HashingVectorizer

df  = pd.DataFrame(cvec.transform([spam]).todense(),
             columns=cvec.get_feature_names())
#show the df
df.transpose().sort_values(0, ascending=False).head(10).transpose()

# hashing vectorizer (your turn)
hvec = HashingVectorizer()

df  = pd.DataFrame(hvec.transform([spam]).todense())
df.transpose().sort_values(0, ascending=False).head(10).transpose()


# td idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tvec = TfidfVectorizer(stop_words='english')
tvec.fit([spam, ham])

df  = pd.DataFrame(tvec.transform([spam, ham]).todense(),
                   columns=tvec.get_feature_names(),
                   index=['spam', 'ham'])


ValueError: could not convert string to float: Hi,  I've noticed that if you only save a model (with all your mapping planes positioned carefully) to a .3DS file that when you reload it after restarting 3DS, they are given a default position and o

In [67]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
# hashing vectorizer (your turn)
hvec = HashingVectorizer(stop_words='english')
log = LogisticRegression()

y_train = data_train['target']


hvec_train = hvec.fit_transform(data_train['data'])
log.fit(hvec_train,y_train)

hvec_test = hvec.fit_transform(data_test['data'])
y_test = data_test['target']
predictions = log.predict(hvec_test)
score = accuracy_score(y_test, predictions)
print 'Accuracy score: ',score
print '# of Features: ', hvec.n_features




Accuracy score:  0.736881005174
# of Features:  1048576


In [86]:
len(list(data_train.columns)), len(list(data_test.columns))

AttributeError: columns

In [90]:
len(data_train.data), len(data_test.data)
tvec_train

<2034x26576 sparse matrix of type '<type 'numpy.float64'>'
	with 133634 stored elements in Compressed Sparse Row format>

In [93]:
log = LogisticRegression()

tvec = TfidfVectorizer(stop_words='english')
tvec_train = tvec.fit_transform(data_train['data']).toarray()
model = log.fit(tvec_train,y_train)

tvec_test = tvec.fit_transform(data_test['data']).toarray()
predictions = model.predict(tvec_test)

score = accuracy_score(y_test, predictions)
print 'Accuracy score: ',score
print '# of Features: ', tvec.n_features


ValueError: X has 21240 features per sample; expecting 26576

In [None]:
# A pipeline is a way for us to construct a function to execute
# the same tasks continuously
# In our variable model we fit a vectorizer, and a model
# our Model variable is stored with the fit vectorizer and model
# so we we call model.xxxx it uses that information stored
model = make_pipeline(HashingVectorizer(stop_words='english',
                                        non_negative=True,
                                        n_features=2**16),
                      LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print accuracy_score(y_test, y_pred)
print "Number of features:", 2**16

In [None]:
model = make_pipeline(TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.5,
                                      max_features=1000),
                      LogisticRegression(),
                      )
model.fit(data_train['data'], y_train)
y_pred = model.predict(data_test['data'])
print accuracy_score(y_test, y_pred)
print "Number of features:", len(model.steps[0][1].get_feature_names())