<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [1]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LogisticRegression

# import vectorizer, tokenizer, stemmer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer

# import bs4
from bs4 import BeautifulSoup

# others
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

In [2]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [3]:
#Extracting Information from the Data's Dictionary format 

categories = ['alt.atheism','talk.religion.misc','comp.graphics','sci.space']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

# A:

Shuffling is important for methods (which we might use later on) which make the assumption that the samples are independent and identically distributed (i.i.d.), such as stochastic gradient descent.

We set a random_state, so as to fix the sampling of data rows from the train and test dataset, so that the sampled results are replicable/reproducible everytime the code is run. This reproducibility allows us to troubleshoot our model more easily.

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

In [4]:
data_train; #loooooong printout surpressed, but this line would show the entire data structure
            #and I've seen that its a dictionary of lists

In [5]:
data_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
data_train['data']; #loooooong printout surpressed, but this line would show how ...['data'] looks like

In [7]:
print(data_train['filenames'])
data_train['filenames'].dtype

['C:\\Users\\Dell\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38816'
 'C:\\Users\\Dell\\scikit_learn_data\\20news_home\\20news-bydate-train\\talk.religion.misc\\83741'
 'C:\\Users\\Dell\\scikit_learn_data\\20news_home\\20news-bydate-train\\sci.space\\61092'
 ...
 'C:\\Users\\Dell\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38737'
 'C:\\Users\\Dell\\scikit_learn_data\\20news_home\\20news-bydate-train\\alt.atheism\\53237'
 'C:\\Users\\Dell\\scikit_learn_data\\20news_home\\20news-bydate-train\\comp.graphics\\38269']


dtype('<U94')

In [8]:
data_train['target_names']

['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']

In [9]:
print(data_train['target'])

[1 3 2 ... 1 0 1]


In [10]:
data_train['DESCR'][:100]

'.. _20newsgroups_dataset:\n\nThe 20 newsgroups text dataset\n------------------------------\n\nThe 20 new'

# A:

1. As seen above, data_train is a dictionary (due to the {}) of lists of unicode/strings/integers.

In [11]:
print('Qty of data        :',len(data_train['data']))
print('Qty of filenames   :',len(data_train['filenames']))
print('Qty of target_names:',len(data_train['target_names']))    #only 4 because we only selected 4 categories here
print('Qty of target      :',len(data_train['target']))
print('Qty of DESCR       :',len(data_train['DESCR']))           #>2034 because this is just a string of characters

Qty of data        : 2034
Qty of filenames   : 2034
Qty of target_names: 4
Qty of target      : 2034
Qty of DESCR       : 9535


# A:

2. As seen above, data_train has 2034 data points.

In [12]:
print('1st data    :',data_train['data'][0])
print('\n1st filename:',data_train['filenames'][0])
print('\n1st target  :',data_train['target'][0])

1st data    : Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych

1st filename: C:\Users\Dell\scikit_learn_data\20news_home\20news-bydate-train\comp.graphics\38816

1st target  : 1


# A:

3. The 1st data point is from the 'comp.graphics' category (1st category in the 'target_names' feature). It looks like a query, a forum post.

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

I will do it without functions (cos that's meant for the bonus at the end of the lab)... 

In [13]:
# process data_train words first...

# instantiate empty list to hold tokenized words, and those with stopwords removed/stemmed/joined
train_words, train_words_nostop, train_words_nostop_stem, train_words_nostop_stem_join=[],[],[],[]

# instantiate tokenizer, then tokenize the lower case of text
tokenizer = RegexpTokenizer('\w+')    #only extract words, not spaces/punctuations
for i in range(len(data_train['data'])):
    train_words.append(tokenizer.tokenize(data_train['data'][i].lower()))

# remove stopwords, compile into a list. Or, dont do this here, instead let CountVectorizer do it later on
# stops = set(stopwords.words('english')) #runs faster if its a set
stops = []
for i in range(len(train_words)):
    train_words_nostop.append([w for w in train_words[i] if w not in stops])

# instantiate snowballstemmer, then stem
s_stemmer = SnowballStemmer('english')
for i in range(len(train_words_nostop)):
    train_words_nostop_stem.append([s_stemmer.stem(j) for j in train_words_nostop[i]])

# Join the words back into one string separated by space, and return the result.
for i in range(len(train_words_nostop_stem)):
    train_words_nostop_stem_join.append(" ".join(train_words_nostop_stem[i]))

In [14]:
# process data_test words too...

# instantiate empty list to hold tokenized words, and those with stopwords removed/stemmed/joined
test_words, test_words_nostop, test_words_nostop_stem, test_words_nostop_stem_join=[],[],[],[]

# instantiate tokenizer, then tokenize the lower case of text
tokenizer = RegexpTokenizer('\w+')    #only extract words, not spaces/punctuations
for i in range(len(data_test['data'])):
    test_words.append(tokenizer.tokenize(data_test['data'][i].lower()))

# remove stopwords, compile into a list. Or, dont do this here, instead let CountVectorizer do it later on 
# stops = set(stopwords.words('english')) #runs faster if its a set
stops = []
for i in range(len(test_words)):
    test_words_nostop.append([w for w in test_words[i] if w not in stops])

# instantiate snowballstemmer, then stem
s_stemmer = SnowballStemmer('english')
for i in range(len(test_words_nostop)):
    test_words_nostop_stem.append([s_stemmer.stem(j) for j in test_words_nostop[i]])

# Join the words back into one string separated by space, and return the result.
for i in range(len(test_words_nostop_stem)):
    test_words_nostop_stem_join.append(" ".join(test_words_nostop_stem[i]))


In [15]:
# initiate vectorizer
vectorizer = CountVectorizer(analyzer='word',
#                              stop_words='english', #define what are our stopwords (from the 'english' list)
#                              max_df=0.7,           #ignores frequent words that appear in eg.>70% of datapoints, or say >6 datapoints
#                              min_df=0.1,           #ignores infrequent words that appear in eg.<10% of datapoints, or say <6 datapoints
#                              max_features=5000       #limits the number of features/words in the vectorizer
                            )   

In [16]:
# fit transform from train, then restrict test data to just those features
train_data_features = vectorizer.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer.transform(test_words_nostop_stem_join)

In [17]:
# see no. of features for train and test data
print(train_data_features.shape)
print(test_data_features.shape)

(2034, 19619)
(1353, 19619)


Without removing stopwords, we have a huge number (19619) of features. Now, lets remove stopwords.

In [18]:
# initiate vectorizer
vectorizer = CountVectorizer(analyzer='word',
                             stop_words='english', #define what are our stopwords (from the 'english' list)
#                              max_df=0.7,           #ignores frequent words that appear in eg.>70% of datapoints, or say >6 datapoints
#                              min_df=0.1,           #ignores infrequent words that appear in eg.<10% of datapoints, or say <6 datapoints
#                              max_features=5000       #limits the number of features/words in the vectorizer
                            )   

In [19]:
# fit transform from train, then restrict test data to just those features
train_data_features = vectorizer.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer.transform(test_words_nostop_stem_join)  #transform only! not fit

In [20]:
# see no. of features for train and test data
print(train_data_features.shape)
print(test_data_features.shape)

(2034, 19400)
(1353, 19400)


Fewer features (19400) after removing stopwords (but removing stopwords in the CountVectorizer function, is less preferred than removing it earlier outside of the function, via nltk's library. This is because CountVectorizer uses a poorer set of stopwords from sklearn). Feature dictionary certainly decreased in size. 

Now let's pass them through a logreg model, to see how well the test data performs, through learning from the train data. 

In [21]:
# set y's for train, test datasets 
y_train = data_train['target']
y_test = data_test['target']

In [22]:
# initiate logreg model
logreg = LogisticRegression()

# train on train data, then score it
logreg.fit(train_data_features, y_train)
print(logreg.score(train_data_features, y_train))

# now, score and evaluate model on test data
print(logreg.score(test_data_features, y_test))



0.9768928220255654
0.7361419068736141


Accuracy was very high (98%) for training data, but reduced for test data (74%).

Now lets fiddle with the max_df, min_df, max_features arguments in CountVectorizer.

In [23]:
# initializing 3 different types of vectorisers, each varying its maxdf/mindf/maxfeatures value. So 6 vectorizers

vectorizer_maxdf7 = CountVectorizer(analyzer='word',
                             stop_words='english',
                             max_df=0.7,           #ignores frequent words that appear in eg.>70% of datapoints, or say >6 datapoints
                            )

vectorizer_maxdf2 = CountVectorizer(analyzer='word',
                             stop_words='english',
                             max_df=0.2,           #ignores frequent words that appear in eg.>20% of datapoints, or say >6 datapoints
                            )

vectorizer_mindf1 = CountVectorizer(analyzer='word',
                             stop_words='english',
                             min_df=0.1,           #ignores infrequent words that appear in eg.<10% of datapoints, or say <6 datapoints
                            )

vectorizer_mindf2 = CountVectorizer(analyzer='word',
                             stop_words='english',
                             min_df=0.2,           #ignores infrequent words that appear in eg.<20% of datapoints, or say <6 datapoints
                            )

vectorizer_maxfeatures9000 = CountVectorizer(analyzer='word',
                             stop_words='english',
                             max_features=9000       #limits the number of features/words in the vectorizer  
                            )

vectorizer_maxfeatures1000 = CountVectorizer(analyzer='word',
                             stop_words='english',
                             max_features=1000       #limits the number of features/words in the vectorizer  
                            )

In [24]:
# varying maxdf

# maxdf = 0.7
# fit transform with new vectorizer
train_data_features = vectorizer_maxdf7.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer_maxdf7.transform(test_words_nostop_stem_join)

# fit logreg model using these reduced features, and score for train and test data 
logreg.fit(train_data_features, y_train)
print('Train score with maxdf 0.7:',logreg.score(train_data_features, y_train))
print('Test score with maxdf  0.7:',logreg.score(test_data_features, y_test))

# maxdf = 0.2
# fit transform with new vectorizer
train_data_features = vectorizer_maxdf2.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer_maxdf2.transform(test_words_nostop_stem_join)

# fit logreg model using these reduced features, and score for train and test data 
logreg.fit(train_data_features, y_train)
print('Train score with maxdf 0.2:',logreg.score(train_data_features, y_train))
print('Test score with maxdf  0.2:',logreg.score(test_data_features, y_test))

Train score with maxdf 0.7: 0.9768928220255654
Test score with maxdf  0.7: 0.7361419068736141
Train score with maxdf 0.2: 0.9768928220255654
Test score with maxdf  0.2: 0.7442719881744272


Setting a low maxdf (ignoring frequent words/features that appear in >20% of the entries/datapoints in the train dataset), improves the accuracy. This stringency probably differentiates entries with unique words better, hence making predictions more accurate.

In [25]:
# varying mindf

# mindf = 0.1
# fit transform with new vectorizer
train_data_features = vectorizer_mindf1.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer_mindf1.transform(test_words_nostop_stem_join)

# fit logreg model using these reduced features, and score for train and test data 
logreg.fit(train_data_features, y_train)
print('Train score with mindf 0.1:',logreg.score(train_data_features, y_train))
print('Test score with mindf  0.1:',logreg.score(test_data_features, y_test))

# mindf = 0.2
# fit transform with new vectorizer
train_data_features = vectorizer_mindf2.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer_mindf2.transform(test_words_nostop_stem_join)

# fit logreg model using these reduced features, and score for train and test data 
logreg.fit(train_data_features, y_train)
print('Train score with mindf 0.2:',logreg.score(train_data_features, y_train))
print('Test score with mindf  0.2:',logreg.score(test_data_features, y_test))

Train score with mindf 0.1: 0.567354965585054
Test score with mindf  0.1: 0.516629711751663
Train score with mindf 0.2: 0.37659783677482794
Test score with mindf  0.2: 0.3614190687361419


Setting a low mindf (ignoring infrequent words/features that appear only in <10% of the entries/datapoints in the train/test dataset), improves the accuracy.

In [26]:
# varying maxfeatures

# maxfeatures = 9000
# fit transform with new vectorizer
train_data_features = vectorizer_maxfeatures9000.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer_maxfeatures9000.transform(test_words_nostop_stem_join)

# fit logreg model using these reduced features, and score for train and test data 
logreg.fit(train_data_features, y_train)
print('Train score with maxfeatures 9000:',logreg.score(train_data_features, y_train))
print('Test score with maxfeatures  9000:',logreg.score(test_data_features, y_test))

# maxfeatures = 1000
# fit transform with new vectorizer
train_data_features = vectorizer_maxfeatures1000.fit_transform(train_words_nostop_stem_join)
test_data_features = vectorizer_maxfeatures1000.transform(test_words_nostop_stem_join)

# fit logreg model using these reduced features, and score for train and test data 
logreg.fit(train_data_features, y_train)
print('Train score with maxfeatures 1000:',logreg.score(train_data_features, y_train))
print('Test score with maxfeatures  1000:',logreg.score(test_data_features, y_test))

Train score with maxfeatures 9000: 0.9759095378564405
Test score with maxfeatures  9000: 0.738359201773836
Train score with maxfeatures 1000: 0.9488692232055064
Test score with maxfeatures  1000: 0.7095343680709535


Setting a low maxfeatures (limits the no. of words/features that each entry/datapoint would have to define itself), worsens the accuracy. This is quite likely due to the reduced ability to identify an entry correctly, due to fewer features.

### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [27]:
# A:
# instantiate hvec
hvec = HashingVectorizer()
# fit transform from train, then restrict test data to just those features
train_data_features = hvec.fit_transform(train_words_nostop_stem_join)
test_data_features = hvec.transform(test_words_nostop_stem_join)
print('No. of features',train_data_features.shape[1])

# train on train data, then score it
logreg.fit(train_data_features, y_train)
print('Train score:',logreg.score(train_data_features, y_train))

# now, score and evaluate model on test data
print('Test score :',logreg.score(test_data_features, y_test))

No. of features 1048576
Train score: 0.8456243854473943
Test score : 0.6836659275683666


In [28]:
# A:
# instantiate tfidf
tvec = TfidfVectorizer()
# fit transform from train, then restrict test data to just those features
train_data_features = tvec.fit_transform(train_words_nostop_stem_join)
test_data_features = tvec.transform(test_words_nostop_stem_join)
print('No. of features',train_data_features.shape[1])

# train on train data, then score it
logreg.fit(train_data_features, y_train)
print('Train score:',logreg.score(train_data_features, y_train))

# now, score and evaluate model on test data
print('Test score :',logreg.score(test_data_features, y_test))

No. of features 19619
Train score: 0.9336283185840708
Test score : 0.7420546932742055


The HashingVectorizer produces significantly more features than TFIDF, but performs worse. 

### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.

##### My Approach:
1. tokenize (simultaneously lower-casing, removing punctuation)
2. stem
3. remove stopwords
4. join

## Import libraries

In [29]:
import nltk
nltk.download('all', quiet=True)  #set quiet=True to surpress loooong printouts
from nltk.stem.snowball import SnowballStemmer

from nltk.tokenize import RegexpTokenizer

!pip install regex
import regex as re

from nltk.corpus import stopwords



In [30]:
# instantiate tokenizer
tokenizer = RegexpTokenizer('\w+')    #only extract words, not spaces/punctuations

# test if our tokenizer works
words = tokenizer.tokenize(data_train['data'][0].lower()) #tokenize the lower case of the text
print(words)    

['hi', 'i', 've', 'noticed', 'that', 'if', 'you', 'only', 'save', 'a', 'model', 'with', 'all', 'your', 'mapping', 'planes', 'positioned', 'carefully', 'to', 'a', '3ds', 'file', 'that', 'when', 'you', 'reload', 'it', 'after', 'restarting', '3ds', 'they', 'are', 'given', 'a', 'default', 'position', 'and', 'orientation', 'but', 'if', 'you', 'save', 'to', 'a', 'prj', 'file', 'their', 'positions', 'orientation', 'are', 'preserved', 'does', 'anyone', 'know', 'why', 'this', 'information', 'is', 'not', 'stored', 'in', 'the', '3ds', 'file', 'nothing', 'is', 'explicitly', 'said', 'in', 'the', 'manual', 'about', 'saving', 'texture', 'rules', 'in', 'the', 'prj', 'file', 'i', 'd', 'like', 'to', 'be', 'able', 'to', 'read', 'the', 'texture', 'rule', 'information', 'does', 'anyone', 'have', 'the', 'format', 'for', 'the', 'prj', 'file', 'is', 'the', 'cel', 'file', 'format', 'available', 'from', 'somewhere', 'rych']


In [31]:
# remove stopwords. This is the preferred approach to removing stopwords - doing it before, outside of 
# the vectorizer function.
print('Original no. of words:',len(words))
words = [w for w in words if w not in stopwords.words('english')]
print('New no. of words     :',len(words))

Original no. of words: 109
New no. of words     : 54


In [32]:
# instantiate snowballstemmer
s_stemmer = SnowballStemmer('english')

# stem our words
words = [s_stemmer.stem(i) for i in words]
words

['hi',
 'notic',
 'save',
 'model',
 'map',
 'plane',
 'posit',
 'care',
 '3ds',
 'file',
 'reload',
 'restart',
 '3ds',
 'given',
 'default',
 'posit',
 'orient',
 'save',
 'prj',
 'file',
 'posit',
 'orient',
 'preserv',
 'anyon',
 'know',
 'inform',
 'store',
 '3ds',
 'file',
 'noth',
 'explicit',
 'said',
 'manual',
 'save',
 'textur',
 'rule',
 'prj',
 'file',
 'like',
 'abl',
 'read',
 'textur',
 'rule',
 'inform',
 'anyon',
 'format',
 'prj',
 'file',
 'cel',
 'file',
 'format',
 'avail',
 'somewher',
 'rych']

## Pack all into a function

In [33]:
def clean_words(data):
    words = []
    
    # instantiate tokenizer
    tokenizer = RegexpTokenizer('\w+')    #only extract words, not spaces/punctuations
    
    # instantiate snowballstemmer
    s_stemmer = SnowballStemmer('english')
    
    #tokenize the lower case of the text
    words.append(tokenizer.tokenize(data.lower()))
    
    # remove stopwords 
    words_nostop = [w for w in words if w not in stopwords.words('english')]

    # stemming
    words_nostop_stem = [s_stemmer.stem(i) for i in words_nostop[0]]
    
    # Join the words back into one string separated by space, and return the result.
    return(" ".join(words_nostop_stem))
    
clean_words(data_train.data[0])

'hi i ve notic that if you onli save a model with all your map plane posit care to a 3ds file that when you reload it after restart 3ds they are given a default posit and orient but if you save to a prj file their posit orient are preserv doe anyon know whi this inform is not store in the 3ds file noth is explicit said in the manual about save textur rule in the prj file i d like to be abl to read the textur rule inform doe anyon have the format for the prj file is the cel file format avail from somewher rych'