<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing Lab

_Authors: Dave Yerrington (SF)_

---

In this lab, we'll explore scikit-learn and NLTK's capabilities for processing text even further. We'll use the 20 newsgroups data set, which is provided by scikit-learn.

In [24]:
# Standard data science imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [25]:
# Getting the scikit-learn data set:
from sklearn.datasets import fetch_20newsgroups

### 1. Use the `fetch_20newsgroups` function to download a training and testing set.

The "20 Newsgroups" dataset is described [here](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html).

For this lab let's choose 4 categories to analyze.  The full list is given below.


```python
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
```

Note that the solution code will use these categories:
- `alt.atheism`
- `talk.religion.misc`
- `comp.graphics`
- `sci.space`

Also remove the headers, footers, and quotes using the `remove` keyword argument of the function.

In [26]:
#Extracting Information from the Data's Dictionary format 

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']  # Fill in whatever categories you want to use!!

# Setting out training data
data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))
# Setting our testing data
data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

**Question:** What does the `shuffle` argument do?  Why are we setting a `random_state`?

In [27]:
# A: 
#
# setting 'shuffle' to true ensures that the data gets shuffled. 
# random_state is used to shuffle the data

### 2) Inspect the data.

We've downloaded a few `newsgroups` categories and removed their headers, footers, and quotes.

Because this is a scikit-learn data set, it comes with pre-split training and testing sets (note: we were able to call "train" and "test" in subset).

Let's inspect them.

1) What data type is `data_train`?
- Is it a list? A dictionary? What else?
- How many data points does it contain?
- Inspect the first data point. What does it look like?

A: 

The data type is "sklearn.utils.Bunch"

It is a dictionary with the following keys: 'data', 'filenames', 'target_names', 'target', 'DESCR', 'description'.

data_train.data is a list of what appear to be emails.

data_train.filenames is a numpy array

There are 2,034 data points.

The first data point looks like an email.

### 3) Create a bag-of-words model.

Let's train a model using a simple count vectorizer.

1) Initialize a standard CountVectorizer and fit the training data.
- How big is the feature dictionary?
- Eliminate English stop words.
- Is the dictionary smaller?
- Transform the training data using the trained vectorizer.
- Evaluate the performance of a logistic regression on the features extracted by the CountVectorizer.
    - You will have to transform the `test_set`, too. Be careful to use the trained vectorizer without refitting it.

**Bonus**
- Try a couple of modifications:
    - Restrict the `max_features`.
    - Change the `max_df` and `min_df`.

In [28]:
# A:
#
# The feature dictionary contains 26,879 elements
# Removing the English stop workds reduces the dictionary to 26,576 elements.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

In [30]:
cv = CountVectorizer(stop_words='english')
cv.fit(data_train.data)

X_train = cv.transform(data_train.data)
y_train = data_train.target

In [31]:
len(cv.get_feature_names())
X_train.shape

(2034, 26576)

In [32]:
X_test = cv.transform(data_test.data)
y_test = data_test.target

In [33]:
from sklearn.linear_model import LogisticRegression

In [34]:
clf = LogisticRegression()

In [35]:
clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [36]:
clf.score(X_test, y_test)

0.7450110864745011

In [37]:
from sklearn.dummy import DummyClassifier
dummy = DummyClassifier()
dummy.fit(X_train, y_train)
dummy.score(X_test, y_test)

  k in range(self.n_outputs_)).T


0.29563932002956395

### 4) Test Out Hashing and TF-IDF.

Let's see if hashing or TF-IDF improves our accuracy.

1) Initialize a HashingVectorizer and repeat the test with no restriction on the number of features.
- Does the score improve with respect to the CountVectorizer?
- Print out the number of features for this model.
- Initialize a TF-IDF vectorizer and repeat the analysis above.
- Print out the number of features for this model.

**Bonus**
- Change the parameters of either (or both) models to improve your score.

In [None]:
# A:
#
# The accuracy scores are about the same:
#
#     CountVectorizer:   0.745
#     HashingVectorizer: 0.737
#     TF-IDF:            0.748

In [66]:
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
hv = HashingVectorizer(stop_words='english')
tfidf = TfidfVectorizer(stop_words='english')

#### Hashing Vectorizer

In [75]:
X_train_hv = hv.fit_transform(data_train.data)
print(f'number of features: {X_train_hv.shape[1]}')

number of features: 1048576


In [68]:
X_test_hv = hv.transform(data_test.data)

In [69]:
clf.fit(X_train_hv, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [70]:
clf.score(X_test_hv, y_test)

0.7368810051736882

#### TF_IDF Vectorizer

In [76]:
X_train_tfidf = tfidf.fit_transform(data_train.data)
print(f'number of features: {X_train_tfidf.shape[1]}')

number of features: 26576


In [72]:
X_test_tfidf = tfidf.transform(data_test.data)

In [73]:
clf.fit(X_train_tfidf, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [74]:
clf.score(X_test_tfidf, y_test)

0.7479674796747967

### 5. [Bonus] Robust Text Preprocessing

Your mission, should you choose to accept it, is to write a preprocessing function for all of your text.  This functions should

- convert all text to lowercase,
- remove punctuation,
- stem or lemmatize each word of the text,
- remove stopwords.

The function should receive one string of text and return the processed text.

Once you have built your function, use it to process your train and test data, then fit a Logistic Regression model to see how it performs.