# Natural Language Processing I Lab

In this lab we will further explore Scikit-Learn's capabilities to process text. We will use the 20 Newsgroup dataset, which is provided by Scikit-Learn. You can learn more about this data [here](http://qwone.com/~jason/20Newsgroups/). We will be using four categories: baseball, computer graphics, science (medicine) and science (space).

**Goal:** Your goal in this lab will be to use the text in these various news sources to predict the category of news source. You will build a logistic regression model in `sklearn`.

In [5]:
# Standard Data Science Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer, CountVectorizer

# Getting that SKLearn Dataset
from sklearn.datasets import fetch_20newsgroups

%matplotlib inline

In [6]:
categories = [
    'rec.sport.baseball',
    'comp.graphics',
    'sci.med',
    'sci.space',
]

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=('headers', 'footers', 'quotes'))

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=('headers', 'footers', 'quotes'))

## 1. Data inspection

We have downloaded a few newsgroup categories and removed headers, footers and quotes.

Because this is an SKLearn dataset, it comes with pre-split train and test sets (note we were able to call 'train' and 'test' in subset).

In [7]:
type(data_train)

sklearn.utils.Bunch

Let's inspect `data_train`.

- What data type is `data_train`?
- Is it structured like a list or dictionary?

Inspect `data_train['data']`.
- How many data points does it contain?
- Describe the first data point.

Similarly, you should inspect `data_train['target']`.

In [8]:
data_train['data'][0]
len(data_train['target'])

2368

In [9]:
data_train['data'][9]
data_train['target']

array([1, 2, 2, ..., 0, 2, 3])

## 2. Bag-of-Words Model

A "bag-of-words" model is one that ignores the punctuation and order of words and treats them as some unordered list.

Let's train a model using a simple CountVectorizer.

1. Initialize a standard CountVectorizer and fit the training data.
    - How big is the feature dictionary?
    - Eliminate English stop words.
    - Check the size of the feature dictionary. Is it smaller?
2. Fit a Logistic Regression model.
    - Given that there are more than two classes, [check the documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [see what model](http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) is, by default, fit when you have more than two categories. 
    - Do we have a major concern with unbalanced classes here?
    - Transform the training data using the trained vectorizer. (You'll create `X_train` and `y_train` here.)
    - Transform the test data using the trained vectorizer. **Be careful to use the trained vectorizer without re-fitting it.** (You'll create `X_test` and `y_test` here.)
        - (This is similar to when you use `.fit()` and `.predict()` in models. You fit on the training data, then predict with the testing data; you don't re-fit the model on the testing data!)
    - Fit your model!
    - Evaluate the performance of your Logistic Regression model on the features extracted by the CountVectorizer.

#### BONUS:
- Try some modifications:
    - restrict the max_features
    - change max_df and min_df

In [12]:
df = pd.DataFrame(data_train['data']).T
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2358,2359,2360,2361,2362,2363,2364,2365,2366,2367
0,The O's just lost to the Rangers a few minutes...,\n\n\tAsk the practitioner whether he uses the...,\n\nSjogren's syndrome has been known to induc...,"\nHi Janet,\n\nSounds exactly like mine. Same...","Me, too... RBI are a worthless stat. Of course...",someone wrote in expressing concern about gett...,I am currently looking for a 3D graphics libra...,BoSox 3 Royals 1\n\nWP: Clemens (1-0)\nLP:...,\n\n\n Try graPHIGS from IBM... It is an exc...,\n\n\nI recall that the issue is that fat on t...,...,Any more news on Steve's status since he lost ...,\n\n\nGood thing i stuck in a couple of questi...,I had allergy shots for about four years start...,"Forwarded from Neal Ausman, Galileo Mission Di...","\nHey Valentine, I don't see Boston with any w...",\nHi there\nI'm suffering from Sarcoidosis at ...,There is a nice little tool in Lucid emacs. It...,I have the need for displaying 2 1/2 D surface...,Subject: options before back surgery for protr...,Does anyone know ifthe STS-56 email press kit ...


In [7]:
cvec = CountVectorizer(stop_words='english')
cvec.fit(df.iloc[0])
cvecdata = cvec.transform(df.iloc[0])

In [8]:
#without stopwords (29314), with stopwords (29013)
new_df = pd.DataFrame(cvecdata.todense(), columns=cvec.get_feature_names())

In [9]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
model = lr.fit(new_df, data_train['target'])
model.score(new_df, data_train['target'])

0.97804054054054057

In [10]:
from sklearn.metrics import classification_report, confusion_matrix

In [11]:
preds = model.predict(new_df)

In [12]:
print(classification_report(data_train['target'],preds))

             precision    recall  f1-score   support

          0       1.00      0.97      0.98       584
          1       0.92      1.00      0.96       597
          2       1.00      0.97      0.98       594
          3       1.00      0.97      0.99       593

avg / total       0.98      0.98      0.98      2368



In [13]:
#NOW STARTING WITH TEST DATA

In [14]:
df_test = pd.DataFrame(data_test['data']).T

In [15]:
cvecdata_test = cvec.transform(df_test.iloc[0])

In [16]:
new_df_2 = pd.DataFrame(cvecdata_test.todense(), columns=cvec.get_feature_names())

In [17]:
test_predictions = model.predict(new_df_2)

In [18]:
print(classification_report(data_test['target'], test_predictions))

             precision    recall  f1-score   support

          0       0.86      0.86      0.86       389
          1       0.79      0.94      0.86       397
          2       0.89      0.78      0.83       396
          3       0.83      0.78      0.80       394

avg / total       0.84      0.84      0.84      1576



In [19]:
# Here's some code that might provide an interesting look at the data.

common_words = []

for i in range(4):
    word_count = new_df[new_df_2==i].sum(axis=0)
    
    print(data_train['target_names'][i], "most common words")
    
    cw = word_count.sort_values(ascending = False).head(10)
    
    print(cw)
    
    common_words.extend(cw.index)

comp.graphics most common words
space    810.0
edu      594.0
data     422.0
like     397.0
don      363.0
just     360.0
time     352.0
year     348.0
nasa     347.0
image    340.0
dtype: float64
rec.sport.baseball most common words
like     74.0
don      70.0
use      67.0
know     64.0
just     53.0
think    45.0
time     39.0
year     36.0
does     35.0
good     33.0
dtype: float64
sci.med most common words
like      20.0
time      19.0
don       19.0
just      13.0
files     12.0
year      10.0
run        9.0
people     9.0
know       7.0
good       7.0
dtype: float64
sci.space most common words
space       16.0
edu         16.0
like         6.0
use          5.0
time         5.0
don          5.0
think        5.0
computer     5.0
know         4.0
just         4.0
dtype: float64


In [20]:
#new_df[new_df_2==1].sum(axis=0).sort_values(ascending=False)

## 3. Hashing and TF-IDF

Let's see if Hashing or TF-IDF improves the accuracy.

1. Initialize a HashingVectorizer and repeat with no restriction on the number of features.
    - Does the score improve with respect to the CountVectorizer?
2. Initialize a TF-IDF Vectorizer and repeat the analysis above
    - Does the score improve with respect to the CountVectorizer?

In [None]:
df = pd.DataFrame(data_train['data']).T

In [21]:
hvec = HashingVectorizer(stop_words='english')
hvec.fit(df.iloc[0])

HashingVectorizer(alternate_sign=True, analyzer='word', binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1, 1), non_negative=False,
         norm='l2', preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None)

In [22]:
df = pd.DataFrame(hvec.transform(df.iloc[0]).todense())

In [23]:
lr = LogisticRegression()

In [24]:
model = lr.fit(df, data_train['target'])

In [25]:
preds = model.predict(df)

In [26]:
print(classification_report(data_train['target'], preds))

             precision    recall  f1-score   support

          0       0.97      0.96      0.96       584
          1       0.91      0.99      0.95       597
          2       1.00      0.96      0.98       594
          3       0.99      0.95      0.97       593

avg / total       0.97      0.96      0.96      2368



In [27]:
df_2 = pd.DataFrame(hvec.transform(df_test.iloc[0]).todense())

In [28]:
preds_2 = model.predict(df_2)

In [29]:
print(classification_report(data_test['target'], preds_2))

             precision    recall  f1-score   support

          0       0.87      0.87      0.87       389
          1       0.80      0.92      0.85       397
          2       0.86      0.80      0.83       396
          3       0.85      0.79      0.82       394

avg / total       0.85      0.84      0.84      1576



In [31]:
df = pd.DataFrame(data_train['data']).T

In [32]:
tvec = TfidfVectorizer(stop_words='english')

In [33]:
tvec.fit(df.iloc[0])

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [34]:
df = pd.DataFrame(tvec.transform(df.iloc[0]).todense())

In [35]:
lr = LogisticRegression()
model = lr.fit(df, data_train['target'])

In [36]:
preds_tvec = model.predict(df) 

In [38]:
print(classification_report(data_train['target'], preds_tvec))

             precision    recall  f1-score   support

          0       0.98      0.97      0.97       584
          1       0.92      1.00      0.96       597
          2       1.00      0.97      0.98       594
          3       1.00      0.96      0.98       593

avg / total       0.98      0.97      0.97      2368



In [39]:
df_2 = pd.DataFrame(tvec.transform(df_test.iloc[0]).todense())

In [40]:
preds_test = model.predict(df_2)

In [41]:
print(classification_report(data_test['target'], preds_test))

             precision    recall  f1-score   support

          0       0.90      0.90      0.90       389
          1       0.85      0.93      0.89       397
          2       0.91      0.87      0.89       396
          3       0.89      0.83      0.86       394

avg / total       0.89      0.88      0.88      1576



### BONUS
- Build a model comparing `rec.sport.baseball` to `sci.med`. Evaluate the model.
- Build a separate model comparing `sci.med` to `sci.space`. Evaluate the model.
- Compare the two model evaluations. Is it easier for one model to differentiate the two sources than another model? Why is that?