
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP II: `CountVectorizer` and `TfidfVectorizer`

_Authors: Dave Yerrington (SF), Justin Pounders (ATL), Riley Dallas (ATX), Matt Brems (DC)_

---

<img src="https://snag.gy/uvESGH.jpg" alt="drawing" width="800"/>

# $$
\begin{eqnarray*}
\textbf{Fun Fact:  } \text{Word Clouds} &\neq& \text{Data Science}
\end{eqnarray*}
$$

[If you want to generate a word cloud in the shape of something for art only, check here.](https://medium.com/hackernoon/what-real-fake-news-says-about-obamas-presidency-4bf42be71ff1)

## Learning Objectives
---

*After this lesson, you will be able to:*

- Extract features from unstructured text with `sklearn`.
- Describe how CountVectorizers and TF-IDFVectorizers work.
- Implement `CountVectorizer` in a spam classification model.


In [1]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Introduction to Text Feature Extraction

The models we've learned, like linear regression, logistic regression, and k-nearest neighbors, take in an `X` and a `y` variable.
- `X` is a matrix/dataframe of real numbers.
- `y` is a vector/series of real numbers.

Text data (also called natural language data) is not already organized as a matrix or vector of real numbers. We say that this data is **unstructured**.

> This lesson will focus on how to transform our unstructured text data into a numeric `X` matrix.

## Spam Classification Model

One of the most common applications of NLP is predicting "spam"/"ham," or "spam"/"not spam."

Can we predict real vs. promotional texts just based on what is written?

> This data set was taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [2]:
# Read in data.
df = pd.read_csv('../datasets/SMSSpamCollection',
                 sep='\t',
                 names=['label', 'message'])

# Check out first five rows.
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Basic terminology

---

Virtually all NLP uses this base terminology:

- A collection of text is a **document**. 
    - You can think of a document as a row in your feature matrix.
- A collection of documents is a **corpus**. 
    - You can think of your full dataframe as the corpus.

<details><summary>In this specific example, what is a document?</summary>
    
- Each text message in our data set is one document. 
- There are 5,572 documents in our corpus.
</details>

## Model prep
---

Convert ham/spam into binary labels:
- 0 for ham
- 1 for spam

In [3]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Let's set up our data for modeling:
- `X` will be the `message` column. **NOTE**: `CountVectorizer` requires a vector, so make sure you set `X` to be a `pandas` Series, **not** a DataFrame.
- `y` will be the `label` column

In [4]:
X = df['message']
y = df['label']

In [5]:
# Check what we need to check.
y.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [6]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

## Demo: Scikit-Learn `CountVectorizer`
---

`sklearn` offers a `CountVectorizer` class with many configurable options. We'll start with a default CountVectorizer, then get into the various hyperparameters of the class.

<details><summary>Remind me: what is a hyperparameter?</summary>

- A hyperparameter is a built-in option that affects our model, but our model cannot learn these from our data!
- Examples of hyperparameters include:
    - the value of $k$ and the distance metric in $k$-nearest neighbors,
    - our regularization constants $\alpha$ or $C$ in linear and logistic regression.
</details>

In [7]:
# Instantiate a CountVectorizer.
cvec = CountVectorizer()

In [8]:
# Fit the vectorizer on our corpus.
cvec.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [9]:
# Transform the corpus.
X_train = cvec.transform(X_train)

In [10]:
# Convert X_train into a DataFrame.

X_train_df = pd.DataFrame(X_train.toarray(),
                          columns=cvec.get_feature_names())
X_train_df

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
# Transform test
X_test = cvec.transform(X_test)
X_test_df = pd.DataFrame(X_test.toarray(),
                         columns=cvec.get_feature_names())

## Stop Words

---

Some words are so common that they may not provide legitimate information about the $Y$ variable we're trying to predict. This may or may not be the case!

In [12]:
# Let's look at sklearn's stop words.
from sklearn.feature_extraction import stop_words

print(stop_words.ENGLISH_STOP_WORDS)

frozenset({'above', 'twenty', 'these', 'go', 'nevertheless', 'somehow', 'give', 'sincere', 'wherever', 'next', 'please', 'somewhere', 'around', 'could', 'fifteen', 'our', 'their', 'why', 'per', 'below', 'noone', 'both', 'it', 'beforehand', 'else', 'hereby', 'thereby', 'you', 'might', 'together', 'ours', 'is', 'every', 'yours', 'well', 'hundred', 'anywhere', 'at', 'ever', 'whoever', 'whenever', 'upon', 'that', 'inc', 'seems', 'her', 'myself', 'since', 'among', 'call', 'do', 'whom', 'eleven', 'amount', 'few', 'an', 'however', 'now', 'was', 'hers', 'cry', 'done', 'except', 'in', 'toward', 'until', 'under', 'something', 'wherein', 'couldnt', 'from', 'much', 'everything', 'interest', 'some', 'has', 'see', 'them', 'have', 'my', 'for', 'side', 'keep', 'another', 'anyone', 'even', 'hence', 'otherwise', 'seeming', 'un', 'first', 'should', 'than', 'other', 'thus', 'show', 'afterwards', 'mostly', 'bill', 'co', 'ten', 'here', 'hasnt', 'almost', 'nothing', 'third', 'everyone', 'thick', 'twelve', 'w

`CountVectorizer` gives you the option to eliminate stop words from your corpus.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stop words that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```

## Vocabulary size

---
One downside to `CountVectorizer` is the size of its vocabulary (`cvec.get_feature_names()`) can get really large. To mitigate this problem, you can set `max_features` to only include the $N$ most popular vocabulary words in the corpus.

```python
cvec = CountVectorizer(max_features=1000) # Only the top 1,000 words from the entire corpus will be saved
```

We can also tell `CountVectorizer` to only consider words that occur within a certain threshold in the corpus.

For example, if we only want `CountVectorizer` to add words that occur in **at least** two documents, we can do the following:

```python
cvec = CountVectorizer(min_df=2) # A word must occur in at least two documents from the corpus
```

Conversely, we can set an upper threshold with `max_df`:

```python
cvec = CountVectorizer(max_df=.98) # Ignore words that occur in > 98% of the documents from the corpus
```

<details><summary>Why might we want to control these hyperparameters?</summary>
    
- If we have too many features, our models may take a **very** long time to fit.
- Control for overfitting/underfitting.
- Words in 99% of documents or words occuring in only one document might not be very informative.
</details>

## Bag of Words/Word Counting

---

When we have unstructured text data, there is a lot of information in that text data.

When we force unstructured text data to follow a "spreadsheet" or "dataframe" structure, we might lose some of that information.

For example, CountVectorizer creates a vector for each token and counts up the number of occurrences of each token in each document.

In [13]:
X_train_df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Notice that the order of the words in the original document no longer matters!
- The **bag-of-words model** is a simplified representation of the raw data. 
- In this model, a document is represented as the bag/multiset of its words.

Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

<details><summary>What might be some of the advantages of the bag-of-words approach?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

<details><summary>What might be some of the disadvantages of the bag-of-words approach?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

## N-Gram Range
---

`CountVectorizer` has the ability to capture $n$-word phrases, also called $n$-grams. Consider the following:

> The quick brown fox jumped over the lazy dog.

In the example sentence, the 2-grams (aka bi-grams) are:
- 'the quick'
- 'quick brown'
- 'brown fox'
- 'fox jumped'
- 'jumped over'
- 'over the'
- 'the lazy'
- 'lazy dog'

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every single word and every 2-gram
```

<details><summary>How many 3-grams (tri-grams) would be generated from the phrase "The Cat In The Hat?"</summary>

- Three 3-grams.
    - "The Cat In"
    - "Cat In The"
    - "In The Hat"
</details>

<details><summary>Why might we want to change n_gram range to something other than (1,1)?</summary>

- We can work with multi-word phrases like "not good" or "very hot."
</details>

## Modeling

---

We may want to test lots of different values of hyperparameters in our CountVectorizer.

<details><summary>What two things will we use to do this?</summary>
    
- GridSearch
- Pipelines
</details>

In [14]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

## Baseline accuracy
---

We need to calculate baseline accuracy in order to tell if our model is outperforming the null model (predicting the majority class).

In [15]:
y_test.value_counts(normalize=True)

0    0.865688
1    0.134312
Name: label, dtype: float64

In [16]:
# Let's set it up with two stages:
# 1. An instance of CountVectorizer (transformer)
# 2. A LogisticRegression instance (estimator)

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression())
])

## `GridSearchCV`
---

At this point, you could use your `pipeline` object as a model:

```python
# Evaluate how your model will perform on unseen data
cross_val_score(pipe, X_train, y_train, cv=3).mean() 

# Fit your model
pipe.fit(X_train, y_train)

# Training score
pipe.score(X_train, y_train)

# Test score
pipe.score(X_test, y_test)
```

Since we want to tune over the `CountVectorizer`, we'll load our `pipeline` object into `GridSearchCV`.

In [17]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2500, 3000, 3500
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and bigrams).

pipe_params = {
    'cvec__max_features': [2500, 3000, 3500],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

In [18]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=3) # 3-fold cross-validation.

In [19]:
# Fit GridSearch to training data.
gs.fit(X_train, y_train)



GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('cvec', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'cvec__max_features': [2500, 3000, 3500], 'cvec__min_df': [2, 3], 'cvec__max_df': [0.9, 0.95], 'cvec__ngram_range': [(1, 1), (1, 2)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [20]:
# What's the best score?
print(gs.best_score_)

0.981516206804179


In [21]:
# Save best model as gs_model.

gs_model = gs.best_estimator_

In [22]:
# Score model on training set.
gs_model.score(X_train, y_train)

0.996785427270292

In [23]:
# Score model on testing set.
gs_model.score(X_test, y_test)

0.9820554649265906

## Term Frequency-Inverse Document Frequency (TF-IDF)

---

<details><summary>Which type of word do you think may be most useful in modeling?</summary>

- Words that are common in certain documents but rare in other documents tend to be more informative than words that are common in all documents or rare in all documents.
</details>

A TF-IDF score tells us which words are most differentiating between documents. Words that occur often in one document but don't occur in many documents contain a great deal of predictive power.

- The TF-IDF score is a statistical measure used to evaluate how important a word is to a document relative to all other documents.

Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

<details><summary>(BONUS) Let's see how it's calculated.</summary>
    
Term frequency (`tf`) is the frequency of a certain term in a document:

$$
\mathrm{tf}(t,d) = \frac{N_\text{term}}{N_\text{terms in Document}}
$$

where

- $N_\text{term}$ is the number of times a term/word $t$ appears in document $d$
- $N_\text{terms in Document}$ is the number of terms/words in document $d$

Inverse document frequency (`idf`) is defined as the frequency of documents that contain that term over the whole corpus:

$$
\mathrm{idf}(t, D) = \log\frac{N_\text{Documents}}{N_\text{Documents that contain term}}
$$

where

- $N_\text{Documents}$ is the number of documents in the corpus $D$
- $N_\text{Documents that contain term}$ is the number of documents in $D$ that contain term/word $t$

TF-IDF is then calculated as:

$$
\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)
$$

</details>

<a id='tfidf-vec'></a>
## Practice Using the `TfidfVectorizer`

---

### Why Use TF-IDF?
- Common words are penalized.
- Rare words have more influence.

Scikit-learn provides a TF-IDF vectorizer that works similarly to the other vectorizers we've covered. Notice that we can also eliminate stop words to improve our analysis.

As you did above, import and initialize the `TfidfVectorizer`, then fit the spam and ham data.

In [24]:
# Fit the transformer.
tvec = TfidfVectorizer()

In [25]:
df = pd.DataFrame(tvec.fit_transform(X_train).toarray(),
                  columns=tvec.get_feature_names())
df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
X_train = tvec.fit_transform(X_train)

X_test = tvec.transform(X_test)

In [27]:
# Instantiate logistic regression.
lr = LogisticRegression()

# Fit logistic regression.
lr.fit(X_train, y_train)

# Evaluate logistic regression.
print(f'Training Score: {lr.score(X_train, y_train)}')
print(f'Testing Score: {lr.score(X_test, y_test)}')

Training Score: 0.9726761317974819
Testing Score: 0.9711799891245242


## Interview Question

## (BONUS) How is the information from vectorizers stored efficiently?

When you CountVectorize the text messages, you get 8,713 features and have 5,572 rows... this is 48,548,836 entries. That's a lot of data to store in a dataframe!

<details><summary>How many of these values are zero?</summary>

- 48,474,667
- About 99.85% of all values are zero!
</details>

Instead of storing all those zeroes, `sklearn` automatically stores these as a sparse matrix. It saves **a lot** of space.

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

cvec = CountVectorizer()

X_train = cvec.fit_transform(X_train)

print(type(X_train))
print(X_train[0])

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 3155)	1
  (0, 2888)	1
  (0, 853)	1
  (0, 4368)	1
  (0, 4462)	1
  (0, 3977)	1
  (0, 6754)	1
  (0, 3407)	1
  (0, 6885)	1
