# This iiiiiis Jeopardyyy!

> [Source of Dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

In [1]:
# Install packages.
# !pip3 install pandas regex sklearn --upgrade

In [2]:
# Import libraries.
import pandas as pd
import regex as re

import time

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Agenda

1. Import and explore our data using `pandas`.
2. Clean our text data.
3. Vectorize our text data.
4. Fit model.
5. Evaluate performance.

### Import and explore our data using `pandas`.

In [3]:
# Read our Jeopardy data in.
jeopardy = pd.read_csv('./jeopardy.csv')

In [4]:
# Examine the first five rows.
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,12/31/04,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,12/31/04,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,12/31/04,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,12/31/04,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,12/31/04,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [5]:
# Convert air_date to a datetime column.
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])

In [6]:
# What is the latest air_date?
max(jeopardy['air_date'])

Timestamp('2012-01-27 00:00:00')

In [7]:
# What is the earliest air_date?
min(jeopardy['air_date'])

Timestamp('1984-09-10 00:00:00')

In [8]:
# What is the breakdown of questions by round?
jeopardy['round'].value_counts()

Jeopardy!           107384
Double Jeopardy!    105912
Final Jeopardy!       3631
Tiebreaker               3
Name: round, dtype: int64

In [9]:
# What is the breakdown of questions by category?
jeopardy['category'].value_counts()

BEFORE & AFTER              547
SCIENCE                     519
LITERATURE                  496
AMERICAN HISTORY            418
POTPOURRI                   401
                           ... 
LITERATURE OF THE 1800s       1
FICTIONAL PEOPLE              1
19th CENTURY POLITICIANS      1
WOMEN IN POEMS                1
SEMIANNUAL PUBLICATIONS       1
Name: category, Length: 27983, dtype: int64

In [10]:
# How can I filter my dataframe to see the 
# questions asked in a Tiebreaker round?
jeopardy[jeopardy['round'] == 'Tiebreaker']

Unnamed: 0,show_number,air_date,round,category,value,question,answer
12305,5332,2007-11-13,Tiebreaker,CHILD'S PLAY,,A Longfellow poem & a Lillian Hellman play abo...,The Children's Hour
184710,2941,1997-05-19,Tiebreaker,THE AMERICAN REVOLUTION,,"On Nov. 15, 1777 Congress adopted this constit...",the Articles of Confederation
198973,4150,2002-09-20,Tiebreaker,LITERARY CHARACTERS,,"Hogwarts headmaster, he's considered by many t...",Professor Dumbledore


### Clean our text data.

In [11]:
# Examine zero-th question.
jeopardy['question'][0]

"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory"

In [12]:
# Examine first question.
jeopardy['question'][1]

'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'

In [13]:
# Examine second question.
jeopardy['question'][2]

'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'

This data is particularly clean. But our "toolbox" for cleaning text data includes many things of which we may want to be aware:
- Convert all words to lower case.
- Tokenize text.
- Remove HTML artifacts.
- Remove punctuation.
- Lemmatize/stem.

#### Convert words to lower case.

Python is case-sensitive (as are most languages!), but we often want to make apples-to-apples comparisons.

In [14]:
# Examine zero-th question.
jeopardy['question'][0]

"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory"

Depending on how we create columns of data, the `For` and `for` may be interpreted differently, even though these are the same word!

In [15]:
# Print string in lower case.
jeopardy['question'][0].lower()

"for the last 8 years of his life, galileo was under house arrest for espousing this man's theory"

In [16]:
# Create lower_case variable.
lower_case = jeopardy['question'][0].lower()

#### Tokenize text.

When we **tokenize** text data, we split a string into smaller strings based on some pattern.

In [17]:
# Use .split() to tokenize our text data.
lower_case.split()

['for',
 'the',
 'last',
 '8',
 'years',
 'of',
 'his',
 'life,',
 'galileo',
 'was',
 'under',
 'house',
 'arrest',
 'for',
 'espousing',
 'this',
 "man's",
 'theory']

In [18]:
# Define tokens.
tokens = lower_case.split()

**Discussion 1**
<details><summary>Why do you think tokenizing might be beneficial?</summary>

- Tokenizing allows us to iterate through each individual word or phrase in our list. 
- We can then clean each token separately/individually.
- We can also count up tokens meeting a certain condition, which enables us to better understand our data.
</details>

One can use regular expressions to detect specific patterns, in case we want to do something more specific than just split items based one specific character string, like spaces.
> The `nltk` library includes the `RegexpTokenizer()` function if you want to tokenize based on this specific pattern. Below, we'll see some examples of patterns we can identify!

In [19]:
# Use re.findall to find all tokens with a numeric digit.

for token in tokens:
    print(re.findall('\d+', token), token)

[] for
[] the
[] last
['8'] 8
[] years
[] of
[] his
[] life,
[] galileo
[] was
[] under
[] house
[] arrest
[] for
[] espousing
[] this
[] man's
[] theory


In [20]:
# Use re.findall to split tokens up containing punctuation.

for token in tokens:
    print(re.findall('\w+|\$[\d\.]+|\S+', token), token)

['for'] for
['the'] the
['last'] last
['8'] 8
['years'] years
['of'] of
['his'] his
['life', ','] life,
['galileo'] galileo
['was'] was
['under'] under
['house'] house
['arrest'] arrest
['for'] for
['espousing'] espousing
['this'] this
['man', "'s"] man's
['theory'] theory


In [21]:
# Use re.findall to select words beginning with a capital letter.
# We shouldn't see any, since we've forced tokens to be lower case!

for token in tokens:
    print(re.findall('[A-Z]\w+', token), token)

[] for
[] the
[] last
[] 8
[] years
[] of
[] his
[] life,
[] galileo
[] was
[] under
[] house
[] arrest
[] for
[] espousing
[] this
[] man's
[] theory


In [22]:
# Use re.findall to select words beginning with a capital letter.
# If we look at the original jeopardy['question'][0].split()
# before applying .lower(), then we should see something.

for token in jeopardy['question'][0].split():
    print(re.findall('[A-Z]\w+', token), token)

['For'] For
[] the
[] last
[] 8
[] years
[] of
[] his
[] life,
['Galileo'] Galileo
[] was
[] under
[] house
[] arrest
[] for
[] espousing
[] this
[] man's
[] theory


In [23]:
# Use re.findall to get non-letters and non-numbers.
for token in tokens:
    print(re.findall('[^a-zA-Z0-9]', token), token)

[] for
[] the
[] last
[] 8
[] years
[] of
[] his
[','] life,
[] galileo
[] was
[] under
[] house
[] arrest
[] for
[] espousing
[] this
["'"] man's
[] theory


#### Remove HTML artifacts.

It is unlikely that any of these Jeopardy questions contain HTML artifacts that we'd want to discard, so we won't apply that to this data. 

However, lots of data will contain HTML! If you're scraping or downloading information, HTML code snippets like `<br>` or `\` will likely show up. These will likely cause problems with your analysis -- your tokenizer (and your vectorizer, which we'll discuss later) may not understand what to do when `<br>` is encountered.

> The `bs4` library includes a `BeautifulSoup` object and `.get_text()` method if you want to remove code snippets!

#### Lemmatize/stem words.

Some words may have similar meaning, but be spelled differently.

In [24]:
# Examine zero-th question.
jeopardy['question'][0]

# Year / years
# Espouse / espousing
# Theory / theories

"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory"

If we want to combine these words together, we might use a lemmatizer or stemmer. By "combine," I mean, "map all uses of the word `years` to `year`, map all uses of the word `espousing` to `espouse`, and so on."

In [25]:
# Examine second question.
jeopardy['question'][2]

# State (noun) / state (verb)
# Average / mean
# Record (adjective) / record (noun) / record (verb)

'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'

Lemmatizers and stemmers are not perfect. They're inexact!
- Lemmatizers tend to be "gentler." It changes fewer words, but may have more false negatives. (Words that should be changed, but aren't.)
- Stemmers tend to be "cruder." It changes more words, but may have more false positives. (Words that should not be changed, but are changed anyway.)

> The `nltk` library contains lemmatizers and stemmers. I've used the `WordNetLemmatizer()` and the `PorterStemmer()`.

#### Put into one function and clean Jeopardy data.

In [26]:
def jeopardy_clean(input_text):
    # The input is a single string (one question), and 
    # the output is a single string (a cleaned question).
    
    # 1. Remove any punctuation.
    letters_numbers = re.sub("[^a-zA-Z0-9]", " ", input_text)
    
    # 2. Convert to lower case. 
    lower_case = letters_numbers.lower()
    
    # 3. split into individual words.
    words = lower_case.split()
    
    # If you wanted to add more in here, like:
    # removing HTML artifacts,
    # lemmatizing/stemming,
    # removing stopwords,  <-- we'll use scikit-learn to do this later
    # then you could do that here!
    
    # 4. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(words))

In [27]:
# Initialize an empty list for our cleaned Jeopardy data.
clean_data = []

# How many questions do we have?
total_questions = len(jeopardy['question'])

print("Cleaning the Jeopardy questions...")

# Instantiate counter.
count = 0

# For every question in our data...
for question in jeopardy['question']:
    
    # Clean question, then append to clean_data.
    clean_data.append(jeopardy_clean(question))
    
    # If the index is divisible by 10,000, print a message.
    if (count + 1) % 10_000 == 0:
        print(f'Review {count + 1} of {total_questions}.')
    
    count += 1

# Let us know when we're done and how many questions are cleaned.
print(f'Done! Cleaned all {total_questions} questions.')

Cleaning the Jeopardy questions...
Review 10000 of 216930.
Review 20000 of 216930.
Review 30000 of 216930.
Review 40000 of 216930.
Review 50000 of 216930.
Review 60000 of 216930.
Review 70000 of 216930.
Review 80000 of 216930.
Review 90000 of 216930.
Review 100000 of 216930.
Review 110000 of 216930.
Review 120000 of 216930.
Review 130000 of 216930.
Review 140000 of 216930.
Review 150000 of 216930.
Review 160000 of 216930.
Review 170000 of 216930.
Review 180000 of 216930.
Review 190000 of 216930.
Review 200000 of 216930.
Review 210000 of 216930.
Done! Cleaned all 216930 questions.


In [28]:
# Examine first ten questions.
clean_data[:10]

['for the last 8 years of his life galileo was under house arrest for espousing this man s theory',
 'no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves',
 'the city of yuma in this state has a record average of 4 055 hours of sunshine each year',
 'in 1963 live on the art linkletter show this company served its billionth burger',
 'signer of the dec of indep framer of the constitution of mass second president of the united states',
 'in the title of an aesop fable this insect shared billing with a grasshopper',
 'built in 312 b c to link rome the south of italy it s still in use today',
 'no 8 30 steals for the birmingham barons 2 306 steals for the bulls',
 'in the winter of 1971 72 a record 1 122 inches of snow fell at rainier paradise ranger station in this state',
 'this housewares store was named for the packaging its merchandise came in was first displayed on']

### Vectorize our text data.

We have our cleaned data! Now we can start preparing our model.

In machine learning:
- `X` is a matrix/dataframe of real numbers.
- `y` is a vector/series of real numbers.

Our $Y$ variable will be, "Was this question asked during the 'Double Jeopardy!' round?"
- Often, we might want a more _interesting_ $Y$ variable, like predicting whether or not the clue is from the `Before and After` category. However, it's tough to define categories that don't fall prey to [unbalanced/imbalanced classes](https://blog.roboflow.com/handling-unbalanced-classes/), which makes machine learning more difficult. Today, we're going to pick a slightly less interesting but easier problem to solve so that we can focus our attention on natural language processing!

In [29]:
# Create our y variable.
y = pd.Series([1 if item == 'Double Jeopardy!' else 0 for item in jeopardy['round']])

In [30]:
# Confirm we created our variable correctly.
y.value_counts()

0    111018
1    105912
dtype: int64

In [31]:
# Create our X variable.
X = pd.DataFrame(clean_data, columns=['question'])

In [32]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2, # 80% training / 20% test.
                                                    stratify=y,    # keep distribution of Y same across train/test
                                                    random_state=42)

Text data (natural language data) is often not already organized as a matrix or vector of real numbers. For example, above, we have a list of cleaned questions.

**Vectorizing our text data** describes one fairly simple process for converting a set of text data into a matrix of real numbers.

There are two basic, commonly used types of vectorizer: `CountVectorizer` and `TfidfVectorizer`.

#### CountVectorizer

<img src="./images/countvectorizer.png" alt="drawing" width="700"/>

[Source.](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061)

In [33]:
# Instantiate CountVectorizer.

cvec = CountVectorizer()

In [34]:
# Fit the vectorizer on our corpus. (corpus = set of words)

cvec.fit(X_train)

CountVectorizer()

In [35]:
# Transform the corpus.
X_train = cvec.transform(X_train)

In [36]:
# Convert X_train into a DataFrame.

X_train_df = pd.DataFrame(X_train.toarray(),
                          columns=cvec.get_feature_names())
X_train_df

Unnamed: 0,00,000,0000,0003,000529,000th,001,002,0025,004,...,zygomatic,zygote,zygotes,zymase,zynga,zyplast,zyuganov,zyzzyx,zz,zzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173539,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
173540,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
173541,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
173542,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
# Transform test data.
X_test = cvec.transform(X_test)

Our tokens are now stored as a bag-of-words. This is a simplified way of looking at and storing our data.
- Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

**Discussion 2**
<details><summary>What might be some of the advantages of using this bag-of-words approach when modeling?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

**Discussion 3**
<details><summary>What might be some of the disadvantages of using this bag-of-words approach when modeling?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

In [38]:
# Instantiate logistic regression model.

lr = LogisticRegression(solver = 'liblinear')

In [39]:
# Fit logistic regression model.

lr.fit(X_train, y_train)

LogisticRegression(solver='liblinear')

In [40]:
# Get model accuracy on training data.

print(f'Model Accuracy on Training Data: {lr.score(X_train, y_train)}')

Model Accuracy on Training Data: 0.7641001705619325


In [41]:
# Get model accuracy on testing data.

print(f'Model Accuracy on Testing Data: {lr.score(X_test, y_test)}')


Model Accuracy on Testing Data: 0.5725579680081132


#### TfidfVectorizer

<img src="./images/tfidfvectorizer.png" alt="drawing" width="900"/>

[Source.](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061)

In [42]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2, # 80% training / 20% test
                                                    stratify=y,    # keep distribution of Y same across train/test
                                                    random_state=42)

# Instantiate TfidfVectorizer.
tvec = TfidfVectorizer()

# Fit the vectorizer and transform the training data.
X_train = tvec.fit_transform(X_train)

# Transform the testing data.
X_test = tvec.transform(X_test)

In [43]:
# Instantiate logistic regression model.
lr = LogisticRegression(solver = 'liblinear')

# Fit logistic regression model.
lr.fit(X_train, y_train)

# Get model accuracy on training data.
print(f'Model Accuracy on Training Data: {lr.score(X_train, y_train)}')

# Get model accuracy on testing data.
print(f'Model Accuracy on Training Data: {lr.score(X_test, y_test)}')

Model Accuracy on Training Data: 0.7151097128105841
Model Accuracy on Training Data: 0.584335960909049


### Hyperparameters

---

**Discussion 4**
<details><summary>What is a hyperparameter?</summary>

- A hyperparameter is a built-in option that affects our model, but our model cannot learn these from our data!
- Examples of hyperparameters include:
    - the value of $k$ and the distance metric in $k$-nearest neighbors,
    - our regularization constants $\alpha$ or $C$ in linear and logistic regression.
</details>

There are many different hyperparameters of `CountVectorizer` that can affect the fit of our model!
- `stop_words`
- `max_features`, `max_df`, `min_df`
- `ngram_range`


#### `stop_words`

Some words are so common that they may not provide legitimate information about the $Y$ variable we're trying to predict.

In [44]:
# Let's look at sklearn's stopwords.
print(CountVectorizer(stop_words = 'english').get_stop_words())

frozenset({'behind', 'via', 'rather', 'it', 'last', 'third', 'sixty', 'was', 'nine', 'myself', 'in', 'than', 'formerly', 'least', 'someone', 'our', 'twelve', 'part', 'otherwise', 'may', 'made', 'along', 'everything', 'must', 'eleven', 'top', 'always', 'became', 'others', 'nevertheless', 'am', 'yourself', 'sincere', 'first', 'whatever', 'mill', 'hundred', 'former', 'thus', 'while', 'whereupon', 'none', 'twenty', 'interest', 'keep', 'latter', 'five', 'go', 'can', 'even', 'herself', 'seeming', 'whose', 'whither', 'several', 'some', 'somewhere', 'cant', 'forty', 'however', 'here', 'together', 'call', 're', 'nobody', 'so', 'yours', 'bottom', 'by', 'see', 'through', 'side', 'every', 'beyond', 'perhaps', 'himself', 'ltd', 'whole', 'hence', 'any', 'inc', 'namely', 'thereupon', 'eg', 'per', 'fifty', 'too', 'what', 'either', 'much', 'that', 'de', 'themselves', 'these', 'ourselves', 'con', 'i', 'somehow', 'well', 'co', 'such', 'only', 'above', 'get', 'if', 'they', 'without', 'cry', 'about', 'amon

`CountVectorizer` gives you the option to eliminate stopwords from your corpus when instantiating your vectorizer.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stopwords that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```

#### Vocabulary size

---
One downside to `CountVectorizer` and `TfidfVectorizer` is the size of its vocabulary (`cvec.get_feature_names()`) can get really large. We're creating one column for every unique token in your corpus of data!

There are three hyperparameters to help you control this.

1. You can set `max_features` to only include the $N$ most popular vocabulary words in the corpus.

```python
cvec = CountVectorizer(max_features=1_000) # Only the top 1,000 words from the entire corpus will be saved
```

2. You can tell `CountVectorizer` to only consider words that occur in **at least** some number of documents.

```python
cvec = CountVectorizer(min_df=2) # A word must occur in at least two documents from the corpus
```

3. Conversely, you can tell `CountVectorizer` to only consider words that occur in **at most** some percentage of documents.

```python
cvec = CountVectorizer(max_df=.98) # Ignore words that occur in > 98% of the documents from the corpus
```

Both `max_df` and `min_df` can accept either an integer or a float.
- An integer tells us the number of documents.
- A float tells us the percentage of documents.

**Discussion 5**
<details><summary>Why might we want to control these vocabulary size hyperparameters?</summary>
    
- If we have too many features, our models may take a **very** long time to fit.
- Control for overfitting/underfitting.
- Words in 99% of documents or words occuring in only one document might not be very informative.
</details>

#### `ngram_range`

---

`CountVectorizer` has the ability to capture $n$-word phrases, also called $n$-grams. Consider the following:

> The quick brown fox jumps over the lazy dog.

In the example sentence, the 2-grams are:
- 'The quick'
- 'quick brown'
- 'brown fox'
- 'fox jumps'
- 'jumps over'
- 'over the'
- 'the lazy'
- 'lazy dog.'

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every 1-gram and every 2-gram
```

**Discussion 6**
<details><summary>How many 3-grams would be generated from the phrase "the quick brown fox jumped over the lazy dog?"</summary>

- Seven 3-grams.
    - 'The quick brown'
    - 'quick brown fox'
    - 'brown fox jumped'
    - 'fox jumped over'
    - 'jumped over the'
    - 'over the lazy'
    - 'the lazy dog.'
</details>

**Discussion 7**
<details><summary>Why might we want to change ngram_range to something other than (1,1)?</summary>

- We can work with multi-word phrases like "not good" or "very hot."
</details>

## Modeling

---

We may want to test lots of different values of hyperparameters in our `CountVectorizer` and `TfidfVectorizer`.

In [45]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

**Discussion 8**
<details><summary>What is GridSearch?</summary>
    
- GridSearch allows us to try different values of different hyperparameters, measure our model's performance on each one, and return the best model.
</details>

In [46]:
# Let's set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

A **pipeline** allows us to GridSearch over both:
- one or more transformers (like our CountVectorizer)
- an estimator (a model, like our logistic regression model)

In [47]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 1,000 and 2,000.
# Check (individual tokens) and also check (individual tokens and 2-grams).
# Check removing English stopwords and not removing any stopwords.

pipe_params = {
    'cvec__max_features': [1_000, 2_000],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': ['english', None]
}

In [48]:
# Instantiate GridSearchCV.

gs_cvec = GridSearchCV(pipe, # what object are we optimizing?
                       param_grid=pipe_params, # what parameters values are we searching?
                       cv=5) # 5-fold cross-validation.

**Discussion 9**
<details><summary>How many models are we fitting here?</summary>

- 2 max_features
- 2 ngram_range
- 2 stop_words
- 5-fold CV
- 2 * 2 * 2 * 5 = 40 models
</details>

In [49]:
# Start stopwatch.
t0 = time.time()

# Fit GridSearch to training data.
gs_cvec.fit(X_train, y_train)

print(time.time() - t0)

248.04471611976624


In [50]:
# Evaluate accuracy on training data.
gs_cvec.score(X_train, y_train)

0.6033340248006269

In [51]:
# Evaluate accuracy on testing data.
gs_cvec.score(X_test, y_test)

0.5849352325634998

In [52]:
# What model performed best?
gs_cvec.best_params_

{'cvec__max_features': 2000,
 'cvec__ngram_range': (1, 1),
 'cvec__stop_words': None}

### All at once, now, with `TfidfVectorizer`!

In [53]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

# Let's set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

# Search over the following values of hyperparameters:
# Maximum number of features fit: 1,000 and 2,000.
# Check (individual tokens) and also check (individual tokens and 2-grams).
# Check removing English stopwords and not removing any stopwords.

pipe_params = {
    'tvec__max_features': [1_000, 2_000],
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': ['english', None]
}

# Instantiate GridSearchCV.
gs_tvec = GridSearchCV(pipe, # what object are we optimizing?
                       param_grid=pipe_params, # what parameters values are we searching?
                       cv=5) # 5-fold cross-validation.


# Fit GridSearch to training data.
gs_tvec.fit(X_train, y_train)

# Evaluate accuracy on training data.
print(f'Model Accuracy on Training Data: {gs_tvec.score(X_train, y_train)}.')

# Evaluate accuracy on testing data.
print(f'Model Accuracy on Testing data: {gs_tvec.score(X_test, y_test)}.')

# The logistic regression model with what
# hyperparameters performed best? (As measured
# by accuracy.)
print(f'Best parameter values: {gs_tvec.best_params_}')

Model Accuracy on Training Data: 0.6042444567371963.
Model Accuracy on Testing data: 0.5866869497072789.
Best parameter values: {'tvec__max_features': 2000, 'tvec__ngram_range': (1, 1), 'tvec__stop_words': None}
