# This iiiiiis Jeopardyyy!

> [Source of Dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

In [None]:
# Install packages.
# !pip3 install pandas regex sklearn --upgrade

In [None]:
# Import libraries.
import pandas as pd
import regex as re

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

### Agenda

1. Import and explore our data using `pandas`.
2. Clean our text data.
3. Vectorize our text data.
4. Fit model.
5. Evaluate performance.

### Import and explore our data using `pandas`.

In [None]:
# Read our Jeopardy data in.
jeopardy = pd.read_csv('./jeopardy.csv')

In [None]:
# Examine the first five rows.


In [None]:
# Convert air_date to a datetime column.


In [None]:
# What is the latest air_date?
max(jeopardy['air_date'])

In [None]:
# What is the earliest air_date?


In [None]:
# What is the breakdown of questions by round?


In [None]:
# What is the breakdown of questions by category?


In [None]:
# How can I filter my dataframe to see all
# questions in a Tiebreaker round?


### Clean our text data.

In [None]:
# Examine zero-th question.
jeopardy['question'][0]

In [None]:
# Examine first question.


In [None]:
# Examine second question.


This data is particularly clean. But our "toolbox" for cleaning text data includes many things of which we may want to be aware:
- Convert all words to lower case.
- Tokenize text.
- Remove HTML artifacts.
- Remove punctuation.
- Lemmatize/stem.

#### Convert words to lower case.

As Python is case-sensitive (as are most languages!), we often want to make apples-to-apples comparisons.

In [None]:
# Examine zero-th question.
jeopardy['question'][0]

Depending on how we create columns of data, the `For` and `for` may be interpreted differently, even though these are the same word!

In [None]:
# Print string in lower case.
jeopardy['question'][0].lower()

In [None]:
# Create lower_case variable.
lower_case =

#### Tokenize text.

When we **tokenize** text data, we split a string into smaller strings based on some pattern.

In [None]:
# Use .split() to tokenize our text data.


In [None]:
# Define tokens.
tokens =

<details><summary>Why do you think tokenizing might be beneficial?</summary>

- It allows us to iterate through each token in our list to clean separately/individually.
</details>

One can use regular expressions to detect specific patterns, in case we want to do something more specific than just split items based one specific character string, like spaces.
> The `nltk` library includes the `RegexpTokenizer()` function if you want to tokenize based on this specific pattern. Below, we'll see some examples of patterns we can identify!

In [None]:
# Use re.findall to find all tokens with a numeric digit.

for token in tokens:
    print(re.findall('\d+', token), token)

In [None]:
# Use re.findall to split tokens up containing punctuation.

for token in tokens:
    print(re.findall('\w+|\$[\d\.]+|\S+', token), token)

In [None]:
# Use re.findall to select words beginning with a capital letter.

for token in tokens:
    print(re.findall('[A-Z]\w+', token), token)

In [None]:
# Use re.findall to select words beginning with a capital letter.

for token in jeopardy['question'][0].split():
    print(re.findall('[A-Z]\w+', token), token)

In [None]:
# Use re.findall to get non-letters and non-numbers.
for token in tokens:
    print(re.findall('[^a-zA-Z0-9]', token), token)

#### Remove HTML artifacts.

It is unlikely that any of these Jeopardy questions contain HTML artifacts that we'd want to discard, so we won't do that here. However, HTML code snippets (like `<br>` or `\`) might be interpreted incorrectly by a vectorizer.
> The `bs4` library includes a `BeautifulSoup` object and `.get_text()` method if you want to remove code snippets!

#### Lemmatize/stem words.

Some words may have similar meaning, but be spelled differently.

In [None]:
# Examine zero-th question.
jeopardy['question'][0]

In [None]:
# Examine second question.
jeopardy['question'][2]

If we want to combine these words together, we might use a lemmatizer or stemmer. Lemmatizers and stemmers are not perfect. They're inexact!
- Lemmatizers tend to be "gentler." It changes fewer words, but may have more false negatives. (Words that should be changed, but aren't.)
- Stemmers tend to be "cruder." It changes more words, but may have more false positives. (Words that should not be changed, but are changed anyway.)

> The `nltk` library contains lemmatizers and stemmers. I've used the `WordNetLemmatizer()` and the `PorterStemmer()`.

#### Put into one function and clean Jeopardy data.

In [None]:
def jeopardy_clean(input_text):
    # The input is a single string (one question), and 
    # the output is a single string (a cleaned question).
    
    # 1. Remove any punctuation.
    letters_numbers = re.sub("[^a-zA-Z0-9]", " ", input_text)
    
    # 2. Convert to lower case. 
    lower_case = letters_numbers.lower()
    
    # 3. split into individual words.
    words = lower_case.split()
    
    # If you wanted to add more in here, like:
    # removing HTML artifact,
    # lemmatizing/stemming,
    # removing stopwords,
    # then you could do that here!
    
    # 4. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(words))

In [None]:
# Initialize an empty list for our cleaned Jeopardy data.
clean_jeopardy = []

# How many questions do we have?
total_questions = len(jeopardy['question'])

print("Cleaning the Jeopardy questions...")

# Instantiate counter.
count = 0

# For every question in our data...
for question in jeopardy['question']:
    
    # Clean question, then append to clean_jeopardy.
    clean_jeopardy.append(jeopardy_clean(question))
    
    # If the index is divisible by 10,000, print a message.
    if (count + 1) % 10_000 == 0:
        print(f'Review {count + 1} of {total_questions}.')
    
    count += 1

# Let's do the same for our testing set.
print(f'Done! Completed all {total_questions} questions.')

In [None]:
# Examine first ten questions.


### Vectorize our text data.

We have our cleaned data! Now we can start preparing our model.

In machine learning:
- `X` is a matrix/dataframe of real numbers.
- `y` is a vector/series of real numbers.

Our $Y$ variable will be, "Was this question asked during the 'Double Jeopardy!' round?"

In [None]:
# Create our y variable.
y = pd.Series([1 if item == 'Double Jeopardy!' else 0 for item in jeopardy['round']])

In [None]:
# Confirm we created our variable correctly.


In [None]:
jeopardy['round'].value_counts()

In [None]:
# Create our X variable.
X = pd.DataFrame(clean_jeopardy, columns=['question'])

In [None]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2, # 20% of our data is test.
                                                    stratify=y,    # keep distribution of y same 
                                                    random_state=42)

Text data (natural language data) is often not already organized as a matrix or vector of real numbers. **Vectorizing our text data** describes one common process for converting a set of text data into a matrix of real numbers.

There are two basic, common types of vectorizer: `CountVectorizer` and `TfidfVectorizer`.

#### CountVectorizer

<img src="./images/countvectorizer.png" alt="drawing" width="700"/>

[Source.](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061)

In [None]:
# Instantiate CountVectorizer.

cvec =

In [None]:
# Fit the vectorizer on our corpus.



In [None]:
# Transform the corpus.

X_train =

In [None]:
# Convert X_train into a DataFrame.

X_train_df = pd.DataFrame(X_train.toarray(),
                          columns=cvec.get_feature_names())
X_train_df

In [None]:
# Transform test data.
X_test = cvec.transform(X_test)

Our tokens are now stored as a bag-of-words. This is a simplified way of looking at and storing our data.
- Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

<details><summary>What might be some of the advantages of using this bag-of-words approach when modeling?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

<details><summary>What might be some of the disadvantages of using this bag-of-words approach when modeling?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

In [None]:
# Instantiate logistic regression model.

lr = LogisticRegression(solver = 'liblinear')

In [None]:
# Fit logistic regression model.

lr.fit(X_train, y_train)

In [None]:
# Get model accuracy on training data.

lr.score(X_train, y_train)

In [None]:
# Get model accuracy on testing data.



#### TfidfVectorizer

<img src="./images/tfidfvectorizer.png" alt="drawing" width="900"/>

[Source.](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061)

In [None]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2, # 20% of our data is test.
                                                    stratify=y,    # keep distribution of y same 
                                                    random_state=42)

# Instantiate TfidfVectorizer.
tvec = TfidfVectorizer()

# Fit the vectorizer and transform the training data.
X_train = tvec.fit_transform(X_train)

# Transform the testing data.
X_test = tvec.transform(X_test)

In [None]:
# Instantiate logistic regression model.
lr = LogisticRegression(solver = 'liblinear')

# Fit logistic regression model.
lr.fit(X_train, y_train)

# Get model accuracy on training data.
print(lr.score(X_train, y_train))

# Get model accuracy on testing data.
print(lr.score(X_test, y_test))

### Hyperparameters

---


<details><summary>What is a hyperparameter?</summary>

- A hyperparameter is a built-in option that affects our model, but our model cannot learn these from our data!
- Examples of hyperparameters include:
    - the value of $k$ and the distance metric in $k$-nearest neighbors,
    - our regularization constants $\alpha$ or $C$ in linear and logistic regression.
</details>

There are many different hyperparameters of `CountVectorizer` that can affect the fit of our model!
- `stop_words`
- `max_features`, `max_df`, `min_df`
- `ngram_range`


#### `stop_words`

Some words are so common that they may not provide legitimate information about the $Y$ variable we're trying to predict.

In [None]:
# Let's look at sklearn's stopwords.
print(CountVectorizer(stop_words = 'english').get_stop_words())

`CountVectorizer` gives you the option to eliminate stopwords from your corpus when instantiating your vectorizer.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stopwords that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```

#### Vocabulary size

---
One downside to `CountVectorizer` and `TfidfVectorizer` is the size of its vocabulary (`cvec.get_feature_names()`) can get really large. We're creating one column for every unique token in your corpus of data!

There are three hyperparameters to help you control this.

1. You can set `max_features` to only include the $N$ most popular vocabulary words in the corpus.

```python
cvec = CountVectorizer(max_features=1_000) # Only the top 1,000 words from the entire corpus will be saved
```

2. You can tell `CountVectorizer` to only consider words that occur in **at least** some number of documents.

```python
cvec = CountVectorizer(min_df=2) # A word must occur in at least two documents from the corpus
```

3. Conversely, you can tell `CountVectorizer` to only consider words that occur in **at most** some percentage of documents.

```python
cvec = CountVectorizer(max_df=.98) # Ignore words that occur in > 98% of the documents from the corpus
```

Both `max_df` and `min_df` can accept either an integer or a float.
- An integer tells us the number of documents.
- A float tells us the percentage of documents.

<details><summary>Why might we want to control these vocabulary size hyperparameters?</summary>
    
- If we have too many features, our models may take a **very** long time to fit.
- Control for overfitting/underfitting.
- Words in 99% of documents or words occuring in only one document might not be very informative.
</details>

#### `ngram_range`

---

`CountVectorizer` has the ability to capture $n$-word phrases, also called $n$-grams. Consider the following:

> The quick brown fox jumps over the lazy dog.

In the example sentence, the 2-grams are:
- 'the quick'
- 'quick brown'
- 'brown fox'
- 'fox jumps'
- 'jumps over'
- 'over the'
- 'the lazy'
- 'lazy dog'

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every 1-gram and every 2-gram
```

<details><summary>How many 3-grams would be generated from the phrase "the quick brown fox jumps over the lazy dog?"</summary>

- Seven 3-grams.
    - 'the quick brown'
    - 'quick brown fox'
    - 'brown fox jumps'
    - 'fox jumps over'
    - 'jumps over the'
    - 'over the lazy'
    - 'the lazy dog'
</details>

<details><summary>Why might we want to change ngram_range to something other than (1,1)?</summary>

- We can work with multi-word phrases like "not good" or "very hot."
</details>

## Modeling

---

We may want to test lots of different values of hyperparameters in our `CountVectorizer` and `TfidfVectorizer`.

In [None]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

<details><summary>What is GridSearch?</summary>
    
- GridSearch allows us to try different values of different hyperparameters, measure our model's performance on each one, and return the best model.
</details>

In [None]:
# Let's set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

In [None]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 1,000 and 2,000.
# Check (individual tokens) and also check (individual tokens and 2-grams).
# Check removing English stopwords and not removing any stopwords.

pipe_params = {
    'cvec__max_features': [1_000, 2_000],
    'cvec__ngram_range': [(1,1), (1,2)],
    'cvec__stop_words': ['english', None]
}

In [None]:
# Instantiate GridSearchCV.

gs_cvec = GridSearchCV(, # what object are we optimizing?
                       , # what parameter values are we searching?
                       ) # 5-fold cross-validation.

<details><summary>How many models are we fitting here?</summary>

- 2 max_features
- 2 ngram_range
- 2 stop_words
- 5-fold CV
- 2 * 2 * 2 * 5 = 40 models
</details>

In [None]:
import time

t0 = time.time()

# Fit GridSearch to training data.
gs_cvec.fit(X_train, y_train)

print(time.time() - t0)

In [None]:
# Evaluate accuracy on training data.
gs_cvec.score(X_train, y_train)

In [None]:
# Evaluate accuracy on testing data.
gs_cvec.score(X_test, y_test)

In [None]:
# What model performed best?


### All at once, now, with `TfidfVectorizer`!

In [None]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X['question'],
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

# Let's set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

# Search over the following values of hyperparameters:
# Maximum number of features fit: 1,000 and 2,000.
# Check (individual tokens) and also check (individual tokens and 2-grams).
# Check removing English stopwords and not removing any stopwords.

pipe_params = {
    'tvec__max_features': [1_000, 2_000],
    'tvec__ngram_range': [(1,1), (1,2)],
    'tvec__stop_words': ['english', None]
}

# Instantiate GridSearchCV.
gs_tvec = GridSearchCV(pipe, # what object are we optimizing?
                       param_grid=pipe_params, # what parameters values are we searching?
                       cv=5) # 5-fold cross-validation.


# Fit GridSearch to training data.
gs_tvec.fit(X_train, y_train)

# Evaluate accuracy on training data.
print(f'Accuracy score on training data: {gs_tvec.score(X_train, y_train)}.')

# Evaluate accuracy on testing data.
print(f'Accuracy score on testing data: {gs_tvec.score(X_test, y_test)}.')

# What model performed best?
print(f'Best parameter values: {gs_tvec.best_params_}')