# CPSC 330 - Applied Machine Learning 

## Homework 4: Logistic regression, hyperparameter optimization 
### Associated lectures: [Lectures 7, 8](https://ubc-cs.github.io/cpsc330/README.html) 

## Imports 

In [1]:
from hashlib import sha1
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.rcParams["font.size"] = 16

from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier

## Instructions
<hr>
rubric={points:6}

Follow the [homework submission instructions](https://github.com/UBC-CS/cpsc330/blob/master/docs/homework_instructions.md). 

**You may work with a partner on this homework and submit your assignment as a group.** Below are some instructions on working as a group.  
- The maximum group size is 2. 
- Use group work as an opportunity to collaborate and learn new things from each other. 
- Be respectful to each other and make sure you understand all the concepts in the assignment well. 
- It's your responsibility to make sure that the assignment is submitted by one of the group members before the deadline. 
- If the .ipynb file is too big and doesn't render on Gradescope, also upload a pdf or html in addition to the .ipynb.
- You can find the instructions on how to do group submission on Gradescope [here](https://help.gradescope.com/article/m5qz2xsnjy-student-add-group-members).
- Note that we are using the autograder for some of the coding questions so that you can get instant feedback. When you submit your homework on Gradescope, it will run your code on an AWS server. It's possible that all your tests pass locally but some test cases fail when you submit your work on Gradescope. Wait for your submission to run there and examine the autograder results.  If it fails, you are either submitting a wrong file or doing something unexpected somewhere in the code. If you cannot figure it out, make use of office hours and tutorials. It's important that you start early on the assignments so that we help you with such issues. Also, remember that passing the tests is not the only goal. Make sure you are understanding what exactly you are doing in each question and why you are doing it. 

_Note: The assignments will get gradually more open-ended as we progress through the course. In many cases, there won't be a single correct solution. Sometimes you will have to make your own choices and your own decisions (for example, on what parameter values to use when they are not explicitly provided in the instructions). Use your own judgment in such cases and justify your choices, if necessary._

<br><br><br><br>

<div class="alert alert-warning">

Solution_1
    
</div>

## Exercise 1: Implementing `DummyClassifier` 
<hr>
rubric={points:25}

In this course (unlike CPSC 340) you will generally **not** be asked to implement machine learning algorihtms (like logistic regression) from scratch. However, this exercise is an exception: you will implement the simplest possible classifier, `DummyClassifier`.

As a reminder, `DummyClassifier` is meant as a baseline and is generally the worst possible "model" you could "fit" to a dataset. All it does is predict the most popular class in the training set. So if there are more 0s than 1s it predicts 0 every time, and if there are more 1s than 0s it predicts 1 every time. For `predict_proba` it looks at the frequencies in the training set, so if you have 30% 0's 70% 1's it predicts `[0.3 0.7]` every time. Thus, `fit` only looks at `y` (not `X`).

Below you will find starter code for a class called `MyDummyClassifier`, which has methods `fit()`, `predict()`, `predict_proba()` and `score()`. Your task is to fill in those four functions. To get your started, I have given you a `return` statement in each case that returns the correct data type: 
- `fit` can return nothing, 
- `predict` returns an array whose size is the number of examples, 
- `predict_proba` returns an array whose size is the number of examples x 2, and 
- `score` returns a number.

The next code block has some tests you can use to assess whether your code is working. 

I suggest starting with `fit` and `predict`, and making sure those are working before moving on to `predict_proba`. For `predict_proba`, you should return the frequency of each class in the training data, which is the behaviour of `DummyClassifier(strategy='prior')`. Your `score` function should call your `predict` function. Again, you can compare with `DummyClassifier` using the code below.

To simplify this question, you can assume **binary classification**, and furthermore that these classes are **encoded as 0 and 1**. In other words, you can assume that `y` contains only 0s and 1s. The real `DummyClassifier` works when you have more than two classes, and also works if the target values are encoded differently, for example as "cat", "dog", "mouse", etc.

In [2]:
class MyDummyClassifier:
    """
    A baseline classifier that predicts the most common class.
    The predicted probabilities come from the relative frequencies
    of the classes in the training data.

    This implementation only works when y only contains 0s and 1s.
    """

    def fit(self, X, y):
        # BEGIN SOLUTION
        self.prob_1 = np.mean(y == 1)
        # END SOLUTION
        return None  # Replace with your code

    def predict(self, X):
        # BEGIN SOLUTION
        if self.prob_1 >= 0.5:
            return np.ones(X.shape[0])
        else:
            return np.zeros(X.shape[0])
        # END SOLUTION
        return np.zeros(X.shape[0])  # Replace with your code

    def predict_proba(self, X):
        # BEGIN SOLUTION
        probs = np.zeros((X.shape[0], 2))
        probs[:, 0] = 1 - self.prob_1
        probs[:, 1] = self.prob_1
        return probs
        # END SOLUTION
        return np.zeros((X.shape[0], 2))  # Replace with your code

    def score(self, X, y):
        # BEGIN SOLUTION
        return np.mean(self.predict(X) == y)
        # END SOLUTION
        return 0.0  # Replace with your code
    

In [3]:
from sklearn.dummy import DummyClassifier
n_train = 21
d = 4
np.random.seed(111)
X_train_dummy = np.random.randn(n_train, d) 
y_train_dummy = np.random.randint(2, size=n_train)
my_dc = MyDummyClassifier()
my_dc.fit(X_train_dummy, y_train_dummy)
my_dc.predict(X_train_dummy)
my_dc.predict_proba(X_train_dummy)
my_dc.score(X_train_dummy, y_train_dummy)

0.5238095238095238

In [4]:
sk_dc = DummyClassifier(strategy="prior") # 
sk_dc.fit(X_train_dummy, y_train_dummy)
sk_dc.predict(X_train_dummy)
sk_dc.predict_proba(X_train_dummy)
sk_dc.score(X_train_dummy, y_train_dummy)

0.5238095238095238

The following are automated tests for the solution. You can run each against the student code which is expected to fail if the solution is incorrect.


In [5]:
""" # BEGIN TEST CONFIG
ok_format: false
success_message: Good job!
""" # END TEST CONFIG


def test_it(MyDummyClassifier):
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from sklearn.dummy import DummyClassifier
    n_train = 101
    n_valid = 21
    d = 5
    for _ in range(100):
        X_train_dummy = np.random.randn(n_train, d) 
        X_valid_dummy = np.random.randn(n_valid, d)
        y_train_dummy = np.random.randint(2, size=n_train)
        y_valid_dummy = np.random.randint(2, size=n_valid)
        my_dc = MyDummyClassifier()
        sk_dc = DummyClassifier(strategy="prior")
        _ = my_dc.fit(X_train_dummy, y_train_dummy)
        _ = sk_dc.fit(X_train_dummy, y_train_dummy)
        assert np.array_equal(my_dc.predict(X_train_dummy), sk_dc.predict(X_train_dummy)), "the prediction score is wrong on a random training set"
        assert np.array_equal(my_dc.predict(X_valid_dummy), sk_dc.predict(X_valid_dummy)), "the prediction score is wrong a random validation set"

test_it(MyDummyClassifier)

In [6]:
""" # BEGIN TEST CONFIG
ok_format: false
success_message: Good job!
""" # END TEST CONFIG


def test_it(MyDummyClassifier):
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from sklearn.dummy import DummyClassifier
    n_train = 101
    n_valid = 21
    d = 5
    for _ in range(100):
        X_train_dummy = np.random.randn(n_train, d) 
        X_valid_dummy = np.random.randn(n_valid, d)
        y_train_dummy = np.random.randint(2, size=n_train)
        y_valid_dummy = np.random.randint(2, size=n_valid)
        my_dc = MyDummyClassifier()
        sk_dc = DummyClassifier(strategy="prior")
        _ = my_dc.fit(X_train_dummy, y_train_dummy)
        _ = sk_dc.fit(X_train_dummy, y_train_dummy)
        assert np.allclose(my_dc.predict_proba(X_train_dummy), sk_dc.predict_proba(X_train_dummy)), "predict_proba is wrong on a random training set"
        assert np.allclose(my_dc.predict_proba(X_valid_dummy), sk_dc.predict_proba(X_valid_dummy)), "predict_proba is wrong on a random validation set"

test_it(MyDummyClassifier)

In [7]:
""" # BEGIN TEST CONFIG
ok_format: false
success_message: Good job!
""" # END TEST CONFIG


def test_it(MyDummyClassifier):
    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    from sklearn.dummy import DummyClassifier
    n_train = 101
    n_valid = 21
    d = 5
    for _ in range(100):
        X_train_dummy = np.random.randn(n_train, d) 
        X_valid_dummy = np.random.randn(n_valid, d)
        y_train_dummy = np.random.randint(2, size=n_train)
        y_valid_dummy = np.random.randint(2, size=n_valid)
        my_dc = MyDummyClassifier()
        sk_dc = DummyClassifier(strategy="prior")
        _ = my_dc.fit(X_train_dummy, y_train_dummy)
        _ = sk_dc.fit(X_train_dummy, y_train_dummy)
        assert np.isclose(my_dc.score(X_train_dummy, y_train_dummy), sk_dc.score(X_train_dummy, y_train_dummy)), "the score is wrong on a random training set"
        assert np.isclose(my_dc.score(X_valid_dummy, y_valid_dummy), sk_dc.score(X_valid_dummy, y_valid_dummy)), "the score is wrong on a random validation set"

test_it(MyDummyClassifier)

<br><br><br><br>

## Exercise 2: Trump Tweets
<hr>

For the rest of this assignment we'll be working with a [dataset of Donald Trump's tweets](https://www.kaggle.com/austinreese/trump-tweets) as of June 2020. You should start by downloading the dataset. Unzip it and move the file `realdonaldtrump.csv` under the data directory in this folder. As usual, please do not submit the dataset when you submit the assignment. 

In [8]:
tweets_df = pd.read_csv("realdonaldtrump.csv", index_col=0)
tweets_df.head()

Unnamed: 0_level_0,link,content,date,retweets,favorites,mentions,hashtags
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1698308935,https://twitter.com/realDonaldTrump/status/169...,Be sure to tune in and watch Donald Trump on L...,2009-05-04 13:54:25,510,917,,
1701461182,https://twitter.com/realDonaldTrump/status/170...,Donald Trump will be appearing on The View tom...,2009-05-04 20:00:10,34,267,,
1737479987,https://twitter.com/realDonaldTrump/status/173...,Donald Trump reads Top Ten Financial Tips on L...,2009-05-08 08:38:08,13,19,,
1741160716,https://twitter.com/realDonaldTrump/status/174...,New Blog Post: Celebrity Apprentice Finale and...,2009-05-08 15:40:15,11,26,,
1773561338,https://twitter.com/realDonaldTrump/status/177...,"""My persona will never be that of a wallflower...",2009-05-12 09:07:28,1375,1945,,


In [9]:
tweets_df.shape

(43352, 7)

We will be trying to predict whether a tweet will go "viral", defined as having more than 10,000 retweets:

In [10]:
y = tweets_df["retweets"] > 10_000

To make predictions, we'll be using only the content (text) of the tweet. 

In [11]:
X = tweets_df["content"]

For the purpose of this assignment, you can ignore all the other columns in the original dataset.

<br><br>

<br><br>

#### 2(a) ordering the steps
rubric={points:8}

Let's start by building a model using `CountVectorizer` and `LogisticRegression`. The code required to do this has been provided below, but in the wrong order. 

- Rearrange the lines of code to correctly fit the model and compute the cross-validation score. 
- Add a short comment to each block to describe what the code is doing.

<div class="alert alert-warning">

Solution_2.1
    
</div>

In [12]:
# # YOUR COMMENT HERE
# countvec = CountVectorizer(stop_words="english")

# # YOUR COMMENT HERE
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=123)

# # YOUR COMMENT HERE
# cross_val_results = pd.DataFrame(
#     cross_validate(pipe, X_train, y_train, return_train_score=True)
# )

# # YOUR COMMENT HERE
# pipe = make_pipeline(countvec, lr)

# # YOUR COMMENT HERE
# cross_val_results.mean()

# # YOUR COMMENT HERE
# lr = LogisticRegression(max_iter=1000, random_state=123)

In [13]:
# BEGIN SOLUTION
# 1. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6, random_state=123)

# 2. Create the vectorizer object
countvec = CountVectorizer(stop_words="english")

# 3. Create the log reg object (could switch with Step 2)
lr = LogisticRegression(max_iter=1000, random_state=123)

# 4. Create the Pipeline
pipe = make_pipeline(countvec, lr)

# 5. Compute the cross-validation scores across folds
cross_val_results = pd.DataFrame(
    
    cross_validate(pipe, X_train, y_train, return_train_score=True)
)

# 6. Average cross-validation scores
cross_val_results.mean()
# END SOLUTION

fit_time       0.620654
score_time     0.070398
test_score     0.895098
train_score    0.976644
dtype: float64

#### 2(b) Cross-validation fold sub-scores
rubric={points:3}

Above we averaged the scores from the 5 folds of cross-validation. 

- Print out the 5 individual scores. 
    - (Reminder: `sklearn` calls them `"test_score"` but they are really (cross-)validation scores.)
- Are the 5 scores close to each other or spread far apart? 
  - (This is a bit subjective, answer to the best of your ability.)
- How does the size of this dataset (number of rows) compare to the cities dataset we have been using in class? How does this relate to the different sub-scores from the 5 folds?

In [14]:
cross_val_results

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.672414,0.069167,0.891292,0.978085
1,0.672031,0.07099,0.900231,0.976788
2,0.606207,0.074941,0.894752,0.976644
3,0.563848,0.071249,0.889273,0.978013
4,0.588771,0.065645,0.899942,0.973688


Q: Are the 5 scores close to each other or spread far apart?

In [15]:

cross_val_results['test_score'].std()

0.00495845725229569

They are pretty close.

Q: How does the size of this dataset (number of rows) compare to the cities dataset we have been using in class? How does this relate to the different sub-scores from the 5 folds?

There are `43352` samples in this dataset which is bigger than almost all datasets we have been exploring. Since Train set would also be bigger, the size of each fold will be bigger, and in turn, there would be less variability across folds. 

In more technical terms, becasue of the big size of the Train/Val sets, each fold is a better representation of the actual population, and the average performance is comparable across folds. 

<br><br>

#### 2(c) baseline
rubric={points:3}

By the way, are these scores any good? 

- Run `DummyClassifier` (or `MyDummyClassifier`!) on this dataset.
- Compare the `DummyClassifier` score to what you got from logistic regression above. Does logistic regression seem to be doing anything useful?
- Is it necessary to use `CountVectorizer` here? Briefly explain.

<div class="alert alert-warning">

Solution_2.2
    
</div>

- The cross-validation scores for the `DummyClassifier` are around 74%, which represents the fact that about 26% of the tweets in the training set are viral and 74% non-viral. 
- The logistic regression is getting around 90% accuracy so that is clearly an improvement over 74%.

In [16]:
cross_val_results["test_score"] # SOLUTION

0    0.891292
1    0.900231
2    0.894752
3    0.889273
4    0.899942
Name: test_score, dtype: float64

I would say they are quite close together, e.g. max-min =

In [17]:
cross_val_results["test_score"].max() - cross_val_results["test_score"].min() # SOLUTION

0.010957324106113053

is quite small. This is likely due to the fairly large dataset:

In [18]:
len(X_train) # SOLUTION

17340

it is much bigger than the cities dataset which has only 200 rows. Generally, larger datasets will give us more reliable validation results because the randomness of the splits plays less of a role.

In [19]:
assert not dummy_cv_score is None, 'Are you using the provided variable?'
assert sha1(str(np.round(dummy_cv_score, 3)).encode('utf-8')).hexdigest() == '838befe9ffa0a0d530805ba36c2e4890d4148ba1', "Your mean CV score doesn't look correct." 

NameError: name 'dummy_cv_score' is not defined

<br><br>

<div class="alert alert-warning">

Solution_2.3
    
</div>

<br><br>

#### 2(d) probability scores
rubric={points:5}

Here we train a logistic regression classifier on the entire training set: 

(Note: this is relying on the `pipe` variable from 2(a) - you'll need to redefine it if you overwrote that variable in between.)

In [None]:
pipe.fit(X_train, y_train);

**Your tasks:**

1. Using this model, find the tweet in the **test set** with the highest predicted probability of being viral. Store the tweet and the associated probability in the variables `tweet` and `prob`, respectively. 

> Reminder: you are free to reuse/adapt code from lecture. Please add in a small attribution, e.g. "From Lecture 7".

<div class="alert alert-warning">

Solution_2.4
    
</div>

In [None]:
tweet = None
prob = None

# BEGIN SOLUTION
most_positive_ind = np.argmax(pipe.predict_proba(X_test)[:, 1])
prob = np.max(pipe.predict_proba(X_test)[:, 1])
tweet = X_test.iloc[most_positive_ind]
tweet
# END SOLUTION

In [None]:
assert not tweet is None, "Are you using the correct variable to store the tweet?"
assert not prob is None, "Are you using the correct variable to store the probability?"
assert sha1(str(np.round(prob, 4)).encode('utf-8')).hexdigest() == 'e8dc057d3346e56aed7cf252185dbe1fa6454411', "Incorrect probability."
assert sha1(str(tweet).encode('utf8')).hexdigest() == 'fe39d5cbae2b335b4fde5486a8df189e41392043', "Incorrect tweet text."

<br><br>

#### 2(e) coefficients
rubric={points:4}

We can extract the `CountVectorizer` and `LogisticRegression` objects from the `make_pipeline` object as follows:


In [None]:
vec_from_pipe = pipe.named_steps["countvectorizer"]
lr_from_pipe = pipe.named_steps["logisticregression"]

**Your tasks:**

Using these extracted components above, get the five words with the highest coefficients and 5 words with smallest coefficients. Store them as lists in `top_5_words` and `bottom_5_words` variables, respectively. 

<div class="alert alert-warning">

Solution_2.5
    
</div>

In [None]:
top_5_words = None
bottom_5_words = None

# BEGIN SOLUTION
words_weights_df = pd.DataFrame(
    data=lr_from_pipe.coef_.ravel(),
    index=vec_from_pipe.get_feature_names_out(),
    columns=["Coefficient"],
)
sorted_words_weights_df = words_weights_df.sort_values(by="Coefficient", ascending=False)
top_5_words = sorted_words_weights_df.index[0:5].tolist()
bottom_5_words = sorted_words_weights_df.index[-5:].tolist()
# END SOLUTION

In [None]:
assert not top_5_words is None, "Are you using the correct variable?"
assert not bottom_5_words is None, "Are you using the correct variable?"
assert len(top_5_words) == 5, "Are you getting the top 5 words?"
assert len(bottom_5_words) == 5, "Are you getting the bottom 5 words?"
assert sha1("".join(sorted(top_5_words)).encode('utf-8')).hexdigest() == '439a61ad61c34c9069de01e2daa3a5bee63ee273', 'incorrect top 5 words'
assert sha1("".join(sorted(bottom_5_words)).encode('utf-8')).hexdigest() == '5e2af1c788307cca183334510cf6bbee0208cab2', 'incorrect bottom 5 words'

<br><br>

#### 2(f) Running a cross-validation fold without sklearn tools 
rubric={points:8}

Sklearn provides a lot of useful tools like `make_pipeline` and `cross_validate`, which are awesome. But with these fancy tools it's also easy to lose track of what is actually happening under the hood. 

**Your tasks:**

1. Compute logistic regression's validation score on the first fold, that is, train on 80% and validate on 20% of the training data (`X_train`) without using sklearn `Pipeline` or `cross_validate` or `cross_val_score`. Store the score of the fold in a variable called `fold_score`. Recall that `cross_validation` in `sklearn` does not shuffle the data by default.    

You should start with the following `CountVectorizer` and `LogisticRegression` objects, as well as `X_train` and `y_train` (which you should further split with `train_test_split` and `shuffle=False`):

In [None]:
countvec = CountVectorizer(stop_words="english")
lr = LogisticRegression(max_iter=1000, random_state=123)

> Meta-comment: you might be wondering why we're going into "implementation" here if this course is about _applied_ ML. In CPSC 340, we would go all the way down into `LogisticRegression` and understand how `fit` works, line by line. Here we're not going into that at all, but I still think this type of question (and Exercise 1) is a useful middle ground. I do want you to know what is going on in `Pipeline` and in `cross_validate` even if we don't cover the details of `fit`. To get into logistic regression's `fit` requires a bunch of math; here, we're keeping it more conceptual and avoiding all those prerequisites.

<div class="alert alert-warning">

Solution_2.6
    
</div>

In [None]:
fold_score = None

# BEGIN SOLUTION
X_train_fold, X_valid_fold, y_train_fold, y_valid_fold = train_test_split(
    X_train, y_train, test_size=0.2, shuffle = False 
)

X_train_fold_vec = countvec.fit_transform(X_train_fold)
X_valid_fold_vec = countvec.transform(X_valid_fold)

lr.fit(X_train_fold_vec, y_train_fold)
fold_score = lr.score(X_valid_fold_vec, y_valid_fold)
# END SOLUTION

In [None]:
assert not fold_score is None, "Are you using the correct variable?"
assert sha1(str(np.round(fold_score, 4)).encode('utf8')).hexdigest() == 'd267e2ea9b801d9f7ac454714a7fe6320075c255', "The fold_score doesn't look right."

<br><br><br><br>

## Exercise 3: Hyperparameter optimization
<hr>

### 3.1 Optimizing `max_features` of `CountVectorizer`
rubric={points:4}

The following code varies the `max_features` hyperparameter of `CountVectorizer` and makes a plot (with the x-axis on a log scale) that shows train/cross-validation scores vs. `max_features`. It also prints the results. 

**Your tasks:**
- Based on the plot/output, what value of `max_features` seems best? Briefly explain.

> The code may take a minute or two to run. You can uncomment the `print` statement if you want to see it show the progress.

In [None]:
train_scores = []
cv_scores = []

max_features = [10, 100, 1000, 10_000, 100_000]

for mf in max_features:
    #     print(mf)    
    pipe = make_pipeline(CountVectorizer(stop_words="english", max_features=mf), LogisticRegression(max_iter=1000))
    cv_results = cross_validate(pipe, X_train, y_train, return_train_score=True)
    train_scores.append(cv_results["train_score"].mean())
    cv_scores.append(cv_results["test_score"].mean())

plt.semilogx(max_features, train_scores, label="train")
plt.semilogx(max_features, cv_scores, label="valid")
plt.legend()
plt.xlabel("max_features")
plt.ylabel("accuracy");

In [None]:
pd.DataFrame({"max_features": max_features, "train": train_scores, "cv": cv_scores})

<div class="alert alert-warning">

Solution_3.1
    
</div>

In terms of cross-validation score, it looks like the best is `max_features=100_000`. In this case that means using all the words, since the total number of words is less than 100,000:

In [None]:
len(CountVectorizer(stop_words="english").fit(X_train, y_train).get_feature_names_out()) # SOLUTION

<br><br>

### 3.2 Optimizing `C` of `LogisticRegression`
rubric={points:6}

The following code varies the `C` hyperparameter of `LogisticRegression` and makes a plot (with the x-axis on a log scale) that shows train/cross-validation scores vs. `C`. 

**Your tasks:**

- Based on the plot, what value of `C` seems best? Briefly explain. 

> The code may take a minute or two to run. You can uncomment the `print` statement if you want to see it show the progress.

In [None]:
train_scores = []
cv_scores = []

C_vals = 10.0 ** np.arange(-1.5, 2, 0.5)

for C in C_vals:
    #     print(C)
    pipe = make_pipeline(CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000, C=C))    
    cv_results = cross_validate(pipe, X_train, y_train, return_train_score=True)

    train_scores.append(cv_results["train_score"].mean())
    cv_scores.append(cv_results["test_score"].mean())

plt.semilogx(C_vals, train_scores, label="train")
plt.semilogx(C_vals, cv_scores, label="valid")
plt.legend()
plt.xlabel("C")
plt.ylabel("accuracy");

In [None]:
pd.DataFrame({"C": C_vals, "train": train_scores, "cv": cv_scores})

<div class="alert alert-warning">

Solution_3.2
    
</div>

It looks like the best value of `C` is 1, as it gives us the best cross-validation scores. That said, $C\approx 0.3$ gives a very similar cross-validation score.

<br><br>

### 3.3 Hyperparameter optimization 
rubric={points:10}

Start with the pipeline `pipe` below.

**Your tasks:**
- Create a `GridSearchCV` object named `grid_search` to jointly optimize `max_features` of `CountVectorizer` and `C` of `LogisticRegression` across all the combinations of values we tried above. 
- What are the best values of `max_features` and `C` according to your grid search? 
- Store them in variables `best_max_features` and `best_C`, respectively.  
- Store the best score returned by the grid search in a variable called `best_score`. 

> The code might be a bit slow here. Setting `n_jobs=-1` should speed it up if you have a multi-core processor.

In [None]:
pipe = make_pipeline(CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000, random_state=123))

<div class="alert alert-warning">

Solution_3.3
    
</div>

In [None]:
grid_search = None 
best_max_features = None
best_C = None
best_score = None

# BEGIN SOLUTION

param_grid = {
    "countvectorizer__max_features": max_features,
    "logisticregression__C": C_vals,
}
grid_search = GridSearchCV(pipe, param_grid, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train);
best_max_features = grid_search.best_params_['countvectorizer__max_features']
best_C = grid_search.best_params_['logisticregression__C']
best_score = grid_search.best_score_
best_score
# END SOLUTION

In [None]:
assert not best_max_features is None, "Are you using the correct variable?"
assert not best_C is None, "Are you using the correct variable?"
assert type(grid_search.get_params()['estimator']) == Pipeline, "Are you passing a pipeline to the GridSearch?"
assert len(grid_search.get_params()['param_grid']['countvectorizer__max_features']) == 5, "Are you using the max_features values from 3.1?"
assert len(grid_search.get_params()['param_grid']['logisticregression__C']) == 7, "Are you using the C values from 3.2?"

In [None]:
assert sha1(str(best_max_features).encode('utf-8')).hexdigest() == '8a12a315082a345f1a9d3ad14b214cd36d310cf8', "Best max feature doesn't seem correct."

In [None]:
assert sha1(str(best_C).encode('utf-8')).hexdigest() == '98507b2da2f54e9a46bad9a285b92199e8e04200', "Best C doesn't seem correct."

In [None]:
assert sha1(str(np.round(best_score,4)).encode('utf-8')).hexdigest() == '567b99788b50aef4a1ebe4338c49910c4b5363fc', "Best score doesn't seem correct."

<br><br>

### 3.4 Discussion 
rubric={points:4}

- Do the best values of hyperparameters found by Grid Search agree with what you found in 3.1 and 3.2? 
- Generally speaking, _should_ these values agree with what you found in parts  3.1 and 3.2? Why or why not? 

<div class="alert alert-warning">

Solution_3.4
    
</div>

They don't agree. In general there is no reason they need to agree - by jointly optimizing the hyperparameters you might find something better. 

<br><br>

#### 3(e) Test score
rubric={points:2}

- Evaluate your final model on the test set. Store the test accuracy in the variable called `test_score`.

<div class="alert alert-warning">

Solution_3.5
    
</div>

In [None]:
test_score = None

# BEGIN SOLUTION
test_score = grid_search.score(X_test, y_test)
test_score
# END SOLUTION

In [None]:
assert not test_score is None, "Are you storing the score in the provided variable?"
assert sha1(str(np.round(test_score, 4)).encode('utf-8')).hexdigest() == 'dad0d8c6709453d08f527227df265bb497d5067e', "The test score doesn't look correct."

<br><br>

### 3.6 Discussion of Test Score
rubric={points:4}

- How does your test accuracy compare to your validation accuracy? 
- If they are different: do you think this is because you "overfitted on the validation set", or simply random luck?

<div class="alert alert-warning">

Solution_3.6
    
</div>

The test score is very close to the cross-validation score; which means our validation score was a good estimation of our test score and we have not "overfitted" on validation set during our HParam optimization.

<br><br><br><br>

## Exercise 4: Very short answer questions
<hr>
rubric={points:8}

Each question is worth 2 points. Max 2 sentences per answer.

1. What is the problem with calling `fit_transform` on your test data with `CountVectorizer`? 
2. If you could only access one of `predict` or `predict_proba`, which one would you choose? Briefly explain.
3. What are two advantages of `RandomizedSearchCV` over `GridSearchCV`?
4. Why is it important to follow the Golden Rule? If you violate it, will that give you a worse classifier?

<div class="alert alert-warning">

Solution_4
    
</div>

1. You need to perform the same transformations on the train and test data, otherwise the results will not make sense.
2. `predict_proba`. From here you can get the output of `predict`, but not the other way around.
3. (1) you can directly choose the number of experiments, (2) avoids the "irrelevant hyperparameter" issue where experiments are wasted (see lecture). 
4. Violating the rule would result in overfitting the model on the test data and over-estimating the performance of the model. But, when the model is finally deployed, its performance would be lower than expected.

<br><br><br><br>

## Submission instructions 

**PLEASE READ:** When you are ready to submit your assignment do the following:

1. Run all cells in your notebook to make sure there are no errors by doing `Kernel -> Restart Kernel and Clear All Outputs` and then `Run -> Run All Cells`. 
2. Notebooks with cell execution numbers out of order or not starting from “1” will have marks deducted. Notebooks without the output displayed may not be graded at all (because we need to see the output in order to grade your work).
3. Upload the assignment using Gradescope's drag and drop tool. Check out this [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/) if you need help with Gradescope submission. 

Congratulations on finishing the homework! 

![](img/eva-well-done.png)

