# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**
 
## **Load Libraries and Corpus**

<span style="color:black">In this notebook, you will automate the search for logistic regression hyperparameters in order to improve the accuracy of your model on a test set. The [`GridSearchCV()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object from the [scikit-learn](https://scikit-learn.org/stable/index.html) (SKL) library automates much of the tedious manual labor, but it cannot search the full space of hyperparameters, and you need to specify which values to try. The model is evaluated in a cross-validation fashion for every hyperparameter combination and all validation evaluations are packaged into a Python dictionary, which can be easily converted to a dataframe, sorted, and colored for further assessment.

<span style="color:black">Begin by defining [cosine similarity](https://scikit-learn.org/stable/modules/metrics.html#cosine-similarity) metric using a [SciPy's](https://docs.scipy.org/doc/scipy/index.html) [cosine distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine) function and loading male and female names as lowercased strings. Recall that a cosine distance is defined as one minus the cosine similarity of the argument vectors.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"
import pandas as pd, numpy as np, nltk, matplotlib.pyplot as plt, plotly.express as px
from scipy.spatial.distance import cosine  # a cosine distance, not similarity
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

np.set_printoptions(linewidth=10000, precision=2, edgeitems=20, suppress=True)  # display format for numpy arrays
pd.set_option('max_rows', 100, 'max_columns', 100, 'max_colwidth', 100, 'precision', 2, 'display.max_rows', 12) # display format for pandas dataframes

CosSim = lambda u,v: (1 - cosine(u, v))  # compute cosine similarity between two word vectors

_ = nltk.download(['names'], quiet=True)
LsM = [name.strip().lower() for name in nltk.corpus.names.words('male.txt')]
LsF = [name.strip().lower() for name in nltk.corpus.names.words('female.txt')]
print(f'{len(LsM)} male names:  ', LsM[:8])
print(f'{len(LsF)} female names:', LsF[:8])

## **Load Word2vec Model**

<span style="color:black">One difference from the code in the video is that you will not use FastText because it is slow to load (approximately two minutes) and consumes large amounts of memory (the smallest model takes about 3 GB). Instead, you will create your own version of a simple FastText model equivalent from a word2vec model, `glove-wiki-gigaword-50.gz`. Recall that this model has a 400K (lowercased) word vocabulary with their 50-dimensional numeric vectors.

In [None]:
from gensim.models import KeyedVectors
%time wv = KeyedVectors.load_word2vec_format('glove-wiki-gigaword-50.gz')  # ~20 seconds

## **Approximate FastText Model**

<span style="color:black">As you recall, FastText is trained on n-grams of words, so that when an out of vocabulary word appears, many of its substrings can still be in the vocabulary. The final vector is a centroid (i.e. mean) vector from the identified subword-vectors. 
    
<span style="color:black">In short of training FastText, you can reuse the pretrained word2vec model in the same fashion. You first need first to parse a word into its subwords in the `MakeNGrams()` function below.

In [None]:
from nltk.util import ngrams

def MakeNGrams(w:'word'='hello'):
    '''Partitions the word w into all possible subwords and returns a list of these ngrams'''
    s = []
    for n in range(1, len(w)+1):
        for ng in ngrams(w, n):            
            s.append(''.join(ng))
    return s

print('n-grams from "":', MakeNGrams(''))   # we should always test our functions on all reasonable edge cases
print('n-grams from "a":', MakeNGrams('a'))
print('n-grams from "test":', MakeNGrams('test'))

<span style="color:black">In the next function `ftX()`, you will try to find a `wv` vector for each n-gram of the word `w` and then compute a centroid, or a mean vector, from whichever vectors you identify.

In [None]:
def ftX(w:'word'='hello'):
    '''Compute centroid for n-grams from the word w or return a zero vector if no subwords are fund in wv model'''
    vecs = np.array([wv[s] for s in MakeNGrams(w) if s in wv])
    v = vecs.mean(axis=0) if len(vecs) else np.zeros(50)
    return v

print('If no substring is found in ftX, zeros are returned:', ftX('CAPITALS ARE NOT IN WORDTOVEC')[:10])
print('Empty strings are mapped to zero vector:', ftX('')[:10])

You can compare vectors produced directly by the `wv` model and by your `ftX` model. If `ftX` works well in capturing the semantic meaning from subwords, both vectors for the same name will be similar. Similarity can be measured by a cosine similarity metric. Below you show two vectors for one name and the cosine similarity between two vectors.

In [None]:
print('ftX vector for "abbey"', ftX('abbey')[:10])
print(' wv vector for "abbey"', wv['abbey'][:10])
print(f'Similarity between wv and ftX vectors of "abbey": {CosSim(ftX("abbey"), wv["abbey"]):.3f}')

<span style="color:black">This is only one example, selected by chance. If you do this comparison for all suitable vectors, what would be the distribution of cosine similarities? You hope it is shifted to the right of zero, which shows that most cosine similarities are positive. 

In [None]:
%time sims = [CosSim(ftX(w), wv[w]) for w in LsF+LsM if w in wv]  # cosine similarity can only be done for names in wv

<span style="color:black">The histogram of cosine similarity frequencies shows that the mean of similarities is 0.144. The distribution is symmetric, without notable outliers or abnormal (i.e. fat/thin) tails. So, **on average**, the word vectors produced with `ftX` are reasonable substitutes for vectors which are missing in `wv`.

In [None]:
print(f'mean of all similarities: {np.mean(sims):.3f}')
f = px.histogram(pd.DataFrame(sims, columns=['CS']).sort_values('CS'))
f = f.update_layout(height=200, margin=dict(t=0, b=0, l=0, r=0))
f = f.add_scatter(x=[np.mean(sims)], y=[0], name='mean')  # Add a central point in the plot
f.show()

## **Balance Out Two Classes**

<span style="color:black">Now, proceed with building a logistic regression classification model based on the numeric features from the `ftX` model. Balance your two sets of names, so that an accuracy metric can be used. Recall that an accuracy score is a fraction of correctly predicted genders. In a balanced problem, a random draw will have a 50% success rate. Thus, if accuracy exceeds 50%, you can interpret it as successful. This is not the case in imbalanced class problems, where greater imbalance makes an accuracy metric less suitable and requires more specific measures of performance, such as precision, recall, f1 score, AUC and many others.

In [None]:
# Balance observations in two classes. So, a random draw has 50% chance of being from either class.
np.random.seed(0)
LsF = sorted(np.random.choice(LsF, size=len(LsM), replace=False))  # randomly drop excess names
df = pd.DataFrame(LsF + LsM, columns=['Name']).set_index('Name')   # create empty dataframe with names in the index
df['Y'] = [1] * len(LsF) + [0] * len(LsM)                          # add labels 1=femail and 0=male labels
print(f'{len(LsM)} male names:  ', LsM[:8])
print(f'{len(LsF)} female names:', LsF[:8])

## **Create Two Features**

<span style="color:black">Now create two features: a cosine similarity of the given name to the word `"feminine"` and a cosine similarity to the word `"thomas"`. As you can guess, you can create many other similar features as a cosine similarity between the given name and some gender-specific word. Notably, adding more features will not necessarily improve the model, but these few might. 

In [None]:
# Generate features with gender-specific words 
for sRef in ['feminine', 'thomas']: 
    df[f'CS2_{sRef}'] = [CosSim(ftX(w), ftX(sRef)) for w in df.index] # add a feature with a cosine similarity for each name
df.T.round(2)           # print transposed (flipped about its diagonal) dataframe

## **Train and Validate a Logistic Regression With Two Features**

<span style="color:black">Split your input features and output labels into training and validation sets, train the model on the training set, and evaluate the model on the validation set.

In [None]:
tX, vX, tY, vY = train_test_split(df.drop('Y', axis=1), df.Y, test_size=0.2, random_state=0)
lr = LogisticRegression()
print(f'Test accuracy: {lr.fit(tX, tY).score(vX, vY):.3f}')

<span style="color:black">The test performance of your model is ~60%, which is better than 50% (for a random classifier), but is below test scores you observed earlier with other models, such as Naive Bayes, random forest, and boosting.

## Visualize Names in 2D Feature Space

<span style="color:black">Since every name now has two coordinates (i.e. numeric features), you can plot each name then identify male names in blue and female names in pink. Also plot the decision boundary line produced by logistic regression.

In [None]:
import plotly.graph_objects as go     # import graph object from plotly library
vColors = np.array(["blue", "skyblue"])[df.Y]                   # foreground color name per observation
sClrM, sClrF = 'rgba(100,180,255,0.2)', 'rgba(251,176,64,0.2)'  # background RGB colors and transparency
sLabX, sLabY = df.drop('Y', axis=1).columns[:2]
sTtl = 'Female and male names with cosine similarity features in 2D'

goS = go.Scatter(x=df[sLabX], y=df[sLabY], mode='markers', 
    marker=dict(size=2, line=dict(width=1, color=vColors), color=vColors), text=df.index)
layout = go.Layout( title=sTtl, hovermode='closest', xaxis=dict(title=sLabX), yaxis=dict(title=sLabY))

b0, (b1, b2) = lr.intercept_[0], lr.coef_.T[:2]  # retrieve 3 model parameters
c, m = (-b0/b2)[0], (-b1/b2)[0]                  # compute intercept and slope of the decision boundary line
xL, xR = df[sLabX].min(), df[sLabX].max()        # horizontal interval left and right points
yL, yR = df[sLabY].min(), df[sLabY].max()        # vertical interval left & right points
yBL, yBR = m*xL+c, m*xR+c                        # boundary line left & right points

f = go.Figure(layout=layout, layout_yaxis_range=[yL, yR], layout_xaxis_range=[xL, xR])
f = f.add_trace(go.Scatter(x=[xL,xL,xR,xR], y=[yL,yBL,yBR,yL], fill='toself', mode='none', fillcolor=sClrM, name='Female name class')) # fill below
f = f.add_trace(go.Scatter(x=[xL,xL,xR,xR], y=[yR,yBL,yBR,yR], fill='toself', mode='none', fillcolor=sClrF, name='Male name class'))   # fill above
f = f.add_trace(go.Scatter(x=[xL,xR], y=[yBL,yBR], mode='lines', name='Decision boundary in 2D')) # add decision boundary
f = f.add_trace(goS)    # add points last so that point labels remain interactive
f.show()

<span style="color:black">These groups strongly overlap, and one significant improvement would be to use more powerful features, which would separate the two sets of points visually. You could also try more numeric features, even though visualizations are most effective for just two dimensions. It seems the pink points are on the right of the decision boundary line, while blue points extend to the left. So, the line split does make sense, since you want it to best separate the two classes.

## **<span style="color:black">Search for Better Hyperparameters</span>**

<span style="color:black">To see all hyperparameters of the logistic regression, print these with the `get_params()` method.

In [None]:
pd.DataFrame(lr.get_params().items(), columns=['param','value']).set_index('param').T

<span style="color:black">Now, to improve performance of the model using current features, try using different hyperparameters with the [`GridSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) object. The input into this object is a "grid" of parameters, and every point of this "grid" is a set of hyperparameter values. Below you specify your search for different values of the regularization strength `C` and different optimizers. You will train/test on two folds, meaning that you make an equal train/test split and then flip the train and validation sets. Because you trained your model twice here, it takes about twice as long. The best (in terms of higher accuracy) set of hyperparameters is displayed.

In [None]:
from sklearn.model_selection import GridSearchCV

lr1 = LogisticRegression(penalty='l2', random_state=0)
C = np.logspace(-1, 1, 3)             # .1, 1, 10 regularization strength
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']  # different optimizers
DGrid = dict(C=C, solver=solver)      # grid of parameters
gs = GridSearchCV(lr1, DGrid, cv=2)   # each sample is split into 2 folds
gs.fit(tX, tY)

lr1.set_params(**gs.best_params_)
print(f'Cross-validation (CV) accuracy:\t{gs.best_score_:.3f}')
print(f'Test accuracy (held out set): \t{lr1.fit(tX,tY).score(vX,vY):.3f}')
print(f'Best hyper-parameters: \t\t{gs.best_params_}')

<span style="color:black">The best cross-validation accuracy is 58.3%, which is lower than 60.5% test accuracy. Still, you would trust cross validation more, because it is able to utilize all observations and is more indicative of your model's true test performance.

<span style="color:black">As before, you can display all hyperparameters of the grid search object.

In [None]:
pd.DataFrame(gs.get_params().items(), columns=['param','value']).set_index('param').T

## **Display All Grid Search Results**

<span style="color:black">Finally, you can view all evaluations done by `GridSearchCV` and the corresponding accuracy scores. These appear in the rows named `**test_score`, which result from each cross validation. The statistics for these test scores are also shown. The bottom row ranks all model evaluations with rank 1 indicating the best set of hyperparameters.

In [None]:
dfgs = pd.DataFrame(gs.cv_results_)
dfgs.T

<span style="color:black">Go one step further by color-coding all values. Also sort the columns by rank for easier navigation around the table. Non-numeric values are placed in so-called dataframe [MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html), so that color background gradients could be applied.

In [None]:
dfgs = pd.DataFrame(gs.cv_results_)  # convert a dictionary to a dataframe
dfgs.params = dfgs.params.apply(str) # convert params dictionary to string values (to be used in index)
dfgs = dfgs.reset_index().set_index(['index', 'params', 'param_solver']).apply(pd.to_numeric) # set multi-index
dfgs.sort_values('rank_test_score').T.style.background_gradient(cmap='coolwarm', axis=1).set_precision(3) # sort and color

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**
 
Now, equipped with these concepts and tools, you will tackle a few related tasks.
 
First you will try to find a "better" pair of features, where "better" is in the sense of higher test accuracy on exactly the same split of our observations. Reproducibility of the split avoids the dependence of our test accuracy on training and validation observations as we try different modeling experiments. If not split identically, then you would not know whether the change in test accuracy was a result of the new feature or just a different split. Similar rationale of reproducibility and experimentation is behind seeding all random number generators, including that in `LogisticRegression()` object.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1
 
Find the first feature which performs better than `CS2_feminine`, which you used above. While you could try every single word in the English vocabulary, this would be tremendously expensive in terms of time and computational resources. You could limit your search for a suitable word to just the word2vec model's vocabulary of 400K words. How long would this take? With the current virtual environment it takes about 2 seconds to build cosine similarities and train/test with a single word. That's 800K seconds or about one and a half years. In general, you have multiple options. One of which is parallel search on multiple processors. However, most words in word2vec vocabulary are unlikely to be gender specific or correlate with male or female names. Start by drawing a few dozen names that relate to `'feminine'`, which is already a decent benchmark. 
 
Use the appropriate method of `wv` object to find 35 words most similar to `'feminine'`. Save the list of these words to `LsWords`. (Later you can try more words)

<b>Hint:</b> Review the documentation for the <code>wv.most_similar</code> method.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
LsWords = [n for n, _ in wv.most_similar('feminine', topn=35)]
print('"feminine" neighbors:', LsWords)
</pre>
</details> 
</font>
<hr>

## Task 2
 
Copy a dataframe `df` with column `Y` only to `df1` using `copy()` method, which ensures that a copy is made instead of just a reference from one variable to a subset of columns in another. Now, let's build as many experiments with a logistic regression model of a single feature as you have words in `LsWords`.
 
1. For each word `W` in `LsWords` :
    1. compute cosine similarity between each name in `df1` and `W`, as we did above, and save these similarities to `df1` into a column `CS2_Word1`. That is, you will overwrite this column with each new value of `W`. You should have a dataframe with just two columns: `Y` and `CS2_Word1` ready for modeling.
    1. Use 0-seeded `train_test_split` to split `df` into 20% validation set and 80% train set. This results in DataFrames of input features, `tX`, `vX` and Series of the corresponding output labels, `tY`, `vY` (similar to our code above)
    1. Build 0-seeded `LogisticRegression()`, then fit it on `tX` and `tY` and score it on `vX`, `vY`
    1. Save the word `W` and the test score of the model using `W` to a list of tuples `hist`
1. Now you can sort `hist` by decreasing accuracy. The top word is our best candidate out of all that were tried.
 
Finally, re-compute cosine similarities for the word that produced the best test accuracy and save those cosine similarities into a column `CS2_Word1`. This is you first feature. Here are the top three rows from the resulting `df` dataframe:
 
 
|Name|Y|CS2_Word1|
|-|-|-|
|abagail|1|0.85|
|abbe|1|0.74|
|abbey|1|0.80|
 
The top performing word should result in **0.60 test accuracy**, which is better than the two features (combined) you had above!

<b>Hint:</b> This is just a repeated model building in a loop. Keep the train and test split inside the loop as well. Check code above for reference.


In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=crimson>▶ </font>See <b>solution</b>.</summary>
<pre>
hist = []
df1 = df[['Y']].copy()
 
for i, W in enumerate(LsWords):
    df1[f'CS2_Word1'] = [CosSim(ftX(w), ftX(W)) for w in df.index] # add a feature with a cosine similarity for each name
    tX, vX, tY, vY = train_test_split(df1.drop('Y', axis=1), df1.Y, test_size=0.2, random_state=0)
    hist += [[W, LogisticRegression(random_state=0, n_jobs=-1).fit(tX, tY).score(vX, vY)]]
    print(i, hist[-1])
    
pd.DataFrame(hist).sort_values(1).T
df1[f'CS2_Word1'] = [CosSim(ftX(w), ftX('sensibility')) for w in df.index]
</pre>
</details> 
</font>
<hr>

## Task 3
 
Now, you will look for the second feature in a similar way, but keeping the first feature fixed in the model. As you can guess, you can keep adding new features as long as they improve the test accuracy. Features that degrade the model accuracy are not of interest to you. 
 
1. Similar to Task 1, find 35 words that are most similar to `"thomas"` and save them to the list of strings `LsWords`. 
1. Similar to Task 2, make a copy of `df1` and `df2`, but keep both columns, the label `Y` and the feature `CS2_Word1`. 
1. Build 35 logistic regression models (one for each word `W` in `LsWords`), where each model is trained/tested on two (not one) features, `CS2_Word1` and `CS2_Word2`, where the latter is being replaced with cosine similarities computed between the word `W` and the corresponding vectors of names in `df2.index`. Keep the train/test split and logistic regression seeded with zeros.
 
Here are the top 3 rows of the `df2` dataframe.
 
|Name|Y|CS2_Word1|CS2_Word2|
|-|-|-|-|
|abagail|1|0.85|0.79|
|abbe|1|0.74|0.80|
|abbey|1|0.80|0.79|
 
Your second feature is a vector of cosine similarities that resulted in the best test accuracy of logistic regression built on two features. You should observe the **test accuracy of 0.63**, which is even better than the single-feature model you built above in Tasks 1 & 2.

<b>Hint:</b> This is similar to the review code above and Tasks 1 and 2, but the first feature is now part of every model. The second feature keeps changing with every word in the list <code>LsWords</code>.


In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
LsWords = [n for n, _ in wv.most_similar('thomas', topn=35)]
print('"smith" neighbors:', LsWords)
hist = []
df2 = df1.copy()

for i, word in enumerate(LsWords):
    df2[f'CS2_Word2'] = [CosSim(ftX(w), ftX(word)) for w in df.index] # add a feature with a cosine similarity for each name
    tX, vX, tY, vY = train_test_split(df2.drop('Y', axis=1),df2.Y, test_size=0.2, random_state=0)
    hist += [[word, LogisticRegression(random_state=0, n_jobs=-1).fit(tX, tY).score(vX, vY)]]
    print(i, hist[-1])

pd.DataFrame(hist).sort_values(1)
df2[f'CS2_Word2'] = [CosSim(ftX(w), ftX('wright')) for w in df.index]
</pre>
</details> 
</font>
<hr>

## Task 4

Now, keep these features and try to improve your model by searching for "better" hyper parameter values. You can find a full list of these in the definition of SKLearn's [logistic regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). You can start by providing a larger or different number of regularization strength parameter `C`.

<b>Hint:</b> See grid search code above. Use <code>df2</code>. You can try searching for optimal values of additional hyperparameters of <a href=https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html><code>LogisticRegression()</code></a> model.


In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
tX2, vX2, tY2, vY2 = train_test_split(df2.drop('Y', axis=1), df2.Y, test_size=0.2, random_state=0)
lr1 = LogisticRegression(penalty='l2', random_state=0)
C = np.logspace(-1, 1, 3)              # .1, 1, 10 regularization strength
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']  # different optimizers
DGrid = dict(C=C, solver=solver)       # grid of parameters
gs2 = GridSearchCV(lr1, DGrid, cv=2)   # each sample is split into 2 folds
gs2.fit(tX2, tY2)
lr1.set_params(**gs.best_params_)
print(f'Cross-validation (CV) accuracy:\t{gs.best_score_:.3f}')
print(f'Test accuracy (held out set): \t{lr1.fit(tX,tY).score(vX,vY):.3f}')
print(f'Best hyper-parameters: \t\t{gs.best_params_}')
            </pre>Notice the improved test accuracy of 0.637 (versus 0.591 in the results of hyperparameter search above). In this fashion we can continue improving our model with a better set of engineered input features and a better set of hyperparameters. "Better" is defined as the higher test accuracy.
</details>
</font>
<hr>