# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

<span style="color:black">In this notebook, you will rebuild the previous logistic regression model that uses the cosine similarity feature. The focus is on making and interpreting predictions from the model.

<span style="color:black">As before, you will load the appropriate packages and the word2vec model, `glove-wiki-gigaword-50.gz`. Note that you will use Word2Vec instead of FastText because the model loads 10x faster and is sufficient for this demostration. Define a cosine similarity function using cosine distance.

In [None]:
import pandas as pd, numpy as np, nltk, seaborn as sns, matplotlib.pyplot as plt
import plotly.express as px, plotly.graph_objects as go
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from scipy.spatial.distance import cosine  # a cosine distance, not similarity
CosSim = lambda x, y: 1 - cosine(x, y)     # convert distance to similarity

# Dictionary-like object. key=word (string), value=trained embedding coefficients (array of numbers)
%time wv = KeyedVectors.load_word2vec_format('glove-wiki-gigaword-50.gz')

## **Balanced Classification Problem**

Begin by loading the male and female names into lists, and drop names that are not found in word2vec. This downsamples the majority class, i.e. the class with a larger count of observations (or names). A procedure that results in equal count of observations from each class produces a **balanced classification problem**, where the observations from two classes are equally likely (without any information from the predictors). Thus, an accuracy that is higher than 50% indicates that the model performs better than random chance.

In [None]:
_ = nltk.download(['names'], quiet=True)
LsM = [name.strip().lower() for name in nltk.corpus.names.words('male.txt')]
LsF = [name.strip().lower() for name in nltk.corpus.names.words('female.txt')]

# Balance observations in two classes. So, a random draw has 50% chance of being from either class.
np.random.seed(0)
LsF = sorted(np.random.choice(LsF, size=len(LsM), replace=False))
LsM2, LsF2 = [s for s in LsM if s in wv], [s for s in LsF if s in wv]
LsM2 = np.random.choice(LsM2, size=min(len(LsM2), len(LsF2)), replace=False).tolist()
LsF2 = np.random.choice(LsF2, size=min(len(LsM2), len(LsF2)), replace=False).tolist()
print(f'{len(LsM2)} male names:  ', LsM2[:20])
print(f'{len(LsF2)} female names:', LsF2[:20])

## **Functional Programming**

Next, rewrite the `CosSim2Female()` function in a [functional](https://docs.python.org/3/howto/functional.html) style so that it returns a function. Doing this makes it easier to add features to the dataframe because you can use one functional for various query strings. Test the functional below.

In [None]:
# cosine similarity to query function, i.e. it returns a function which can be further evaluated
CS2Q = lambda sQuery='man': lambda sName: CosSim(wv[sName], wv[sQuery]) if sName in wv else 0 
df = pd.DataFrame(LsF2 + LsM2, columns=['Name'])
df['Y'] = [1] * len(LsF2) + [0] * len(LsM2)   # create numeric labels for names
df['CS2F'] = df.Name.apply(CS2Q('feminine'))   # for each name compute cosine similarity to female query word
df = df.sort_values('CS2F', ascending=False).set_index('Name')
df.T.round(2)

As expected, the `'man'` query string gives a higher cosine similarity to the male name `'david'` than to `'kathy'`.

## **Train Logistic Regression**

Now you can split the dataset into train and validation samples, train the logistic regression model, and evaluate its validation accuracy.

In [None]:
tX, vX, tY, vY = train_test_split(df.drop('Y', axis=1), df.Y, test_size=0.2, random_state=0)
lr = LogisticRegression(random_state=0)   # create a model and (always) seed random number generator
lr.fit(tX, tY)                            # fit a model to compute model parameters
print(f'Accuracy = fraction of correct predictions: {lr.score(vX, vY):.3f}') # report out of sample accuracy

Package all results into a dataframe and add `isAccurate` to the rows where the prediction matches the observed class.

In [None]:
dfvXY = vX.copy()
dfvXY['P[Y=0(male)|CS2F]'] = lr.predict_proba(vX)[:,0]
dfvXY['P[Y=1(female)|CS2F]'] = lr.predict_proba(vX)[:,1]
dfvXY['pY'] = pY = lr.predict(vX)
dfvXY['vY'] = vY
dfvXY['isAccurate'] = dfvXY.pY == dfvXY.vY
dfvXY

## **Count Outputs**

Using the `isAccurate` row, count the correctly and incorrectly classified outputs (i.e., correctly classified means that the predicted output matches the observed output, whether that value is 0 or 1; otherwise, it is misclassified).

In [None]:
dfvXY[['vY', 'pY']].value_counts().reset_index().rename(columns={0:'Counts'})

The first two rows show correct classification counts, which add up to 517, and the bottom rows indicate the misclassified names, which sum up to 309. Overall, the accuracy is $517/(309+517)\approx 0.626$ which is higher than the baseline accuracy of 50%.

## **Confusion Matrix**

Now you are ready to create a [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) from this dataframe. The confusion matrix places all correctly predicted counts on its diagonal and all incorrect predictions off its diagonal.

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true=vY, y_pred=pY, labels=[0,1])
cm

Without referring to the scikit-learn documentation, it is hard to tell whether the rows in this matrix correspond to observed classes and columns correspond to predicted classes. To improve the readability of the confusion matrix, label the dimensions and plot the matrix as an annotated heatmap using seaborn library.

In [None]:
import seaborn as sns
MsTxt = np.char.array([['TN','FP'], ['FN','TP']]) + '='
MsTxt = MsTxt + cm.astype('str') + '; ' + (cm/cm.sum() * 100).round(1).astype('str') + '%'

plt.rcParams['figure.figsize'] = [10, 2]
ax = sns.heatmap(cm, annot=MsTxt, cbar=False, cmap='coolwarm', fmt='', annot_kws={"fontsize":20});
ax.set_title('Confusion Matrix: counts and % of total count');
ax.set(xlabel='Predicted labels', ylabel='True labels');

From the matrix, you can see that there are 17.6% false positives, i.e., true negatives that the model reports as positives. Recall that you defined female gender as 1, which is now the "positive" class. The male gender label is 0 or negative class.

scikit-learn has a function, [`plot_confusion_matrix()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html), that does majority of this work in one line.

In [None]:
from sklearn.metrics import plot_confusion_matrix
plt.rcParams['figure.figsize'] = [10, 5]
disp = plot_confusion_matrix(lr, vX, vY, display_labels=[0,1], cmap=plt.cm.Blues, normalize='all');
disp.ax_.set_title('Confusion Matrix');

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**
Now, equipped with these concepts and tools, you will tackle a few related tasks.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1
 
Copy the `df` dataframe into `df1` and add a column `CS2lady`, which uses `CS2Q()` function to compute cosine similarity distances from the given name (in each row) to the target word `lady`.

<b>Hint:</b> The code above should guide you through these steps.

In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
df1 = df.copy()
df1['CS2lady'] = df.reset_index().Name.apply(CS2Q('lady')).values
df1.T
</pre>
</details> 
</font>
<hr>

## Task 2

Split `df1` rows into train and test inputs and outputs and then train a logistic regression model with two input features, `CS2F` and `CS2lady`. Compute the out of sample (i.e. test) accuracy.

<b>Hint:</b> Use the code above as your guide.

In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
tX, vX, tY, vY = train_test_split(df1.drop('Y', axis=1), df1.Y, test_size=0.2, random_state=0)
lr = LogisticRegression(random_state=0).fit(tX, tY)   # create a model; compute model parameters
print(f'Accuracy = fraction of correct predictions: {lr.score(vX, vY):.3f}') # report out of sample accuracy
</pre>
</details> 
</font>
<hr>

## Task 3
 
Build a confusion matrix for the new model. Which true class shows the greatest improvement?

<b>Hint:</b> Use the code above as your guide.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=carnelian>▶ </font>See <b>solution</b>.</summary>
<pre>
plt.rcParams['figure.figsize'] = [10, 5]
disp = plot_confusion_matrix(lr, vX, vY, display_labels=[0,1], cmap=plt.cm.Blues, normalize='all');
disp.ax_.set_title('Confusion Matrix');</pre>
 
The true female gender class (lower row of the matrix) was improved the most. Correct classification rose from 31% to 34%, while male gender class increased by only 1%.
    </details> 
</font>
<hr>

## Task 4

Try adding more cosine similarity features based on words that have strong gender bias, such as certain gender-associated professions, names, gender specific products and services. Add these one by one and keep only those features that improve your accuracy.

This task is open-ended and does not have a single specific solution.

In [None]:
# check solution here