# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

<span style="color:black">This notebook will contain a review of feature engineering and model training / evaluation on cosine similarity features. You have done this before, and a more detailed explanation is available in past videos and Jupyter notebooks. This notebook will then move onto using metrics to evaluate the confusion matrix.
    
<span style="color:black">Begin by defining a cosine similarity function and loading the Word2Vec model.


In [None]:
import pandas as pd, numpy as np, nltk, seaborn as sns, matplotlib.pyplot as plt
import plotly.express as px, plotly.graph_objects as go
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
CosSim = lambda x, y: x @ y / (x @ x)**0.5 / (y @ y)**0.5  # our own implementation of cosine similarity

# Dictionary-like object. key=word (string), value=trained embedding coefficients (array of numbers)
%time wv = KeyedVectors.load_word2vec_format('glove-wiki-gigaword-50.gz')  # ~20 seconds

<span style="color:black">Next, load names and drop those that are not in the Word2Vec model. Balance the labels in the dataset by downsampling the longer list of names to the size of the shorter list. 

In [None]:
_ = nltk.download(['names'], quiet=True)
LsM = [name.strip().lower() for name in nltk.corpus.names.words('male.txt')]
LsF = [name.strip().lower() for name in nltk.corpus.names.words('female.txt')]

# Balance observations in two classes. So, a random draw has 50% chance of being from either class.
np.random.seed(0)
LsF = sorted(np.random.choice(LsF, size=len(LsM), replace=False))
LsM2, LsF2 = [s for s in LsM if s in wv], [s for s in LsF if s in wv]
LsM2 = np.random.choice(LsM2, size=min(len(LsM2), len(LsF2)), replace=False).tolist()
LsF2 = np.random.choice(LsF2, size=min(len(LsM2), len(LsF2)), replace=False).tolist()
print(f'{len(LsM2)} male names:  ', LsM2[:8])
print(f'{len(LsF2)} female names:', LsF2[:8])

## **Cosine Similarity**

<span style="color:black">Now, build the cosine similarity feature that measures the relation of each name to the word "feminine."

In [None]:
# cosine similarity to query function, i.e. it returns a function which can be further evaluated
CS2Q = lambda sQuery='man': lambda sName: CosSim(wv[sName], wv[sQuery]) if sName in wv else 0 
df = pd.DataFrame(LsF2 + LsM2, columns=['Name'])
df['Y'] = [1] * len(LsF2) + [0] * len(LsM2)   # create numeric labels for names
df['CS2F'] = df.Name.apply(CS2Q('feminine'))   # for each name compute cosine similarity to female query word
df = df.sort_values('CS2F', ascending=False).set_index('Name')
df.T.round(2)

## **Logistic Regression Model**

<span style="color:black">Build a logistic regression model on the single input feature and the binary output variable.

In [None]:
tX, vX, tY, vY = train_test_split(df.drop('Y', axis=1), df.Y, test_size=0.2, random_state=0)
lr = LogisticRegression(random_state=0).fit(tX, tY)    # create a model; fit a model to compute model parameters
print(f'Accuracy = fraction of correct predictions: {lr.score(vX, vY):.3f}') # report out of sample accuracy

## **Confusion Matrix**

<span style="color:black">Construct a simple confusion matrix with the correct classifications on the diagonal and misclassified examples off the diagonal.

In [None]:
pY = lr.predict(vX)
cm = confusion_matrix(y_true=vY, y_pred=pY, labels=[0,1])
cm

<span style="color:black">You will now manually calculate the metrics discussed in the video. 
    
<span style="color:black"><b>Note:</b> scikit-learn has convenient functions to compute these metrics individually or all together in a single report. 

In [None]:
(TN, FP), (FN, TP) = cm
nTot = cm.sum()
nCorrect = TP + TN   # Total correct predictions (pos or neg); diagonal elements
nTrueNeg = TN + FP   # Total true negative (male) cases; top row
nTruePos = FN + TP   # Total true poitive (female) cases; bottom row
nPredNeg = TP + FP   # Total predicted negative (male) cases; 1st column
nPredPos = TP + FP   # Total predicted positive (female) cases; 2nd column

nAcc = nCorrect / nTot
nFPR = FP / nTrueNeg # False pos rate, 1-specificity, Type-1 Error
nTPR = TP / nTruePos # True pos rate, recall, sensitivity, Type-II error
nPPV = TP / nPredPos # Pos predictive value, precision, 1-false discovery rate
nNPV = TN / nPredNeg # Neg predictive value

print(f'Acc = {nAcc:.3F}')  # Fraction of correct predictions
print(f'FPR = {nFPR:.3F}')  # 
print(f'TPR = {nTPR:.3F}')  # Fraction of true positives that were correctly predicted
print(f'PPR = {nPPV:.3F}')  # Fracion of predicted positives, which are correct 
print(f'F1  = {2 * nPPV * nTPR/ (nPPV + nTPR):.3F}') # 


## **Build a Receiver Operating Characteristic**

<span style="color:black"> Next, you will build a Receiver Operating Characteristic ([ROC](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=eceiver%20operating%20characteristic)) curve.

<div id="blank_space" style="padding-top:20px">
    <details>
        <summary>
            <div id="button" style="color:white;background-color:#de2424;padding:10px;border:3px solid #B31B1B;border-radius:30px;width:140px;text-align:center;float:left;margin-top:-15px"> 
                <b>ROC Curve → </b>
            </div>
        </summary>
        <div id="button_info" style="padding:20px;background-color:#eee;border:3px solid #aaa;border-radius:30px;margin-left:25px;">
            <p style="padding:15px 2px 2px 2px">
                There are various versions of the <a href="https://scikit-learn.org/stable/modules/model_evaluation.html#roc-metrics">ROC</a> curve, but a popular version plots true positive rate (TPR) versus false positive rate (FPR). For every value of you probability threshold, plot a single point with coordinates (TPR, FPR). There are as many thresholds as observations in the validation sample. These points are connected with step functions to derive the orange curve. The dashed line indicates the prediction of the random model (which randomly assigns labels to observations). The random model is your benchmark. </p>
        </div> 
    </details

In [None]:
from sklearn.metrics import roc_curve, roc_auc_score

pY1 = lr.predict_proba(vX)[:,1]   # probability of class 1 only
AUC = roc_auc_score(y_true=vY, y_score=pY1)
fpr, tpr, thresholds = roc_curve(y_true=vY, y_score=pY1)

plt.plot([0,1], [0,1], linestyle='--', label='Random guess model');
plt.plot(fpr, tpr, marker='', label='Logistic regression model');
plt.xlabel('False Positive Rate (FPR)');
plt.ylabel('True Positive Rate (TPR)');
plt.text(0, 0.8, f'AUC={AUC:.2f}', fontsize=15);
plt.title('Receiver Operating Characteristic (ROC) Curve');
plt.legend();
plt.grid();

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**
Now, equipped with these concepts and tools, you will tackle a few related tasks.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1
 
Copy the `df` dataframe into `df1` and add a columns `CS2lady`. This uses the `CS2Q()` function to compute cosine similarity distances from the given name (in each row) to the target word `lady`. Similarly, add columns `CS2woman`, `CS2female`, `CS2guy`, `CS2girl`, and `CS2robert`.

<b>Hint:</b> The code above should guide you through these steps.


In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
df1 = df.copy()
df1['CS2lady'] = df.reset_index().Name.apply(CS2Q('lady')).values
df1['CS2woman'] = df.reset_index().Name.apply(CS2Q('woman')).values
df1['CS2female'] = df.reset_index().Name.apply(CS2Q('female')).values
df1['CS2guy'] = df.reset_index().Name.apply(CS2Q('guy')).values
df1['CS2girl'] = df.reset_index().Name.apply(CS2Q('girl')).values
df1['CS2robert'] = df.reset_index().Name.apply(CS2Q('robert')).values
df1.T
</pre>
</details> 
</font>
<hr>

## Task 2

Build a logistic regression model based on the features you created. Then compute the confusion matrix.

<b>Hint:</b> See the code above or in the previous Jupyter Notebook.

In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
tX, vX, tY, vY = train_test_split(df1.drop('Y', axis=1), df1.Y, test_size=0.2, random_state=0)
lr1 = LogisticRegression(random_state=0).fit(tX, tY)   # create a model; compute model parameters
print(f'Accuracy = fraction of correct predictions: {lr1.score(vX, vY):.3f}') # report out of sample accuracy
pY = lr1.predict(vX**2)
cm = confusion_matrix(y_true=vY, y_pred=pY, labels=[0,1])
cm
</pre>
</details> 
</font>
<hr>

## Task 3

Compute AUC and build a ROC curve for the model you built.

<b>Hint:</b> See code above.

In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
pY1 = lr1.predict_proba(vX)[:,1]   # probability of class 1 only
AUC = roc_auc_score(y_true=vY, y_score=pY1)
fpr, tpr, thresholds = roc_curve(y_true=vY, y_score=pY1)

plt.plot([0,1], [0,1], linestyle='--', label='Random guess model')
plt.plot(fpr, tpr, marker='', label='Logistic regression model')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.text(0, 0.8, f'AUC={AUC:.2f}', fontsize=15)
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.grid()
</pre>
</details> 
</font>
<hr>

## Task 4

Try to further improve the model. You can also explore different hyperparameters of [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model.

This is an open ended task and has no single specific solution.

In [None]:
# your solutions goes here