# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**


## **Part 1: Build a Logistic Regression Name Classifier**

<span style="color:black">In this notebook, you will build a name classifier using a logistic regression model with a single input feature. This feature will be the cosine similarity between the embedding vector of the name and embedding vector of the query word "feminine". The idea is that female names will have a high cosine similarity with the query word and male names will have a low cosine similarity with the query word. Logistic regression helps you automate this threshold search by minimizing misclassifications.
    
After building the model, you will focus on making and interpreting predictions from the trained model.

<span style="color:black"><b>Note:</b> Instead of using a large FastText model, you use the familiar word2vec model, which loads ~10x faster. You are losing the advantage of finding out-of-vocabulary words and some names may not have a vector. You will quantify this loss of observations below.

In [None]:
import pandas as pd, numpy as np, nltk, seaborn as sns, matplotlib.pyplot as plt
import plotly.express as px, plotly.graph_objects as go
from gensim.models import KeyedVectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Dictionary-like object. key=word (string), value=trained embedding coefficients (array of numbers)
%time wv = KeyedVectors.load_word2vec_format('glove-wiki-gigaword-50.gz')

<span style="color:black">Load names from the NLTK corpora and lowercase all names. Recall that the `glove-wiki-gigaword-50.gz` word2vec model has a limited vocabulary that contains only lower case words. Since all names are title cased, capitalization does not help distinguish these names and lowercasing them will make it easier to retrieve vectors from the Word2Vec model.

In [None]:
_ = nltk.download(['names'], quiet=True)
LsM = [name.strip().lower() for name in nltk.corpus.names.words('male.txt')]
LsF = [name.strip().lower() for name in nltk.corpus.names.words('female.txt')]

# Balance observations in two classes. So, a random draw has 50% chance of being from either class.
np.random.seed(0)
LsF = sorted(np.random.choice(LsF, size=len(LsM), replace=False))
print(f'{len(LsM)} male names:  ', LsM[:20])
print(f'{len(LsF)} female names:', LsF[:20])


### **Quantify Names Not in the Vocabulary**

<span style="color:black">You will now quantify the fraction of names that are not in the model's vocabulary. To do so, for every female name in the list, check if it is in the word2vec vocabulary and save a `True` or `False` value to the resulting list. Since `True` values evaluate to 1 and `False` values evaluate to 0, you can easily calculate the fraction of `True` values by dividing the sum by its count. Repeat this procedure for male names.

In [None]:
LbF, LbM = [s in wv for s in LsF], [s in wv for s in LsM] # list of booleans
print(f'Fraction of female names found in word2vec model: {sum(LbF)/len(LbF):.4f}')
print(f'Fraction of male names found in word2vec model: {sum(LbM)/len(LbM):.4f}')
print(f'In vocabulary: ', [s for s,b in zip(*(LsF,LbF)) if b][:10])      # Female names found in model vocabulary
print(f'Out vocabulary: ', [s for s,b in zip(*(LsF,LbF)) if not b][:10]) # Female names not found in model vocabulary

<span style="color:black">Notice that ~30% of females and ~8% of male names are out of vocabulary (OOV). You can either throw away vocabulary names, identify new features, or use a different word embedding model.


### **Cosine Similarity Metric**

<span style="color:black">Next, create a cosine similarity metric using cosine distance. Before using this metric, you need to convert the query word into a 50-dimensional vector. 
    
<span style="color:black"><b>Note:</b> Although you can use all 50 values as features for the model (which would likely produce better results), for the sake of simplicity, you will use a single input feature that can be visualized in a 2D plot.

In [None]:
from scipy.spatial.distance import cosine  # a cosine distance, not similarity
CosSim = lambda x, y: 1 - cosine(x, y)     # convert distance to similarity

sQuery = 'feminine'   # a query concept or word
vFemale = wv[sQuery]
print(f'Vector for "{sQuery}", size {len(vFemale)}:', list(vFemale.round(2)))

<span style="color:black">Next, join female and male names into a DataFrame. Convert each to a numeric vector, and then compute its cosine similarity with the query vector. In this DataFrame, for each name, there is a `Y` output label, indicating gender (1 for female and 0 for male), and a `CS2F` value, indicating its cosine similarity with "feminine". You can use `sort_values()` to order each name in the dataframe by its `CS2F` score.

In [None]:
CosSim2Female = lambda sName: CosSim(wv[sName], vFemale) if sName in wv else 0 # cosine similarity to query

df = pd.DataFrame(LsF + LsM, columns=['Name'])
df['Y'] = [1] * len(LsF) + [0] * len(LsM)   # create numeric labels for names
df['CS2F'] = df.Name.apply(CosSim2Female)   # for each name compute cosine similarity to female query word
df = df.sort_values('CS2F', ascending=False).set_index('Name')
df.T.round(2)

<span style="color:black"> Observe that female names tend to have higher cosine similarities with the vector of "feminine", while male names have lower cosine similarities.


### **Decision Boundary Line**

<span style="color:black">You will now plot each name using its cosine similarity with the query word's vector and its label as coordinates in 2D space. 

In [None]:
f = px.scatter(df.reset_index(), x='CS2F', y='Y', hover_name='Name')
f = f.update_traces(marker=dict(size=6, line=dict(width=.1)), marker_symbol='line-ns')
f = f.update_layout(margin=dict(l=0, r=0, t=30, b=0), height=200, title='Training observations: output vs input')
f.add_trace(go.Scatter(x=[-0.4, 0.4], y=[0.2, .8], mode="text", text=["males", "females"], showlegend=False))

<span style="color:black">Notice the slight left-shift in distribution of male names. You can hover your mouse pointer over these markers to view each name. OOV names are stacked at cosine similarity (CS2F) 0, so you do not see them. The female names to the right are closer to the query word and the female names to the left are farther from our query word (i.e. from its vector representation).

<span style="color:black">You can approximate a vertical decision boundary line that splits the names into two regions, and use this boundary to classify new names. Such a line would likely be slightly lower than 0, but it is difficult to tell with so many points clustered together. If you are mistaken by even .001, your "manual" classifier might underperform by several percentages of accuracy. Thus, a better approach is to train a logistic regression model to find the optimal decision boundary automatically.

<span style="color:black">To begin, split the observations into training and validation sets for training and evaluating the logistic regression model, respectively. Always seed random number generators (with an arbitrary value) for reproducibility.

In [None]:
tX, vX, tY, vY = train_test_split(df.drop('Y', axis=1), df.Y, test_size=0.2, random_state=0)
print('tX:', tX.shape, ',\t tY:', tY.shape)
print('vX:', vX.shape, ',\t vY:', vY.shape)

<span style="color:black">Now, instantiate a logistic regression model with its default (hyper-) parameters. The only parameter that you will set is the random number generator seed to ensure reproducible results.

In [None]:
lr = LogisticRegression(random_state=0)
lr


### **Train the Model**

<span style="color:black">You are now ready to use the model for training on the training inputs/outputs and for testing on the validation inputs and outputs.

In [None]:
lr.fit(tX, tY)
print(f'Accuracy = fraction of correct predictions: {lr.score(vX, vY):.3f}')

<span style="color:black">The resulting accuracy score is 64.6%, which is good for the first model with a single input feature. About 30%+8% of input values are zeros, so these classes of names are not evenly balanced. This raises some questions that you will explore later. 

<span style="color:black">Print the intercept and slope from the underlying linear regression.

In [None]:
print(f'learnt β₀={lr.intercept_}, β₁={lr.coef_}')

### **Design a Nonlinear Helper Function**

<span style="color:black">Finally, you will define a nonlinear helper function and use this function to help plot the logistic curve and its decision boundaries (dashed lines) in both the probability and cosine similarity domains. The former can be used to classify name vectors in probability space, and the latter can be used to do the same in cosine similarity space, i.e. horizontal axis.

In [None]:
Sigmoid = lambda x: 1 / (1 + np.exp(-x))   # equivalently:  exp(x) / (1 + exp(x))
Logistic = lambda x, b0, b1: Sigmoid(b0 + b1 * x)
Logit = lambda p: np.log(p / (1 - p))   # Inverse sigmoid p is probability in [0,1]
InverseLogistic = lambda p, b0, b1: (Logit(p) - b0) / b1

<span style="color:black">Suppose you are given the name "Kit", which is traditionally a male's name. To predict the gender, you first need to find its embedding vector and then calculate the vector's cosine similarity to the vector of the query word ("feminine"), CS2F. If this CS2F>-0.0240, you classify "Kit" as a female's name. The CS2F for "Kit" is 0.3 so it would be incorrectly classified as a female's name. That is okay because the model is not perfect and it can be improved.

In [None]:
plt.rcParams['figure.figsize'] = [20, 4]   # plot wider figures
ax = df.plot.scatter('CS2F', 'Y', grid=True, marker=r'|', alpha=.5, lw=1);

x = np.linspace(df['CS2F'].min(), df['CS2F'].max(), 100)

b0, b1 = lr.intercept_[0], lr.coef_[0][0]

DXY = {'CS2F': x, f'y | β₀= {b0:.2f}, β₁= {b1:.2f}': Logistic(x, b0=b0, b1=b1)}
pd.DataFrame(DXY).set_index('CS2F').plot(grid=True, ax=ax, color='green', lw=4);

p_cut = 0.5
cs_cut = InverseLogistic(p_cut, b0, b1)
plt.axhline(y=0.5, color='r', linestyle='--');
plt.axvline(x=cs_cut, color='r', linestyle='--');
ax.text(cs_cut, p_cut, f'Decision boundaries:\n  in feature space: cosine similarity={cs_cut:.4f}\n  in probability space={p_cut}', size=15, color='r', verticalalignment='top');
ax.set_title('Fitted logistic function');
plt.xlabel(f'Cosine similarity to "{sQuery}" vector');

## **Part 2: Make Predictions From the Trained Model**
<span style="color:black">With this decision boundary, you can now make predictions for the gender of the names. You will start with making predictions using the cosine similarity value. Choose two cosine similarities for this demonstration.

In [None]:
# Trained model predicts gender for the given cosine similarity
x_cs = -0.03
print('x=', x_cs, ', prob=', np.round(lr.predict_proba([[x_cs]]), 4), ', predicted class=', lr.predict([[x_cs]])) 
x_cs = 0.03
print('x=', x_cs, ', prob=', np.round(lr.predict_proba([[x_cs]]), 4), ', predicted class=', lr.predict([[x_cs]]))

<span style="color:black">The first `x_cs=-0.03` results in male prediction since the probability of class 0 is 0.5066 (slightly higher than 50%), and the second results in female prediction since the probability of class 1 is 0.5592.
    
<span style="color:black">Now, you will make predictions using a name by first converting it into a cosine similarity value. Then, you can proceed with the same decision making process based on the previously identified threshold.

In [None]:
sName = 'Ann'
x0 = [(CosSim2Female)(sName)]   # observation vector
pY = lr.predict([x0])[0]         # predicted label y is 0 or 1
print('Cosine similarity to female:', np.round(x0, 4), ', \nPredicted label/gender:', pY, '/', ['male', 'female'][pY])

<span style="color:black">For the given name `'Ann'`, the model predicts a female class.

<span style="color:black">Consider all of the observations in the validation set. You can feed these to our model, which will decide automatically whether the number is to the left or to the right of the decision boundary. This may seem trivial with a single feature, but it quickly becomes very complex, if multiple input features are involved.

In [None]:
vX.T

The `predict_proba()` method takes all the cosine similarities and returns probability of class 0 (male) in the first column and probability of class 1 (female) in the second column.

In [None]:
lr.predict_proba(vX)   # probabilities for class 0(male) and class 1(female)

Here are the first few predictions and you can match them with the names to evaluate whether they are correct.

In [None]:
lr.predict(vX)[:30]    # class predictions

<span style="color:black">You can package all the results into a dataframe and add `isAccurate` for the rows where the prediction matches the observed class.

In [None]:
dfvXY = vX.copy()
dfvXY['P[Y=0(male)|CS2F]'] = lr.predict_proba(vX)[:,0]
dfvXY['P[Y=1(female)|CS2F]'] = lr.predict_proba(vX)[:,1]
dfvXY['pY'] = pY = lr.predict(vX)
dfvXY['vY'] = vY
dfvXY['isAccurate'] = dfvXY.pY == dfvXY.vY
dfvXY

<span style="color:black">Using this dataframe, you can compute the number of correct prediction in each class

In [None]:
dfvXY['isAccurate'].value_counts() # count of correct predictions

<span style="color:black"> Compute the fraction of correctly classified cases.

In [None]:
dfvXY['isAccurate'].value_counts() / len(dfvXY)   # accuracy score = fraction of correct predictions

<span style="color:black">This is the same as the accuracy predicted by the logistic regression model.

# **Optional Practice, Part 1: Improving the Model**

In this section, you will improve the logistic regression name classifier model in a few ways.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.


## Task 1

Remove OOV names from `LsM` and `LsF` lists and name them `LsM2` and `LsF2`, respectively. Now, these lists are of different length, so accuracy is biased towards the longer list. There are different metrics to deal with this, but let's trim the longer list to the length of the shorter. Just shuffle the list before removing its elements, otherwise, trimming the ordered names might introduce some alphabetical bias into our model. You should end up with two lists of names, where each has a length of 2063 elements.  Now you have two perfectly balanced classes.

<b>Hint:</b> Look up [`random.shuffle()`](https://docs.python.org/3/library/random.html#random.shuffle) or [`np.random.choice()`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) (make sure to not replace elements if sampling from a list).


In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
LsM2, LsF2 = [s for s in LsM if s in wv], [s for s in LsF if s in wv]
LsM2 = np.random.choice(LsM2, size=min(len(LsM2), len(LsF2)), replace=False)
LsF2 = np.random.choice(LsF2, size=min(len(LsM2), len(LsF2)), replace=False)
print(f'{len(LsM2)} male names:  ', LsM2[:20])
print(f'{len(LsF2)} female names:', LsF2[:20])
</pre>
</details> 
</font>
<hr>

## Task 2

Create a new dataframe `df2` similar to `df`, but only with the names left in `LsM` and `LsF` lists. The feature `CS2F` is computed identically to that above. Fit a logistic regression model `lr2` (with its default hyperparameter and `random_state=0`) to training sample derived from lists in Task 1. Then test it on the held out validation sample. What is the new accuracy score on the new model? 

<b>Hint:</b> This is very similar to the model fitting/validation steps above.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
df2 = pd.DataFrame(np.r_[LsF2, LsM2], columns=['Name'])
df2['Y'] = [1] * len(LsF2) + [0] * len(LsM2)   # create numeric labels for names
df2['CS2F'] = df2.Name.apply(CosSim2Female)   # cosine similarity to female query word
df2 = df2.sort_values('CS2F', ascending=False).set_index('Name')
df2.T.round(2)
tX2, vX2, tY2, vY2 = train_test_split(df2.drop('Y', axis=1), df2.Y, test_size=0.2, random_state=0)
lr2 = LogisticRegression()
print(f'Model accuracy = fraction of correct predictions: {lr2.fit(tX2, tY2).score(vX2, vY2):.3f}')
</pre>
</details> 
</font>
<hr>

## Task 3

Add two more features to the logistic regression model, now named `lr3`. These will be two new rows in your dataframe. One feature is the cosine similarity of each name to the word `'man'` and another is a cosine similarity of each name to `'woman'`. Alternatively, you can create or add your features. It might help to compute cosine similarities to strongly gender identifying words. What is the out of sample (i.e., validation) accuracy for your model? Has it improved?

<b>Hint:</b> If you want to continue using the <code>apply()</code> method of a dataframe, then you can define two more functions like <code>CosSim2Female()</code>, which are called on each name. Alternatively, you can create a <a href="https://docs.python.org/3/howto/functional.html">functional</a>, i.e. a function that returns a function. There are other ways to compute cosine similarities for each name.


In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#b31b1b>▶ </font>See <b>solution</b>.</summary>
<pre>
# cosine similarity to query function, i.e. it returns a function which can be further evaluated
CS2Q = lambda sQuery='man': lambda sName: CosSim(wv[sName], wv[sQuery]) if sName in wv else 0 
CS2Q('man')('david'), CS2Q('man')('kathy')         # cosine similarities to "man"
CS2Q('woman')('david'), CS2Q('woman')('jennifer')  # cosine similarities to "woman"

df3 = df2.copy()
df3['CS2man'] = df3.reset_index().Name.apply(CS2Q('man')).values   # cosine similarity to female query word
df3['CS2woman'] = df3.reset_index().Name.apply(CS2Q('woman')).values   # cosine similarity to female query word
df3.T.round(2)
tX3, vX3, tY3, vY3 = train_test_split(df3.drop('Y', axis=1), df2.Y, test_size=0.2, random_state=0)
lr3 = LogisticRegression()
print(f'Model accuracy = fraction of correct predictions: {lr3.fit(tX3, tY3).score(vX3, vY3):.3f}')</pre>

The model's test accuracy increased to 76% !
</details> 
</font>
<hr>

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Practice, Part 2: Making Inferences From the Model**

Now, equipped with concepts and tools, let's try to tackle a few related tasks.

In this exercise you will practice using the trained logistic regression model `lr` to build an inference for the names:

        LsNames = 'Astrid,Maja,Alice,Olivia,Vera,Ella,Wilma,Alma,Lilly,Ebba'
        
As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 4

For each name in `LsNames` determine its probability to be a female's name. Construct the same cosine similarity feature as we did above.

<b>Hint:</b> Don't forget that the word2vec model has lower case vocabulary, but the given names are title-cased. Also, try <code>CS2Q</code> to compute the cosine similarities to 'feminine' word.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=carnelian>▶ </font>See <b>solution</b>.</summary>
<pre>
LsNames = 'Astrid,Maja,Alice,Olivia,Vera,Ella,Wilma,Alma,Lilly,Ebba'
LsF3 = LsNames.lower().split(',')
vX3 = pd.DataFrame([CS2Q('feminine')(sName) for sName in LsF3], index=LsF3)
pY3 = lr.predict_proba(vX3)[:,1]
pY3
</pre>
</details> 
</font>
<hr>

## Task 5
 
Now use these probabilities to identify labels (`'male'` or `'female'`) for the given names.

<b>Hint:</b> You need to map all probabilities of a female name to the label `'female'` and the remaining probabilities to the label `'male'`.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=carnelian>▶ </font>See <b>solution</b>.</summary>
<pre>
LsLabels = np.array(['male','female'])[(pY3>0.5)*1]
df = pd.DataFrame(np.c_[LsLabels], index=LsF3)
highlight_cells = lambda x: 'background-color: ' + x.map({'female': 'pink', 'male': 'lightblue'})
df.T.style.apply(highlight_cells)
</pre>
</details> 
</font>
<hr>

## Task 6

Compute accuracy score for the given list. How well did the model do?

<b>Hint:</b> This is just a fraction of correct predictions. We already know that all names are female names.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=carnelian>▶ </font>See <b>solution</b>.</summary>
<pre>
print(f'Accuracy is {sum(pY3>0.5)/len(pY3)}')
</pre>
</details> 
</font>
<hr>