# **Part Three of the Course Project**
In this part of the course project, you will train and evaluate your models using the Area Under the ROC Curve metric.
<hr style="border-top: 2px solid #606366; background: transparent;">


# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete this part of the course project. 

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt, nltk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from collections import Counter
from numpy.testing import assert_equal as eq, assert_almost_equal as aeq
from gensim.models import KeyedVectors
from sklearn.metrics import roc_curve, roc_auc_score
import unittest
from colorunittest import run_unittest
CosSim = lambda x, y: x @ y / (x @ x)**0.5 / (y @ y)**0.5  # our own implementation of cosine similarity
pd.set_option('max_colwidth', 100, 'display.max_rows', 4)

_ = nltk.download(['names'], quiet=True)
LsM = nltk.corpus.names.words('male.txt')   # list of strings: male names
LsF = nltk.corpus.names.words('female.txt') # list of strings: female names
LsF = [n for n in LsF if n not in LsM]      # remove 365 female names that match male names
print(f'{len(LsM)} male names:  ', LsM[:8])
print(f'{len(LsF)} female names:', LsF[:8])

# Overview

The classes are somewhat imbalanced, since there are not an equal number of observations in each class. There are about 63% of female names and 37% of male names. So, if you randomly draw a name, it will turn out to be a female's name 63% of the time. Thus, a naive model that randomly draws a name and ignorantly classifies it as a female name will be correct 63% of the time. Consequently, you should use confusion matrix and related metrics to more informatively assess the quality of our model.
 
In this project, you will engineer features based on 50 word2vec coefficients. Then you will build a logistic regression and evaluate its quality with metrics from a confusion matrix. One caveat is that not all names (even if lowercase) are in the word2vec model. So, you'll build a function to draw character-level vectors and aggregate these to word-level vectors with the hope that some vowel and consonant information of the characters in a word will still be useful in your model and you would not need to drop words not found in word2vec vocabulary.


Do not sort or reorder observations unless instructed to do so. The tests assume the continuity in the order of observations.

In [None]:
rng = np.random.RandomState(0)  # seed random number generator with a number 0 (for reproducibility)
# LsF = sorted(list(rng.choice(LsF, size=len(LsM), replace=False)))       # shorten the list of female names
df = pd.DataFrame(dict(Name=LsF + LsM, Y=[1]*len(LsF) + [0]*len(LsM)) ) # assign labels: 1=female, 0=male
df.Name = df.Name.str.lower()   # convert all names to lower case
df.set_index('Name', inplace=True)
df.T                # display names (as column names) and their labels (0=male, 1=female)

Next, you load the word2vec model with 400K word vocabulary of words mapped to 50-dimensional vectors. 

In [None]:
%time wv = KeyedVectors.load_word2vec_format('glove-wiki-gigaword-50.gz')  # ~20 seconds to load this model
wv['abagail']    # retrieve a word vector for the name abagail

# Task 1. A Word Vector From Characters
 
Create a `CharVec()` function which takes a word (as a string of characters) and generates a word vector, which is a centroid of all character-level vectors found in the model `wv`.

In [None]:
def CharVec(w:'string'='abagael', wv:'word2vec model'=wv) -> np.zeros(50):
    '''Takes a word w and word2vec model wv. 
    Then for each character of w we retrieve a word vector if a character is found in wv.
    Finally, all 50-dim character vectors are averaged with np.mean()
    to produce a 50-dim word vector. Characters not found in wv are ignored
    Return: 50-dim word vector. If no characters are found in wv, then 50-dim zero vector is returned.'''
    # YOUR CODE HERE
    raise NotImplementedError()
    return word_vector  # numpy array of length 50

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_task_1(unittest.TestCase):
    def test_00(self): aeq(CharVec('a')[:5], [.217, .465, -0.468, .101, 1.013], decimal=3)
    def test_01(self): aeq(CharVec('a').sum(), 1.6782844, decimal=3)
    def test_02(self): eq(CharVec('A'), np.zeros(50))
    def test_03(self): aeq(CharVec('abc')[:5], [-.191, .426, .049, .636, .89], decimal=3)
    def test_04(self): aeq(CharVec('abc').sum(), 8.53262, decimal=3)
    def test_05(self): aeq(CharVec('abagail').sum(), 4.7599854, decimal=3)
    def test_06(self): aeq(CharVec('Abbe').sum(), 8.714625, decimal=3)
    def test_07(self): aeq(CharVec('marco-pollo').sum(), 4.661618, decimal=3)

# Task 2. A Word Vector For Any Word

Next, create a `W2V()` function, which returns a word vector, if one is found. Otherwise, it makes a call to `CharVec()` function to build a new vector.

In [None]:
def W2V(w:'string'='abagail', wv:'word2vec model'=wv) -> np.zeros(50):
    '''W2V takes a word w and a word2vec model wv. 
    If w is in the model wv, its vector is returned. 
    Otherwise, we call CharVec on w to build a new vector from its characters.
    Returns: 50-dim word vector'''
    # YOUR CODE HERE
    raise NotImplementedError()
    return word_vector    # numpy array of length 50

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_task_2(unittest.TestCase):
    def test_00(self): aeq(W2V('a')[:5], [.217, .465, -0.468, .101, 1.013], decimal=3)
    def test_01(self): aeq(W2V('a').sum(), 1.6782844, decimal=3)
    def test_02(self): eq(W2V('A'), np.zeros(50))
    def test_03(self): aeq(W2V('abc')[:5], [.12305, .1083, .40415, 1.0219, .085337], decimal=3)
    def test_04(self): aeq(W2V('abc').sum(), 2.0375853, decimal=3)
    def test_05(self): aeq(W2V('abagail').sum(), 2.756581, decimal=3)
    def test_06(self): aeq(W2V('Abbe').sum(), 8.714625, decimal=3)
    def test_07(self): aeq(W2V('marco-pollo').sum(), 4.661618, decimal=3)

# Practice With Pandas: DataFrame() and concat()

If you are already familiar with these Pandas functions, you can skip to Task 3.

The Pandas `DataFrame()` function allows you to convert properly structured data in other formats into a Pandas DataFrame. For example, create a simple list of lists:

In [None]:
prac_lists = [[1, 2, 3],[2, 3, 4],[3, 4, 5],[4, 5, 6]]
prac_lists

This data set is structured such that it is equivalent to a table with four rows and three columns. Pandas can easily convert this into a DataFrame as follows:

In [None]:
prac_df_nums = pd.DataFrame(prac_lists)
prac_df_nums

Note that the `DataFrame()` function automatically generated a new index (numbered 0 through 4) and column labels (numbered 0 through 2). 

Next, subset the `df` DataFrame so that it has the same number of rows as `prac_df_nums`:

In [None]:
prac_df_names = df[:4]
prac_df_names

Note that this DataFrame uses the **Name** column as its index (i.e., there are no numbers to the left of the **Name** column) and the other column has been given a non-default label as well.

In [None]:
prac_df_names.index

You can change the default index during DataFrame creation by specifying the values you want to use, which do not need to be in data being converted to a DataFrame. An example of this would be the following:

In [None]:
prac_df_nums = pd.DataFrame(prac_lists, index = prac_df_names.index)
prac_df_nums

You can combine the data in these DataFrames using the Pandas `concat()` function. Please note that the order of the DataFrames listed in the function matters:

In [None]:
prac_df_combined = pd.concat([prac_df_names,prac_df_nums])
prac_df_combined

Note that this did combine the two DataFrames, but not exactly as desired. This is because the default behavior for the `concat()` function is to stack DataFrames vertically (across axis 0), on top of each other. Pandas will add both columns and rows to make this work, however if you want to stack horizontally, you need to specify axis = 1 as part of your code:

In [None]:
prac_df_combined = pd.concat([prac_df_names,prac_df_nums], axis = 1)
prac_df_combined

In [None]:
prac_df_combined.shape

Feel free to experiment with these small sample DataFrame and familiarize yourself with these functions prior to attempting the next task.

# Task 3. Add Word Vector Features

As a first step, copy `df` dataframe to another DataFrame named `df1` to preserve the original. Then convert each name in `df1.index` to its word vector using `W2V()` function and convert this to its own DataFrame. Finally, combine this DataFrame containing the resulting coefficients with the `df1` DataFrame, keeping the name of the DataFrame as `df1` and preserving the **Name** column as the index. The resulting DataFrame should contain the original DataFrame of names and labels plus 50 numeric values which are saved in columns named 0 to 49. Here is a small example of the resulting `df1`:

|Name|Y|0|1|2|3|
|-|-|-|-|-|-|
|abagael|1|.074134|.662556|.229403|.64297|
|abagail|1|-.537140|.313840|-.677850|-.53706|

**Note: The line of code at the bottom of the task block should print output that matches what you see above. In addition, this and all other remaining tasks in this exercise can be written as a script cell rather than a function.**

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
df1.iloc[:2,:5]

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_task_3(unittest.TestCase):
    def test_00(self): aeq(df1.T['abagael'][:5], [1,.074,.663,.229,.643], decimal=3)
    def test_01(self): aeq(df1.T['abagael'].sum(), 5.760784632526338, decimal=3)
    def test_02(self): aeq(df1.T['abagail'].sum(), 3.7565809320658445, decimal=3)
    def test_03(self): aeq(df1[0].sum(), -706.9891949769153, decimal=3)
    def test_04(self): aeq(df1.query('Y==1').sum().sum(), 11999.755633788649, decimal=3)
    def test_05(self): aeq(df1.sum().sum(), 10487.074652241812, decimal=3)

# Task 4. Train and Validate Logistic Regression Classifier
 
Use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split Pandas DataFrame `df1` (without `Y` column) and Pandas series `Y` (a column from `df1` DataFrame) into objects:
 
1. `tX` = a Pandas DataFrame with a columns 0-49, a training input feature
1. `vX` = a Pandas DataFrame with a column 0-49, a validation input feature
1. `tY` = a Pandas series, a column `Y`, containing training labels for the corresponding rows in `tX`
1. `vY` = a Pandas series, a column `Y`, containing validation labels for the corresponding rows in `vX`
 
Then proceed with model fitting and evaluation:
 
1. Create a [`LogisticRegression()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) object, `lr`. 
1. Fit it to `tX,tY` using the [`lr.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) method. 
1. Compute the model accuracy with [`lr.score()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) method and appropriate input/output arguments.
 
If done correctly, your model should score about 81% validation accuracy and 81% train accuracy.
 
To ensure reproducibility of your model results, leave all function arguments at their default values, except:
 
1. Set `random_state` to 0 for both functions.
1. Use `test_size` of 0.2 for the split. That is, 20% is allocated to validation sets, `vX,vY`, and 80% is allocated to train sets, `tX,tY`.
 
Hint: See previous course videos and Jupyter notebooks for examples.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
pd.DataFrame(lr.get_params(deep=True).items()).set_index(0).T  # print model hyperparameters as a dataframe

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_task_4(unittest.TestCase):
    def test_00(self): eq(type(tX), pd.DataFrame)
    def test_01(self): eq(type(vX), pd.DataFrame)
    def test_02(self): eq(type(tY), pd.Series)
    def test_03(self): eq(type(vY), pd.Series)
    def test_04(self): eq(tX.shape, (6063, 50))
    def test_05(self): eq(vX.shape, (1516, 50))
    def test_06(self): eq(tY.shape, (6063, ))
    def test_07(self): eq(vY.shape, (1516, ))
    def test_08(self): eq(tX.index[:5].tolist(), ['rahul', 'elisabet', 'giuseppe', 'selie', 'jean-pierre']) # check the ordering of rows
    def test_09(self): aeq((lr.score(vX, vY), lr.score(tX, tY)), (0.837730870712401, 0.838693715982187), decimal=3)
    def test_10(self): eq(lr.get_params()['random_state'], 0)

# Task 5. Build Predictions and AUC
 
The accuracy computed above is overstated because the two classes are not balanced (i.e., not equal in their counts of observations). A more reliable metric in such case is Area Under the ROC Curve ([AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)), which you examined earlier in the video and Jupyter Notebook (JN). In the cell below compute `pY`, which is a NumPy array of probabilities of class=1 (i.e., name is that of a female) for validation observations. 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
pY, AUC

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest
class test_task_5(unittest.TestCase):
    def test_00(self): eq(type(pY), np.ndarray)
    def test_01(self): eq(pY.shape, (1516,))
    def test_02(self): eq(tX.index[:5].tolist(), ['rahul', 'elisabet', 'giuseppe', 'selie', 'jean-pierre']) # check the ordering of rows
    def test_03(self): aeq(pY[:5], [0.04161141, 0.53286882, 0.97318801, 0.95680585, 0.74057672], decimal=3)
    def test_04(self): aeq(pY.sum(), 937.4494763770035, decimal=3)
    def test_05(self): aeq(AUC, 0.8720775675213581, decimal=3)

The predicted probabilities (and the resulting predicted labels) can then be compared to the corresponding true labels to compute the confusion matrix and various aggregate measures of quality, which can help in model improvement.