# **Part Two of the Course Project**
 
In this part of the course project, you will train models to perform binary classification of a dataset containing names and explore various features of your dataset to improve your model. 
<hr style="border-top: 2px solid #606366; background: transparent;">


# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete this part of the course project. 

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt, nltk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from collections import Counter
from numpy.testing import assert_equal as eq, assert_almost_equal as aeq
pd.set_option('max_colwidth', 100, 'display.max_rows', 10)
import unittest
from colorunittest import run_unittest

_ = nltk.download(['names'], quiet=True)
LsM = nltk.corpus.names.words('male.txt')   # list of strings: male names
LsF = nltk.corpus.names.words('female.txt') # list of strings: female names
print(f'{len(LsM)} male names:  ', LsM[:8])
print(f'{len(LsF)} female names:', LsF[:8])

In [None]:
# Balance observations in two classes. So, a random draw has 50% chance of either class.
rng = np.random.RandomState(0)  # seed random number generator with a number 0 (for reproducibility)
LsF = sorted(list(rng.choice(LsF, size=len(LsM), replace=False)))       # shorten the list of female names
df = pd.DataFrame(dict(Name=LsF + LsM, Y=[1]*len(LsF) + [0]*len(LsM)) ) # assign labels: 1=female, 0=male
df.Name = df.Name.str.lower()   # convert all names to lower case
df.set_index('Name', inplace=True)
df.T   # display names (as column names) and their labels (0=male, 1=female)

To predict the class of a name, we hypothesize numeric attributes that can be indicative of a gender, given a name. In the video and earlier exercises we tried counting letters in the name. Now, let's count vowel letters with hope that it will help us identify a class for the given name. Since there are equal counts of class names in `df` dataframe, a model that does not use any of the name attributes would be 50% accurate. We are trying to find attributes that will help us outperform this naive benchmark of 50% accuracy on the (test) sample, which the model has not memorized.

# Task 1. Add numeric feature: a count of vowels

In the cell below use `copy()` method to copy `df` to dataframe `df1` and add a column `Vowels` to `df1`. It should contain the count of vowel letters (`'aeiouy'`) in the corresponding name. Here is an example of the updated `df1` for the top two rows:
        
|Name|Y|Vowels|
|-|-|-|
|abagail|1|4|
|abbe|1|2|
 
FYI: `'y'` (like `'w'`) is sometimes defined as a consonant, but we'll consider it a vowel sound as in words mary, daryl, corey, and others. You can later experiment with which categorization of these letters improves your model.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
df1.T

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest

class test_task_1(unittest.TestCase):
    def test_00(self): 
        assert 'Vowels' in df1
    def test_01(self):    
        eq(df1.shape, (5886,2))  # df should contain only columns Y and Vowels
    def test_02(self):
        eq(df1.query("Name=='abagail'").iloc[0,:].values, (1,4))
    def test_03(self):
        eq(df1.query("Name=='abbe'").iloc[0,:].values, (1,2))
    def test_04(self):
        eq(df1.sum().values, [ 2943, 15296])


# Task 2. Train and validate logistic regression with Vowels feature
 
Use [`train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to split pandas dataframe `df1` (without `Y` column) and pandas series `Y` (a column from `df1` dataframe) into objects:
 
1. `tX` = a pandas dataframe with a column `Vowels`, a training input feature
1. `vX` = a pandas dataframe with a column `Vowels`, a validation input feature
1. `tY` = a pandas series, a column `Y`, containing training labels for the corresponding rows in `tX`
1. `vY` = a pandas series, a column `Y`, containing validation labels for the corresponding rows in `vX`
 
Then proceed with model fitting and evaluation:
 
1. create a [`LogisticRegression()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) object, `lr`. 
1. fit it to `tX,tY` using the [`lr.fit()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit) method. 
1. compute the model accuracy with [`lr.score()`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.score) method and appropriate input/output arguments
 
If done correctly, your model should score about 61% validation accuracy and 62% training accuracy.
 
To ensure reproducibility of your model results, leave all function arguments at their default values, except:
 
1. set `random_state` to 0 for both functions.
1. use `test_size` of 0.2 for the split. That is 20% is allocated to validation sets, `vX,vY`, and 80% is allocated to train sets, `tX,tY`
 
Hint: see previous course videos and Jupyter notebooks for examples.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
pd.DataFrame(lr.get_params(deep=True).items()).set_index(0).T  # print model hyperparameters as a dataframe

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest

class test_task_2(unittest.TestCase):
    def test_00(self): 
        eq(tX.shape, (4708, 1))
    def test_01(self):    
        eq(vX.shape, (1178, 1))
    def test_02(self):
        eq(tY.shape, (4708,))
    def test_03(self):
        eq(vY.shape, (1178,))
    def test_04(self):
        eq(tX.sum().values, 12249)
    def test_05(self):
        eq(vX.sum().values, 3047)
    def test_06(self):
        eq(tY.sum(), 2360)
    def test_07(self):
        eq(vY.sum(), 583)
    def test_08(self):
        eq(lr.get_params()['random_state'], 0)
    def test_09(self):
        aeq(lr.score(vX, vY), 0.6061120543293718, 2)  # validation accuracy. compare to 3 decimal places
    def test_10(self):
        aeq(lr.score(tX, tY), 0.6274426508071368, 2)  # training accuracy


# Task 3. Add numeric feature: a count of consonants

If vowel count was helpful, the model might also benefit from the count of consonant symbols in each name. This is just a hypothesis with no guarantees, but we can add this numeric feature, train and test our model to evaluate whether this feature is useful. Copy `df1` to the dataframe `df2` (containing now two features) and add `Consonants` column to `df2` with counts of letters `'bcdfghjklmnpqrstvwxz'` in each name.

Here is an example of the top two rows of `df2`:

|Name|Y|Vowels|Consonants|
|-|-|-|-|
|abagail|1|4|3|
|abbe|1|2|2|

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
df2.T

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest

class test_task_3(unittest.TestCase):
    def test_00(self):    
        assert 'Vowels' in df2
    def test_01(self):
        assert 'Consonants' in df2
    def test_02(self):
        eq(df2.shape, (5886, 3))    # df should contain only columns Y, Vowels, and Consonants
    def test_03(self):
        eq(df2.query("Name=='abagail'").iloc[0,:].values, (1,4,3))
    def test_04(self):
        eq(df2.query("Name=='abbe'").iloc[0,:].values, (1,2,2))
    def test_05(self):
        eq(df2.sum().values, [ 2943, 15296, 20066])
        

# Task 4. Train and validate logistic regression with Consonants feature
 
As before, split the new dataframe into `tX, vX, tY, vY`, then create a `LogisticRegression()` object, `lr`, fit it on the training inputs/outputs and validate it on the test inputs/outputs. Use the same function arguments as you did above in Task 2.
 
If done correctly, the additional use of consonant counts should improve the model slightly with validation accuracy of 62% and training accuracy of 63%. Recall that we only care for validation accuracy, i.e. the model's performance on observations it has not used in training.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
lr.score(vX, vY)

In [None]:
# TEST & AUTOGRADE CELL
@run_unittest

class test_task_4(unittest.TestCase):
    def test_00(self):    
        eq(tX.shape, (4708, 2))
    def test_01(self):
        eq(vX.shape, (1178, 2))
    def test_02(self):
        eq(tY.shape, (4708,))
    def test_03(self):
        eq(vY.shape, (1178,))
    def test_04(self):
        eq(tX.sum().values, [12249, 15970])
    def test_05(self):
        eq(vX.sum().values, [3047, 4096])
    def test_06(self):
        eq(tY.sum(), 2360)
    def test_07(self):
        eq(vY.sum(), 583)
    def test_08(self):
        eq(lr.get_params()['random_state'], 0)
    def test_09(self):
        aeq(lr.score(vX, vY), 0.6230899830220713, 2)  # validation accuracy. compare to 3 decimal places
    def test_10(self):
        aeq(lr.score(tX, tY), 0.6285046728971962, 2)  # training accuracy


# Practice
 
As an additional practice try adding more features to continue improving the model's predictive performance. For example, you can find the most frequent first letter among male names and among female names. Suppose, it turns out that male names in our sample `LsM` often start with a letter `'m'` and female names in the sample `LsF` often start with a letter `'f'`. We could add two individual features, such as `IsM` and `IsF`, which would indicate 0 for no match of the given letter and 1 for a match. Alternatively, we can add just one feature with a value -1 for letter `'m'`, 1 for letter `'f'` and 0 for no match. 
 
You might be surprised to find out that this feature does not improve the validation accuracy of our model. However, if you focus on the last character in the male and female names, then you will greatly succeed. 
 
While each of these hypothesis makes sense, the model improves only from the **incremental** value of each additional feature. So, while the first character indicator may be valuable on its own, it does not appear to improve the model built on `df2` with two features you have added earlier.

In [None]:
# your solution here

<font color=#606366>
    <details><summary><font color=crimson>▶ </font>See <b>solution</b>.</summary>
        <b>Hint:</b>
        <details><summary><font color=crimson>ᐅ </font>One more click for a <b>solution</b>.</summary>
            <pre>
df3 = df2.copy()
sECF = Counter([s[0] for s in LsF]).most_common(1)[0][0]  # most common first letter among male names
sECM = Counter([s[0] for s in LsM]).most_common(1)[0][0]  # most common first letter among female names
sECF, sECM  # Length of the strings containing most common first letter
df3['FirstChar'] = df3.reset_index().Name.apply(lambda s: 1 if s[0]=='m' else -1 if s[0]=='s' else 0).values
df3 = df2.copy()
Counter([s[-1] for s in LsF]).most_common(1)
Counter([s[-1] for s in LsM]).most_common(1)
df3['LastChar'] = df3.reset_index().Name.apply(lambda s: 1 if s[-1]=='n' else -1 if s[-1]=='a' else 0).values
            </pre>
        </details>
    </details> 
</font>
 <hr>