# Introduction

Ah, baby names. When it’s time, we want to provide the best name for our children. On oc- cassion, we can determine the ethnicity of a person simply by their name. This can also lead to discrimination, especially when applying for jobs.

Today, resumes submitted online goes through a system to determine how good of a candidate you are. __Now, how accurate can a computer determine someone’s ethnicity?__

Note that in this dataset, our scope is with first names within New York City.

## Retrieving and reorganizing our data

The first step in this analysis is to retrieve the data and reorganize it using pandas.

In [1]:
import requests
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
         
theData = pd.read_csv(
            "Most_Popular_Baby_Names_by_Sex_and_Mother_s_Ethnic_Group__New_York_City.csv"
            )
len(theData)

13962

In [2]:
# A glimpse on the table 
theData.head()

Unnamed: 0,BRTH_YR,GNDR,ETHCTY,NM,CNT,RNK
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53


To perform this analysis, we will use two variables to train our model: 

- The gender 
- The name 

The output will be the ethnicity.

## Wrangling the Data

Upon inspecting the data, we notice a few things we need to change before we begin training our model.

First, we have a count column that’s associated with the occurrance of each row. We should unroll every instance of the names.

In [3]:
newData = {column:[] for column in list(theData.columns)[:4]}

for index, row in theData.iterrows():
    for count in range(row['CNT']):
        newData["BRTH_YR"].append(row["BRTH_YR"])
        newData["GNDR"].append(row["GNDR"])
        newData["ETHCTY"].append(row["ETHCTY"])
        newData["NM"].append(row["NM"])
theData = pd.DataFrame(newData)
theData.head()

Unnamed: 0,BRTH_YR,ETHCTY,GNDR,NM
0,2011,HISPANIC,FEMALE,GERALDINE
1,2011,HISPANIC,FEMALE,GERALDINE
2,2011,HISPANIC,FEMALE,GERALDINE
3,2011,HISPANIC,FEMALE,GERALDINE
4,2011,HISPANIC,FEMALE,GERALDINE


Next, there appears to be a redundancy on ethnicity types.

In [4]:
ethnicCount = theData['ETHCTY'].value_counts()
ethnicCount

HISPANIC                      167237
WHITE NON HISPANIC            154067
BLACK NON HISPANIC             62265
ASIAN AND PACIFIC ISLANDER     51379
WHITE NON HISP                 26675
ASIAN AND PACI                 10300
BLACK NON HISP                 10208
Name: ETHCTY, dtype: int64

It appears that the last three columns are meant to be part of another pre-existing category. We’re going to need to correct this issue.

In [5]:
theData.loc[theData["ETHCTY"] == "WHITE NON HISP", "ETHCTY"]= "WHITE NON HISPANIC"
theData.loc[theData["ETHCTY"] == "ASIAN AND PACI", "ETHCTY"] = "ASIAN AND PACIFIC ISLANDER"
theData.loc[theData["ETHCTY"] == "BLACK NON HISP", "ETHCTY"] = "BLACK NON HISPANIC"

ethnicCount = theData['ETHCTY'].value_counts()
ethnicCount

WHITE NON HISPANIC            180742
HISPANIC                      167237
BLACK NON HISPANIC             72473
ASIAN AND PACIFIC ISLANDER     61679
Name: ETHCTY, dtype: int64

As for names, the data contains a mix of upper and lower case letters, so we’ll convert all names to uppercase.

In [6]:
# This is for converting all names to uppercase 
theData['NM'] = theData['NM'].str.upper()

## Training

To determine how accurate the claim actually is, we’re going to use Naive Bayes. Navie Bayes assumes that all evidence is independent from each other.

We’ll determine the probability as follows:

$$ P(ethnicity|gender, name) = P(ethnicity) ∗ P(gender|ethnicity) ∗ P(name|ethnicity)$$ 

But first, we’ll need to calculate all of the probabilities defined.

In [7]:
# Generate the ethnicity table
ethnicProb = {
    ethnicCount.keys()[i]:ethnicCount[i]/len(theData) 
        for i in range(ethnicCount.nunique())
    }
ethnicTable = pd.DataFrame(ethnicProb,index=[0])
ethnicTable

Unnamed: 0,ASIAN AND PACIFIC ISLANDER,BLACK NON HISPANIC,HISPANIC,WHITE NON HISPANIC
0,0.12793,0.150318,0.34687,0.374882


In [8]:
# Generate the gender given ethnicity table 
genderProb = []
for ethnicity in ethnicCount.keys():
    ethData = theData[theData['ETHCTY'] == ethnicity]
    genderCount = ethData['GNDR'].value_counts()
    temp = {genderCount.keys()[i]:genderCount[i]/len(ethData)
                for i in range(genderCount.nunique())}
    genderProb.append(temp)
genderTable = pd.DataFrame(genderProb,index=ethnicCount.keys())
genderTable

Unnamed: 0,FEMALE,MALE
WHITE NON HISPANIC,0.457608,0.542392
HISPANIC,0.423076,0.576924
BLACK NON HISPANIC,0.419467,0.580533
ASIAN AND PACIFIC ISLANDER,0.425104,0.574896


In [9]:
# Generate the name given ethnicity table
nameProb = [{
    name:0 for name in theData['NM'].value_counts().keys()}
        for j in theData['ETHCTY'].value_counts()] 
for i in range(theData['ETHCTY'].nunique()):
    ethnicity = ethnicCount.keys()[i]
    ethData = theData[theData['ETHCTY'] == ethnicity]
    nameCount = ethData['NM'].value_counts()
    for j in range(ethData['NM'].nunique()):
        name = nameCount.keys()[j]
        nameProb[i][name] = nameCount[j]/len(ethData)
nameTable = pd.DataFrame(nameProb,index=ethnicCount.keys())
nameTable

Unnamed: 0,AAHIL,AALIYAH,AARAV,AARON,AARYA,AAYAN,ABBY,ABDIEL,ABDOUL,ABDOULAYE,...,ZELDA,ZENDAYA,ZEV,ZION,ZISSY,ZOE,ZOEY,ZOYA,ZURI,ZYAIRE
WHITE NON HISPANIC,0.0,0.0,0.0,0.002954,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000127,0.0,0.001848,0.0,0.000979,0.003303,0.00088,0.0,0.0,0.0
HISPANIC,0.0,0.002577,0.0,0.004694,0.0,0.0,0.000239,0.000353,0.0,0.0,...,0.0,0.0,0.0,0.000478,0.0,0.003008,0.001268,0.0,0.0,0.0
BLACK NON HISPANIC,0.0,0.006499,0.0,0.004705,0.0,0.0,0.0,0.0,0.001173,0.001366,...,0.0,0.000179,0.0,0.003808,0.0,0.003808,0.002842,0.0,0.0012,0.000773
ASIAN AND PACIFIC ISLANDER,0.000227,0.0,0.001411,0.006469,0.000162,0.000584,0.000859,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.003486,0.001102,0.000259,0.0,0.0


## Predicting

Now, we’re going to predict the ethnicity given the data. We’ll be using a function to calculate the predicted ethnicity.

In [10]:
def predict(gender, name):
    bestRes = 0
    theName = ''
    for ethnicity in ethnicTable:
        res = float(ethnicTable[ethnicity]) * \
              float(genderTable[gender][ethnicity]) * \
              float(nameTable[name][ethnicity])
        if bestRes <= res:
            bestRes = res
            theName = ethnicity
    return theName

In [11]:
acc = 0
for index, row in theData.iterrows():
    if predict(row['GNDR'],row['NM']) == row['ETHCTY']:
        acc += 1
print(acc / len(theData))

0.6429061811001574


So, Navie Bayes can correctly determines a person’s ethnicity 64% of the time. If we were to assume independence, then the model is really poor at determining ethnicitiy.

Now, how much better would the model perform if we considered dependence for the variables?

## Training with dependence

We’ll determine the new probability as follows:

$$P(ethnicity|gender, name) = P(ethnicity) ∗ P(gender|ethnicity) ∗ P(name|gender, ethnicity)$$ 

We only need to calculate the name tables for each gender.

In [12]:
# Generate the name given ethnicity table 
nameProb = {}
for gender in genderTable:
    tempData = theData[theData['GNDR'] == gender]
    nameProbGender = [{name:0.0 for name in tempData['NM'].value_counts().keys()}
                        for j in tempData['ETHCTY'].value_counts()]
    nameProbGender = pd.DataFrame(nameProbGender,index=ethnicCount.keys())
    for ethnicity in tempData['ETHCTY'].value_counts().keys():
        ethData = tempData[tempData['ETHCTY'] == ethnicity]
        nameCount = ethData['NM'].value_counts()
        for j in range(ethData['NM'].nunique()):
            name = nameCount.keys()[j]
            nameProbGender[name][ethnicity] = nameCount[j]/len(ethData)
            nameProbGender[name][ethnicity]
    nameProb[gender] = nameProbGender

Since we are relying on dependency, we’ll need a new prediction algorithm.

In [13]:
def predictDep(gender, name):
    bestRes = 0
    theName = ''
    for ethnicity in ethnicTable:
        res = float(ethnicTable[ethnicity]) * \
              float(genderTable[gender][ethnicity]) * \
              float(nameProb[gender][name][ethnicity])
        if bestRes <= res:
            bestRes = res
            theName = ethnicity
    return theName

In [14]:
acc = 0
for index, row in theData.iterrows():
    if predictDep(row['GNDR'],row['NM']) == row['ETHCTY']:
        acc += 1
print(acc / len(theData))

0.6452644613186043


After predicting with dependency, the accuracy did not improve.