We want to predict the most likely gender and ethnicity for a given first name in the given data set. Having done a quick bit of research online I found that the best approach would be some kind of LSTM deep learning model. This has the advantage of learning from sequences as it's a kind of RNN. Some of the code is adapted from that contained in the following reference (why re-invent the wheel?!):

Deep Learning Gender from name - RNN LSTMs (P R Deepakbabu Github repository):

https://github.com/prdeepakbabu/Python/blob/master/Deep%20learning%20gender/Deep%20Learning%20(RNN%20-%20LSTMs)%20Predict%20Gender%20from%20Name.ipynb

So let's pre-process our data.

In [4]:
from __future__ import print_function

from sklearn.preprocessing import OneHotEncoder
from keras.layers.core import Dense, Activation, Dropout
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Embedding
from keras.layers import LSTM
from keras.datasets import imdb
import pandas as pd
import numpy as np
import os
import csv

In [2]:
#parameters
maxlen = 30
labels = 2

In [5]:
#Remove column headers:

reader = csv.reader(open('clean_names.csv' , 'rb'))
    
f=csv.writer(open('clean_names_tidy.csv' , 'wb'))
    
for line in reader:
    if "gender" not in line:
        f.writerow(line)

In [35]:
input = pd.read_csv('clean_names_tidy.csv',header=None)
input.columns = ['first_name','gender','last_name','race']
input['namelen']= [len(str(i)) for i in input['first_name']]
#print (input)
input1 = input[(input['namelen'] >= 2) ]
print (input1)

      first_name gender            last_name      race  namelen
0        shirley      f                adams  hispanic        7
1            ana      f               alonso  hispanic        3
2         miriam      f               alonzo  hispanic        6
3         ivette      f              alvarez  hispanic        6
4          saray      f               amador  hispanic        5
5         niurka      f              batista  hispanic        6
6          maria      f           betancourt  hispanic        5
7       merienne      f                blake  hispanic        8
8      rosalinda      f                 boyd  hispanic        9
9       migdalia      f              braconi  hispanic        8
10        marina      f                bueno  hispanic        6
11        carmen      f               burgos  hispanic        6
12         zaida      f              cabrera  hispanic        5
13        lurvin      f             calderon  hispanic        6
14       melissa      f             cald

In [7]:
input1.groupby('gender')['first_name'].count()

gender
f     4816
m    60639
Name: first_name, dtype: int64

In [8]:
names = input['first_name']
gender = input['gender']
vocab = set(' '.join([str(i) for i in names]))
vocab.add('END')
len_vocab = len(vocab)

In [9]:
print(vocab)
print("vocab length is ",len_vocab)
print ("length of input is ",len(input1))

set([' ', '-', '.', '0', 'END', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z'])
vocab length is  31
length of input is  65456


In [10]:
char_index = dict((c, i) for i, c in enumerate(vocab))

In [11]:
print(char_index)

{' ': 0, '-': 1, '.': 2, '0': 3, 'END': 4, 'a': 5, 'c': 6, 'b': 7, 'e': 8, 'd': 9, 'g': 10, 'f': 11, 'i': 12, 'h': 13, 'k': 14, 'j': 15, 'm': 16, 'l': 17, 'o': 18, 'n': 19, 'q': 20, 'p': 21, 's': 22, 'r': 23, 'u': 24, 't': 25, 'w': 26, 'v': 27, 'y': 28, 'x': 29, 'z': 30}


In [12]:
#train test split
msk = np.random.rand(len(input1)) < 0.8
train = input1[msk]
test = input1[~msk]

In [13]:
#take input upto max and truncate rest
#encode to vector space(one hot encoding)
#pad 'END' to shorter sequences
print (train.first_name)
train_X = []
trunc_train_name = [str(i)[0:30] for i in train.first_name]
for i in trunc_train_name:
    tmp = [char_index[j] for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(char_index["END"])
    train_X.append(tmp)

0          shirley
1              ana
2           miriam
3           ivette
4            saray
5           niurka
6            maria
7         merienne
9         migdalia
11          carmen
12           zaida
13          lurvin
14         melissa
15         mileyka
16        kimberly
17        ivelisse
18             ana
19          emilia
20          connie
23       elizabeth
24            ines
25          djerid
26          ivette
27          gloria
28           linda
29          josefa
30        graciela
31             ada
32        michelle
33          angela
           ...    
65460      michael
65461      michael
65463      richard
65464        roger
65465       stacey
65466      timothy
65467         todd
65468       victor
65469      zachary
65471      kenneth
65472        brent
65473      timothy
65474       george
65475         john
65477      michael
65478      william
65479       walter
65480      richard
65481      kenneth
65482        roger
65483        danny
65484       

In [14]:
np.asarray(train_X).shape

(52154, 30)

In [15]:
def set_flag(i):
    tmp = np.zeros(39);
    tmp[i] = 1
    return(tmp)

In [16]:
set_flag(3)

array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.])

In [17]:
#take input upto max and truncate rest
#encode to vector space(one hot encoding)
#add 'END' to shorter sequences
#also convert each index to one-hot encoding
train_X = []
train_Y = []
trunc_train_name = [str(i)[0:maxlen] for i in train.first_name]
for i in trunc_train_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    train_X.append(tmp)
for i in train.gender:
    if i == 'm':
        train_Y.append([1,0])
    else:
        train_Y.append([0,1])

In [18]:
np.asarray(train_X).shape

(52154, 30, 39)

In [19]:
np.asarray(train_Y).shape

(52154, 2)

In [20]:
#build the model: 2 stacked LSTM

print('Build model...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen,39)))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

Build model...


In [21]:
test_X = []
test_Y = []
trunc_test_name = [str(i)[0:maxlen] for i in test.first_name]
for i in trunc_test_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    test_X.append(tmp)
for i in test.gender:
    if i == 'm':
        test_Y.append([1,0])
    else:
        test_Y.append([0,1])

In [22]:
print(np.asarray(test_X).shape)
print(np.asarray(test_Y).shape)
print (maxlen)

(13302, 30, 39)
(13302, 2)
30


In [23]:
batch_size=1000
model.fit(np.array(train_X), np.array(train_Y),batch_size=batch_size,nb_epoch=3,validation_data=(np.array(test_X), np.array(test_Y)))

  


Train on 52154 samples, validate on 13302 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1a2ee31d50>

In [24]:
score, acc = model.evaluate(np.array(test_X), np.array(test_Y))
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.21567851997079607
Test accuracy: 0.9243722748458878


92% is pretty good for 3 epochs. Let's run a few more:

In [57]:
#input more epochs here Peter

batch_size=1000
model.fit(np.array(train_X), np.array(train_Y),batch_size=batch_size,nb_epoch=15,validation_data=(np.array(test_X), np.array(test_Y)))

  after removing the cwd from sys.path.


Train on 52154 samples, validate on 13302 samples
Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x1a16841ed0>

In [58]:
score, acc = model.evaluate(np.array(test_X), np.array(test_Y))
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.08758377855515645
Test accuracy: 0.9743647571793715


In [59]:
request = raw_input("Enter the name whose gender you want to know: ")
print('The name whose gender you want to know is:', request)

Enter the name whose gender you want to know: melissa
The name whose gender you want to know is: melissa


In [60]:
name=[]
name.append(request)
print (name)
X=[]
trunc_name = [i[0:maxlen] for i in name]
for i in trunc_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    X.append(tmp)
pred=model.predict(np.asarray(X))
print (pred)
if pred[0,0] > 0.5:
    print ("Male")
else:
    print ("Female")

['melissa']
[[0.01798168 0.98201835]]
Female


In [61]:
#save model and data
model.save_weights('gender_model_18_epochs',overwrite=True)
train.to_csv("train_split_18_epochs.csv")
test.to_csv("test_split_18_epochs.csv")

Now let's do the ethnicity! Similar process:

In [36]:
input1.groupby('race')['first_name'].count()

race
b               1
black       35042
hispanic     4381
whit            1
white       26030
Name: first_name, dtype: int64

In [84]:
#tidy data

input1['race'] = [w.replace('whit', 'white') for w in input1['race']]
input1['race'] = [w.replace('b', 'black') for w in input1['race']]

input1.groupby('race')['first_name'].count()

0        hispanic
1        hispanic
2        hispanic
3        hispanic
4        hispanic
5        hispanic
6        hispanic
7        hispanic
8        hispanic
9        hispanic
10       hispanic
11       hispanic
12       hispanic
13       hispanic
14       hispanic
15       hispanic
16       hispanic
17       hispanic
18       hispanic
19       hispanic
20       hispanic
21       hispanic
22       hispanic
23       hispanic
24       hispanic
25       hispanic
26       hispanic
27       hispanic
28       hispanic
29       hispanic
           ...   
65466       white
65467       white
65468       white
65469       white
65470       white
65471       white
65472       white
65473       white
65474       white
65475       white
65476       white
65477       white
65478       white
65479       white
65480       white
65481       white
65482       white
65483       white
65484       white
65485       white
65486       white
65487       white
65488       white
65489       white
65490     

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


race
black       35043
hispanic     4381
white       26031
Name: first_name, dtype: int64

In [85]:
names = input1['first_name']
race = input1['race']
vocab = set(' '.join([str(i) for i in names]))
vocab.add('END')
len_vocab = len(vocab)

In [86]:
print(vocab)
print("vocab length is ",len_vocab)
print ("length of input is ",len(input1))

set([' ', '-', '.', '0', 'END', 'a', 'c', 'b', 'e', 'd', 'g', 'f', 'i', 'h', 'k', 'j', 'm', 'l', 'o', 'n', 'q', 'p', 's', 'r', 'u', 't', 'w', 'v', 'y', 'x', 'z'])
vocab length is  31
length of input is  65456


In [87]:
char_index = dict((c, i) for i, c in enumerate(vocab))
print (char_index)

{' ': 0, '-': 1, '.': 2, '0': 3, 'END': 4, 'a': 5, 'c': 6, 'b': 7, 'e': 8, 'd': 9, 'g': 10, 'f': 11, 'i': 12, 'h': 13, 'k': 14, 'j': 15, 'm': 16, 'l': 17, 'o': 18, 'n': 19, 'q': 20, 'p': 21, 's': 22, 'r': 23, 'u': 24, 't': 25, 'w': 26, 'v': 27, 'y': 28, 'x': 29, 'z': 30}


In [88]:
#train test split
msk = np.random.rand(len(input1)) < 0.8
train = input1[msk]
test = input1[~msk]

In [89]:
#take input upto max and truncate rest
#encode to vector space(one hot encoding)
#padd 'END' to shorter sequences
print (train.first_name)
train_X = []
trunc_train_name = [str(i)[0:30] for i in train.first_name]
for i in trunc_train_name:
    tmp = [char_index[j] for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(char_index["END"])
    train_X.append(tmp)

1              ana
2           miriam
3           ivette
4            saray
5           niurka
6            maria
7         merienne
8        rosalinda
9         migdalia
10          marina
11          carmen
14         melissa
17        ivelisse
18             ana
19          emilia
20          connie
22        samantha
23       elizabeth
24            ines
26          ivette
27          gloria
28           linda
29          josefa
30        graciela
31             ada
34        jennifer
35         mariely
36          glenny
37           yanet
39         johanna
           ...    
65462      richard
65463      richard
65464        roger
65465       stacey
65466      timothy
65467         todd
65468       victor
65469      zachary
65470        lukas
65471      kenneth
65472        brent
65473      timothy
65474       george
65476      freeman
65477      michael
65478      william
65479       walter
65480      richard
65481      kenneth
65482        roger
65483        danny
65484       

In [90]:
np.asarray(train_X).shape

(52408, 30)

In [93]:
#take input upto max and truncate rest
#encode to vector space(one hot encoding)
#add 'END' to shorter sequences
#also convert each index to one-hot encoding
train_X = []
train_Y = []
trunc_train_name = [str(i)[0:maxlen] for i in train.first_name]
for i in trunc_train_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    train_X.append(tmp)
for i in train.race:
    if i == 'white':
        train_Y.append([1,0,0])
    elif i == 'black':
        train_Y.append([0,1,0])
    elif i== 'hispanic':
        train_Y.append([0,0,1])

In [94]:
#build the model: 2 stacked LSTM

print('Build model...')
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(maxlen,39)))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(3))     #change to 3 (from 2) as we now have 3 outputs, white, black or hispanic
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

Build model...


In [95]:
test_X = []
test_Y = []
trunc_test_name = [str(i)[0:maxlen] for i in test.first_name]
for i in trunc_test_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    test_X.append(tmp)
for i in test.race:
    if i == 'white':
        test_Y.append([1,0,0])
    elif i == 'black':
        test_Y.append([0,1,0])
    elif i == 'hispanic':
        test_Y.append([0,0,1])

In [96]:
print(np.asarray(test_X).shape)
print(np.asarray(test_Y).shape)
print (maxlen)

(13048, 30, 39)
(13048, 3)
30


In [98]:
batch_size=1000
model.fit(np.array(train_X), np.array(train_Y),batch_size=batch_size,nb_epoch=18,validation_data=(np.array(test_X), np.array(test_Y)))

  


Train on 52408 samples, validate on 13048 samples
Epoch 1/18
Epoch 2/18
Epoch 3/18
Epoch 4/18
Epoch 5/18
Epoch 6/18
Epoch 7/18
Epoch 8/18
Epoch 9/18
Epoch 10/18
Epoch 11/18
Epoch 12/18
Epoch 13/18
Epoch 14/18
Epoch 15/18
Epoch 16/18
Epoch 17/18
Epoch 18/18


<keras.callbacks.History at 0x1a30b97b10>

In [119]:
score, acc = model.evaluate(np.array(test_X), np.array(test_Y))
print('Test score:', score)
print('Test accuracy:', acc)

Test score: 0.7097495917086949
Test accuracy: 0.6463059472716125


In [134]:
#save model and data
model.save_weights('race_model_18_epochs_try2',overwrite=True)
train.to_csv("train_split_18_epochs.csv")
test.to_csv("test_split_18_epochs.csv")

In [117]:
request = raw_input("Enter the name whose race you want to know: ")
print('The name whose race you want to know is:', request)

Enter the name whose race you want to know: tony
The name whose race you want to know is: tony


In [118]:
name=[]
name.append(request)
print (name)
X=[]
trunc_name = [i[0:maxlen] for i in name]
for i in trunc_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    X.append(tmp)
pred=model.predict(np.asarray(X))
print (pred)
if pred[0,0] > 0.3333:    #note: this if loop is not right, see below how I've corrected it for the combined result
    print ("white")
elif pred[0,1] > 0.3333:
    print ("black")
elif pred[0,2] > 0.3333:
    print ("hispanic")

['tony']
[[0.30031165 0.6664009  0.03328747]]
black


So our accuracy is 65% which isn't bad but not as good as for gender. We could now refine the architecture of our model to see if we can improve our accuracy. But this seems okay for now. Let's combine our two results:

In [122]:
import keras

In [123]:
model_ethnicity = keras.models.clone_model(model)

In [124]:
print('Build model...')
model_gender = Sequential()
model_gender.add(LSTM(512, return_sequences=True, input_shape=(maxlen,39)))
model_gender.add(Dropout(0.2))
model_gender.add(LSTM(512, return_sequences=False))
model_gender.add(Dropout(0.2))
model_gender.add(Dense(2))     #change to 2 (from 3) as we now have 2 outputs, m or f
model_gender.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

Build model...


In [129]:
print(model_ethnicity.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 30, 512)           1130496   
_________________________________________________________________
dropout_3 (Dropout)          (None, 30, 512)           0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dropout_4 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 1539      
_________________________________________________________________
activation_2 (Activation)    (None, 3)                 0         
Total params: 3,231,235
Trainable params: 3,231,235
Non-trainable params: 0
_________________________________________________________________


In [130]:
print(model_gender.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_5 (LSTM)                (None, 30, 512)           1130496   
_________________________________________________________________
dropout_5 (Dropout)          (None, 30, 512)           0         
_________________________________________________________________
lstm_6 (LSTM)                (None, 512)               2099200   
_________________________________________________________________
dropout_6 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 1026      
_________________________________________________________________
activation_3 (Activation)    (None, 2)                 0         
Total params: 3,230,722
Trainable params: 3,230,722
Non-trainable params: 0
_________________________________________________________________


In [145]:
model_gender.save_weights('gender_model_18_epochs',overwrite=True)

In [146]:
model_gender.load_weights('gender_model_18_epochs', by_name=False)

In [147]:
model_ethnicity.load_weights('race_model_18_epochs', by_name=False)

So now we're ready to ask for user input! Let's give it a go:

In [174]:
request = raw_input("Enter the name whose ethnicity and gender you want to know: ")
print('The name whose race you want to know is:', request)

Enter the name whose ethnicity and gender you want to know: daniel
The name whose race you want to know is: daniel


In [175]:
name=[]
name.append(request)
print (name)
X=[]
trunc_name = [i[0:maxlen] for i in name]
for i in trunc_name:
    tmp = [set_flag(char_index[j]) for j in str(i)]
    for k in range(0,maxlen - len(str(i))):
        tmp.append(set_flag(char_index["END"]))
    X.append(tmp)
pred_ethnicity=model_ethnicity.predict(np.asarray(X))
print (pred_ethnicity)
if pred_ethnicity[0,0] > pred_ethnicity[0,1] and pred_ethnicity[0,0] > pred_ethnicity[0,2]:
    print ("White")
elif pred_ethnicity[0,1] > pred_ethnicity[0,0] and pred_ethnicity[0,1] > pred_ethnicity[0,2]:
    print ("Black")
elif pred_ethnicity[0,2] > pred_ethnicity[0,0] and pred_ethnicity[0,2] > pred_ethnicity[0,1]:
    print ("Hispanic")
    
pred_gender=model_gender.predict(np.asarray(X))
print (pred_gender)
if pred_gender[0,0] > 0.5:
    print ("Male")
else:
    print ("Female")

['daniel']
[[0.7436051  0.15517457 0.10122035]]
white
[[0.988057   0.01194294]]
Male


Have run a few tests and seems to work quite well. As I mentioned above I could now look at adapting the architecture for the ethnicity model to see if I can get the accuracy higher. But it seems to give good results as it is so let's leave it for now (can discuss this when we meet if you like).