# Chinese Name Gender Predictor
James Chan 2018

In [6]:
from xpinyin import Pinyin
import pandas as pd
from keras.layers import LSTM, Dense
from keras.models import Sequential
import numpy as np

## Overview
The Chinese writing system is made up of individual characters famous for their elegant strokes and detailed complexity.  While most recognize a Chinese character when they see one, few knows about the existence of a romanized spelling system known as Pinyin, which translates Chinese characters to alphabets and vice versa.  

"The pinyin system was developed in the 1950s by many linguists, including Zhou Youguang, based on earlier form romanizations of Chinese. It was published by the Chinese government in 1958 and revised several times".[1]

![](files/img/bruce.png)
<div align="center">Figure 1. Bruce Lee's Chinese Name in Pinyin</div>

The distinction between a female  and a male name in Chinese is more subtle than a typical American English name.  For example John and Mary in most cases belong to names of a male and female respectively, whereas the gender for a name like Xiao-Ming is practically a coin-toss. However, in most cases there are subtle patterns that leans towards one gender than the other.

In this project, I have harnessed the power of deep learning to train a gender predictor for Chinese names using ~9800 Chinese name samples[2].  We hope that the gender predictor will perform at least as good as human.  

#### Table of Content
1. Framing the Problem
2. Data Preprocessing
3. Train the Model
4. Evaluate the Model
5. Conclusion

## 1. Framing the Problem
We first strip the last name from our 9800 Chinese names because they have negligible correlation (if any) with the gender of a person.  Once we removed the last name, our Pinyin representation of chinese names are basically in alphabets.  By converting these alphabets into one-hot-vectors and assigning each with a timestamp, our training examples are ready to be fed into an LSTM model.  In the example below, t denotes each timestep and T denotes the total number of timesteps.

![](files/img/feed.png)
<div align="center">Figure 2. Data Processing for Deep Learning (LSTM)</div>

It is worth mentioning that this scheme is a many-to-one mapping, which means the input feature has a varying length timestamp with a single binary target output - gender.

## 2. Data Processing
Here we process the the raw data per Figure 2.

In [7]:
#the conversion from Chinese character to Pinyin equivalent is taken care by the utility below [3]
p = Pinyin()
df = pd.read_csv('ChineseNames.csv')

In [8]:
#change headers to english
df.columns = ['Name', 'Gender']

In [9]:
#strip last name
def strip_last_name(name):
    return name[1:]

def convert_to_pinyin(name):
    return p.get_pinyin(name)

def process_name(name):
    name = strip_last_name(name)
    name = convert_to_pinyin(name)
    return name

In [10]:
#visualize before after processing
print(df.tail(5))
df['Name'] = df['Name'].apply(process_name)
print(df.tail(5))

     Name Gender
9792  左婉怡      F
9793  左烜晅      F
9794  左雨晴      F
9795   左越      F
9796  左子烨      F
           Name Gender
9792     wan-yi      F
9793  xuan-xuan      F
9794    yu-qing      F
9795        yue      F
9796      zi-ye      F


The first five entries are raw data.  The last five entries were the Pinyin equivalent with last name stripped.  This looks correct.  Next we build a dictionary by scanning all the characters appeared.  If everything goes according to plan, there should only be 27 characters - 26 letters and a dash.

In [11]:
#get all unique chars
characters = {}
for i, name in enumerate(df['Name']):
    for char in name:
        if char in characters:
            characters[char] += 1
        else:
            characters[char] = 1

In [14]:
#show number of unique characters
len(characters)

27

27 is the correct number as discussed above.

In [16]:
#find the longest name
maxlen = df['Name'].str.len().max()
maxlen

16

We use the longest name to limit the length of our time-series, which in this case is 16.  We then convert our data into time-series one-hot-vectors per our dictionary.

In [17]:
#create mapping.  though it would be nice if they mapping is alphabetically ordered, but it doesn't matter mathematically.
idx = 0
for c in characters.keys():
    characters[c] = idx
    idx += 1

In [18]:
#create training example of dimension (example #, timestep, # of features (length of 1-hot vector))
num_examples = df.shape[0]
time_steps = maxlen
num_features = len(characters)
def char_to_vec(char):
    vector = np.zeros((1, num_features), dtype=int)
    idx = characters[char]
    vector[0,idx] = 1
    return vector

def name_to_vec(name):
    example = np.zeros((maxlen, num_features), dtype = int)
    for i in range(maxlen):
        if i < len(name):
            char = name[i]
            vector = char_to_vec(char)
            example[i,:] = vector[0,:]
        else:
            example[i,:] = np.zeros((num_features),dtype=int)
    return example

In [19]:
#convert all examples to vector format
X = np.empty((df.shape[0], maxlen, num_features))
for i, name in enumerate(df['Name']):
    example = name_to_vec(name)
    X[i,:,:] = example

In [20]:
Y = df['Gender'].values
Y[Y == 'M'] = 1
Y[Y == 'F'] = 0

In [23]:
print(X.shape)
print(Y.shape)

(9797, 16, 27)
(9797,)


These dimensions look correct.  We have about 9800 name entries.  The timestamp is 16 in length, and our one-hot-vector has 27 unique positions.

## 3. Train the Model
The parameters below were determined via manual tuning.  

In [30]:
model = Sequential()
model.add(LSTM(64, input_shape=(maxlen, num_features), return_sequences=False, dropout=0.2, recurrent_dropout=0.2))
for _ in range(8):
    model.add(Dense(64, activation='tanh'))
model.add(Dense(1, activation='tanh'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [34]:
model.fit(X,Y, batch_size=64, epochs=10, verbose=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x206b3323eb8>

## 4. Evaluate the Model
With just 10 epochs of training, the model reached an accruacy of low 60%, which doesn't seem that great, but by increasing the number of epochs, there also isn't a noticeable improvement in accuracy either.  Low 60% was about as high as I could get with my model, which is within reasonable expectation.  To further evaluate the accuracy of the model, we will take some real life examples. This step is sectioned into 3 parts:
1. Gender Prediction of Famous People
2. Gender Prediction of My Grade School Classmates
3. Gender Prediction of My Immediate Family Members

### 4. 1 Gender Prediction of Famous People [82% Accuracy]

In [35]:
#let's test on some real life examples:
X_test = np.empty((0, maxlen, num_features), dtype=int)
Y_test = np.empty((0), dtype=int)
people = ['long', #jackie chan
          'xiao-long', #bruce lee
          'bing-bing', #fan bingbing
          'zi-yi', #zhang ziyi
          'zi-dan', #donnie yen
          'lian-jie', #jet li
          'yu-ling', #lucy liu
          'en-mei', #amy tan
          'ming-na', #ming-na wen
          'ze-dong', #chairman mao
         'jin-ping'] #chairman xi
Y_test = np.array([1,1,0,0,1,1,0,0,0,1,1])
for name in people:
    array = name_to_vec(name)
    X_test = np.concatenate((X_test, np.expand_dims(array, 0)), 0)

In [36]:
#predict
predictions = model.predict_classes(X_test)
for i, prediction in enumerate(predictions):
    if prediction != Y_test[i]:
        print('mis-classified: ', people[i])
print()
print('accuracy: ', sum(predictions[:,0] == Y_test)/Y_test.shape[0])

mis-classified:  bing-bing
mis-classified:  lian-jie

accuracy:  0.8181818181818182


The result is more impressive than I expected even after just 10 epochs of training. 82% accuracy is VERY HIGH for classifying chinese names.  I would have expected Bing Bing to be classified as female since any name ending in -ing is more commonly female than male in my experience.  I am however, not at all surprised that Jet Li were mis-classified since Lian tends to be more on the feminine side in my experience

### 4. 2 Gender Prediction of My Grade School Classmates [82% Accuracy]

In [37]:
#some of the names that comes to mind from grade school
X_test = np.empty((0, maxlen, num_features), dtype=int)
Y_test = np.empty((0), dtype=int)
people = ['ting-pei',
          'xin',
          'jin-hao',
          'zhe-an',
          'yi-cheng',
          'zi-jun',
          'zhi-hao',
          'wei-han',
          'guan-yu',
          'xiu-qi',
          'jun-de']
Y_test = np.array([0,0,1,1,1,1,1,1,0,1,1])
for name in people:
    array = name_to_vec(name)
    X_test = np.concatenate((X_test, np.expand_dims(array, 0)), 0)

In [38]:
#predict
predictions = model.predict_classes(X_test)
for i, prediction in enumerate(predictions):
    if prediction != Y_test[i]:
        print('mis-classified: ', people[i])
print()
print('accuracy: ', sum(predictions[:,0] == Y_test)/Y_test.shape[0])

mis-classified:  ting-pei
mis-classified:  xiu-qi

accuracy:  0.8181818181818182


Again we have 81% accuracy, which is awesome!  I was surprised that ting-pei was classified as male, because ting by itself will classify as female and pei is a more commonly a feminine name base on my experience. I knew xiu-qi were going to be mis-classified right off the bat and it was. I am a bit upset that the model DID NOT mis-classify zi-jun, because most humans would likely to have mis-classified this one.

### 4. 3 Gender Prediction of My Immediate Family Members [80% Accuracy]
The model also achieved a whopping 80% accuracy for 15 of my close family members. Their names are withheld for privacy purposes. 3 of whom were mis-classified, and out of those 3, 2 of them are known to have names that tend to be mis-classified even by human.

# 5. Conclusion
I am extremely pleased with the result.  80% accuracy means 4 out of 5 names will be classified correctly.  In my experience, this is about the right amount because while most people will go with names that are of statistical norm to their gender, a fair amount of people also tend to deviate from that.  

If you are familiar with Chinese names, please help me by taking the randomly generated quiz below and let me know if you did better than my A.I. Gender Predictor! Please share your result with jchan70@gatech.edu.

### Application
Advertisement and recommendation system are such applications that can greatly benefit from a gender predictor.  When the gender of a user is unavailable, but the nationality is identified as Chinese, we may predict with a fairly high certain the gender of the user.  Products such as make-up and purse are highly favored statistically by females, whereas items such as boxing gloves are statistically unattractive to females.  

### Randomly Generated Quiz

In [33]:
quiz = df.sample(20)
quiz['Name']

4740          wei-jie
9490              mei
4550             qian
372          tian-hui
8803    liu-meng-ying
5891           jia-yi
361          hao-yang
7663             yuan
1099            shuai
5304               yi
5407         dan-yang
604             cheng
6595         zhuo-wan
1589          shu-ren
2451          yi-long
4863          zi-jian
9398          yu-ting
1625         xiao-fei
940         ying-feng
2349        shao-qian
Name: Name, dtype: object

### Quiz Answers (Spolier Alert)

In [34]:
quiz['Gender']

4740    1
9490    0
4550    1
372     1
8803    0
5891    0
361     1
7663    0
1099    1
5304    0
5407    0
604     1
6595    0
1589    1
2451    1
4863    1
9398    0
1625    1
940     1
2349    1
Name: Gender, dtype: object

### Reference:
1. Wikipedia https://en.wikipedia.org/w/index.php?title=Pinyin&oldid=856531498
2. Dataset: https://www.researchgate.net/publication/269630594_9800_Chinese_Names_with_Gender
3. Chinese to Pinyin Translator: https://github.com/lxneng/xpinyin
