## Lesson 12 - Gender Prediction Based on Name




### Table of Contents
* [Gender Prediction Based on English Name](#EnglishName)
* [Gender Prediction Based on Chinese Name](#ChineseName)


<a id="EnglishName"></a>
## Gender Prediction Based on English Name

Detecting patterns is a central part of Natural Language Processing. Words ending in -ed tend to be past tense verbs (5.). Frequent use of will is indicative of news text (3). These observable patterns — word structure and word frequency — happen to correlate with particular aspects of meaning, such as tense and topic. But how did we know where to start looking, which aspects of form to associate with which aspects of meaning?

The goal of this chapter is to answer the following questions:

How can we identify particular features of language data that are salient for classifying it?
How can we construct models of language that can be used to perform language processing tasks automatically?
What can we learn about language from these models?
Along the way we will study some important machine learning techniques, including decision trees, naive Bayes' classifiers, and maximum entropy classifiers. We will gloss over the mathematical and statistical underpinnings of these techniques, focusing instead on how and when to use them (see the Further Readings section for more technical background). Before looking at these methods, we first need to appreciate the broad scope of this topic.

### Supervised Classification

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

Deciding whether an email is spam or not.
Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports," "technology," and "politics."
Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.
The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified.

A classifier is called supervised if it is built based on training corpora containing the correct label for each input. The framework used by supervised classification is shown in 1.1.
<img src="images/nltk-supervised-classification.png"><br>

### prepare training data

<img src="images/English_name_dataset.png">

待訓練統計的姓名有三個欄位，分別為姓名、性別、以及權重。<br>
下面我們利用nltk中的Naive Bayes Classifier，根據名字判斷性別中的資料進行預測。首先讀取英文姓名資料

In [None]:
import os
import random
from zipfile import ZipFile
from nltk import NaiveBayesClassifier, classify

# English Name gender guess
gender_map = {'M': 0, 'F': 1}

#讀取eng_name.zip, 回傳包含名稱和次數的dict()
def load_names(zip_file='data/name/English/eng_names.zip'):
    if not os.path.isfile(zip_file):
        print('names.zip is missing.')
        exit(-1)

    names = dict()
    unzip = ZipFile(zip_file, 'r')
    files = unzip.namelist()

    for file in files:
        file = unzip.open(file, 'r').read().decode('utf-8')
        rows = [row.strip().split(',') for row in file.split('\n') if len(row) > 1]
        for row in rows:
            if not len(row) == 3:
                continue
            name = row[0].upper()
            gender = gender_map[row[1].upper()]
            count = int(row[2])
            # adding frequency in names dict based on gender
            if name not in names:
                names[name] = [0, 0]
            names[name][gender] += count
    return names

# 從傳入姓名字串中提取特徵/屬性/輸入/預測變數
"""
:param name: string
:return: dict of feature values
"""
def extract_feature(name: str):
    name = name.upper()
    return {
        'last_1': name[-1],
        'last_2': name[-2:],
        'last_3': name[-3:],
        'last_is_vowel': (name[-1] in 'AEIOUY')
    }

# 計算姓名分佈性別的機率，算式為:
# male_probability = total_male_count / (total_male_count + total_female_count)
"""
:param name_tuple: name tuple contains male / female frequency count
:return: male, female probability
"""
def get_probability_distribution(name_tuple):
    male_counts = name_tuple[1]
    female_counts = name_tuple[2]
    male_prob = (male_counts * 1.0) / sum([male_counts, female_counts])
    if male_prob == 1.0:
        male_prob = 0.99
    elif male_prob == 0.00:
        male_prob = 0.01
    female_prob = 1.0 - male_prob
    return male_prob, female_prob

"""
:param feature_set: validation purpose
:return: None
"""
def validate_data_set(feature_set: list):
    data_list = []
    for feature_value, gender in feature_set:
        data_list.append({**feature_value, **{'gender': gender}})

    import pandas as pd
    df = pd.DataFrame(data_list)
    print('Feature matrix shape - ', df.shape)

In [None]:
'''
轉換為 (name，male_freq_count，female_freq_count)
param names：包含名稱和頻率計數的dict()
回傳：names tuple (male_names，female_names)
'''
def split_names(names: dict()):
    if not names:
        print('names dict is none.')
        exit(-1)

    male_names = list()
    female_names = list()

    for name in names.keys():
        counts = names[name]
        # converting into tuple (name, male_freq_count, female_freq_count)
        male_counts, female_counts = counts[0], counts[1]
        data = (name, male_counts, female_counts)
        if male_counts == female_counts:
            continue
        if male_counts > female_counts:
            male_names.append(data)
        else:
            female_names.append(data)

    names = (male_names, female_names)
    total_males_names = len(male_names)
    total_females_names = len(female_names)
    total_names = total_females_names + total_males_names
    print('Total names - {} \n Total males names - '
          '{} \n Total female names - {}'.format(total_names, total_males_names,
                                                 total_females_names))
    return names

In [None]:
# 準備特徵矩陣 [X] 和映射向量 [y] 監督學習模型
"""
:param names: 包含男性名字和女性名字的 tuple()
:return:
"""
def prepare_data_set():
    feature_set = list()
    male_names, female_names = split_names(load_names())
    names = {'M': male_names, 'F': female_names}

    for gender in names.keys():
        for name in names[gender]:
            features = extract_feature(name[0])
            male_prob, female_prob = get_probability_distribution(name)
            features['m_prob'] = male_prob
            features['f_prob'] = female_prob
            feature_set.append((features, gender))
    random.shuffle(feature_set)
    return feature_set

In [None]:
# 拆分訓練集並擬合準確率
"""
:param train_percent: 拆分訓練集/測試集比例
:return:
"""
def train_and_test(train_percent=0.80):
    feature_set = prepare_data_set()
    validate_data_set(feature_set)
    random.shuffle(feature_set)
    total = len(feature_set)
    cut_point = int(total * train_percent)
    # splitting Dataset into train and test
    train_set = feature_set[:cut_point]
    test_set = feature_set[cut_point:]

    # fitting feature matrix to the model
    classifier = NaiveBayesClassifier.train(train_set)
    print('{} Accuracy- {}'.format('Naive Bayes', classify.accuracy(classifier, test_set)))
    print('Most informative features')
    informative_features = classifier.most_informative_features(n=5)
    for feature in informative_features:
        print("\t {} = {} ".format(*feature))
    return classifier

In [None]:
model_classifier = train_and_test()

In [None]:
# 定義傳入姓名預測性別
def gender_guess(eng_name):
    result = model_classifier.classify(extract_feature(eng_name))
    if gender_map[result.upper()]:
        result = 'Female'
    else:
        result = 'Male'
    return result

In [None]:
res = gender_guess("David")
print(res)

In [None]:
res = gender_guess("Emily")
print(res)

In [None]:
### extract_feature
# 母音 為 False
extract_feature("David")

In [None]:
# 母音 為 True
extract_feature("Emily")

### Save English name guesser model

In [None]:
import pickle
with open('model/NaiveBayes/lesson11/eng_name_gender_classifier.pkl', 'wb') as handle:
    pickle.dump(model_classifier, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Load English name guesser model

In [1]:
import pickle
with open('model/NaiveBayes/lesson11/eng_name_gender_classifier.pkl', 'rb') as handle:
    eng_name_gender_model = pickle.load(handle)

In [2]:
gender_map = {'M': 0, 'F': 1}

def extract_feature(name: str):
    name = name.upper()
    return {
        'last_1': name[-1],
        'last_2': name[-2:],
        'last_3': name[-3:],
        'last_is_vowel': (name[-1] in 'AEIOUY')
    }

def eng_name_gender_guess(eng_name_gender_model, eng_name):
    result = eng_name_gender_model.classify(extract_feature(eng_name))
    if gender_map[result.upper()]:
        result = 'Female'
    else:
        result = 'Male'
    return result

In [3]:
eng_name_gender_guess(eng_name_gender_model, 'David')

'Male'

In [4]:
eng_name_gender_guess(eng_name_gender_model, 'Emily')

'Female'

### Homework
- 預測的姓名是First name, 有些有middle name的呢?
- 試試看自己的英文姓名
- 請撰寫從中英文文章中，萃取出英文姓名的First name，並做性別預測
- 思考應用場景為何?

<a id="ChineseName"></a>
## Gender Prediction Based on Chinese Name


### prepare training data

<img src="images/Chinese_name_dataset.png">

待訓練統計的姓名有三個欄位，分別為id、中文名(不包含姓氏)、以及性別 flag, 0為女生，1為男生。
下面我們利用nltk中的Naive Bayes Classifier，根據名字判斷性別的資料進行預測。首先讀取中文姓名資料

In [5]:
import pandas as pd
import math
from collections import defaultdict

# Chinese Name gender guess
gender_map = {0: '女生', 1: '男生'}

#讀取訓練集
train = pd.read_csv('data/name/Chinese/train.csv')
test = pd.read_csv('data/name/Chinese/test.csv')
submit = pd.read_csv('data/name/Chinese/sample_submit.csv')

In [6]:
train.head()

Unnamed: 0,id,name,gender
0,1,閎家,1
1,2,玉瓔,0
2,3,於鄴,1
3,4,越英,0
4,5,蘊萱,0


In [40]:
test.head()

Unnamed: 0,id,name,pred
0,0,辰君,0
1,1,佳遙,0
2,2,淼劍,1
3,3,浩苳,1
4,4,儷妍,0


In [41]:
submit.head()

Unnamed: 0,id,gender
0,0,0
1,1,0
2,2,1
3,3,1
4,4,0


In [7]:
# 把資料分為男女兩部分
names_female = train[train['gender'] == 0]
names_male = train[train['gender'] == 1]

# totals用來存放訓練集中女生、男生的總數
totals = {'f': len(names_female),
          'm': len(names_male)}

### 分別計算在所有女生（男生）的名字當中，某個字出現的頻率。

這一步相當於是計算 
    
$$P(X_i \mid 女生) 和 P(X_i \mid 男生)$$

In [8]:
frequency_list_f = defaultdict(int)
for name in names_female['name']:
    for char in name:
        frequency_list_f[char] += 1. / totals['f']

frequency_list_m = defaultdict(int)
for name in names_male['name']:
    for char in name:
        frequency_list_m[char] += 1. / totals['m']

In [9]:
print(frequency_list_f['娟'])

0.004144009000562503


In [10]:
print(frequency_list_m['鋼'])

0.00031498425078746044


### 樸素貝葉斯, 拉普拉斯平滑(Laplace Smoothing)

上面兩個例子說明<br>
P(名字中含有娟|女生)=0.004144P(名字中含有娟|女生)=0.004144<br>
P(名字中含有鋼|男生)=0.0006299P(名字中含有鋼|男生)=0.0006299<br>
考慮到預測集中可能會有中文字並沒有出現在訓練集中，那麼樸素貝葉斯的統計模型就會出現以下 x feature根本不存在於資料集當中的問題。

<img src="images/bayes_problems.jpg">

所以我們需要對頻率進行 Laplace 平滑，那麼什麼是Laplace平滑?作法為何?

拉普拉斯平滑（Laplace Smoothing）又被稱為加1平滑，是比較常用的平滑方法。平滑方法的存在是為了解決機率為 0 的問題。

Laplace的解決方法是：

對於一個隨機變數z，它的取值範圍是{1,2,3...,k}，對於m次試驗的觀測結果{z(1),z(2),...z(m))}，極大似然估計按照下式計算：

<img src="images/laplace_smoothing_0.jpg">

使用了Laplace之後：

<img src="images/laplace_smoothing_1.jpg">

即在分子上+1，在分母上+變數能取到的個數。
因此，在樸素貝葉斯問題，通過laplace平滑修正後：

<img src="images/laplace_smoothing_ok.jpg">

In [13]:
def LaplaceSmooth(char, frequency_list, total, alpha=1.0):
    count = frequency_list[char] * total
    distinct_chars = len(frequency_list)
    freq_smooth = (count + alpha ) / (total + distinct_chars * alpha)
    return freq_smooth

In [14]:
base_f = math.log(1 - train['gender'].mean())
base_f += sum([math.log(1 - frequency_list_f[char]) for char in frequency_list_f])

base_m = math.log(train['gender'].mean())
base_m += sum([math.log(1 - frequency_list_m[char]) for char in frequency_list_m])

bases = {'f': base_f, 'm': base_m}

$$\log{ P(X_i=1 \mid Y )  } - \log{ P(X_i=0 \mid Y) }$$
我們利用以下函數計算

In [17]:
def GetLogProb(char, frequency_list, total):
    freq_smooth = LaplaceSmooth(char, frequency_list, total)
    return math.log(freq_smooth) - math.log(1 - freq_smooth)

ypred = argmaxy
$$ P(Y=y)P(X_2=1 \mid Y=y) ∏ n_i= P(X_i=0 \mid Y=y)P(X_2=0 \mid Y=y)$$

In [19]:
def ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f):
    logprob_m = bases['m']
    logprob_f = bases['f']
    for char in name:
        logprob_m += GetLogProb(char, frequency_list_m, totals['m'])
        logprob_f += GetLogProb(char, frequency_list_f, totals['f'])
    return {'male': logprob_m, 'female': logprob_f}

def GetGender(LogProbs):
    return LogProbs['male'] > LogProbs['female']

def pred_name(name):
    LogProbs = ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f)
    gender = GetGender(LogProbs)
    result = gender_map[gender]
    return result

result = []
for name in test['name']:
    LogProbs = ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f)
    gender = GetGender(LogProbs)
    result.append(int(gender))

submit['gender'] = result
submit.to_csv('data/name/Chinese/my_NB_prediction.csv', index=False)

In [20]:
test['pred'] = result
test.head(5)

Unnamed: 0,id,name,pred
0,0,辰君,0
1,1,佳遙,0
2,2,淼劍,1
3,3,浩苳,1
4,4,儷妍,0


In [21]:
gdr = pred_name("芊霈")
print(gdr)

女生


In [22]:
gdr = pred_name("紘霆")
print(gdr)

男生


### Save Chinese name guesser model

In [25]:
model_dict = {}
model_dict["totals"] = totals
model_dict["bases"] = bases
model_dict["frequency_list_m"] = frequency_list_m
model_dict["frequency_list_f"] = frequency_list_f

import pickle
with open('model/NaiveBayes/lesson11/cht_name_gender_classifier.pkl', 'wb') as handle:
    pickle.dump(model_dict, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Load Chinese name guesser model

In [36]:
import math
import pickle
with open('model/NaiveBayes/lesson11/cht_name_gender_classifier.pkl', 'rb') as handle:
    cht_name_gender_model = pickle.load(handle)

In [37]:
totals = cht_name_gender_model['totals']
bases = cht_name_gender_model['bases']
frequency_list_m = cht_name_gender_model['frequency_list_m']
frequency_list_f = cht_name_gender_model['frequency_list_f']

gender_map = {0: '女生', 1: '男生'}

def GetGender(LogProbs):
    return LogProbs['male'] > LogProbs['female']

def LaplaceSmooth(char, frequency_list, total, alpha=1.0):
    count = frequency_list[char] * total
    distinct_chars = len(frequency_list)
    freq_smooth = (count + alpha ) / (total + distinct_chars * alpha)
    return freq_smooth

def GetLogProb(char, frequency_list, total):
    freq_smooth = LaplaceSmooth(char, frequency_list, total)
    return math.log(freq_smooth) - math.log(1 - freq_smooth)

def ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f):
    logprob_m = bases['m']
    logprob_f = bases['f']
    for char in name:
        logprob_m += GetLogProb(char, frequency_list_m, totals['m'])
        logprob_f += GetLogProb(char, frequency_list_f, totals['f'])
    return {'male': logprob_m, 'female': logprob_f}

def cht_name_gender_guess(name):
    LogProbs = ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f)
    gender = GetGender(LogProbs)
    result = gender_map[gender]
    return result

In [38]:
cht_name_gender_guess('謝霆峰')

'男生'

In [39]:
cht_name_gender_guess('蔡依琳')

'女生'

### Homework
- 預測的姓名是名字, 將姓氏放入預測效果如何?
- 試試看自己的中文姓名
- 請撰寫從中英文文章中，萃取出英文姓名的First name，並做性別預測
- 思考有哪些地方可應用中文姓名性別預測?