### Gender Checker by Name

##### Introduction and Preprocessing

本节参考 [斗大的熊猫](http://blog.topspeedsnail.com/archives/10833) ，根据姓名判断性别

代码参考 [kaggle baby name competetion](https://github.com/tensorflow/models/tree/master/namignizer)

以及 Sentiment_With_CNN.ipynb 一章的代码

使用的数据集：[名字](https://pan.baidu.com/s/1hsHTEU4)

数据集文件类似：
```
安镶怡,女
饶黎明,男
段焙曦,男
苗芯萌,男
覃慧藐,女
芦玥微,女
```

In [1]:
import tensorflow as tf
import numpy as np
import cPickle
import codecs

from sentiment140 import Sentiment140

In [2]:
tf.__version__

'0.9.0'

##### Part I. 探索数据集

In [3]:
datafile = 'data/name.csv'

def parse_file(fname):
    x, y = [], []
    with codecs.open(fname, 'r', 'utf-8') as f:
        for line in f:
            parts = line.strip().split(',')
            if len(parts) == 2:
                x.append(parts[0])
                y.append([0, 1] if parts[1] == u'男' else [1, 0])
    return x, y

train_x, train_y = parse_file(datafile)
print "number of training set: {}".format(len(train_x))
print "number of male: {}".format(len([y for y in train_y if y == [0, 1]]))
print "number of female: {}".format(len([y for y in train_y if y == [1, 0]]))

number of training set: 351789
number of male: 206543
number of female: 145246


In [4]:
""" 看一下最长名字的长度，因为要把全部名字都 padding 到这个长度上 """
max_name_length = max([len(name) for name in train_x])
print(u"最长名字的字符数: ", max_name_length)

(u'\u6700\u957f\u540d\u5b57\u7684\u5b57\u7b26\u6570: ', 3)


In [5]:
""" 下面创建此表，给每个字都分配一个索引 """
from collections import defaultdict as dd
counter = len(train_x)
vocab = dd(int)
for name in train_x:
    for w in name:
        vocab[w] += 1

# 按词频由高到低排，前面加上空格。空格用于 padding
vocab_list = [' '] + sorted(vocab, key=vocab.get, reverse=True)
print("vocabulary length: {}".format(len(vocab_list)))

vocab_dict = dict([(x, y) for (y, x) in enumerate(vocab_list)])
""" 转化名字为 vector，至于为什么这么转化，下面有详解 """
train_x_vec = []
for name in train_x:
    vec = []
    for w in name:
        vec.append(vocab_dict.get(w))
    # padding
    vec = vec + [0] * (max_name_length - len(vec))
    train_x_vec.append(vec)

for i in range(5):
    print train_x_vec[i]

vocabulary length: 6018
[141, 3614, 2203]
[85, 526, 485]
[72, 86, 510]
[130, 758, 69]
[171, 296, 545]


In [6]:
input_size = max_name_length    # 下面有详解，这里 input_size 是 padding 后的长度，不是词典长度，vec 不是 one-hot 格式
num_classes = 2
print input_size
batch_size = 64    # 测试集 batch size

3



##### Part II. Layer definations

后面要做 embedding，故此这里先熟悉一下 tf.nn.embedding_lookup 函数
```
>>> sess = tf.InteractiveSession()
>>> w = tf.random_uniform([5, 2], -1.0, 1.0)       # 这里假设词典中词个数为 5 维，embedding 到 2 维，初始化为 -1~1 之间的随机数
>>> w = sess.run(w)
>>> x = [[1,1,0,0,0], [1,0,0,0,0], [0,0,1,1,1]]     # x 为 3 x 5 维， 3 代表 3 个样本，5 为词典中词个数，1 表示该词在词典中，0 表示不在
>>> embedded_chars = tf.nn.embedding_lookup(w, x)

>>> w
array([[ 0.82283998,  0.21245265],              # 看到，w 确实是随机出来了，每一行表示词典中的对应词的 2 维 embedding 向量
       [-0.737818  , -0.59785843],
       [-0.24692678, -0.69566345],
       [ 0.85945463, -0.21308041],
       [ 0.28053808,  0.88169646]], dtype=float32)
>>> x
[[1, 1, 0, 0, 0], [1, 0, 0, 0, 0], [0, 0, 1, 1, 1]]
>>> sess.run(embedded_chars)
array([[[-0.737818  , -0.59785843],             # 注意， embedding_chars 为 3 x 5 x 2 维，即样本个数 x 词典中词总个数 x embedding 维度
        [-0.737818  , -0.59785843],
        [ 0.82283998,  0.21245265],           # 我们以 embedding_chars[0::] 为例，也就是第一个样本得到的 lookup 结果
        [ 0.82283998,  0.21245265],           # 第一个样本为 [1,1,0,0,0]，即该样本中有词典的前 2 个词，而没有后面 3 个词
        [ 0.82283998,  0.21245265]],           # 我原以为 embedding_chars[0::] 会是前两个词的 embedding 向量 + 3 个 [0,0] 
                                    # 结果并不是这样，1 的部分填入 w[1]，而 0 的部分填入 w[0]
       [[-0.737818  , -0.59785843],           # 这样有一个问题，那就是 w[2:] 也就是 w 矩阵从第三个开始往后的元素都没有用到啊！
        [ 0.82283998,  0.21245265],           
        [ 0.82283998,  0.21245265],           # 保留疑问，后面再说
        [ 0.82283998,  0.21245265],
        [ 0.82283998,  0.21245265]],

       [[ 0.82283998,  0.21245265],
        [ 0.82283998,  0.21245265],
        [-0.737818  , -0.59785843],
        [-0.737818  , -0.59785843],
        [-0.737818  , -0.59785843]]], dtype=float32)
```

##### 更新认识  !!!

上面这个部分是之前做 Sentiment Analysis 时我的一些分析；而到现在我其实懂了，上面部分的 x 错了！

上面分析中认为，x 为一个矢量，维度和词典中词的个数相同，x 的每个元素表示词典对应位置的词是否存在，存在则 1 否则 0

实际上不是这样的，x 确实为一个矢量，只不过其长度和 x 这个句子的长度相同，每个元素则表示句子这个位置的词在词典中的索引，故此 x 的每个元素不限于 0 and 1，是句子中每个存在的词的索引

最后，为了 embedding 的需要，要求 batch 样本中所有句子对应的 x 有相同的长度，故此需要做一个 padding

这样：

- 传给 embedding 层的数据为 batch_size X padded_sentence_length，即 batch_size 个已经被 padding 为相同长度的句子 
- embedding 层的权重 W 为 vacabulary_size X embedding_size，表示词典中每个词对应的 embedding_size 长度的 vector
- embedding 层的结果就是 batch_size X padded_sentence_length X embedding_size，即 batch_size 个句子，每个句子的每个词由其在词典中的位置索引变为该词对应的 embedding 矢量

In [7]:
def neural_network(data, vocabulary_size, embedding_size = 128,  num_filters=128, dropout_keep_prob=0.5):
    """
    data 即 batch of X，维度为 sample_count x padded_sentence_length，后者记为 input_size
    """
    # embedding 层，把句子样本 batch 输入转为 vector
    with tf.name_scope("embedding"):
        # W 在 -1 & 1 之间随机分布，维度为 vocabulary_size x embedding_size
        W = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
        # 见上一个 cell 中的研究，embedded_char 为 sample_count x input_size x embedding_size 维
        embedded_chars = tf.nn.embedding_lookup(W, data)
        # 在最后再添加一个维度，那么就是 sample_count x input_size x embedding_size x 1 维
        # 我们知道 cnn 的输入为 sample_count x image_width x image_height x num_channel ，这里 num_channel 为 1
        # 然后，input_size x embedding_size 就相当于图片中的 width x height
        embedded_chars_expanded = tf.expand_dims(embedded_chars, -1)

    # 接下来是 CNN 层，注意和之前图片的串行 CNN 不同，这里采用的是并行 CNN，就是从 embedding 层出来的结果同时进入 3 个 CNN
    # 3 层 CNN 对应 3 个 filters，每个 filters 都是一维，或者理解为 n x 1 的二维； 每层 filter 都是 num_filter
    filter_sizes = [1, 2, 3]
    pooled_outputs = []
    for i, filter_size in enumerate(filter_sizes):
        with tf.name_scope("conv-maxpool-{}-{}".format(i, filter_size)):
            # 准备初始化权重，首先得到权重的 shape
            # 输入的每个样本为 input_size x embedding_size ，那么我们把每层 filter 的维度设计为 filter_size x embedding_size
            # 也就是说，filter 的一个维度和样本相当，故此 filter 只会在 input_size 这个维度上滑动
            # 每层的 filter 都有 num_filters 组，而输入的 num_channel 都为 1
            filter_shape = [filter_size, embedding_size, 1, num_filters]
            W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1))
            b = tf.Variable(tf.constant(0.1, shape=[num_filters]))
            # 每次移动一步, conv 为 sample_size x input_size x 1 x num_filters
            conv = tf.nn.conv2d(embedded_chars_expanded, W, strides=[1, 1, 1, 1], padding="VALID")
            h = tf.nn.relu(tf.nn.bias_add(conv, b))
            # pooling，pooling 跨度很大，input_size - filter_size + 1, 也就是说 input_size 那么长的向量，pool 完之后只剩 1 长
            # 故此 pooled 维度为 sample_size x 1 x 1 x num_filters
            pooled = tf.nn.max_pool(h, ksize=[1, input_size - filter_size + 1, 1, 1], strides=[1,1,1,1], padding="VALID")
            pooled_outputs.append(pooled)
    
    # pooled_output 为一个数组，每个元素为 pooled，pooled 的维度为 sample_size x 1 x 1 x num_filters
    num_filters_total = num_filters * len(filter_sizes)
    # 在 idx=3 的维度上 concat，结果维度为 sample_size x 1 x 1 x num_filters_total
    h_pool = tf.concat(3, pooled_outputs)     
    # 保持最后一个维度不变，进行 flaten，结果是二维的；第一个维度为 sample_size * 1 * 1，第二个维度仍为 num_filters_total
    h_pool_flat = tf.reshape(h_pool, [-1, num_filters_total])    
    
    # dropout
    with tf.name_scope("dropout"):
        # 仍然是 sample_size x num_filters_total 维度
        h_drop = tf.nn.dropout(h_pool_flat, dropout_keep_prob)
        
    # output full connection 层
    with tf.name_scope("output"):
        # W = tf.get_variable("W", shape=[num_filters_total, num_classes], initializer=tf.contrib.layers.xavier_initializer())
        W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.5))
        b = tf.Variable(tf.constant(0.1, shape=[num_classes]))
        # sample_size x  num_classes 维度
        output = tf.nn.xw_plus_b(h_drop, W, b)
        return output

##### Part III. Trainning

In [9]:
X = tf.placeholder(tf.int32, [None, input_size])
Y = tf.placeholder(tf.float32, [None, num_classes])
dropout_keep_prob = tf.placeholder(tf.float32)

In [12]:
total_epochs = 11
save_interval = 5

output = neural_network(X, len(vocab_list), dropout_keep_prob=dropout_keep_prob)
optimizer = tf.train.AdamOptimizer(1e-3)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(output, Y))
# optimizer = optimizer.minimize(loss)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars)   
    
num_batch = len(train_x_vec) // batch_size
    
# saver = tf.train.Saver(tf.trainable_variables())
saver = tf.train.Saver(tf.global_variables())
sess = tf.Session()
sess.run(tf.initialize_all_variables())
for e in range(total_epochs):
    for i in range(num_batch):
        batch_x = train_x_vec[i * batch_size: (i + 1) * batch_size]
        batch_y = train_y[i * batch_size: (i + 1) * batch_size]
        _, loss_ = sess.run([train_op, loss], feed_dict={X:batch_x, Y:batch_y, dropout_keep_prob:0.5})
        if i % 100 == 0:
            print("epoch {} - batch {} - loss: {}".format(e, i, loss_))
            
    if e % save_interval == 0:
        saver.save(sess, 'name2sex.model', global_step=e)

 epoch 0 - batch 0 - loss: 4.71256637573
epoch 0 - batch 100 - loss: 3.49745106697
epoch 0 - batch 200 - loss: 3.74005746841
epoch 0 - batch 300 - loss: 2.46446561813
epoch 0 - batch 400 - loss: 2.21196365356
epoch 0 - batch 500 - loss: 0.994374752045
epoch 0 - batch 600 - loss: 1.73235440254
epoch 0 - batch 700 - loss: 2.10601425171
epoch 0 - batch 800 - loss: 1.15178155899
epoch 0 - batch 900 - loss: 1.29807972908
epoch 0 - batch 1000 - loss: 0.872881770134
epoch 0 - batch 1100 - loss: 1.08145427704
epoch 0 - batch 1200 - loss: 1.27475547791
epoch 0 - batch 1300 - loss: 1.11164855957
epoch 0 - batch 1400 - loss: 0.794882953167
epoch 0 - batch 1500 - loss: 0.654997527599
epoch 0 - batch 1600 - loss: 0.783994793892
epoch 0 - batch 1700 - loss: 0.78123819828
epoch 0 - batch 1800 - loss: 0.477663934231
epoch 0 - batch 1900 - loss: 0.454263865948
epoch 0 - batch 2000 - loss: 0.395449876785
epoch 0 - batch 2100 - loss: 0.461769223213
epoch 0 - batch 2200 - loss: 0.387417137623
epoch 0 - ba

In [18]:
def create_name_vec(name):
    vec = []
    for w in name:
        vec.append(vocab_dict.get(w, 0))
    vec = vec + [0] * (max_name_length - len(vec))
    x = [vec]
    return x        

In [20]:
x = create_name_vec(u'白富美')
pred = tf.argmax(output, 1)
res = sess.run(pred, {X:x, dropout_keep_prob:1.0})
print(name, 'female' if res[0] == 0 else 'male')

(u'\u9648\u5eb7\u4f73', 'female')


In [22]:
x = create_name_vec(u'杨博宇')
pred = tf.argmax(output, 1)
res = sess.run(pred, {X:x, dropout_keep_prob:1.0})
print(name, 'female' if res[0] == 0 else 'male')

(u'\u9648\u5eb7\u4f73', 'male')


In [23]:
x = create_name_vec(u'秦香莲')
pred = tf.argmax(output, 1)
res = sess.run(pred, {X:x, dropout_keep_prob:1.0})
print(name, 'female' if res[0] == 0 else 'male')

(u'\u9648\u5eb7\u4f73', 'female')


In [24]:
def predict_gender(name):
    x = create_name_vec(name)
    
    output = neural_network(x, len(vocab_list), dropout_keep_prob=1.0)
    # saver = tf.train.Saver(tf.trainable_variables())
    saver = tf.train.Saver(tf.global_variables())
    session = tf.Session()
    session.run(tf.initialize_all_variables())
    ckpt = tf.train.get_checkpoint_state('.')
    if ckpt != None:
        print(ckpt.model_checkpoint_path)
        saver.restore(session, ckpt.model_checkpoint_path)
    else:
        print "Failed to find module"
    
    pred = tf.argmax(output, 1)
    res = session.run(pred, {X:x, dropout_keep_prob:1.0})
    print(name, 'female' if res[0] == 0 else 'male')
        