The task is to classify names as male/female using different approaches, including neural networks and bayes classifier. Data is split between 2 files and is contained as lists. The task is completed with https://github.com/Bravo111

In [1]:
import tensorflow
import pandas as pd
import keras

Using TensorFlow backend.


## Data preprocessing

In [2]:
df_f = pd.read_csv('data/female.txt', names=['name']).name
df_m = pd.read_csv('data/male.txt', names=['name']).name

print("\nLength of 'female':", len(df_f), '\n')


print('#'*27, '\nFemale:\n')
print(df_f.head(7))
print('#'*27)
print("\n\nLength of 'male':", len(df_m), '\n')
print('#'*27, '\nMale:\n')
print(df_m.head(7))
print('#'*27)


Length of 'female': 5001 

########################### 
Female:

0    Abagael
1    Abagail
2       Abbe
3      Abbey
4       Abbi
5      Abbie
6       Abby
Name: name, dtype: object
###########################


Length of 'male': 2943 

########################### 
Male:

0     Aamir
1     Aaron
2     Abbey
3     Abbie
4     Abbot
5    Abbott
6      Abby
Name: name, dtype: object
###########################


### Delete ambigous names male and female at the same time

In [3]:
# Delete intersections:
# using ~ to return only the rows you in df_f which are not df_test
df_f_clean = df_f[~df_f.isin(df_m)].reset_index(drop=True)
df_m_clean = df_m[~df_m.isin(df_f)].reset_index(drop=True)

print("\n New length of 'female':", len(df_f_clean))
print("\n New length of 'male':", len(df_m_clean))


 New length of 'female': 4636

 New length of 'male': 2578


### Store data in a single dataframe

In [4]:
df_f_clean = pd.DataFrame(df_f_clean)
df_f_clean['sex'] = 1
df_m_clean = pd.DataFrame(df_m_clean)
df_m_clean['sex'] = 0

df = pd.concat([df_f_clean, df_m_clean])
df = df.sort_values('name').reset_index(drop=True)
df.head(7)

Unnamed: 0,name,sex
0,Aamir,0
1,Aaron,0
2,Abagael,1
3,Abagail,1
4,Abbe,1
5,Abbi,1
6,Abbot,0


### Test data is derived as 20% of names for each letter in an alphabet

In [5]:
import numpy as np

# fix random seed for reproducibility
np.random.seed(1789)

train = pd.DataFrame()
test = pd.DataFrame()

letters = 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z'.split()

for letter in letters:
    names_starts_with_letter = df[ df.name.str.startswith(letter) ]
    msk = np.random.rand( len(names_starts_with_letter) ) < 0.80
    train = pd.concat([train, df[ df.name.str.startswith(letter) ][msk]])
    test = pd.concat([test, df[ df.name.str.startswith(letter) ][~msk]])

train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

print('\nTest percentage:', (len(test)/len(df)))


Test percentage: 0.19725533684502355


### Symbolic n-grams for each name

In [6]:
from nltk.util import ngrams
from time import time
from tqdm import tqdm

def symbol_ngram(test, train, n=2):
    #################################### Test
    test_n = pd.DataFrame()
    t0 = time()

    for i in tqdm(range(len(test['name']))):
        name = test['name'][i]
        
        string_ngrams = ngrams(name, n=n)
        sex = int(test[test['name'] == name]['sex'])
        
        for ngram in string_ngrams: 
            ngram_sex = pd.DataFrame([[''.join(ngram), sex]], columns=['n={}'.format(n), 'sex'])
            test_n = pd.concat([test_n, ngram_sex])
    
    test_n = test_n.reset_index(drop=True)
    
    t_test = time() - t0
    print("\nPreprocessing time (test): %0.1fm" % (t_test/60))
    print('\n', '#'*12, '\n', test_n.head(7))
    print('#'*12)
    #################################### Test
    
    ############################# Train
    train_n = pd.DataFrame()
    t1 = time()
    
    for i in tqdm(range(len(train['name']))):
        name = train['name'][i]
        
        string_ngrams = ngrams(name, n=n)
        sex = int(train[train['name'] == name]['sex'])
        
        for ngram in string_ngrams: 
            ngram_sex = pd.DataFrame([[''.join(ngram), sex]], columns=['n={}'.format(n), 'sex'])
            train_n = pd.concat([train_n, ngram_sex])

    train_n = train_n.reset_index(drop=True)
    
    t_train = time() - t1
    print("\nPreprocessing time (train): %0.1fm" % (t_train/60))
    print('\n', '#'*12, '\n', train_n.head(7))
    print('#'*12)
    ############################# Train
    
    print("\n\nTotal preprocessing time: %0.1fm" % ((t_test + t_train)/60))
    
    return test_n, train_n

### Bigrams

In [7]:
test_n2, train_n2 = symbol_ngram(test, train, 2)

100%|█████████████████████████████████████████████████████████████████████████████| 1423/1423 [00:09<00:00, 142.61it/s]



Preprocessing time (test): 0.2m

 ############ 
   n=2  sex
0  Ab    1
1  bb    1
2  be    1
3  Ab    1
4  bb    1
5  bi    1
6  Ab    0
############


100%|█████████████████████████████████████████████████████████████████████████████| 5791/5791 [00:51<00:00, 112.50it/s]



Preprocessing time (train): 0.9m

 ############ 
   n=2  sex
0  Aa    0
1  am    0
2  mi    0
3  ir    0
4  Aa    0
5  ar    0
6  ro    0
############


Total preprocessing time: 1.0m


### Trigrams

In [9]:
test_n3, train_n3 = symbol_ngram(test, train, 3)

100%|██████████████████████████████████████████████████████████████████████████████| 1423/1423 [00:19<00:00, 73.39it/s]



Preprocessing time (test): 0.3m

 ############ 
    n=3  sex
0  Abb    1
1  bbe    1
2  Abb    1
3  bbi    1
4  Abb    0
5  bbo    0
6  bot    0
############


100%|██████████████████████████████████████████████████████████████████████████████| 5791/5791 [01:22<00:00, 70.18it/s]



Preprocessing time (train): 1.4m

 ############ 
    n=3  sex
0  Aam    0
1  ami    0
2  mir    0
3  Aar    0
4  aro    0
5  ron    0
6  Aba    1
############


Total preprocessing time: 1.7m


### Four-grams

In [10]:
test_n4, train_n4 = symbol_ngram(test, train, 4)

100%|██████████████████████████████████████████████████████████████████████████████| 1423/1423 [00:19<00:00, 71.86it/s]



Preprocessing time (test): 0.3m

 ############ 
     n=4  sex
0  Abbe    1
1  Abbi    1
2  Abbo    0
3  bbot    0
4  Adah    1
5  Adel    1
6  dela    1
############


100%|██████████████████████████████████████████████████████████████████████████████| 5791/5791 [01:32<00:00, 62.84it/s]



Preprocessing time (train): 1.5m

 ############ 
     n=4  sex
0  Aami    0
1  amir    0
2  Aaro    0
3  aron    0
4  Abag    1
5  baga    1
6  agae    1
############


Total preprocessing time: 1.9m


## Naive Bayes for names classification

### F-score basic theory:

It is therefore conventional to employ a different set of measures for search tasks, based on the number of items in each of the four categories shown in 3.1:

* True positives are relevant items that we correctly identified as relevant.
* True negatives are irrelevant items that we correctly identified as irrelevant.
* False positives (or Type I errors) are irrelevant items that we incorrectly identified as relevant.
* False negatives (or Type II errors) are relevant items that we incorrectly identified as irrelevant.

Given these four numbers, we can define the following metrics:

* Precision, which indicates how many of the items that we identified were relevant, is TP/(TP+FP).
* Recall, which indicates how many of the relevant items that we identified, is TP/(TP+FN).
* The F-Measure (or F-Score), which combines the precision and recall to give a single score, is defined to be the harmonic mean of the precision and recall: (2 × Precision × Recall) / (Precision + Recall).

In [102]:
from nltk import NaiveBayesClassifier
from nltk.classify import accuracy
from nltk.metrics import f_measure
import collections

def classication_and_results(test_ngram, train_ngram):
    # Feature extractor function.
    def gender_features(word):
        return {'ngram': word}

    # Extract features
    train_featuresets = [(gender_features(n), sex) for index, (n, sex) in train_ngram.iterrows()]
    test_featuresets = [(gender_features(n), sex) for index, (n, sex) in test_ngram.iterrows()]

    # Train model
    classifier = NaiveBayesClassifier.train(train_featuresets)
    
    # Check the accuracy
    print('\nAccuracy of Naive Bayes:', accuracy(classifier, test_featuresets))
    
    # F-Measure
    refsets = collections.defaultdict(set)
    testsets = collections.defaultdict(set)
    for i, (feats, label) in enumerate(test_featuresets):
        refsets[label].add(i)
        observed = classifier.classify(feats)
        testsets[observed].add(i)


    print('\nFemale F-measure:', f_measure(refsets[1], testsets[1]))
    print('  Male F-measure:', f_measure(refsets[0], testsets[0]))
    return classifier

### Bigrams

In [103]:
classifier_n2 = classication_and_results(test_n2, train_n2)


Accuracy of Naive Bayes: 0.681749206787143

Female F-measure: 0.780724265754206
  Male F-measure: 0.4199145084234347


### Trigrams

In [104]:
classifier_n3 = classication_and_results(test_n3, train_n3)


Accuracy of Naive Bayes: 0.7366975626501888

Female F-measure: 0.8142857142857143
  Male F-measure: 0.5477594339622641


### Four-grams

In [105]:
classifier_n4 = classication_and_results(test_n4, train_n4)


Accuracy of Naive Bayes: 0.7507943713118475

Female F-measure: 0.8289186662511686
  Male F-measure: 0.5413533834586466


In [109]:
test_n5, train_n5 = symbol_ngram(test, train, 5)

100%|██████████| 1423/1423 [00:07<00:00, 191.05it/s]
  0%|          | 14/5791 [00:00<00:44, 130.70it/s]


Preprocessing time (test): 0.1m

 ############ 
      n=5  sex
0  Abbot    0
1  Adela    1
2  delai    1
3  elaid    1
4  laide    1
5  Adeli    1
6  delin    1
############


100%|██████████| 5791/5791 [00:35<00:00, 161.78it/s]


Preprocessing time (train): 0.6m

 ############ 
      n=5  sex
0  Aamir    0
1  Aaron    0
2  Abaga    1
3  bagae    1
4  agael    1
5  Abaga    1
6  bagai    1
############


Total preprocessing time: 0.7m





### Conclusion:
The tendency is that the bigger 'n' for n-gram, the less F-score for male names. The assumption is that male names are shorter and have less characters than names for women. Also, the results may be related to initial data sizes of men's and women's names (5001 vs 2943). 

## LSTM 

### Preprosessing

In [6]:
test.name = test.name.str.lower()
train.name = train.name.str.lower()

chars = set(  "".join(list(test.name)) + "".join(list(train.name))  )
print(chars)

{'a', 's', 'v', 'z', ' ', 'i', 'w', 'g', 'k', 'd', 'r', "'", 'q', 'o', 'f', 'm', 'x', 'b', 'u', 'c', 'l', 'n', 'e', 't', 'p', '-', 'y', 'h', 'j'}


In [7]:
# Custom sort to rearrange order of letters and symbols
def custom_sort(sorted_list):
    not_letters = []
    for char in sorted_list:
        if not char.isalpha():
            sorted_list.remove(char)
            not_letters.append(char)
    sorted_list = sorted_list + not_letters  
    return sorted_list   

In [8]:
# Dictionaries of characters with indices
chars = custom_sort(sorted(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [9]:
maxlen = len(max(list(train.name) + list(test.name), key=len))
X_train = np.zeros((train.shape[0] , maxlen, len(chars) ))
y_train = np.zeros((train.shape[0] , 2 ))
X_test = np.zeros((test.shape[0] , maxlen, len(chars) ))
y_test = np.zeros((test.shape[0] , 2 ))

In [10]:
# Vectorization of train and test data

for name in train.name:
    word = train[train['name'] == name]
    id_ = word.index.tolist()[0]
    for t, char in enumerate(word['name'].any()):
        X_train[id_, t, char_indices[char]] = 1
    if word['sex'].any() == 0:
        y_train[id_, 0] = 1
    else:
        y_train[id_, 1] = 1
        
             
for name in test.name:
    word = test[test['name'] == name]
    id_ = word.index.tolist()[0]
    for t, char in enumerate(word['name'].any()):
        X_test[id_, t, char_indices[char]] = 1
    if word['sex'].any() == 0:
        y_test[id_, 0] = 1
    else:
        y_test[id_, 1] = 1

### NN construction

In [11]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM
from sklearn.metrics import f1_score, accuracy_score

### 512 nodes, 0.2 dropout, 32 batch size

In [64]:
model = Sequential()
model.add(LSTM(512, return_sequences=True, input_shape=(len(max(train.name, key=len)), len(chars))))
model.add(Dropout(0.2))
model.add(LSTM(512, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=32)
model.save_weights('my_model_weights.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [31]:
predicted = model.predict_classes(X_test)
print("Accuracy is ", accuracy_score(test.sex, predicted))
print("F score is ", f1_score(test.sex, predicted))

Accuracy is  0.645115952214
F score is  0.776053215078


### 128 nodes, 0.4 dropout, 32 batch size

In [76]:
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.4))
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=32)
model.save_weights('128_04_128_04.h5')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [26]:
predicted = model.predict_classes(X_test)
print()
print("Accuracy is ", accuracy_score(test.sex, predicted))
print("F score is ", f1_score(test.sex, predicted))

Accuracy is  0.815179198876
F score is  0.861797162375


### 256 nodes, 0.4 dropout, 16 batch size

In [32]:
model = Sequential()
model.add(LSTM(256, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.4))
model.add(LSTM(256, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=16)
model.save_weights('256_04_256_04_batch16.h5')
predicted = model.predict_classes(X_test)
print()
print("Accuracy is ", accuracy_score(test.sex, predicted))
print("F score is ", f1_score(test.sex, predicted))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Accuracy is  0.801124385102
F score is  0.845439650464


### 128 nodes, 0.4 dropout, 16 batch size

In [None]:
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.4))
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=16)
model.save_weights('128_04_128_04_batch16.h5')
predicted = model.predict_classes(X_test)
print()
print("Accuracy is ", accuracy_score(test.sex, predicted))
print("F score is ", f1_score(test.sex, predicted))

Epoch 1/10
Epoch 2/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Accuracy is  0.831342234715
F score is  0.865470852018


### 64 nodes, 0.4 dropout, 32 batch size

In [12]:
model = Sequential()
model.add(LSTM(64, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.4))
model.add(LSTM(64, return_sequences=False))
model.add(Dropout(0.4))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=32)
model.save_weights('64_04_64_04.h5')
predicted = model.predict_classes(X_test)
print()
print("Accuracy is ", accuracy_score(test.sex, predicted))
print("F score is ", f1_score(test.sex, predicted))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Accuracy is  0.801124385102
F score is  0.832643406268


### 128 nodes, 0.6 dropout, 32 batch size

In [13]:
model = Sequential()
model.add(LSTM(128, return_sequences=True, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.6))
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.6))
model.add(Dense(2))
model.add(Activation('softmax'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop')
model.fit(X_train, y_train, batch_size=32)
model.save_weights('128_06_128_06.h5')
predicted = model.predict_classes(X_test)
print()
print("Accuracy is ", accuracy_score(test.sex, predicted))
print("F score is ", f1_score(test.sex, predicted))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy is  0.801124385102
F score is  0.852834113365


## Conclusion

Best result for both metrics was shown by NN, which has 128 nodes per layer and dropout 0.4. Worst result is shown by NN with 512 nodes and dropout 0.2. Final results are shown below:

128 nodes, 0.4 dropout, 16 batch size
- Accuracy is  0.831342234715
- F score is  0.865470852018

128 nodes, 0.4 dropout, 32 batch size
- Accuracy is  0.815179198876
- F score is  0.861797162375

128 nodes, 0.6 dropout, 32 batch size
- Accuracy is  0.801124385102
- F score is  0.852834113365

256 nodes, 0.4 dropout, 16 batch size
- Accuracy is  0.801124385102
- F score is  0.845439650464

64 nodes, 0.4 dropout, 32 batch size
- Accuracy is  0.801124385102
- F score is  0.832643406268

512 nodes, 0.2 dropout, 32 batch size
- Accuracy is  0.645115952214
- F score is  0.776053215078

Optimal parameters for NN are 128 nodes and 0.4 dropout. Increasing or decreasing number of nodes in layer decreases accuracy and F-score. Also, decreasing and increasing nodes is linearly correlated with time. Same applies for dropout. This reason might be that with higher dropout and smaller number of nodes NN is underfitting - too many nodes are ignored during regularisation or current number of nodes is not enough to build solid relationships. In opposite, NN is overfitting and cannot demonstrate high generalisation.
<br><br>
After all, when comparing methods of Naive Bayes classifier and LSTM NN, LSTM wins. This may be caused by ability of all RNN to remember processed objects, while Bayes classifier only uses aposterior probability. Very important difference between these two - Bayes classifier implies that there is not correlation within dataset, while LSTM can show good results even with correlation.