# Training a Model
---

### Problem

We need to classify small strings of less than 50 characters into two classes, Is a person name and is not a person name.

For a traditonal Machine Learning approach, one of the biggest problems would be how to represent a short string of 2 or 3 words in a vector, using a *Bag of Words* approach will just create huge sparse vectors and this is not efficent, additionally some names might be unique and if we try to use all the words on the corpus we might end with thousands of features per vector.

Instead we can treat this problem with a *Deep Learning* approach, not using word embeddings because we might not have an specific vector of a rare name or word and using word embeddings for two words in a sequence might be too much.

### Approach

Our dataset have strings with maximum 50 characters, knowing this we can build a recurrent neural network with LSTM cells to take each string as a sequence of characters, and we assign an id between 1 and 96 to each character in their respective position.

From ASCII code *32* that is *"Space"* to *127* that is "~", between this range it covers all the latin characters used in the English language, punctuation and numbers.

We set an offset of 31 so the indexes start from 1 and not 32.

For simplicity we use **Keras** as the Deep Learning Framework, **TensorFlow** as the backend.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
from keras.layers.embeddings import Embedding

Using TensorFlow backend.


We load the 6 million samples with labels into a *Pandas DataFrame*

In [2]:
data = pd.read_csv('full_names.csv', index_col=False)

#### Encoding Strings

To encode each string into numbers we just get the ASCII decimal code for each character and shift(substract) it by 31.

For more information the ASCII table can be useful: [ASCII Table](http://www.rapidtables.com/code/text/ascii-table.htm)

**Example**

TYPE|String
---|---
CHAR|J|o|h|n| |S|m|i|t|h
CODE|43|80|73|79|1|52|78|74|85|73

Then for special characters like accented characters or specific language characters we just assigned 96, and not to forget that space is assgined as 1.

In [3]:
def encode_string(s):
    encoded = []
    for c in s:
        idx = ord(c)
        if idx >= 32 and idx <= 126:
            encoded.append(idx-31)
        elif idx > 126:
            # Rare Characters like accented letters and specific language characters
            encoded.append(96)
    return encoded

In [4]:
def decode_vector(v):
    decoded = []
    for idx in v:
        if idx > 0 and idx < 96:
            decoded.append(chr(idx+31))
        elif idx >= 96:
            decoded.append('*')
        else:
            break
    return "".join(decoded)

We create indexes for all 6 million samples and then shuffle them so we can randomize the dataset.

In [5]:
idxs = np.arange(data.shape[0])

In [6]:
np.random.shuffle(idxs)

All strings have different lengths (maximum 50 characters), so we use Keras helper *pad_sequences* function to add padding at the end of the vectors, so we can have 50 steps sequences, it just adds zeros at the end of the sequences.

In [7]:
strings = pad_sequences(list(map(lambda s: encode_string(str(s)), data['string'].values[idxs])), maxlen=50, dtype=np.int32, padding='post')
labels = data['is_person_name'].values[idxs]

We split the data 85% Train and 15% Test sets, I could have used the standard 30% or 40% for test, but 15% is already 900,000 strings.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(strings, labels, test_size=0.15)

I tried different hidden size and this one worked the best.

The embedding size is set to 97, because we have 0 to 96 indexes, the Embedding layer will create one-hot encoded vectors for each step in the sequence.

In [9]:
EMBEDDING_SIZE = 97 # All accepted characters (0 Padding, 1-95 Common ASCII and 96 Rare Chars)
HIDDEN_SIZE = 256
INPUT_LENGTH = 50

Keras let me add an Embedding layer that will take a 2D vector of 256 x 50, and will output a Tensor of 256 x 50 x 97, this will be each batch, 0.3 dropout is added to the recurrent layer.

Then the again before the fully connected layer we also add 0.3 of dropout. (Tried 0.5 before and this one worked the best)

The fully connected layer has a sigmoid layer, i didn't use softmax because sigmoid works better for binary classification.

Then we compute the loss according to the binary cross entropy and optimize with RMS Propagation.

In [10]:
model = Sequential()
model.add(Embedding(EMBEDDING_SIZE, HIDDEN_SIZE, input_length=INPUT_LENGTH))
model.add(LSTM(HIDDEN_SIZE, input_shape=(INPUT_LENGTH, HIDDEN_SIZE), recurrent_dropout=0.3))
model.add(Dropout(0.3))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

Because the full train dataset has around 5 million samples, training on a local CPU is quite slow. 

I have setup a *Google Compute Engine* with a *K80 Tesla* GPU to train the whole network, I trained only for 5 epochs where it seems to converge and not improve anymore.

In [11]:
model.fit(X_train, y_train, batch_size=256, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3f30dab2e8>

Then we evaluate the model with the test dataset and we score **96%** of accuracy.

In [12]:
scores = model.evaluate(X_test, y_test, batch_size=512)
print("Test loss:{} - acc:{}".format(scores[0], scores[1]))



Lastly, we save the model for later use.

In [13]:
model.save('models/model.h5')