In [1]:
import pandas as pd # provide sql-like data manipulation tools. very handy.
pd.options.mode.chained_assignment = None
from os import getcwd
import numpy as np 
MAX_NB_WORDS=40000 #defines the size of our vocabulary
MAX_SEQUENCE_LENGTH=50
VALIDATION_SPLIT=0.2 #what percent of training examples to be saved for testing
EMBEDDING_DIM=200

## Setting up the Training Data

We need all of our labels to be a simple 1 (offensive) or 0 (not offensive). Given that our dataset has a text label for hate speech and "offensive but not hate" speech, we need to map these text labels to integer ones using the Pandas library. #pandasIsYourDataScienceBestFriend


In [2]:
df = pd.read_csv(getcwd() + '/data/twitter-hate-speech.csv', encoding = "ISO-8859-1")
texts = df[:]['tweet_text']
labels = df[:]['does_this_tweet_contain_hate_speech']
labels = labels.map({'The tweet is not offensive': 0, 
                     'The tweet uses offensive language but not hate speech': 1, 
                     'The tweet contains hate speech': 1})
print("Have", len(labels), "training tweets")

Have 14509 training tweets


## Creating our Tokenizer

A tokenizer will turn each text into a sequence of Integers, with each integer being the index of a token in our dictionary. 

This tokenizer dictionary becomes our vocaulary and we need to cap the length (`MAX_NB_WORDS`) of it as english speakers have an uncanny ability to make up words of low frequency. We ignore these words for reasons of computational limits.

We use `fit_on_text` to make a dictionary from the training texts, while the `texts_to_sequences` transforms each work in the text and replaces it with the corresponding integer value.

We also pad our sequences to make every text sequence the same length. We cap these at 50, which is pretty high given the length of a tweet, but the neural net architecture I "borrowed" below was build for something much larger, so we're doing some padding.

Our labels are going to be a 0 or 1. We only have 1 output, but in a lot of cases you might have categories, in which case the output will be an array (vector) where there is a 1 in the postion for a specific category. We would call that "one-hot" encoding.

Note that we're saving a `word_index`, we'll need that later when we create our embedding layer.

In [3]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

from keras import backend as K
K.set_image_dim_ordering('tf')

tokenizer = Tokenizer(nb_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Using TensorFlow backend.




Found 28821 unique tokens.
Shape of data tensor: (14509, 50)
Shape of label tensor: (14509,)


## Create the Pickle File

For sanity sakes, we output a text, the sequence which shows the word indexes, and the padded sequence to demostrate. 

From here, we want to dump our dictionary for later use. It becomes part of our model, as we will need it to encode new text we want to perform predictions on. 

In [4]:
import pickle

print(texts[125])
print(sequences[125])
print(data[125])

with open('model/tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

@OfficialScarbs @JPizzleFIFA @ItzEmmo omg i said him and charlie look alike
[9586, 9587, 9588, 586, 4, 141, 70, 9, 4384, 105, 495]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0 9586 9587 9588
  586    4  141   70    9 4384  105  495]


## Train/Test Sets

We actually want to split our data into `x_train` (training texts), `y_train` (training labels), `x_test` (test texts), `y_test` (test labels). 

The training examples will be used by our algorithm to train the neural network, but we need to test at every step of the way and make sure we aren't "overfitting".

> *Overfitting "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit additional data or predict future observations reliably" -[Oxford Dictionaries](https://www.lexico.com/en/definition/overfitting)*

In [5]:
# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-nb_validation_samples]
y_train = labels[:-nb_validation_samples]
x_test = data[-nb_validation_samples:]
y_test = labels[-nb_validation_samples:]


## Fits Like a GloVe

Now that we have our vectors, we need to load our pre-trained GloVe word vectors for turning our words into high-demensional vectors for understanding english words. We load the 200-dimensional dataset into a dictionary called `embeddings_index`. Which maps words to arrays of 200 numbers. (this takes a minute to load the data)

[GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)

In [6]:
embeddings_index = {}
f = open('data/glove.twitter.27B.200d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
print(embeddings_index['school'])

KeyboardInterrupt: 

## Embedding Layer

The Embedding Layer is the first layer of our "deep learning" network. It take positive integer indexes and turns them into dense vectors based on our GloVe dictionary. Using the `word_index` from our Tokenizer, this is the last step to get our data ready before training. 

In [14]:
from keras.layers import Embedding

embedding_matrix = np.zeros((len(word_index) + 1, EMBEDDING_DIM))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

embedding_layer = Embedding(len(word_index) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

## Training Our Model

The actual architecture of our model is an Embedding Layer, a Convolutional 1D layer, a polling layer, a n1D Convolutional layer, a pooling layer, a 1D Convlutional layer, a final pooling layer, and a fully connected layer (Dense) connected to the 2 outputs.

This is the real "building" of the model, but honesly I just picked it up from some example on the internet and remove some layers because our tweet window size is so small. We can play with the size of each layer (hyperparameter tuning) to get better results but this can take a lot of experimentation.

We run batches of 128 text examples each iteration for speed, iterating over all examples 5 times (number of epochs). This whole process will set the weights on our neural network.

Each iteraction we can see the accuracy increase until it starts to level off. A common trend with training networks. This model will train really quickly on a CPU, but for large datasets that involved much larger things like images, you'll definitely want to run on a GPU.

In [16]:
from keras.layers import Dense, Input, GlobalMaxPooling1D
from keras.layers import Conv1D, MaxPooling1D
from keras.models import Model

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
print(embedded_sequences.shape)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = GlobalMaxPooling1D()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(1, activation='relu')(x)

model = Model(sequence_input, preds)
model.compile(loss='mean_squared_error',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train,
          batch_size=128,
          epochs=5,
          validation_data=(x_test, y_test))

print(model.summary())

Training model.
(?, 50, 200)


Train on 11608 samples, validate on 2901 samples
Epoch 1/5


  128/11608 [..............................] - ETA: 32s - loss: 0.3144 - acc: 0.4375

  384/11608 [..............................] - ETA: 12s - loss: 1.6420 - acc: 0.3568

  640/11608 [>.............................] - ETA: 8s - loss: 1.1155 - acc: 0.4406 

  896/11608 [=>............................] - ETA: 7s - loss: 0.8639 - acc: 0.4933

 1152/11608 [=>............................] - ETA: 6s - loss: 0.7289 - acc: 0.5200

 1408/11608 [==>...........................] - ETA: 5s - loss: 0.6427 - acc: 0.5426

 1664/11608 [===>..........................] - ETA: 5s - loss: 0.5756 - acc: 0.5721

 1920/11608 [===>..........................] - ETA: 4s - loss: 0.5229 - acc: 0.5938

 2176/11608 [====>.........................] - ETA: 4s - loss: 0.4823 - acc: 0.6085

 2432/11608 [=====>........................] - ETA: 4s - loss: 0.4522 - acc: 0.6180

 2688/11608 [=====>........................] - ETA: 4s - loss: 0.4429 - acc: 0.6109













































































Epoch 2/5
  128/11608 [..............................] - ETA: 3s - loss: 0.0632 - acc: 0.9375

  256/11608 [..............................] - ETA: 4s - loss: 0.0747 - acc: 0.9141

  384/11608 [..............................] - ETA: 4s - loss: 0.0652 - acc: 0.9245

  640/11608 [>.............................] - ETA: 3s - loss: 0.0743 - acc: 0.9125

  896/11608 [=>............................] - ETA: 3s - loss: 0.1295 - acc: 0.8225

 1152/11608 [=>............................] - ETA: 3s - loss: 0.1187 - acc: 0.8403

 1408/11608 [==>...........................] - ETA: 3s - loss: 0.1125 - acc: 0.8516

 1664/11608 [===>..........................] - ETA: 3s - loss: 0.1081 - acc: 0.8636

 1920/11608 [===>..........................] - ETA: 3s - loss: 0.1071 - acc: 0.8646

 2176/11608 [====>.........................] - ETA: 3s - loss: 0.1047 - acc: 0.8667

 2432/11608 [=====>........................] - ETA: 3s - loss: 0.1026 - acc: 0.8684

 2688/11608 [=====>........................] - ETA: 3s - loss: 0.1012 - acc: 0.8705









































































Epoch 3/5
  128/11608 [..............................] - ETA: 3s - loss: 0.0482 - acc: 0.9453

  384/11608 [..............................] - ETA: 4s - loss: 0.0548 - acc: 0.9375

  640/11608 [>.............................] - ETA: 3s - loss: 0.0630 - acc: 0.9203

  896/11608 [=>............................] - ETA: 3s - loss: 0.0636 - acc: 0.9230

 1152/11608 [=>............................] - ETA: 3s - loss: 0.0629 - acc: 0.9262

 1408/11608 [==>...........................] - ETA: 3s - loss: 0.0669 - acc: 0.9240

 1664/11608 [===>..........................] - ETA: 3s - loss: 0.0692 - acc: 0.9213

 1920/11608 [===>..........................] - ETA: 3s - loss: 0.0682 - acc: 0.9208

 2176/11608 [====>.........................] - ETA: 3s - loss: 0.0670 - acc: 0.9219

 2432/11608 [=====>........................] - ETA: 3s - loss: 0.0676 - acc: 0.9215

 2688/11608 [=====>........................] - ETA: 3s - loss: 0.0681 - acc: 0.9208











































































Epoch 4/5
  128/11608 [..............................] - ETA: 3s - loss: 0.0637 - acc: 0.9297

  384/11608 [..............................] - ETA: 3s - loss: 0.0501 - acc: 0.9479

  640/11608 [>.............................] - ETA: 3s - loss: 0.0460 - acc: 0.9531

  896/11608 [=>............................] - ETA: 3s - loss: 0.0430 - acc: 0.9542

 1152/11608 [=>............................] - ETA: 3s - loss: 0.0436 - acc: 0.9531

 1408/11608 [==>...........................] - ETA: 3s - loss: 0.0458 - acc: 0.9524

 1664/11608 [===>..........................] - ETA: 3s - loss: 0.0465 - acc: 0.9507

 1920/11608 [===>..........................] - ETA: 3s - loss: 0.0470 - acc: 0.9500

 2176/11608 [====>.........................] - ETA: 3s - loss: 0.0456 - acc: 0.9508

 2432/11608 [=====>........................] - ETA: 3s - loss: 0.0476 - acc: 0.9478

 2688/11608 [=====>........................] - ETA: 2s - loss: 0.0526 - acc: 0.9412













































































Epoch 5/5
  128/11608 [..............................] - ETA: 3s - loss: 0.0373 - acc: 0.9531

  384/11608 [..............................] - ETA: 3s - loss: 0.0537 - acc: 0.9427

  640/11608 [>.............................] - ETA: 3s - loss: 0.0701 - acc: 0.9187

  768/11608 [>.............................] - ETA: 3s - loss: 0.0668 - acc: 0.9219

 1024/11608 [=>............................] - ETA: 3s - loss: 0.0556 - acc: 0.9375

 1280/11608 [==>...........................] - ETA: 3s - loss: 0.0508 - acc: 0.9430

 1536/11608 [==>...........................] - ETA: 3s - loss: 0.0487 - acc: 0.9447

 1792/11608 [===>..........................] - ETA: 3s - loss: 0.0441 - acc: 0.9515

 2048/11608 [====>.........................] - ETA: 3s - loss: 0.0411 - acc: 0.9551

 2304/11608 [====>.........................] - ETA: 3s - loss: 0.0389 - acc: 0.9570

 2560/11608 [=====>........................] - ETA: 3s - loss: 0.0390 - acc: 0.9578









































































_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 50)                0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 50, 200)           5764400   
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 46, 128)           128128    
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 9, 128)            0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 5, 128)            82048     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 128)               16512     
__________

## Trained Model

We now have a trained model and can run predictions on it! Let's try it out with an example texts.

In [19]:
sequences = tokenizer.texts_to_sequences(['I love talking to you about machine learning'])
padded_input = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

predictions = model.predict(padded_input)
print("Message is", predictions[0], "offensive")

Message is [0.8330656] offensive


## Export the Model

Our last step it to explort the model files structure as a json file and its weights as an h5 file. These, along with the pickler dictionary constitute our model.

In [20]:
model_json = model.to_json()
with open("model/model.json", "w") as json_file:
    json_file.write(model_json)
model.save_weights("model/model.h5")
print("Saved model to disk")


Saved model to disk
