Bidirectional LSTM as implemented by Jeremy Howard's [kernel](https://www.kaggle.com/jhoward/improved-lstm-baseline-glove-dropout/notebook)
 -  To do: try out concat pooling method
 - implement in pytorch

In [3]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM,Embedding,Dropout,Activation, Bidirectional,GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
import keras.backend as K

In [6]:
print(K.backend())

tensorflow


Download GloVe pretrained embedding from https://nlp.stanford.edu/projects/glove/, specifically the `glove.6B.zip` file

In [10]:
PATH = '/home/odenigborig/Data/kaggle/toxic_comment'
train_file = os.path.join(PATH,'train.csv')
test_file = os.path.join(PATH,'test.csv')
embedding_file = os.path.join(PATH,'glove.6B.50d.txt')

In [11]:
embed_size = 50   #size of word vector embedding
max_words = 20000 #number of unique words
max_len = 100     #sequence length, number of words to use in a comment

Load text data and replace missing values.

In [16]:
train = pd.read_csv(train_file)
test = pd.read_csv(test_file)

list_sentences_train = train['comment_text'].fillna('_na_').values
list_sentences_test = test['comment_text'].fillna('_na_').values
list_classes = list(train.columns[2:])
y = train[list_classes].values

In [18]:
y.shape

(159571, 6)

Preprocess the texts by turning into tokens then list of word indices of equal length.

In [19]:
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(list_sentences_train)
list_tokenizer_train = tokenizer.texts_to_sequences(list_sentences_train)
list_tokenizer_test = tokenizer.texts_to_sequences(list_sentences_test)
X_train = pad_sequences(list_tokenizer_train,maxlen=max_len)
X_test = pad_sequences(list_tokenizer_test,maxlen=max_len)

In [21]:
X_train.shape,X_test.shape

((159571, 100), (153164, 100))

Read glove word vectors into dictionary, mapping words to vectors

In [22]:
def get_coefs(word,*arr): return word,np.asarray(arr,dtype='float32')
embeddings_index = dict(get_coefs(*o.strip().split()) for o in open(embedding_file))

Use the above vectors to create embedding matrix with random initialization for words not in GloVe. Use the same mean and st. dev of GloVe embeddings

In [25]:
all_embeddings = np.stack(embeddings_index.values())
emb_mean,emb_stdev = all_embeddings.mean(),all_embeddings.std()
emb_mean,emb_stdev

(0.020940464, 0.64410418)

In [28]:
word_index = tokenizer.word_index
nb_words = min(max_words,len(word_index))
embedding_matrix = np.random.normal(emb_mean,emb_stdev,(nb_words,embed_size)) #randomly initialize embedding matrix

#assign GloVe vectors to known words (i.e. words in Glove), and random vectors to unknown words
for word,i in word_index.items():
    if i >= max_words: continue #exit for loop
    embedding_vector = embeddings_index.get(word) #retrieve GloVe vector
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector  #assign GloVe vector to embedding matrix if word is present
        

Model: Bidirectional LSTM with two fully connected layers. Add dropout because of overfitting after 2 epochs.

In [33]:
inp = Input(shape=(max_len,))
x = Embedding(max_words,embed_size,weights=[embedding_matrix])(inp)
x = Bidirectional(LSTM(50,return_sequences=True,dropout=0.1,recurrent_dropout=0.1))(x)
x = GlobalMaxPool1D()(x)
x = Dense(50,activation='relu')(x)
x = Dropout(0.1)(x)
x = Dense(6,activation='sigmoid')(x)
model = Model(inp,x)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_4 (Embedding)      (None, 100, 50)           1000000   
_________________________________________________________________
bidirectional_3 (Bidirection (None, 100, 100)          40400     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 100)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_2 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 306       
Total para

In [34]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.fit(X_train,y,batch_size=32,epochs=2,validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f130903ffd0>

In [None]:
from IPython.display import FileLink

df_submission = pd.read_csv(os.path.join(PATH,'sample_submission.csv'))
kag_preds = model.predict(X_test,batch_size=1024,verbose=1)
df_submission[df_submission.columns[1:]] = kag_preds

In [38]:
df_submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.995608,0.3975916,0.969701,0.1437341,0.903331,0.24651
1,0000247867823ef7,0.000266,9.189914e-07,7.7e-05,1.845149e-07,8e-06,3e-06
2,00013b17ad220c46,0.001102,3.386773e-06,0.000198,2.323211e-06,3.1e-05,1.1e-05
3,00017563c3f7919a,0.001666,2.777298e-06,0.000285,1.380096e-06,8.6e-05,7e-06
4,00017695ad8997eb,0.005767,1.857494e-05,0.000905,1.936324e-05,0.00022,5.1e-05


In [39]:
fname = 'submission_lstm_baseline.csv'
df_submission.to_csv(fname, index=False)
FileLink(fname)

above model scored 0.9759 