This is a simple stacking notebook to get you started with stacking keras's take on fastText + a classic BOW sklearn model.

Based on: https://www.kaggle.com/sterby/fasttext-like-baseline-with-keras-lb-0-257 , https://www.kaggle.com/yekenot/toxic-regression

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
from keras.preprocessing import sequence
from keras.models import Model, Input
from keras.layers import Dense, Embedding, GlobalAveragePooling1D, Dropout, SpatialDropout1D
from keras.preprocessing.text import Tokenizer

# Load the data

In [None]:
train_df = pd.read_csv("../input/train.csv")
test_df = pd.read_csv("../input/test.csv")

In [None]:
df = pd.concat([train_df['comment_text'], test_df['comment_text']], axis=0).fillna("BLANK")  # concat data for "cheating" in vectorizing

In [None]:
train_df.head()

## A little EDA: Is this multiclass or multilabel? 
* Looks like it can be multilabel :(
* Might be reversable with : https://stackoverflow.com/questions/44464280/mapping-one-hot-encoded-target-values-to-proper-label-names

In [None]:
(train_df.iloc[:,2:].apply(sum,axis=1)>1).sum()

In [None]:
print(train_df.comment_text.str.len().describe())

In [None]:
print(train_df.comment_text.str.split().str.len().describe())

In [None]:
print(test_df.comment_text.str.split().str.len().describe())

* It looks like we have less than a hundred words, and a few hundred chars per sentence. 
* This will help us design our max len, as well as giving us insight into there being many short words/characters

In [None]:
X_train = train_df["comment_text"].fillna("BLANK").values
y_train = train_df[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]].values
X_test = test_df["comment_text"].fillna("BLANK").values

In [None]:
i = 0
print("Comment: {}".format(X_train[i]))
print("Label: {}".format(y_train[i]))

# Use simple fasttext-like model

In [None]:
# Set parameters:
max_features = 95000
maxlen = 84
batch_size = 32
embedding_dims = 60 #50
epochs = 3

In [None]:
print('Tokenizing data...')
tok = Tokenizer(num_words=max_features)
tok.fit_on_texts(list(X_train) + list(X_test))
x_train = tok.texts_to_sequences(X_train)
x_test = tok.texts_to_sequences(X_test)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Average train sequence length: {}'.format(np.mean(list(map(len, x_train)), dtype=int)))
print('Average test sequence length: {}'.format(np.mean(list(map(len, x_test)), dtype=int)))

In [None]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

In [None]:
print('Build model...')
comment_input = Input((maxlen,))

# we start off with an  embedding layer
comment_emb = Embedding(max_features, embedding_dims, input_length=maxlen)(comment_input)
# We see that we overfit straight away, so dropout may be useful
drp = SpatialDropout1D(0.1)(comment_emb)
# we add a GlobalAveragePooling1D, which will average the embeddings
# of all words in the document
main = GlobalAveragePooling1D()(drp)

# We project onto a single unit output layer, and squash it with a sigmoid:
output = Dense(6, activation='softmax')(main)

model = Model(inputs=comment_input, outputs=output)

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
hist = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
# print('Build model...')
# comment_input = Input((maxlen,))

# # we start off with an  embedding layer
# comment_emb = Embedding(max_features, embedding_dims, input_length=maxlen)(comment_input)
# # We see that we overfit straight away, so dropout may be useful
# drp = Dropout(0.15)(comment_emb)
# # we add a GlobalAveragePooling1D, which will average the embeddings
# # of all words in the document
# main = GlobalAveragePooling1D()(drp)

# drp2 =  Dropout(0.25)(main)
# # We project onto a single unit output layer, and squash it with a sigmoid:
# output = Dense(6, activation='softmax')(drp2)

# model2 = Model(inputs=comment_input, outputs=output)

# model2.compile(loss='categorical_crossentropy',
#               optimizer='adam',
#               metrics=['accuracy'])

# hist2 = model2.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

#### Playing with dropout doesn't move the FastText needle (results are the same with/without dropout(s)). Not very surprising as it's just a linear embedding.
* Final output is still "loss: 0.2863 - acc: 0.9890 - val_loss: 0.3014 - val_acc: 0.9892"

* Note that spatial dropout has a much bigger effect! 

## Ensemble!
* Let's add the output from another model

In [None]:
nrow_train = train_df.shape[0]

vectorizer = CountVectorizer(stop_words='english',min_df=3, max_df=0.97,max_features = 40000)
data = vectorizer.fit_transform(df)

col = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

lr_preds = np.zeros((test_df.shape[0], len(col)))

X_train = data[:nrow_train]
X_test = data[nrow_train:]

for i, j in enumerate(col):
    print('fit '+j)
    lr_model = LogisticRegression(C=0.1, dual=True)
    lr_model.fit(X_train, train_df[j])
    lr_preds[:,i] = lr_model.predict_proba(X_test)[:,1]
print("done")

## Quick sanity check. compare out predicted outputs

In [None]:
for i, j in enumerate(col):
    print(j,lr_preds[:,i].mean())

In [None]:
# Get predictions from our keras/fasttext model
ft_pred = model.predict(x_test)

# submit

In [None]:
# get mean of both submissions

y_pred = lr_preds+ft_pred
y_pred = y_pred/2.0


In [None]:
submission = pd.read_csv("../input/sample_submission.csv")

In [None]:
submission[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]] = y_pred

In [None]:
submission.head()

In [None]:
submission.to_csv("submission_fasttext_1.csv", index=False)