# Preliminaries
I know, I know... I come up with some pretty terrible names for my kernels.

So, as you can imagine, I will be using the [GloVe](https://nlp.stanford.edu/projects/glove/) word embedding. This is just one of many different word embeddings typically used. Other popular choices include Word2Vec and [fastText](https://fasttext.cc/).

So, I will be passing everything through a convolutional neural network first. CNNs serve an extremely useful purpose; they compress everything, putting far less strain when our data passes through our next layers in our neural network. Despite becoming familiar with CNNs through image recognition, they are extremely valuable for natural language processing as well. See for instance [Yoon Kim's paper](https://arxiv.org/pdf/1408.5882.pdf) on the subject.

Next, we will be passing through a GRU unit. These are a variation on the recurrent neural network (RNN) type architecture. These are pretty standard players at this point in text classification. Some fantastic resources to understanding this would be [Ian Goodfellow et al's 'Deep Learning'](https://www.deeplearningbook.org/), Alex Grave's 'Supervised Sequence Labelling with Recurrent Neural Networks' ([Goodread](https://www.goodreads.com/book/show/14642424-supervised-sequence-labelling-with-recurrent-neural-networks?ac=1&from_search=true)), or [this paper](https://arxiv.org/pdf/1308.0850.pdf) by Alex Grave.

Lastly, we are passing everything through a densely connected neural network layer. These provide a nice catch-all to classify the information gathered out from our RNN.

A shoutout to several other kernels that were of some help making this kernel.
* [A look at different embeddings](https://www.kaggle.com/sudalairajkumar/a-look-at-different-embeddings)
* [LSTM is all you need! Well, maybe embeddings also.](https://www.kaggle.com/mihaskalic/lstm-is-all-you-need-well-maybe-embeddings-also)

In [None]:
# Scipy Stacks
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split # To split our training set into training/validation sets.
from sklearn import metrics

# We are performing a sequential neural network.
from keras.models import Sequential

# Used to process text data.
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# What layers will be involved with our neural network.
from keras.layers import Dense, LSTM, Embedding, Dropout, Activation, CuDNNGRU, CuDNNLSTM, Conv1D, MaxPooling1D, Bidirectional, GlobalMaxPool1D

# Loading Data into Memory
So we need to load in our training data first. Let's start by importing important libraries.

In [None]:
df = pd.read_csv("../input/train.csv")

In [None]:
df_train, df_val = train_test_split(df, test_size=0.1, random_state=42)

In [None]:
# some config values 
embed_size = 500 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 200 # max number of words in a question to use

# fill up the missing values
x_train = df_train["question_text"].fillna("_na_").values
x_val = df_val["question_text"].fillna("_na_").values

# Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(x_train))
x_train = tokenizer.texts_to_sequences(x_train)
x_val = tokenizer.texts_to_sequences(x_val)

# Pad the sentences 
x_train = pad_sequences(x_train, maxlen=maxlen)
x_val = pad_sequences(x_val, maxlen=maxlen)

# Get the target values
y_train = df_train['target'].values
y_val = df_val['target'].values

# GloVe Embedding
Next, we need to load the GloVe embedding into memory. The file containing all the weights is already provided. This may take a while to load into memory, the entire file is barely above 2.0GB.

In [None]:
EMBEDDING_FILE = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
def get_coefs(word,*arr): 
    return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: 
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: 
        embedding_matrix[i] = embedding_vector

# Constructing and Training our Model

In [None]:
model = Sequential()
model.add(Embedding(max_features, 
                    embed_size, 
                    weights=[embedding_matrix]))
model.add(Bidirectional(CuDNNGRU(64, return_sequences=True)))
model.add(GlobalMaxPool1D())
model.add(Dropout(0.1))
model.add(Dense(16, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

In [None]:
model.summary()

In [None]:
history = model.fit(x_train, y_train, batch_size=512, epochs=2, validation_data=(x_val, y_val))

Now, usually I find plotting both training and validation loss and accuracy is very useful to get a high level view of training progress, as well as whatevr overfitting may be going on. In this case, I found overfitting really started to take hold after two epochs. This graph only becomes useful after around 5+ epochs.

In [None]:
'''
import matplotlib.pyplot as plt

def plot_scores(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']

    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1, len(loss) + 1)

    plt.figure(figsize=(20, 10))

    plt.subplot(121)
    plt.plot(epochs, acc, 'bo', label='Training accuracy')
    plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
    plt.title("Training and validation accuracy")
    plt.legend()

    plt.subplot(122)
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title("Training and validation loss")
    plt.legend()
    plt.show()
    
plot_scores(history)
'''

# Testing

In [None]:
df_test = pd.read_csv("../input/test.csv")
x_test = df_test["question_text"].fillna("_na_").values

x_test = tokenizer.texts_to_sequences(x_test)

x_test = pad_sequences(x_test, maxlen=maxlen)

In [None]:
y_test = model.predict([x_test], batch_size=1024, verbose=1)
y_test = (y_test > 0.5).astype(int)
df_test = pd.DataFrame({"qid": df_test["qid"].values})
df_test['prediction'] = y_test
df_test.to_csv("submission.csv", index=False)