In this notebook you will learn how to use Keras to implement a simple single-layer (shallow network) model using handcrafted features as input.
Try to follow, understand the code and do the exercises/answer the questions when prompted. 

In [None]:
import warnings
warnings.filterwarnings("ignore")
import os; os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 
import numpy as np
np.random.seed(42)
%reload_ext autoreload
%autoreload 2
%matplotlib inline
import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import pandas as pd
import keras_models
from scipy.sparse import hstack

### Load dataset
At this point we suppose that you've read the README.md file and that you have downloaded the dataset. 
Move the files train.csv, test.csv and test_labels.csv inside the folder portoai-nlp.

It's important to follow some mandatory deep learning rules when developing models:
 - We split the data in (at least) 3 parts: training, validation and testing
 - While we train out models on the training data, we use the validation split to see how it is performing
 - We **do not** look at the results in the test set. You have to think of the test set as 'what I will encounter in the real-world' and therefore you should only run your model on it when you are done training and confident with your score in the validation set
 
Remember that the model used in 'real-world' will probably not have seen that data before, and therefore we **do not** take any information from the testing set to the training set, or else we will be cheating. During development, we make the assumption that the validation set will be similar to the testing set, and that improvements on validation will be a positive thing.

Make sure you understand the importance of this train/val/test split before going any further.


In [None]:
# Read data, fill empty comments with empty strings ' '
N_ROWS = 30000
dataset = pd.read_csv('train.csv', nrows=N_ROWS).fillna(' ')
LABELS = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

We are using a subset of the whole dataset. 

**At the end of running the exercises**, feel free to try with a bigger `N_ROWS` variable to see the difference in results. You have a maximum of 159571 rows in the dataset.

### Having a look at the data
We can see that we have 8 columns in the dataset: the row identifier (id), comment_text and the other 6 labels we want to predict (toxic, severe_toxic, obscene, threat, insult and identity_hate). 

Note we want to predict all these labels having only the comment_text.

Note also that a comment text can 0 up to 6 labels of toxicity. We could train a different classifier for each label but we will take advantage of models that predict all of them at the same time, taking advantage of things such as: a 'severe_toxic' comment might as well be a 'toxic' comment, right? The model will try to learn these patterns.

In [None]:
print("Dataset shape:", dataset.shape)

In [None]:
dataset.head(10)

### Train and validation split
We will randomly assign 80% of the dataset to train and the rest of the 20% to validation. Note that the test set is already in another file, `test.csv` that will not be used during model development.

In [None]:
msk = np.random.rand(len(dataset)) < 0.8
train = dataset[msk]
val = dataset[~msk]

In [None]:
print("Train size:", train.shape[0], ", Val size:", val.shape[0])

In [None]:
# Grab the comment text from the train and test dataframes
train_text = train['comment_text']
val_text = val['comment_text']

# Bag of words features

We have to transform our text into something that we can feed a machine model with, which is numbers!

We will start with a simple count of words using [sklearn's CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). This will simply convert each comment text to a **fixed length** vector with the dimension of the vocabulary (vocabulary is the set of all words that appear in the training set).

You can see in [sklearn's example](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) how this is done. 

For example, for the comment 'I love NLP', the resulting vector will be filled with zeros, except for the positions of the vector corresponding to the words 'I', 'love' and  'NLP' which will have a 1, since they appear one time on that string. It's nothing more than a word count. Bag of words!

Since we have a lot of words in our vocabulary, we will reduce the vector dimension by passing a MAX_LENGTH number to the `max_features` function argument, that will make sure the CountVectorizer will only construct a vector for each string considering the MAX_LENGTH most frequent words in the training set. Once again, feel free to play with these parameters at the end of the assignment.

In [None]:
MAX_LENGTH = 1000
word_vectorizer = CountVectorizer(stop_words='english', max_features=MAX_LENGTH)
word_vectorizer.fit(train_text)

train_word_features = word_vectorizer.transform(train_text)
val_word_features = word_vectorizer.transform(val_text)

y_train = train[LABELS].values 
y_val = val[LABELS].values

print(train_word_features.shape, val_word_features.shape)

In [None]:
# check our input features
train_word_features[:5, :].toarray()

### Feed a basic linear model with the features
Check the function `get_shallow_features_model` under `keras_models.py` to see how to build a simple linear model in Keras.

Can you explain this line in here `output = Dense(6, activation='sigmoid')(input)`? Understand this before you continue any further.

In [None]:
# Construct model
bow_model = keras_models.get_shallow_features_model(number_feats=train_word_features.shape[1])
bow_model.summary()

Looking at the output above you can see that we have just constructed a simple shallow network with a single layer with an output shape of 6.

We will now train the model with out example from the train set during 10 epochs, which means that we will do feed the networks the same data 5 times and, at the end of each loop, we will measure how well it performs in our validation set.

While runnning the code below you can see the accuracy (val_acc) and [ROC-AUC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) (val_auc) being printed at the end of each training loop/epoch.

Feel free to increase the number of epochs and see the difference in score.

In [None]:
# Check https://keras.io/models/model/#fit for more option on arguments to pass to the fit function
history = bow_model.fit(train_word_features, y_train, batch_size=128, epochs=10, validation_data=(val_word_features, y_val))

In [None]:
epochs = [x for x in range(len(history.history['auc']))]
trace1 = {"y": history.history['auc'], "x": epochs, "name": "train_auc", "type": "scatter"}
trace2 = {"y": history.history['val_auc'], "x": epochs, "name": "val_auc", "type": "scatter"}
trace3 = {"y": history.history['loss'], "x": epochs, "name": "train_loss", "type": "scatter"}
trace4 = {"y": history.history['val_loss'], "x": epochs, "name": "val_loss", "type": "scatter"}

data = [trace1, trace2, trace3, trace4]
layout = {"title": "BoW model"}

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='bow-model')

In the plot above you can see the evolution of loss and ROC-AUC in the train and val set.

While training neural networks, it's very useful to closer look at these plots to see if we are overfitting or even underfitting. 

For overfitting cases, the training loss continues to decrease while the validation score does not increase (and the val loss does not decrease either). 

# TF-IDF features

We now ask you to construct a different kind of input to the model, using [sklearn's TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) that constructs a vector of TF-IDF features for each comment text.

These TF-IDF features work a bit different from the classic Bag of words, by giving more weight to words that appear in less documents (training examples).                                                                             

In [None]:
MAX_LENGTH = 1000
raise NotImplementedError("Implement TF-IDF features")

After implementing the TF-IDF features, we will run our linear model on them.

You will probably notice that with the training takes longer to get as good results as the BoW model. Try to increse the epochs to see the difference. 

A main reason why the BoW model converged faster is because we did not normalize our input to the network. That topic is out of scope of this tutorial, but feel free to read more about it in links like [this one](https://medium.com/@urvashilluniya/why-data-normalization-is-necessary-for-machine-learning-models-681b65a05029).









**Solution to implement TF-IDF features:**
```
word_vectorizer = TfidfVectorizer(max_features=1000)
word_vectorizer.fit(train_text)

train_word_features = word_vectorizer.transform(train_text)
val_word_features = word_vectorizer.transform(val_text)

y_train = train[LABELS] 
y_val = val[LABELS]

print(train_word_features.shape, val_word_features.shape)
```

In [None]:
tfidf_model = keras_models.get_shallow_features_model(number_feats=train_word_features.shape[1])
tfidf_model.summary()

In [None]:
# https://keras.io/models/model/#fit
history = tfidf_model.fit(train_word_features, y_train, batch_size=128, epochs=10, validation_data=(val_word_features, y_val))

In [None]:
epochs = [x for x in range(len(history.history['auc']))]
trace1 = {"y": history.history['auc'], "x": epochs, "name": "train_auc", "type": "scatter"}
trace2 = {"y": history.history['val_auc'], "x": epochs, "name": "val_auc", "type": "scatter"}
trace3 = {"y": history.history['loss'], "x": epochs, "name": "train_loss", "type": "scatter"}
trace4 = {"y": history.history['val_loss'], "x": epochs, "name": "val_loss", "type": "scatter"}

data = [trace1, trace2, trace3, trace4]
layout = {"title": "TF-IDF model"}

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='tfidf-model')

# Implement your own features
We now challenge you to implement your own features. 

Feel free to play with parameters of CountVectorizer, TfidfVectorizer and to add other features.

For example, adding a 'text length' feature:
```
text_len_feat = np.expand_dims(train_text.apply(lambda x: len(x)).to_numpy(), 1)
train_word_features = np.hstack((train_word_features.todense(), text_len_feat))
text_len_feat = np.expand_dims(val_text.apply(lambda x: len(x)).to_numpy(), 1)
val_word_features = np.hstack((val_word_features.todense(), text_len_feat))
```

Try to beat the previous models with new input features and run your model on the final Kaggle test set!

In [None]:
# Load final test set
test = pd.read_csv('test.csv').fillna(' ')
raise NotImplementedError("Implement your own features")

In [None]:
your_model = keras_models.get_shallow_features_model(number_feats=train_word_features.shape[1])
history = your_model.fit(train_word_features, y_train, batch_size=128, epochs=10, validation_data=(val_word_features, y_val))

In [None]:
epochs = [x for x in range(len(history.history['auc']))]
trace1 = {"y": history.history['auc'], "x": epochs, "name": "train_auc", "type": "scatter"}
trace2 = {"y": history.history['val_auc'], "x": epochs, "name": "val_auc", "type": "scatter"}
trace3 = {"y": history.history['loss'], "x": epochs, "name": "train_loss", "type": "scatter"}
trace4 = {"y": history.history['val_loss'], "x": epochs, "name": "val_loss", "type": "scatter"}

data = [trace1, trace2, trace3, trace4]
layout = {"title": "Your model"}

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='your-model')

In [None]:
# Get score in Kaggle's test set
keras_models.get_score_on_kaggle_test_set(test_word_features, your_model)

# First deep model

As a first assignment, we will ask you to follow the example of the shallow network from `keras_models.get_shallow_features_model` and implement a **deep** network now, making it a multilayer perceptron.

Go to keras_models.py and implement the `get_MLP_features_model` function.

Look at the image below to have an idea of how many changes you need to do to the previous network to get your first deep network running!

![this](https://qph.fs.quoracdn.net/main-qimg-8a19e73bffab9a7f6eab55fd5b47c00a.webp)

After implementing it, try running the models again through the deep network to see if you get better results!

**(Possible) Solution:**

```
def get_MLP_features_model(number_feats):
    # Inputs
    input = Input(shape=[number_feats], name="input_features")

    x = Dense(256, activation='relu')(input)  # we can tweak the number of neurons and choose another activation
    x = Dropout(0.2)(x)  # we can tweak the dropout value

    # output
    output = Dense(6, activation='sigmoid')(x)

    # model
    model = Model([input], output)
    model.compile(loss="binary_crossentropy",
                  optimizer='adam', metrics=['accuracy', auc])
    return model
    ```