# TensorFlow Deep Model

Ultimately we want to work in tf to build a model, as it will give the most flexibility and we can build off of some modules we built for SQuAD. 

We'll go through as follows:
0. Set-up.
1. Read in dataset.
2. Convert dataset into format for RNN.
3. Construct vocabulary for RNN.
4. Fit TF-RNN classifier.

In [67]:
# 0. Some initial set-up.
from collections import Counter
import numpy as np
import os
import pandas as pd
import random
from tf_rnn_classifier import TfRNNClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import tensorflow as tf
import sst
import utils

In [68]:
data_dir = "./data/"

## Read in data

In [69]:
train = pd.read_csv(data_dir + "train.csv").fillna(' ')
test = pd.read_csv(data_dir + "test.csv").fillna(' ')
test_labels = pd.read_csv(data_dir + "test_labels.csv")

In [70]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_text = train['comment_text']
train_labels = train[label_cols]

## Formatting dataset for RNN

In order to format the dataset for the RNN, we want to format it so that we have a list of lists. Outer list corresponds to training examples, inner list corresponds to token within each example.

In [71]:
def build_rnn_dataset(dataset):
    label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    text = dataset['comment_text']
    labels = dataset[label_cols]
    
    X_rnn = []
    for comment in text:
        # Train example is list of tokens
        X_rnn.append(comment.split())
        
    Y_rnn = []
    for _, labels in labels.iterrows():
        # Since this is multilabel problem, each y is a list of 0/1 for each of the
        # six classes.
        Y_rnn.append(list(labels))
        
    return X_rnn, Y_rnn

In [72]:
X_rnn_train, Y_rnn_train = build_rnn_dataset(train)

In [73]:
X_rnn_train[51][:20]

['GET',
 'FUCKED',
 'UP.',
 'GET',
 'FUCKEEED',
 'UP.',
 'GOT',
 'A',
 'DRINK',
 'THAT',
 'YOU',
 'CANT',
 'PUT',
 'DOWN???/',
 'GET',
 'FUCK',
 'UP',
 'GET',
 'FUCKED',
 'UP.']

In [74]:
Y_rnn_train[51]

[1, 0, 1, 0, 0, 0]

## Get vocab for RNN

In [75]:
full_train_vocab = sst.get_vocab(X_rnn_train)

In [76]:
print("sst_full_train_vocab has {:,} items".format(len(full_train_vocab)))

sst_full_train_vocab has 532,300 items


In [11]:
train_vocab = sst.get_vocab(X_rnn_train,)

In [12]:
print("sst_full_train_vocab has {:,} items".format(len(train_vocab)))

sst_full_train_vocab has 532,300 items


## Train RNN 

In [100]:
num_train = 3000

In [101]:
tf_rnn = TfRNNClassifier(
    train_vocab,
    embed_dim=50,
    hidden_dim=50,
    max_length=50,
    hidden_activation=tf.nn.tanh,
    cell_class=tf.nn.rnn_cell.LSTMCell,
    train_embedding=True,
    max_iter=50,
    eta=0.1)

In [102]:
_ = tf_rnn.fit(X_rnn_train[:num_train], Y_rnn_train[:num_train])

Iteration 50: loss: 0.0047960305819287945

In [103]:
tf_rnn_predictions = tf_rnn.predict(X_rnn_train[:num_train])

In [108]:
def evaluate(labels, preds):
    """
    labels: list (of shape (num_ex, num_class))
    preds: np.array shape (num_ex, num_class) (can be probabilistic)
    """
    labels = np.array(labels)

    # Class-wise Precision/Recall/F1
    label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    for i in range(len(label_cols)):
        print("CLASS: %s" % label_cols[i])
        print(classification_report(labels[:, i], np.round(tf_rnn_predictions[:,i])))
        print()
        
    # ROC-AUC
    roc_auc = roc_auc_score(labels, preds)
    print("macro-averaged ROC-AUC score: %f" % roc_auc)
    

In [109]:
evaluate(Y_rnn_train[:num_train], tf_rnn_predictions)

CLASS: toxic
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2693
          1       1.00      1.00      1.00       307

avg / total       1.00      1.00      1.00      3000


CLASS: severe_toxic
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2962
          1       1.00      1.00      1.00        38

avg / total       1.00      1.00      1.00      3000


CLASS: obscene
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2833
          1       1.00      0.99      1.00       167

avg / total       1.00      1.00      1.00      3000


CLASS: threat
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2986
          1       1.00      1.00      1.00        14

avg / total       1.00      1.00      1.00      3000


CLASS: insult
             precision    recall  f1-score   support

      

In [105]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
for i in range(len(label_cols)):
    print("CLASS: %s" % label_cols[i])
    print(classification_report(np.array(Y_rnn_train[:num_train])[:, i], np.round(tf_rnn_predictions[:,i])))
    print()

CLASS: toxic
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2693
          1       1.00      1.00      1.00       307

avg / total       1.00      1.00      1.00      3000


CLASS: severe_toxic
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2962
          1       1.00      1.00      1.00        38

avg / total       1.00      1.00      1.00      3000


CLASS: obscene
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2833
          1       1.00      0.99      1.00       167

avg / total       1.00      1.00      1.00      3000


CLASS: threat
             precision    recall  f1-score   support

          0       1.00      1.00      1.00      2986
          1       1.00      1.00      1.00        14

avg / total       1.00      1.00      1.00      3000


CLASS: insult
             precision    recall  f1-score   support

      