# TensorFlow Deep Model

Ultimately we want to work in tf to build a model, as it will give the most flexibility and we can build off of some modules we built for SQuAD. 

We'll go through as follows:
0. Set-up.
1. Read in dataset.
2. Convert dataset into format for RNN.
3. Construct vocabulary for RNN.
4. Fit TF-RNN classifier.

In [17]:
# 0. Some initial set-up.
from collections import Counter
import numpy as np
import os
import pandas as pd
import random
from tf_rnn_classifier import TfRNNClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import tensorflow as tf
import sst
import utils

In [2]:
data_dir = "./data/"

## Read in data

In [3]:
train = pd.read_csv(data_dir + "train.csv").fillna(' ')
test = pd.read_csv(data_dir + "test.csv").fillna(' ')
test_labels = pd.read_csv(data_dir + "test_labels.csv")

In [4]:
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
train_text = train['comment_text']
train_labels = train[label_cols]

## Formatting dataset for RNN

In order to format the dataset for the RNN, we want to format it so that we have a list of lists. Outer list corresponds to training examples, inner list corresponds to token within each example.

In [5]:
def build_rnn_dataset(dataset):
    label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
    text = dataset['comment_text']
    labels = dataset[label_cols]
    
    X_rnn = []
    for comment in text:
        # Train example is list of tokens
        X_rnn.append(comment.split())
        
    Y_rnn = []
    for _, labels in labels.iterrows():
        # Since this is multilabel problem, each y is a list of 0/1 for each of the
        # six classes.
        Y_rnn.append(list(labels))
        
    return X_rnn, Y_rnn

In [6]:
X_rnn_train, Y_rnn_train = build_rnn_dataset(train)

In [7]:
X_rnn_train[51][:20]

['GET',
 'FUCKED',
 'UP.',
 'GET',
 'FUCKEEED',
 'UP.',
 'GOT',
 'A',
 'DRINK',
 'THAT',
 'YOU',
 'CANT',
 'PUT',
 'DOWN???/',
 'GET',
 'FUCK',
 'UP',
 'GET',
 'FUCKED',
 'UP.']

In [8]:
Y_rnn_train[51]

[1, 0, 1, 0, 0, 0]

## Get vocab for RNN

In [9]:
full_train_vocab = sst.get_vocab(X_rnn_train)

In [10]:
print("sst_full_train_vocab has {:,} items".format(len(full_train_vocab)))

sst_full_train_vocab has 532,300 items


In [11]:
train_vocab = sst.get_vocab(X_rnn_train,)

In [12]:
print("sst_full_train_vocab has {:,} items".format(len(train_vocab)))

sst_full_train_vocab has 532,300 items


## Train RNN 

In [56]:
num_train = 10000

In [57]:
tf_rnn = TfRNNClassifier(
    train_vocab,
    embed_dim=50,
    hidden_dim=50,
    max_length=50,
    hidden_activation=tf.nn.tanh,
    cell_class=tf.nn.rnn_cell.LSTMCell,
    train_embedding=True,
    max_iter=50,
    eta=0.1)

In [58]:
_ = tf_rnn.fit(X_rnn_train, Y_rnn_train)

Iteration 50: loss: 7.5037748329341415

In [61]:
tf_rnn_predictions = tf_rnn.predict(X_rnn_train)

In [62]:
roc_auc_score(np.array(Y_rnn_train), tf_rnn_predictions)

0.9807663846490634

In [16]:
for i in range(tf_rnn_predictions.shape[0]):
    if max(Y_rnn_train[i]) > 0:
        print(X_rnn_train[i])
        print(tf_rnn_predictions[i])
        print(Y_rnn_train[i])

['COCKSUCKER', 'BEFORE', 'YOU', 'PISS', 'AROUND', 'ON', 'MY', 'WORK']
[9.9502230e-01 9.8707169e-01 9.8670584e-01 1.7316957e-04 9.9997211e-01
 4.7941774e-02]
[1, 1, 1, 0, 1, 0]
[9.90504324e-01 5.61857771e-04 1.02275452e-02 2.11670040e-03
 8.71731155e-03 1.14875955e-04]
[1, 0, 0, 0, 0, 0]
['Bye!', "Don't", 'look,', 'come', 'or', 'think', 'of', 'comming', 'back!', 'Tosser.']
[9.8992831e-01 7.7116280e-04 4.5293197e-03 1.0671234e-03 5.4652090e-03
 6.5570472e-05]
[1, 0, 0, 0, 0, 0]
[0.9999355  0.01459724 0.9940404  0.04043012 0.9999658  0.9424291 ]
[1, 0, 1, 0, 1, 1]
['FUCK', 'YOUR', 'FILTHY', 'MOTHER', 'IN', 'THE', 'ASS,', 'DRY!']
[0.9999552  0.00968371 0.9859411  0.00518027 0.9995454  0.00363055]
[1, 0, 1, 0, 1, 0]
["I'm", 'Sorry', "I'm", 'sorry', 'I', 'screwed', 'around', 'with', 'someones', 'talk', 'page.', 'It', 'was', 'very', 'bad', 'to', 'do.', 'I', 'know', 'how', 'having', 'the', 'templates', 'on', 'their', 'talk', 'page', 'helps', 'you', 'assert', 'your', 'dominance', 'over', 'them.