# Testing Models

Using the binary classification model created, predict toxic and not toxic values for an unseen dataset. This dataset is built from comments on YouTube and contains the following categories: 
- toxic
- obscene
- threat
- identity_hate 

The data needs to be first pre-processed and tokenized before being used as input to the best model that we trained 

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
from torchtext.vocab import GloVe
from itertools import combinations
import torch
import os
import time
from sklearn.metrics import classification_report

In [2]:
# Read in dataset 

data_path = '/Users/irsaashraf/Desktop/UChicago/Spring_23/Advanced ML/Project/Irsa_project/youtube_comments_for_testing.csv'
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,comment_text,toxic,threat,obscene,identity_hate
0,If only people would just take a step back and...,0,0,0,0
1,Law enforcement is not trained to shoot to app...,1,0,0,0
2,\nDont you reckon them 'black lives matter' ba...,1,0,1,0
3,There are a very large number of people who do...,0,0,0,0
4,"The Arab dude is absolutely right, he should h...",0,0,0,0


In [3]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

No GPU available, using the CPU instead.


## Pre-processing

In [5]:
MAX_SENT_LENGTH = 200
BATCH_SIZE = 16
# EMBEDDING_DIM = 300

In [6]:
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences


2023-05-20 18:50:42.042612: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [10]:
comments = df['comment_text'].values
y = df['toxic'].values
comments

array(["If only people would just take a step back and not make this case about them, because it wasn't about anyone except the two people in that situation.\xa0 To lump yourself into this mess and take matters into your own hands makes these kinds of protests selfish and without rational thought and investigation.\xa0 The guy in this video is heavily emotional and hyped up and wants to be heard, and when he gets heard he just presses more and more.\xa0 He was never out to have a reasonable discussion.\xa0 Kudos to the Smerconish for keeping level the whole time and letting Masri make himself out to be a fool.\xa0 How dare he and those that tore that city down in protest make this about themselves and to dishonor the entire incident with their own hate.\xa0 By the way, since when did police brutality become an epidemic?\xa0 I wish everyone would just stop pretending like they were there and they knew EXACTLY what was going on, because there's no measurable amount of people that honestl

In [11]:
NUM_WORDS = 10000

# Using max_len = 200 for consistency 
MAX_LEN = 200

tokenizer = Tokenizer(num_words=NUM_WORDS)
tokenizer.fit_on_texts(comments)
comments = tokenizer.texts_to_sequences(comments)
comments = pad_sequences(comments, padding='post', maxlen=MAX_LEN)


### Load the saved model

In [14]:
from tensorflow.keras.models import load_model

model = load_model('/Users/irsaashraf/Desktop/UChicago/Spring_23/Advanced ML/Project/Irsa_project/model1_50epochs.h5')
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 200, 300)          16449300  
                                                                 
 conv1d_3 (Conv1D)           (None, 196, 64)           96064     
                                                                 
 global_max_pooling1d_3 (Glo  (None, 64)               0         
 balMaxPooling1D)                                                
                                                                 
 dense_6 (Dense)             (None, 64)                4160      
                                                                 
 dropout_3 (Dropout)         (None, 64)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 65        
                                                      

2023-05-20 18:55:47.532588: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Make predictions and Check Metrics

In [15]:
y_pred = model.predict(comments)
y_pred = (y_pred > 0.5).astype(int)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.40      0.47       538
           1       0.48      0.63      0.54       462

    accuracy                           0.51      1000
   macro avg       0.52      0.52      0.51      1000
weighted avg       0.52      0.51      0.50      1000

