# **Social Media Toxic Comments Text Classifier**

**Project Description:**        
> To give social media users the power to filter and hide cyberbullying comments. In this project, we hope to demonstrate the use of NLP and Machine Learning to identify the comments which are cyberbullying / toxic in nature so that they can be filtered from social media platforms. 

As a Proof-of-Concept, we will be using different Word Embeddings and Machine Learning models to classify social media comments into toxic or non-toxic in nature, and compare and identify the best model for the task.

**Contributers:** 
*   Bansal Priyakankshi
*   Ee Jing Shi Jolynn
*   Lim Kok Leong
*   Lim Yu Bin
*   Saroop Chand Audeshwar Raj Adityaraj

**Last Updated On:** 06 Feb 2023

# **Table of Content**


1.  Importing Cleaned & Processed Data (for immediate modelling)
2.  Modelling
3.  Testing Models on YouTube Comments
3.  Data Exploration, Cleaning, Processing & Exporting of Files (Not required to Run)
4.  Data Preparation for Visualisation (Not required to Run)
5.  Web-scraping for YouTube Comments Dataset (Not required to Run)





#  **Upload Datasets into MyDrive**

Please upload the following dataset into your Google MyDrive and mount your Google Drive by running the code below

(Note: If you put the datasets in a specific folder in MyDrive, please change the working directory (dir) in the code below, the datasets should show up in the output after running the code)


*   X_train.pickle
*   X_val.pickle
*   y_train.pickle
*   y_val.pickle
*   toxicity_parsed_dataset.csv
*   ytcomments.csv



In [None]:
# Mount your Google Drive

import os
from google.colab import drive
drive.mount('/content/drive')
dir = "/content/drive/My Drive/Data"
!ls "/content/drive/My Drive/Data"

Mounted at /content/drive
batch_10_vectorised.pickle  batch_6_vectorised.pickle
batch_1_vectorised.pickle   batch_7_vectorised.pickle
batch_2_vectorised.pickle   batch_8_vectorised.pickle
batch_3_vectorised.pickle   batch_9_vectorised.pickle
batch_4_vectorised.pickle   toxicity_parsed_dataset.csv
batch_5_vectorised.pickle   ytcomments.csv


# **1. Importing Cleaned & Processed Data (for immediate modelling)**

## Install & Import Libraries

In [None]:
# Check GPU
!nvidia-smi

Mon May 31 04:10:16 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2.3MB 29.3MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3.3MB 46.1MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72acbcbeac6b25e86/huggingface_hub-0.0.8-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/

In [None]:
import pandas as pd
import numpy as np
import csv
import logging
from numpy import random
import gensim
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
from typing import Any, Dict, List, Callable, Optional, Tuple, Union
import json
import torch
import transformers
from transformers import BertModel, BertTokenizer, DistilBertModel, DistilBertTokenizer
from torch.utils.data import Dataset, DataLoader
from torch import optim, nn
from sklearn import metrics as sk_metrics
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn import svm
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.utils.multiclass import unique_labels
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, Input
from keras.layers import Conv1D, MaxPooling1D, GlobalMaxPool1D
import tensorflow as tf
from tensorflow.keras.preprocessing import text, sequence
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Conv1D, GlobalMaxPooling1D, MaxPooling1D
import keras
import xgboost
import dill
import os
import pickle
import joblib
from google.colab import files
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
%matplotlib inline

# Suppress scientific notation
pd.options.display.float_format = '{:.2f}'.format

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


ImportError: ignored

## Define Functions

In [None]:
def split_random(train: float, val: float, test: float) -> str:
    if train + val + test != 1.0:
        raise ValueError("train + val + test  must equal 1")
    rand_num = np.random.rand()
    
    if rand_num  <= train:
        return "train"
    elif rand_num <= train + val:
        return "val"
    else:
        return "test"
    
class BertTransformer(BaseEstimator, TransformerMixin):
    def __init__(
        self,
        bert_tokenizer,
        bert_model,
        max_length: int = 60,
        embedding_func: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,
    ):
        self.tokenizer = bert_tokenizer
        self.model = bert_model
        self.model.eval()
        self.max_length = max_length
        self.embedding_func = embedding_func

        if self.embedding_func is None:
            self.embedding_func = lambda x: x[0][:, 0, :].squeeze()

    def _tokenize(self, text: str) -> Tuple[torch.Tensor, torch.Tensor]:
        # Tokenize the text with the provided tokenizer
        tokenized_text = self.tokenizer.encode_plus(
            text, add_special_tokens=True, max_length=self.max_length,truncation=True
        )["input_ids"]
        
        # padding
        padded_text = tokenized_text + [0]*(self.max_length-len(tokenized_text))

        # Create an attention mask telling BERT to use all words and ignore padded values
        attention_mask = np.where(np.array(padded_text) != 0, 1, 0)

        # bert takes in a batch so we need to unsqueeze the rows
        return (
            torch.tensor(padded_text).unsqueeze(0),
            torch.tensor(attention_mask).unsqueeze(0),
        )

    def _tokenize_and_predict(self, text: str) -> torch.Tensor:
        tokenized, attention_mask = self._tokenize(text)

        embeddings = self.model(tokenized, attention_mask)
        return self.embedding_func(embeddings)

    def transform(self, text: List[str]):
        if isinstance(text, pd.Series):
            text = text.tolist()

        with torch.no_grad():
            return torch.stack([self._tokenize_and_predict(string) for string in text])

    def fit(self, X, y=None):
        """No fitting necessary so we just return ourselves"""
        return self
    
def calculate_classification_metrics(
    y_true: np.array,
    y_pred: np.array,
    average: Optional[str] = None,
    return_df: bool = True,
) -> Union[Dict[str, float], pd.DataFrame]:
    """Computes f1, precision, recall, precision, kappa, accuracy, and support

    Args:
        y_true: The true labels
        y_pred: The predicted labels
        average: How to average multiclass results

    Returns:
        Either a dataframe of the performance metrics or a single dictionary
    """
    labels = unique_labels(y_true, y_pred)

    # get results
    precision, recall, f_score, support = sk_metrics.precision_recall_fscore_support(
        y_true, y_pred, labels=labels, average=average
    )

    kappa = sk_metrics.cohen_kappa_score(y_true, y_pred, labels=labels)
    accuracy = sk_metrics.accuracy_score(y_true, y_pred)

    # create a pandas DataFrame
    if return_df:
        results = pd.DataFrame(
            {
                "class": labels,
                "f_score": f_score,
                "precision": precision,
                "recall": recall,
                "support": support,
                "kappa": kappa,
                "accuracy": accuracy,
            }
        )
    else:
        results = {
            "f1": f_score,
            "precision": precision,
            "recall": recall,
            "kappa": kappa,
            "accuracy": accuracy,
        }

    return results

def preparation(data, col_name = 'Text'):
  data=data.drop_duplicates(subset=[col_name])
  data[col_name]=data[col_name].replace(r'(?P<url>https?://[^\s]+)','', regex=True)
  data.dropna(subset=[col_name], inplace=True)
  return data

def save_pkl_pickle(path, vectorizer, vectorizer_filename):
    #pickle.dump(model, open(path + model_filename + ".pickle", 'wb'))
    pickle.dump(vectorizer, open(path + "/" + vectorizer_filename + ".pickle", "wb"))
    print ("====done saving into pickle using Pickle!====")

In [None]:
figure_8_classes = [0,1]

## Importing Datasets

*   X is the BERT word embeddings of the "Text" column in pre-processed Toxicity Dataset
*   Y is the "oh_label" column in the pre-processed Toxicity Dataset
*   df is the original Toxicity Dataset
*   yt is the YouTube Comments Dataset

(Note: If you put the datasets in a specific folder in MyDrive, please change the working directory in the code below, the datasets shape should show up in the output after running the code)



In [None]:
# Load X_train, X_val, y_train, y_val data
X_train = pickle.load(open(dir + "/X_train.pickle",'rb'))
X_val = pickle.load(open(dir + "/X_val.pickle",'rb'))
y_train = pickle.load(open(dir + "/y_train.pickle",'rb'))
y_val = pickle.load(open(dir + "/y_val.pickle",'rb'))
df = pd.read_csv(dir + '/toxicity_parsed_dataset.csv')
yt = pd.read_csv(dir + '/ytcomments.csv')


display(X_train.shape)
display(X_val.shape)
display(y_train.shape)
display(y_val.shape)
display(df.shape)
display(yt.shape)

torch.Size([95632, 768])

torch.Size([63756, 768])

(95632,)

(63756,)

(159686, 5)

(488, 3)

# **2. Modelling**


*   With BERT Embedding (Our Team's Models)
*   Perspective Replica with GloVe Embedding (Baseline)



## Model 1: BERT + Logistic Regression (logreg)

In [None]:
# Model 1: Logistic Regression (logreg)

my_tags = ["non-toxic", "toxic"]

logreg = LogisticRegression(n_jobs=1, solver = "liblinear", C=100)
logreg.fit(X_train, y_train)

print("==== Train Results ====")
log_pred_train = logreg.predict(X_train)
print("                     ")
print(classification_report(y_train, log_pred_train,target_names=my_tags))
print("                            ")
print("==== Train Confusion Matrix ====")
log_cmtx_train = pd.DataFrame(
    confusion_matrix(y_train, log_pred_train, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(log_cmtx_train)
print('Accuracy %s' % accuracy_score(log_pred_train, y_train))
print("AUC: " , metrics.roc_auc_score(y_train, log_pred_train))


print("                            ")
print("                            ")

print("==== Validation Results ====")
log_pred = logreg.predict(X_val)
print("                     ")
print(classification_report(y_val, log_pred,target_names=my_tags))
print("                            ")
print("==== Validation Confusion Matrix ====")
log_cmtx = pd.DataFrame(
    confusion_matrix(y_val, log_pred, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(log_cmtx)
print('Accuracy %s' % accuracy_score(log_pred, y_val))
print("AUC: " , metrics.roc_auc_score(y_val, log_pred))


==== Train Results ====
                     
              precision    recall  f1-score   support

   non-toxic       0.97      0.99      0.98     86410
       toxic       0.84      0.68      0.75      9222

    accuracy                           0.96     95632
   macro avg       0.90      0.83      0.86     95632
weighted avg       0.95      0.96      0.95     95632

                            
==== Train Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,85245,1165
Actual: Toxic,2992,6230


Accuracy 0.9565312865986281
AUC:  0.8310381056695834
                            
                            
==== Validation Results ====
                     
              precision    recall  f1-score   support

   non-toxic       0.97      0.98      0.97     57631
       toxic       0.82      0.66      0.73      6125

    accuracy                           0.95     63756
   macro avg       0.89      0.82      0.85     63756
weighted avg       0.95      0.95      0.95     63756

                            
==== Validation Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,56725,906
Actual: Toxic,2054,4071


Accuracy 0.953572997051258
AUC:  0.8244661776771925


## Model 2: BERT + Linear SVM - SGDClassifier (sgd) 

In [None]:
# Model 2: Linear SVM - SGDClassifier (sgd) 
my_tags = ["non-toxic", "toxic"]

# SGD Model(Linear SVM)
sgd = Pipeline([
                ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, random_state=0)),
               ])
sgd.fit(X_train, y_train)

print("==== Train Results ====")
sgd_pred_train = sgd.predict(X_train)
print("                            ")
print(classification_report(y_train, sgd_pred_train,target_names=my_tags))
print("                            ")
print("==== Train Confusion Matrix ====")
sgd_cmtx_train = pd.DataFrame(
    confusion_matrix(y_train, sgd_pred_train, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(sgd_cmtx_train)
print('Accuracy: %.3f' % accuracy_score(sgd_pred_train, y_train))
print("AUC: " , metrics.roc_auc_score(y_train, sgd_pred_train))



print("                            ")
print("                            ")

print("==== Validation Results ====")
sgd_pred = sgd.predict(X_val)
print("                            ")
print(classification_report(y_val, sgd_pred,target_names=my_tags))
print("                            ")
print("==== Validation Confusion Matrix ====")
sgd_cmtx = pd.DataFrame(
    confusion_matrix(y_val, sgd_pred, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(sgd_cmtx)
print('Accuracy: %.3f' % accuracy_score(sgd_pred, y_val))
print("AUC: " , metrics.roc_auc_score(y_val, sgd_pred))


==== Train Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.97      0.98      0.97     86410
       toxic       0.79      0.71      0.75      9222

    accuracy                           0.95     95632
   macro avg       0.88      0.85      0.86     95632
weighted avg       0.95      0.95      0.95     95632

                            
==== Train Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,84708,1702
Actual: Toxic,2663,6559


Accuracy: 0.954
AUC:  0.8457686056430923
                            
                            
==== Validation Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.97      0.98      0.97     57631
       toxic       0.77      0.70      0.74      6125

    accuracy                           0.95     63756
   macro avg       0.87      0.84      0.86     63756
weighted avg       0.95      0.95      0.95     63756

                            
==== Validation Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,56369,1262
Actual: Toxic,1818,4307


Accuracy: 0.952
AUC:  0.8406428682975681


## Model 3: BERT + XGBoost (xg)

In [None]:
# Model 3: XGBoost (xg)
my_tags = ["non-toxic", "toxic"]

xg = xgboost.XGBClassifier(max_depth = 5)
xg.fit(X_train,y_train)

print("==== Train Results ====")
xg_pred_train = xg.predict(X_train)
print("                            ")
print(classification_report(y_train, xg_pred_train,target_names=my_tags))
print("                            ")
print("==== Train Confusion Matrix ====")
xg_cmtx_train = pd.DataFrame(
    confusion_matrix(y_train, xg_pred_train, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(xg_cmtx_train)
print('Accuracy: %.3f' % accuracy_score(xg_pred_train, y_train))
print("AUC: " , metrics.roc_auc_score(y_train, xg_pred_train))



print("                            ")
print("                            ")

print("==== Validation Results ====")
xg_pred = xg.predict(X_val)
print("                            ")
print(classification_report(y_val, xg_pred,target_names=my_tags))
print("                            ")
print("==== Validation Confusion Matrix ====")
xg_cmtx = pd.DataFrame(
    confusion_matrix(y_val, xg_pred, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(xg_cmtx)
print('Accuracy: %.3f' % accuracy_score(xg_pred, y_val))
print("AUC: " , metrics.roc_auc_score(y_val, xg_pred))


==== Train Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.97      0.99      0.98     86410
       toxic       0.89      0.67      0.77      9222

    accuracy                           0.96     95632
   macro avg       0.93      0.83      0.87     95632
weighted avg       0.96      0.96      0.96     95632

                            
==== Train Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,85618,792
Actual: Toxic,3008,6214


Accuracy: 0.960
AUC:  0.8323289298965096
                            
                            
==== Validation Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.96      0.99      0.97     57631
       toxic       0.82      0.60      0.69      6125

    accuracy                           0.95     63756
   macro avg       0.89      0.79      0.83     63756
weighted avg       0.95      0.95      0.95     63756

                            
==== Validation Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,56820,811
Actual: Toxic,2450,3675


Accuracy: 0.949
AUC:  0.7929638562579168


## Model 4: BERT + RNN - LSTM (lstm)

In [None]:
# Model 4: RNN - LSTM (lstm)

# Reshape input data to fit LSTM model input
X_train = X_train.reshape(-1, 768, 1)
X_val = X_val.reshape(-1, 768, 1)


# Creating a callback to reduce learning rate over time and stopping early
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, verbose=1, mode='auto', cooldown=0, min_lr=0.00001)
early = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto',restore_best_weights = True)

lstm = Sequential()
lstm.add(LSTM(units = 64, dropout = 0.2,return_sequences=True))
lstm.add(LSTM(units = 64, dropout = 0.2))
lstm.add(Dense(units = 1, activation = 'sigmoid'))

lstm.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["AUC", "accuracy"])
lstm.fit(np.array(X_train), np.array(y_train), batch_size = 32, epochs = 8,
         validation_data = (np.array(X_val), np.array(y_val)),
         callbacks = [reduce_lr, early])

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<tensorflow.python.keras.callbacks.History at 0x7ff22606a3d0>

In [None]:
# Model 4: RNN - LSTM (lstm)
X_train = X_train.reshape(-1, 768, 1)
X_val = X_val.reshape(-1, 768, 1)

my_tags = ["non-toxic", "toxic"]

print("==== Train Results ====")
lstm_pred_train = lstm.predict(np.array(X_train))
lstm_predict_train = (lstm_pred_train > 0.5)
lstm_predict_train = lstm_predict_train*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(y_train), lstm_predict_train,target_names=my_tags))
print("                            ")
print("==== Train Confusion Matrix ====")

lstm_cmtx_train = pd.DataFrame(confusion_matrix(y_train, lstm_predict_train, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(lstm_cmtx_train)
lstm_scores_train = lstm.evaluate(np.array(X_train),np.array(y_train),batch_size = 32)
print("Accuracy: %.3f%%" % (lstm_scores_train[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(y_train), lstm_pred_train))



print("                            ")
print("                            ")
print("==== Validation Results ====")
lstm_pred = lstm.predict(np.array(X_val))
lstm_predict = (lstm_pred > 0.5)
lstm_predict = lstm_predict*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(y_val), lstm_predict, target_names=my_tags))
print("                            ")
print("==== Validation Confusion Matrix ====")

lstm_cmtx = pd.DataFrame(confusion_matrix(np.array(y_val), lstm_predict, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(lstm_cmtx)
lstm_scores = lstm.evaluate(np.array(X_val), np.array(y_val),batch_size = 32)
print("Accuracy: %.3f%%" % (lstm_scores[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(y_val), lstm_pred))


==== Train Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.93      1.00      0.96     86410
       toxic       0.92      0.29      0.44      9222

    accuracy                           0.93     95632
   macro avg       0.92      0.64      0.70     95632
weighted avg       0.93      0.93      0.91     95632

                            
==== Train Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,86168,242
Actual: Toxic,6538,2684


Accuracy: 92.910%
AUC:  0.9236359740225614
                            
                            
==== Validation Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.93      1.00      0.96     57631
       toxic       0.92      0.29      0.44      6125

    accuracy                           0.93     63756
   macro avg       0.92      0.64      0.70     63756
weighted avg       0.93      0.93      0.91     63756

                            
==== Validation Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,57470,161
Actual: Toxic,4358,1767


Accuracy: 92.912%
AUC:  0.9217474226420801


## Model 5: BERT + CNN (cnn)

In [None]:
# Model 5: CNN (cnn)
X_train = X_train.reshape(-1, 768, 1)
X_val = X_val.reshape(-1, 768, 1)

# Creating a callback to reduce learning rate over time and stopping early
reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, verbose=1, mode='auto', cooldown=0, min_lr=0.00001)
early = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto',restore_best_weights = True)

cnn = Sequential()
cnn.add(Conv1D(filters = 128, kernel_size = 5, activation = "relu"))
cnn.add(MaxPooling1D(pool_size = 5))
cnn.add(Conv1D(filters = 128, kernel_size = 5, activation = "relu"))
cnn.add(MaxPooling1D(pool_size = 5))
cnn.add(Conv1D(filters = 128, kernel_size = 5, activation = "relu"))
cnn.add(GlobalMaxPool1D())
cnn.add(Dense(units = 128, activation = 'relu'))
cnn.add(Dense(units = 1, activation = 'sigmoid'))

cnn.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["AUC", "accuracy"])
cnn.fit(np.array(X_train), np.array(y_train), batch_size = 32, epochs = 14,
         validation_data = (np.array(X_val), np.array(y_val)),
         callbacks = [reduce_lr, early])

Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14

Epoch 00012: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 13/14
Epoch 14/14

Epoch 00014: ReduceLROnPlateau reducing learning rate to 0.0001250000059371814.


<tensorflow.python.keras.callbacks.History at 0x7ff13ef42d90>

In [None]:
# Model 5: CNN (cnn)

X_train = X_train.reshape(-1, 768, 1)
X_val = X_val.reshape(-1, 768, 1)

my_tags = ["non-toxic", "toxic"]

print("==== Train Results ====")
cnn_pred_train = cnn.predict(np.array(X_train))
cnn_predict_train = (cnn_pred_train > 0.5)
cnn_predict_train = cnn_predict_train*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(y_train), cnn_predict_train,target_names=my_tags))
print("                            ")
print("==== Train Confusion Matrix ====")

cnn_cmtx_train = pd.DataFrame(confusion_matrix(y_train, cnn_predict_train, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(cnn_cmtx_train)
cnn_scores_train = cnn.evaluate(np.array(X_train),np.array(y_train),batch_size = 32)
print("Accuracy: %.3f%%" % (cnn_scores_train[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(y_train), cnn_pred_train))



print("                            ")
print("                            ")
print("==== Validation Results ====")
cnn_pred = cnn.predict(np.array(X_val))
cnn_predict = (cnn_pred > 0.5)
cnn_predict = cnn_predict*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(y_val), cnn_predict, target_names=my_tags))
print("                            ")
print("==== Validation Confusion Matrix ====")

cnn_cmtx = pd.DataFrame(confusion_matrix(np.array(y_val),cnn_predict, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(cnn_cmtx)
cnn_scores = cnn.evaluate(np.array(X_val), np.array(y_val),batch_size = 32)
print("Accuracy: %.3f%%" % (cnn_scores[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(y_val), cnn_pred))


==== Train Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.96      0.99      0.98     86410
       toxic       0.86      0.66      0.74      9222

    accuracy                           0.96     95632
   macro avg       0.91      0.82      0.86     95632
weighted avg       0.95      0.96      0.95     95632

                            
==== Train Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,85418,992
Actual: Toxic,3178,6044


Accuracy: 95.640%
AUC:  0.97376231033647
                            
                            
==== Validation Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.96      0.99      0.97     57631
       toxic       0.83      0.63      0.71      6125

    accuracy                           0.95     63756
   macro avg       0.89      0.81      0.84     63756
weighted avg       0.95      0.95      0.95     63756

                            
==== Validation Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,56822,809
Actual: Toxic,2275,3850


Accuracy: 95.163%
AUC:  0.9617544639205302


## Model 6 (Baseline): GloVe + CNN (glove_cnn)
Replica of Perspective API

In [None]:
# First 5 Rows
df = pd.read_csv(dir + '/toxicity_parsed_dataset.csv')
display(df.head())

# Remove columns
df=df.drop(['index','ed_label_1','ed_label_0'],axis=1)
display(df.head())

# Ratio of Toxic:Non-Toxic Labels 
# 0:    144324
# 1:    15362
# Total: 159686
display(df['oh_label'].value_counts(dropna = False))

# Drop NaN
df = df.dropna()
display(df['oh_label'].value_counts(dropna = False))
df_BERT = df.copy()

# Remove duplicates, remove hyperlinks, drop NA
df_BERT=preparation(df_BERT)

#change all to lower case
df_BERT['Text'] = df_BERT['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#replace special characters
df_BERT['Text'] = df_BERT['Text'].str.replace('[^\w\s]','')

#remove numbers from text
df_BERT['Text'] = df_BERT['Text'].str.replace('\d+', '')

#remove stopwords

nltk.download('stopwords')
stop = stopwords.words('english')

df_BERT['Text'] = df_BERT['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
display(df_BERT.head())
display(df_BERT.shape)

Unnamed: 0,index,Text,ed_label_0,ed_label_1,oh_label
0,0,This: :One can make an analogy in mathematical...,0.9,0.1,0
1,1,` :Clarification for you (and Zundark's righ...,1.0,0.0,0
2,2,Elected or Electoral? JHK,1.0,0.0,0
3,3,`This is such a fun entry. Devotchka I once...,1.0,0.0,0
4,4,Please relate the ozone hole to increases in c...,0.8,0.2,0


Unnamed: 0,Text,oh_label
0,This: :One can make an analogy in mathematical...,0
1,` :Clarification for you (and Zundark's righ...,0
2,Elected or Electoral? JHK,0
3,`This is such a fun entry. Devotchka I once...,0
4,Please relate the ozone hole to increases in c...,0


0    144324
1     15362
Name: oh_label, dtype: int64

0    144324
1     15362
Name: oh_label, dtype: int64

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Text,oh_label
0,one make analogy mathematical terms envisionin...,0
1,clarification zundarks right checked wikipedia...,0
2,elected electoral jhk,0
3,fun entry devotchka coworker korea couldnt tel...,0
4,please relate ozone hole increases cancer prov...,0


(159388, 2)

In [None]:
# Set the embedding parameters
embedding_dim = 100   # how big is each word vector   
max_features = 40000  # how many unique words to use (i.e num rows in embedding vector)
max_text_length= 200  # max number of words in a comment to use

x_tokenizer=text.Tokenizer(max_features)
x_tokenizer.fit_on_texts(list(df_BERT["Text"]))
x_tokenized=x_tokenizer.texts_to_sequences(df_BERT["Text"])
x_train_val=sequence.pad_sequences(x_tokenized,maxlen=max_text_length)

glove_cnn_X_train, glove_cnn_X_val, glove_cnn_y_train, glove_cnn_y_val=train_test_split(x_train_val, df_BERT["oh_label"],test_size=0.4,random_state=0)

In [None]:
# Download GloVe Embeddings

!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove*.zip

--2021-04-04 12:06:44--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-04-04 12:06:44--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-04-04 12:06:44--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‚Äòglove.6B.zip‚Äô


20

In [None]:
# Prepare Embedding Matrix with Pre-trained GloVe Embeddings

embedding_index=dict()
f=open('glove.6B.100d.txt', encoding = "utf8")
for line in f:
    values=line.split()
    word=values[0]
    coefs=np.asarray(values[1:],dtype='float32') 
    embedding_index[word]=coefs
    
f.close()
print(f'Found {len(embedding_index)} word vectors')

Found 400000 word vectors


In [None]:
embedding_matrix=np.zeros((max_features,embedding_dim))
for word,index in x_tokenizer.word_index.items():
    if index>max_features-1:
        break
    else:
        embedding_vector=embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index]=embedding_vector

In [None]:
# Model 6: Perspective Replica Glove-CNN (glove_cnn)

reduce_lr = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, verbose=1, mode='auto', cooldown=0, min_lr=0.00001)
early = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=1, mode='auto',restore_best_weights = True)

filters=250
kernel_size=3
hidden_dims=250

glove_cnn =Sequential()
glove_cnn.add(Embedding(max_features,
                   embedding_dim,
                   embeddings_initializer=tf.keras.initializers.Constant(
                   embedding_matrix),
                   trainable=False))
glove_cnn.add(Dropout(0.2))

glove_cnn.add(Conv1D(filters,
                kernel_size,
                padding='valid'))
glove_cnn.add(MaxPooling1D())
glove_cnn.add(Conv1D(filters,
                5,
                padding='valid',
                activation='relu'))
glove_cnn.add(GlobalMaxPooling1D())
glove_cnn.add(Dense(hidden_dims,activation='relu'))
glove_cnn.add(Dropout(0.2))
glove_cnn.add(Dense(1,activation='sigmoid'))
#glove_cnn.summary()

glove_cnn.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['AUC', 'accuracy'])

batch_size=32
epochs=20
glove_cnn.fit(np.array(glove_cnn_X_train),np.array(glove_cnn_y_train),
         batch_size=batch_size,
         epochs=epochs,
         validation_data=(np.array(glove_cnn_X_val),np.array(glove_cnn_y_val)), callbacks = [reduce_lr, early])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
Epoch 7/20
Epoch 8/20

Epoch 00008: ReduceLROnPlateau reducing learning rate to 0.0002500000118743628.
Epoch 9/20
Restoring model weights from the end of the best epoch.
Epoch 00009: early stopping


<tensorflow.python.keras.callbacks.History at 0x7f3070246410>

In [None]:
# Model 6: Perspective Replica Glove-CNN (glove_cnn)

my_tags = ["non-toxic", "toxic"]

print("==== Train Results ====")
glove_cnn_pred_train = glove_cnn.predict(np.array(glove_cnn_X_train))
glove_cnn_predict_train = (glove_cnn_pred_train > 0.5)
glove_cnn_predict_train = glove_cnn_predict_train*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(glove_cnn_y_train), glove_cnn_predict_train,target_names=my_tags))
print("                            ")
print("==== Train Confusion Matrix ====")

glove_cnn_cmtx_train = pd.DataFrame(confusion_matrix(glove_cnn_y_train, glove_cnn_predict_train, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(glove_cnn_cmtx_train)
glove_cnn_scores_train = glove_cnn.evaluate(np.array(glove_cnn_X_train),np.array(glove_cnn_y_train),batch_size = batch_size)
print("Accuracy: %.3f%%" % (glove_cnn_scores_train[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(glove_cnn_y_train), glove_cnn_pred_train))



print("                            ")
print("                            ")
print("==== Validation Results ====")
glove_cnn_pred = glove_cnn.predict(np.array(glove_cnn_X_val))
glove_cnn_predict = (glove_cnn_pred > 0.5)
glove_cnn_predict = glove_cnn_predict*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(glove_cnn_y_val), glove_cnn_predict, target_names=my_tags))
print("                            ")
print("==== Validation Confusion Matrix ====")

glove_cnn_cmtx = pd.DataFrame(confusion_matrix(np.array(glove_cnn_y_val),glove_cnn_predict, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(glove_cnn_cmtx)
glove_cnn_scores = glove_cnn.evaluate(np.array(glove_cnn_X_val), np.array(glove_cnn_y_val),batch_size = batch_size)
print("Accuracy: %.3f%%" % (glove_cnn_scores[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(glove_cnn_y_val), glove_cnn_pred))


==== Train Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.97      0.99      0.98     86410
       toxic       0.90      0.71      0.80      9222

    accuracy                           0.97     95632
   macro avg       0.94      0.85      0.89     95632
weighted avg       0.96      0.97      0.96     95632

                            
==== Train Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,85711,699
Actual: Toxic,2637,6585


Accuracy: 96.512%
AUC:  0.9836505890235812
                            
                            
==== Validation Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.96      0.99      0.97     57631
       toxic       0.84      0.64      0.73      6125

    accuracy                           0.95     63756
   macro avg       0.90      0.81      0.85     63756
weighted avg       0.95      0.95      0.95     63756

                            
==== Validation Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,56857,774
Actual: Toxic,2191,3934


Accuracy: 95.349%
AUC:  0.9611409449067059


In [None]:
glove_cnn_scores_train

[0.0927647203207016, 0.9835066795349121, 0.9651162624359131]

# **3. Testing Models on Web-Scraped Youtube Comments**



## Clean & Vectorise Data

In [None]:
# Data: yt
yt = pd.read_csv(dir + '/ytcomments.csv')

# First 5 Rows
display(yt.head())

# Remove columns
yt.drop('Unnamed: 0',axis='columns', inplace=True)
display(yt.head())

# Ratio of Toxic:Non-Toxic Labels 
# 0:    373
# 1:    115
# Total: 488
display(yt['cb_label'].value_counts(dropna = False))


# Remove duplicates, remove hyperlinks, drop NA
yt = preparation(yt, 'comment')

#change all to lower case
yt['comment'] = yt['comment'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#replace special characters
yt['comment'] = yt['comment'].str.replace('[^\w\s]','')

#remove numbers from text
yt['comment'] = yt['comment'].str.replace('\d+', '')

#yt for GloVe embedding
#yt_BERT for BERT embedding
yt_BERT = yt.copy()

#remove stopwords from yt (GloVe embedding) only
nltk.download('stopwords')
stop = stopwords.words('english')

yt['comment'] = yt['comment'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
display(yt.head())
display(yt.shape)

Unnamed: 0.1,Unnamed: 0,comment,cb_label
0,2,It must take alot of courage to stand there an...,0
1,3,Husband : Wat do u want to eat?\nWife : I got ...,0
2,4,I cannot tahan that cocky look. How did we end...,1
3,5,Pritam is like the gf that‚Äôs asking CCS ‚ÄúCan I...,0
4,6,I‚Äôm sorry I honestly can‚Äôt take someone seriou...,1


Unnamed: 0,comment,cb_label
0,It must take alot of courage to stand there an...,0
1,Husband : Wat do u want to eat?\nWife : I got ...,0
2,I cannot tahan that cocky look. How did we end...,1
3,Pritam is like the gf that‚Äôs asking CCS ‚ÄúCan I...,0
4,I‚Äôm sorry I honestly can‚Äôt take someone seriou...,1


0    373
1    115
Name: cb_label, dtype: int64

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,comment,cb_label
0,must take alot courage stand speak room people...,0
1,husband wat u want eat wife got nothing hide h...,0
2,cannot tahan cocky look end lousy personalitie...,1
3,pritam like gf thats asking ccs see phone ccs ...,0
4,im sorry honestly cant take someone seriously ...,1


(488, 2)

In [None]:
# Check the max number of words in comments

length = yt_BERT['comment'].str.split().str.len().max()
print("The maximum number of words in yt_BERT comments is : " +  str(length)) 

length = yt['comment'].str.split().str.len().max()
print("The maximum number of words in yt comments is : " +  str(length)) 

The maximum number of words in yt_BERT comments is : 180
The maximum number of words in yt comments is : 109


## BERT + Logistic Regression on Youtube Data

In [None]:
# BERT Embedding for yt_BERT
# Load DistilBERT word embedding (Smaller version of BERT with relatively the same accuracy)
# Set Max_Seq_Length = 180

dbt = BertTransformer(DistilBertTokenizer.from_pretrained("distilbert-base-uncased"),
                      DistilBertModel.from_pretrained("distilbert-base-uncased"),
                      embedding_func=lambda x: x[0][:, 0, :].squeeze(),max_length = 180)

yt_BERT_vectorised = dbt.fit_transform(yt_BERT["comment"])


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_‚Ä¶




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri‚Ä¶




In [None]:
# Classifying using our best proposed Model 1: BERT Log Reg on Youtube data

my_tags = ["non-toxic", "toxic"]

print("==== Results ====")
log_pred_yt = logreg.predict(yt_BERT_vectorised)
print("                     ")
print(classification_report(np.array(yt_BERT["cb_label"]), log_pred_yt,target_names=my_tags))
print("                            ")
print("==== Confusion Matrix ====")
log_cmtx_yt = pd.DataFrame(
    confusion_matrix(np.array(yt_BERT["cb_label"]), log_pred_yt, labels = [0,1]), 
    index=['Actual: Non-Toxic', 'Actual: Toxic'], 
    columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(log_cmtx_yt)
print('Accuracy %s' % accuracy_score(log_pred_yt, np.array(yt_BERT["cb_label"])))
print("AUC: " , metrics.roc_auc_score(np.array(yt_BERT["cb_label"]), log_pred_yt))

==== Results ====
                     
              precision    recall  f1-score   support

   non-toxic       0.79      0.97      0.87       373
       toxic       0.59      0.14      0.23       115

    accuracy                           0.77       488
   macro avg       0.69      0.55      0.55       488
weighted avg       0.74      0.77      0.72       488

                            
==== Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,362,11
Actual: Toxic,99,16


Accuracy 0.7745901639344263
AUC:  0.5548199090803124


## GloVe + CNN on Youtube Data

In [None]:
# Set the embedding parameters
embedding_dim = 100   # how big is each word vector   
max_features = 40000  # how many unique words to use (i.e num rows in embedding vector)
max_text_length= 109  # max number of words in a comment to use

x_tokenizer=text.Tokenizer(max_features)
x_tokenizer.fit_on_texts(list(yt["comment"]))
x_tokenized=x_tokenizer.texts_to_sequences(yt["comment"])
x_train_val=sequence.pad_sequences(x_tokenized,maxlen=max_text_length)

In [None]:
embedding_matrix=np.zeros((max_features,embedding_dim))
for word,index in x_tokenizer.word_index.items():
    if index>max_features-1:
        break
    else:
        embedding_vector=embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index]=embedding_vector

In [None]:
# Classifying using Model 6: Perspective Replica Glove-CNN (glove_cnn) on Youtube data

my_tags = ["non-toxic", "toxic"]

print("==== Results ====")
glove_cnn_pred_yt = glove_cnn.predict(np.array(x_train_val))
glove_cnn_predict_yt = (glove_cnn_pred_yt > 0.5)
glove_cnn_predict_yt = glove_cnn_predict_yt*1 #convert to 0,1 instead of True False
print("                            ")
print(classification_report(np.array(yt["cb_label"]), glove_cnn_predict_yt,target_names=my_tags))
print("                            ")
print("==== Confusion Matrix ====")

glove_cnn_cmtx_yt = pd.DataFrame(confusion_matrix(yt["cb_label"], glove_cnn_predict_yt, labels = [0,1]),
                                index=['Actual: Non-Toxic', 'Actual: Toxic'], 
                                columns=['Predicted: Non-Toxic', 'Predicted: Toxic'])
display(glove_cnn_cmtx_yt)
glove_cnn_scores_yt = glove_cnn.evaluate(np.array(x_train_val),np.array(yt["cb_label"]),batch_size = batch_size)
print("Accuracy: %.3f%%" % (glove_cnn_scores_yt[2]*100))
print("AUC: " , metrics.roc_auc_score(np.array(yt["cb_label"]), glove_cnn_pred_yt))

==== Results ====
                            
              precision    recall  f1-score   support

   non-toxic       0.76      0.97      0.85       373
       toxic       0.25      0.03      0.06       115

    accuracy                           0.75       488
   macro avg       0.51      0.50      0.46       488
weighted avg       0.64      0.75      0.67       488

                            
==== Confusion Matrix ====


Unnamed: 0,Predicted: Non-Toxic,Predicted: Toxic
Actual: Non-Toxic,361,12
Actual: Toxic,111,4


Accuracy: 74.795%
AUC:  0.4753934024944632


# **4. Data Exploration, Cleaning, Processing & Exporting of Files** (Not Required to Run)

## Data Exploration

In [None]:
# Load dataset
#df = pd.read_csv(dir + '/toxicity_parsed_dataset.csv')

# First 5 Rows
display(df.head())

# Remove columns
df=df.drop(['index','ed_label_1','ed_label_0'],axis=1)
display(df.head())

# Ratio of Toxic:Non-Toxic Labels 
# 0:    144324
# 1:    15362
# Total: 159686
display(df['oh_label'].value_counts(dropna = False))

# Drop NaN
df = df.dropna()
3display(df['oh_label'].value_counts(dropna = False))

Unnamed: 0,index,Text,ed_label_0,ed_label_1,oh_label
0,0,This: :One can make an analogy in mathematical...,0.9,0.1,0
1,1,` :Clarification for you (and Zundark's righ...,1.0,0.0,0
2,2,Elected or Electoral? JHK,1.0,0.0,0
3,3,`This is such a fun entry. Devotchka I once...,1.0,0.0,0
4,4,Please relate the ozone hole to increases in c...,0.8,0.2,0


Unnamed: 0,Text,oh_label
0,This: :One can make an analogy in mathematical...,0
1,` :Clarification for you (and Zundark's righ...,0
2,Elected or Electoral? JHK,0
3,`This is such a fun entry. Devotchka I once...,0
4,Please relate the ozone hole to increases in c...,0


0    144324
1     15362
Name: oh_label, dtype: int64

0    144324
1     15362
Name: oh_label, dtype: int64

## Data Cleaning

In [None]:
# Remove duplicates, remove hyperlinks, drop NA
df=preparation(df)

#change all to lower case
df['Text'] = df['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#replace special characters
df['Text'] = df['Text'].str.replace('[^\w\s]','')

#remove numbers from text
df['Text'] = df['Text'].str.replace('\d+', '')

df_BERT = df.copy()

#For GloVe Embedding, remove stopwords

#nltk.download('stopwords')
#stop = stopwords.words('english')

#df['Text'] = df['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
#display(df.head())
#display(df.shape)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


## Data Processing & Exporting - BERT Embedding in Batches

In [None]:
# Split data into chunks to allow emmbedding within the RAM capacity

batch_list = np.array_split(df_BERT, 10)

In [None]:
count = 0
for i in batch_list:
  count += 1
  print("====" + str(count) + "====")
  print("Shape of Batch:" + str(i.shape))
  display(i.head(1))
  print("Rows with NA:")
  display(i[i.isna().any(axis=1)])
  # Download batched data into 10 diff. csv
  #i.to_csv('Batch '+ str(count) + '.csv',index=False)
  #files.download('Batch '+ str(count) + '.csv')


====1====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
0,this one can make an analogy in mathematical t...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====2====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
15975,mainland asia includes the lower basin of ch...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====3====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
31950,now that it is properly licensed i can put it ...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====4====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
47936,september please stop if you continue to ...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====5====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
63913,whiteway and southdown whiteway and southdow...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====6====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
79883,now you need to put an explantion of the tag o...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====7====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
95856,respect has to be earned the opposite of the ...,0


Rows with NA:


Unnamed: 0,Text,oh_label


====8====
Shape of Batch:(15939, 2)


Unnamed: 0,Text,oh_label
111819,i am the gayest person on earth from boing sai...,1


Rows with NA:


Unnamed: 0,Text,oh_label


====9====
Shape of Batch:(15938, 2)


Unnamed: 0,Text,oh_label
127772,yeah just a bunch more whiney suck garbage fr...,1


Rows with NA:


Unnamed: 0,Text,oh_label


====10====
Shape of Batch:(15938, 2)


Unnamed: 0,Text,oh_label
143722,sockpuppets of techastrax hi i would like to...,0


Rows with NA:


Unnamed: 0,Text,oh_label


In [None]:
# Load DistilBERT word embedding (Smaller version of BERT with relatively the same accuracy)
# Set Max_Seq_Length = 200

dbt = BertTransformer(DistilBertTokenizer.from_pretrained("distilbert-base-uncased"),
                      DistilBertModel.from_pretrained("distilbert-base-uncased"),
                      embedding_func=lambda x: x[0][:, 0, :].squeeze(),max_length = 200)

In [None]:
batch_list_name = ["batch_1", "batch_2", "batch_3", "batch_4", "batch_5", "batch_6", "batch_7", "batch_8", "batch_9", "batch_10"]
path = "/content"

In [None]:
from google.colab import files
count = 0

for i in batch_list:
  batch_vectorised = dbt.fit_transform(i["Text"])
  save_pkl_pickle(path, batch_vectorised, batch_list_name[count] + "_vectorised")
  #files.download(batch_list_name[count] + '_vectorised.pickle')
  count +=1

====done saving into pickle using Pickle!====


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Import Batch Processed Data

In [None]:
# Load dataset
df = pd.read_csv(dir + '/toxicity_parsed_dataset.csv')

# First 5 Rows
display(df.head())

# Remove columns
df=df.drop(['index','ed_label_1','ed_label_0'],axis=1)
display(df.head())

# Ratio of Toxic:Non-Toxic Labels 
# 0:    144324
# 1:    15362
# Total: 159686
display(df['oh_label'].value_counts(dropna = False))

# Drop NaN
df = df.dropna()
display(df['oh_label'].value_counts(dropna = False))
df_BERT = df.copy()

df_BERT=preparation(df_BERT)

#change all to lower case
df_BERT['Text'] = df_BERT['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#replace special characters
df_BERT['Text'] = df_BERT['Text'].str.replace('[^\w\s]','')

#remove numbers from text
df_BERT['Text'] = df_BERT['Text'].str.replace('\d+', '')

#remove stopwords

nltk.download('stopwords')
stop = stopwords.words('english')

df_BERT['Text'] = df_BERT['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

display(df_BERT.shape)

Unnamed: 0,index,Text,ed_label_0,ed_label_1,oh_label
0,0,This: :One can make an analogy in mathematical...,0.9,0.1,0
1,1,` :Clarification for you (and Zundark's righ...,1.0,0.0,0
2,2,Elected or Electoral? JHK,1.0,0.0,0
3,3,`This is such a fun entry. Devotchka I once...,1.0,0.0,0
4,4,Please relate the ozone hole to increases in c...,0.8,0.2,0


Unnamed: 0,Text,oh_label
0,This: :One can make an analogy in mathematical...,0
1,` :Clarification for you (and Zundark's righ...,0
2,Elected or Electoral? JHK,0
3,`This is such a fun entry. Devotchka I once...,0
4,Please relate the ozone hole to increases in c...,0


0    144324
1     15362
Name: oh_label, dtype: int64

0    144324
1     15362
Name: oh_label, dtype: int64

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


(159388, 2)

In [None]:
# Getting Y data
Y = df_BERT["oh_label"]
Y.head()
Y.shape

(159388,)

In [None]:
# Load batch data
#path = "/content" 
infile = open(dir + "/batch_1_vectorised.pickle",'rb')
batch_1_vectorised = pickle.load(infile)
batch_1_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_2_vectorised.pickle",'rb')
batch_2_vectorised = pickle.load(infile)
batch_2_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_3_vectorised.pickle",'rb')
batch_3_vectorised = pickle.load(infile)
batch_3_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_4_vectorised.pickle",'rb')
batch_4_vectorised = pickle.load(infile)
batch_4_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_5_vectorised.pickle",'rb')
batch_5_vectorised = pickle.load(infile)
batch_5_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_6_vectorised.pickle",'rb')
batch_6_vectorised = pickle.load(infile)
batch_6_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_7_vectorised.pickle",'rb')
batch_7_vectorised = pickle.load(infile)
batch_7_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_8_vectorised.pickle",'rb')
batch_8_vectorised = pickle.load(infile)
batch_8_vectorised.shape

torch.Size([15939, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_9_vectorised.pickle",'rb')
batch_9_vectorised = pickle.load(infile)
batch_9_vectorised.shape

torch.Size([15938, 768])

In [None]:
# Load batch data 
infile = open(dir + "/batch_10_vectorised.pickle",'rb')
batch_10_vectorised = pickle.load(infile)
batch_10_vectorised.shape

torch.Size([15938, 768])

In [None]:
# Getting X_BERT_vectorised data
X_vectorised = torch.cat([batch_1_vectorised, batch_2_vectorised, 
                          batch_3_vectorised, batch_4_vectorised, 
                          batch_5_vectorised, batch_6_vectorised, 
                          batch_7_vectorised, batch_8_vectorised, 
                          batch_9_vectorised, batch_10_vectorised, ], 0)

X_vectorised.shape

torch.Size([159388, 768])

## Processing Imported Data

In [None]:
# Train-test split X_BERT_vectorised and Y
X_train, X_val, y_train, y_val = train_test_split(X_vectorised, Y, random_state=0, test_size=0.4)

In [None]:
display(X_train.shape)
display(X_val.shape)
display(y_train.shape)
display(y_val.shape)

torch.Size([95632, 768])

torch.Size([63756, 768])

(95632,)

(63756,)

In [None]:
# Saving X_train, X_val, y_train, y_val 
# Note: X is BERT word embeddings of text data in toxicity dataset, Y is label
path = '/content'
save_pkl_pickle(path, X_train, "X_train")
save_pkl_pickle(path, X_val, "X_val")
save_pkl_pickle(path, y_train, "y_train")
save_pkl_pickle(path, y_val, "y_val")
files.download('X_train.pickle')
files.download('X_val.pickle')
files.download('y_train.pickle')
files.download('y_val.pickle')

====done saving into pickle using Pickle!====
====done saving into pickle using Pickle!====
====done saving into pickle using Pickle!====
====done saving into pickle using Pickle!====


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# **5. Data Preparation Visualisation** (Not Required to Run)

## Preparing Data for Tableau

In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import decomposition, ensemble
import pandas as pd, xgboost, numpy, textblob, string
from textblob import TextBlob
from keras.preprocessing import text, sequence
from keras import layers, models, optimizers

In [None]:
data = pd.read_csv('drive/My Drive/Data/toxicity_parsed_dataset.csv')
data=data.drop(['index','ed_label_0','ed_label_1'],axis=1)

#find the word count and length before cleaning the data
data['B4review_len'] = data['Text'].astype(str).apply(len)

#find out the number of words in each of the records
data['B4word_count'] =data['Text'].apply(lambda x: len(str(x).split()))

data=preparation(data)

## Data Cleaning on Toxicity Dataset

In [None]:
#change all to lower case
data['Text'] = data['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

#replace special characters
data['Text'] = data['Text'].str.replace('[^\w\s]','')

#remove numbers from text
data['Text'] = data['Text'].str.replace('\d+', '')

#remove stopwords
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')

data['Text'] = data['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
data.head()

#after cleaning the data what is the length 
#find out the length of the comments
data['cleanreview_len'] = data['Text'].astype(str).apply(len)

#find out the number of words in each of the records
data['cleanword_count'] =data['Text'].apply(lambda x: len(str(x).split()))

from google.colab import files
data.to_csv('preparedtoxicdata.csv') 
files.download('preparedtoxicdata.csv')

## Data Preparation for N-Gram Analysis

In [None]:
data0=data[data['oh_label']==0]
data0
#data.groupby(['oh_label']).count()

In [None]:
data1=data[data['oh_label']==1]
data1

In [None]:
#to find out the words that are frequently used AFTER REMOVING STOP WORDS
def get_top_n_words(corpus, n=None):
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_words(data['Text'], 500)
#for word, freq in common_words:
#    print(word, freq)
common_toxicwords=get_top_n_words(data1['Text'],500)
common_nontoxicwords=get_top_n_words(data0['Text'],500)

In [None]:
common_words

In [None]:
common_nontoxicwords=pd.DataFrame(data=common_nontoxicwords)
common_nontoxicwords.columns=['nontoxic single words','Nontoxic Freq']

In [None]:
common_toxicwords
common_toxicwords=pd.DataFrame(data=common_toxicwords)
common_toxicwords.columns=['toxic singular words','Freq1']

In [None]:
common_singularwords=pd.DataFrame(data=common_words)
common_singularwords.columns=["singular_words","Frequency1"]

In [None]:
common_singularwords

In [None]:
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_bigrams = get_top_n_bigram(data['Text'], 500)
common_toxicbigrams=get_top_n_bigram(data1['Text'],500)
common_nontoxicbigram=get_top_n_bigram(data0['Text'],500)
#for word, freq in common_words:
#    print(word, freq)

In [None]:
common_nontoxicbigram=pd.DataFrame(data=common_nontoxicbigram)
common_nontoxicbigram.columns=['nontoxic bigram','Nontoxic Freq2']

In [None]:
common_toxicbigrams
common_toxicbigrams=pd.DataFrame(data=common_toxicbigrams)
common_toxicbigrams.columns=['toxic bigrams','Freq2']

In [None]:
common_bigrams=pd.DataFrame(data=common_bigrams)
common_bigrams.columns=["common_bigrams","Frequency2"]

In [None]:
common_bigrams

In [None]:
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
common_trigrams = get_top_n_trigram(data['Text'], 500)
common_toxictrigrams=get_top_n_trigram(data1['Text'],500)
common_nontoxictrigrams=get_top_n_trigram(data0['Text'],500)
#for word, freq in common_words:
#    print(word, freq)

In [None]:
common_nontoxictrigram=pd.DataFrame(data=common_nontoxictrigrams)
common_nontoxictrigram.columns=['nontoxic trigram','Nontoxic Freq3']
common_nontoxictrigram

Unnamed: 0,nontoxic trigram,Nontoxic Freq3
0,criteria speedy deletion,1427
1,lol lol lol,1264
2,five pillars wikipedia,1011
3,four tildes automatically,1006
4,fish fish fish,998
...,...,...
495,currently doesnt specify,151
496,doesnt specify created,151
497,release gfdl believe,151
498,tagged find list,151


In [None]:
common_toxictrigrams
common_toxictrigrams=pd.DataFrame(data=common_toxictrigrams)
common_toxictrigrams.columns=['toxic trigrams','Freq3']

In [None]:
common_trigrams=pd.DataFrame(data=common_trigrams)
common_trigrams.columns=["common_trigrams","Frequency3"]

In [None]:
common_trigrams

In [None]:
ngram_analysis=pd.concat([common_singularwords, common_bigrams,common_trigrams], axis=1)
ngram_analysis

In [None]:
from google.colab import files
ngram_analysis.to_csv('updated_ngram_analysis.csv') 
files.download('updated_ngram_analysis.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
nontoxicngram_analysis=pd.concat([common_nontoxicwords,common_nontoxicbigram,common_nontoxictrigram],axis=1)
nontoxicngram_analysis

In [None]:
from google.colab import files
nontoxicngram_analysis.to_csv('nontoxicngram_analysis.csv') 
files.download('nontoxicngram_analysis.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
toxicngram_analysis=pd.concat([common_toxicwords,common_toxicbigrams,common_toxictrigrams],axis=1)
toxicngram_analysis

In [None]:
from google.colab import files
toxicngram_analysis.to_csv('toxicngram_analysis.csv') 
files.download('toxicngram_analysis.csv')

## Attempt at finding emojis in text data, no results

In [None]:
import emoji
import regex

def split_count(text):
    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):
            emoji_list.append(word)
    return emoji_list

# **5. Web-scraping of YouTube Comments Dataset** (Not required to Run)

In [None]:
import time
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
data = []

browser = webdriver.Chrome("chromedriver")
wait = WebDriverWait(browser,15)
browser.get("https://www.youtube.com/watch?v=DSCCnr6yPCU")


for item in range(200): 
    wait.until(EC.visibility_of_element_located((By.TAG_NAME, "body"))).send_keys(Keys.END)
    time.sleep(1.5)
    
for comment in wait.until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#content"))):
    data.append(comment.text)

In [None]:
data

['SG\nSKIP NAVIGATION\nSIGN IN\n0:00 / 3:01\nMediacorp is a Singaporean public broadcast service. Wikipedia\nChan Chun Sing, Pritam Singh spar in Parliament over Singapore\'s foreign worker policy\n361,731 views‚Ä¢Jan 7, 2020\nLIKE\nDISLIKE\nSHARE\nSAVE\nTODAYonline\n83.8K subscribers\nSUBSCRIBE\nSingapore\'s oft-debated foreign worker policy sparked an exchange between Trade and Industry Minister Chan Chun Sing and Workers\' Party chief Pritam Singh in Parliament on Monday (Jan 6). Mr Singh repeatedly asked for a breakdown of the number of new jobs that went to Singaporeans, permanent residents (PRs) and foreigners.\nRead story here: \nSHOW MORE\n921 Comments\nSORT BY\nAdd a public comment...\nMzbros\n8 months ago\nIt must take alot of courage to stand there and speak against a room of people which are against your existence in that room.\n616\nREPLY\nView 10 replies\nShadoStorm\n8 months ago\nHusband : Wat do u want to eat?\nWife : I got nothing to hide.\nHusband : Wat do u want to e

In [None]:
import pandas as pd
df = pd.DataFrame(data,columns=['comment'])
df[2:-1]

Unnamed: 0,comment
2,It must take alot of courage to stand there an...
3,Husband : Wat do u want to eat?\nWife : I got ...
4,I cannot tahan that cocky look. How did we end...
5,Pritam is like the gf that‚Äôs asking CCS ‚ÄúCan I...
6,I‚Äôm sorry I honestly can‚Äôt take someone seriou...
...,...
485,Many PR's are born here and still holding resi...
486,"I think PR is ok la, u go Australia , u go Can..."
487,"WP is worthless with their so called ""checks""...."
488,"Evil intention of the WP. Good job, Mr Chan !"


In [None]:
df[2:-1].to_csv("ytcomments.csv")
files.download("ytcomments.csv")