TOXICITY CLASSIFICATION WITH REDUCED
UNINTENDED BIAS

Group Members
1. Agnes Sharan Sahaya Raj Helan - asr647
2. Jairam Venkitachalam - jv1589
3. Srishti Bhargava - sb7261 (Member responsible for uploading submissions)

INTRODUCTION <br>
Discussion on online platforms can be difficult. A constant fear of abuse and harassment impedes
people from expressing their opinions, which in turn results in platforms being unable to facilitate
and foster an environment for stimulating conversations. This form of cyber bullying has adverse
effects on the psychology of participants as well. Such a scenario does not bode well, neither for the
participants, nor for the platform facilitating the conversations. There have been various attempts
at building models that can understand and classify comments into different classes depending on
their toxicity, with an aim of making online discussions more productive and respectful.
However, these models often have had a history of associating even the comments mean to foster
fruitful discussions on ostracised communities with toxicity due to word associations and hence
lead to erroneous classification of these comments. We aim to build a machine learning model,
that makes it possible for toxic comments to be identified and also reduces the misclassification of
comments due to unintended bias.

Methodology : Algorithm and Models <br>
Three modelling algorithms were used to solve the problem. They are: <br>
1. Logistic Regression <br>
This first model was chosen for its range of use in classification problems and in predicting the probabilities of data points belonging to different classes. By nature of the dataset set, the target value in the dataset is continuous probability and the goal is to predict the nature of class of the comment using probabilistic analysis and thresholding the value obtained at 0.5 to predict toxic and non-toxic comments with $\geq 0.5$ being toxic. <br>
	
2. Random Forests <br>
Decision trees attempt to predict the class of the data point using a restricted subset of the features of the model. Due to the large representative models of the word embeddings/ document term matrices used in the data modelling, such a restrictive choice will prevent overfitting while still being capable of filtering the most prevalent features.\\\\
However one qualm of decision trees is their tendency to over fit on their training data and in order to overcome this an ensemble model that harnesses the advantages of decision trees while still accounting for prevention of overfitting was required, which was then chosen to be Random Forests. These build a many decision trees using the training data and upon testing on thus built forest, output the mode of all output classes predicted by the trees. <br>

3. Gradient Boosted Machines <br>
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. <br>

REFERENCES <br>
[1] https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview/evaluation <br>
[2] https://perspectiveapi.com/#/home <br>
[3] https://twitter.com/jessamyn/status/900867154412699649 <br>
[4] https://arxiv.org/abs/1906.08237 <br>

Experiments:

    We first begin by training our baseline model. Following is the code generating the Machine Learning Models.

In [None]:
import pandas as pd
from tqdm import tqdm
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from scipy.sparse import hstack
import lightgbm as lgb
import string
from sklearn.metrics import accuracy_score
import re
import sklearn
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
from sklearn.feature_extraction import stop_words
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
def read_data():
    """
    We simply read the dataset and return a Pandas Dataframe
    """
    train=pd.read_csv("train.csv")
    return train

In [None]:
def parse(data):
    """
    We aim to remove special characters and punctuations from our text field within this function
    """
    punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~`" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
    def clean_special_chars(text, punct):
        for p in punct:
            text = text.replace(p, ' ')
        return text
    
    data = data.astype(str).apply(lambda x: clean_special_chars(x, punct))
    return data

In [None]:
def preprocess_tfidf():
    """
    The function preprocesses and generates representation for TF-IDF representation.
    It returns X_train, y_train which are the dataset along with the labels for training, 
    and X_test, y_test which are the dataset along with the labels for testing.
    """
    data = read_data()

    train=data[:1500000]
    test=data[1500000:]
    print("Beginning Preprocessing")
    word_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='word',
    token_pattern=r'\w{1,}', stop_words='english', ngram_range=(1, 1), max_features=10000)
    train_intermediate=parse(train)
    print("Beginning Fitting the word")
    word_vectorizer.fit(train_intermediate)

    X_train=word_vectorizer.transform(train['comment_text'])
    X_test=word_vectorizer.transform(test['comment_text'])
    y_train = np.where(train['target'] >= 0.5, 1, 0)
    y_test = np.where(test['target'] >= 0.5, 1, 0)
    
    return X_train, y_train, X_test, y_test

In [None]:
def preprocess_bow():
    """
    The function preprocesses and generates representation for Bag of Words representation.
    It returns X_train, y_train which are the dataset along with the labels for training, 
    and X_test, y_test which are the dataset along with the labels for testing.
    """
    data = read_data()

    train=data[:1500000]
    test=data[1500000:]
    print("Beginning Preprocessing")
    count_vectorizer = CountVectorizer(strip_accents='unicode', stop_words=stop_words.ENGLISH_STOP_WORDS,
                                 analyzer='word', ngram_range=(1, 1), token_pattern=r'\w{1,}')
    train_intermediate=parse(train)
    print("Beginning Fitting the word")
    count_vectorizer.fit(train_intermediate)

    X_train=count_vectorizer.transform(train['comment_text'])
    X_test=count_vectorizer.transform(test['comment_text'])
    y_train = np.where(train['target'] >= 0.5, 1, 0)
    y_test = np.where(test['target'] >= 0.5, 1, 0)
    
    return X_train, y_train, X_test, y_test

In [None]:
def plot_graph(x_axis, y_axis1, yaxis2, xlabel, ylabel, title):
    """
    Skeleton code for plotting graphs
    """
    plt.plot(x_axis, y_axis1, color='blue', label='training')
    plt.plot(x_axis, yaxis2, color='orange', label='validation')
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.legend()
    plt.show()

def plot_graph_single_val(x_axis, y_axis1, xlabel, ylabel, title):
    """
    Skeleton code for plotting graphs
    """
    plt.plot(x_axis, y_axis1, color='blue', label='training')
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.legend()
    plt.show()

In [None]:
def logistic_regression(X_train, y_train, X_test, y_test):
    """
    This function performs Logistic Regression and requires as parameters the training and testing dataset
    """
    lr = LogisticRegression()
    lr.fit(X_train, y_train)
    y_pred_train = lr.predict(X_train)
    y_pred_test = lr.predict(X_test)
    accuracy_train = accuracy_score(y_train, y_pred_train)
    print(accuracy_train)
    accuracy_test = accuracy_score(y_test, y_pred_test)
    print(accuracy_test)

Experimentation involves fiddling with parameters such as Max-Depth, Min-Sample Leaves.

In [None]:
def decision_tree(X_train, y_train, X_test, y_test):
    """
    Contains definition for a Decision Tree classifier
    """

    max_depth = np.arange(30) + 1
    min_samples_leaf = np.arange(50) + 1
    n_estimators = np.arange(50) + 1
    train_depth_acc = []
    valid_depth_acc = []
    train_leaf_acc = []
    valid_leaf_acc = []
    train_est_acc = []
    valid_est_acc = []

    for depth in max_depth:
        clf = DecisionTreeClassifier(max_depth=depth)
        clf.fit(X_train, y_train)
        train_depth_acc.append(clf.score(X_train, y_train))
        valid_depth_acc.append(clf.score(X_test, y_test))

    print(valid_depth_acc[valid_depth_acc.index(max(valid_depth_acc))])
    print(max_depth[valid_depth_acc.index(max(valid_depth_acc))])
    
    plot_graph(max_depth, train_depth_acc, valid_depth_acc, 'max_depth', 'accuracy', 'Maximum depth vs Accuracy')

    for min_leaf in min_samples_leaf:
        clf = DecisionTreeClassifier(min_samples_leaf=min_leaf)
        clf.fit(X_train, y_train)
        train_leaf_acc.append(clf.score(X_train, y_train))
        valid_leaf_acc.append(clf.score(X_test, y_test))

    print("Validation Accuracy: {}".format(valid_leaf_acc[valid_leaf_acc.index(max(valid_leaf_acc))]))

    plot_graph(min_samples_leaf, train_leaf_acc, valid_leaf_acc, 'min_leaf', 'accuracy', 'Minimum samples per leaf vs Accuracy')

In [None]:
def random_forest(X_train, y_train, X_test, y_test):
    """
    Trains a Random Forest
    """
    max_depth = np.arange(30) + 1
    min_samples_leaf = np.arange(50) + 1
    n_estimators = np.arange(50) + 1
    train_depth_acc = []
    valid_depth_acc = []
    train_leaf_acc = []
    valid_leaf_acc = []
    train_est_acc = []
    valid_est_acc = []

    for est in n_estimators:
        clf = RandomForestClassifier(n_estimators=est, bootstrap=False, random_state=42)
        clf.fit(X_train, y_train)
        train_est_acc.append(clf.score(X_train, y_train))
        valid_est_acc.append(clf.score(X_val, y_val))

    plot_graph(n_estimators, train_est_acc, valid_est_acc, 'n_estimators', 'accuracy', 'Number of estimators vs Accuracy')

    clf = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_leaf=1)
    clf.fit(X_train, y_train)
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    print("Random Forest Training Accuracy with default parameters: ".format(str(train_acc)))
    print("Random Forest Testing Accuracy with default parameters: ".format(str(test_acc)))

In [None]:
def gbm_model(X_train, y_train, X_test, y_test):

    X_train, X_validation, y_train, y_validation = train_test_split(
    X_train, y_train, test_size=0.3, random_state=80745)  # If scipy version>0.19, add shuffle=True
    
    lgb_train_data = lgb.Dataset(X_train, y_train)
    lgb_validation_data = lgb.Dataset(X_validation, y_validation, reference=lgb_train_data)
    
    num_leaves = [i for i in range(2, 62)]
    
    min_trees=150
    min_leaves = 70
    min_train=1000
    
    training_acc = []

    # Plotting graph for finding best leaves given the best num_trees:
    for leaf in num_leaves:
        lgb_params = {
            "objective": "binary",
            'metric': {'binary'},
            'num_leaves': leaf,
            'num_trees': 150,
        }

        model = lgb.train(params=lgb_params, train_set=lgb_train_data, valid_sets=[lgb_validation_data])
        y_pred_train = model.predict(X_train)

        accuracy_train = accuracy_score(y_train, y_pred_train.round())
        if(accuracy_train<min_train):
            min_train = accuracy_train
            min_leaves = leaf
        training_acc.append(accuracy_train)

    # Training with the best parameters:
    lgb_params = {
        "objective": "binary",
        'metric': {'binary'},
        'num_leaves': min_leaves,
        'num_trees': min_trees,
    }
    
    model = lgb.train(params=lgb_params, train_set=lgb_train_data, valid_sets=[lgb_validation_data])
    y_pred_train = model.predict(X_train)
    accuracy_train = accuracy_score(y_train, y_pred_train.round())
    print(accuracy_train)

    y_pred_val = model.predict(X_test)
    accuracy_val=accuracy_score(y_test, y_pred_val.round())
    print(accuracy_val)
    plot_graph(num_leaves, training_acc, 'Num Leaves', 'Accuracy', 'Number of Leaves vs Accuracy')

In [None]:
X_train, y_train, X_test, y_test = preprocess_tfidf()

In [None]:
logistic_regression(X_train, y_train, X_test, y_test)

In [None]:
decision_tree(X_train, y_train, X_test, y_test)

In [None]:
gbm_model(X_train, y_train, X_test, y_test)