# Reddit Comments Classification Competition

## Authors
* Artem Ploujnikov
* Jean-Pierre Thach

## Overview
This notebook contains the final solution consisting of a weighted ensemble model of the following machine learning algorithms:

* A Multilevel Perceptron / Fully Connected Neural Network trained using features extracted via the pre-trained Google Universal Sentence Embedding model
* Multinomial Naive Bayes (trained on TFIDF-weighted Bag of Words features combined with additional engineered features)
* A Linear Support Vector Machine

## Running Instructions
This notebook was extracted from a Kaggle kernel. To reproduce the results, follow the steps described below:

- Navigate to the competition in Kaggle.
- Create a new Python Notebook kernel in the competition. This will automatically attach the Reddit dataset.
- Attach the Swear Words dataset to the kernel: https://www.kaggle.com/highflyingbird/swear-words
- Turn on the Internet. This is required to download the pretrained Universal Sentence Encoder model and install the `textstat` package. The notebook does not rely on any external data.
- Turn on the GPU. Without GPU acceleration, training the Multilevel Perceptron model and extracting feature vectors from the
- Click on Commit to run the notebook top-to-bottom or run or run it interactively and download `submissions.csv`

Running this notebook outside of Kaggle is possible but requires adjustments: file paths would need to be changed, and dependencies would need to be installed. Running it on configurations that are significantly different from that of the Kaggle kernel (e.g. older versions of Python) might not be possible.

## Not Included
This notebook only includes the models that were used to produce the final submission. The notebook does not include:

* The Explorartory Data Anlaysis
* Any models and techniques that were not selected for the final submission (but may appear on the report)

## Dependencies
Running this notebook requires the following libraries:
* `numpy`
* `pandas`
* `scikit-learn`
* `tensorflow` (version 2, run in compatibility mode with version `)
* `nltk`
* `textstat`

All libraries except `textstat` are preinstalled on Kaggle.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 
!pip install textstat

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
import re
import csv
from tqdm.auto import tqdm
import tensorflow.keras.layers as L
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import Model, Sequential
from tensorflow.keras import regularizers
from tensorflow.keras.backend import one_hot, clear_session
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer as KerasTokenizer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.ensemble import VotingClassifier
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline, FeatureUnion
import textstat
import matplotlib.pyplot as plt
import math

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

STOPWORDS = list(set(stopwords.words('english')))

The `submit_predictions` flag should be set to `true` when generating final submissions. It can be set to `false` during experimentation.

In [None]:
submit_predictions = True

**Note**: Kaggle is preinstalled with TensorFlow 2 by default. At the time of writing, many TensorFlow Hub models, including the Universal Sentence Encoder, were not Tensorflow 2-compatible.

In [None]:
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() # Needed for TensorFlow Hub

### Dataset Loading

In [None]:
datasets = ['train', 'test']
data_train = np.load("/kaggle/input/ift3395-ift6390-reddit-comments/data_train.pkl", allow_pickle=True)
data_test = np.load("/kaggle/input/ift3395-ift6390-reddit-comments/data_test.pkl", allow_pickle=True)

In [None]:
def to_dataframe(data):
    if len(data) == 2:
        comment, label = data
        result = pd.DataFrame({'comment': comment, 'label': label})
    else:
        result = pd.DataFrame({'comment': data})
    return result

In [None]:
def read_word_list(file_name):
	"""
	This function reads a list of words and return it as a set

	Parameter:
			file_name
	Returns a set of the words
	"""
	with open(file_name) as word_list_file:
		return set(word.strip() for word in word_list_file)

## Feature Engineering

Calling `to_dataframe` produces a Pandas DataFrame with the ofllowing columns:
* `comment`: The raw comment text
* `label`: The string label

### Basic Features

Running `enhance_tokenization` adds the following features to the dataframe:
* `words`: the comment converted to a list of words using `word_tokenize` from `NLTK`
* `sentences`: the comment converted to a list of sentences using `sent_tokenize` from `NLTK`
* `length`: the length, in characters
* `word_count`: the number of words in the comment
* `sentence_count`: the number of sentences in teh comment

### Swear Word Analysis
Running `enhance_bad_words` adds the following features:
* `has_bad_words`: a Boolean value indicating whether the comment contains any profanity
* `bad_word_count`: the number of swear words in the comment

### Readability Analysis
Running `enhance_readability` adds the following features:
* `flesch_reading_ease`: the Flesch reading ease metric
* `difficult_words`: the number of difficult words

The Exploratory Data Analysis included additional metrics. Only metrics selected for machine learning are being used here.

In [None]:
def enhance_tokenization(df):
    """
    This function enhances the dataframe with the length of comments,
    the words tokenized, the sentences tokenized, tokens processed by
    our Tokenizer, the word count and the sentence count.

    Parameter:
        df:  dataframe

    Returns the enhanced dataframe
    """
    dataframe = df
    dataframe['length'] = df.comment.str.len()
    tqdm.pandas('Tokenizing Words')
    dataframe['words'] = df.comment.progress_apply(word_tokenize)
    tqdm.pandas('Tokenizing Sentences')
    dataframe['sentences'] = df.comment.progress_apply(sent_tokenize)
    dataframe['word_count'] = df.words.apply(len)
    dataframe['sentence_count'] = df.sentences.apply(len)
    return dataframe

BAD_WORD_LIST = '/kaggle/input/swear-words/badwords.txt'
def enhance_bad_words(df):
    """
    This function enhances the dataframe with the count of
    bad words and a boolean indicating the appearance of bad word.

    Parameter:
        df: dataframe

    Returns the enhanced dataframe
    """
    dataframe = df
    bad_words = read_word_list(BAD_WORD_LIST)
    dataframe['bad_word_count'] = df.words.apply(lambda words: len(set(word.lower() for word in words) & bad_words))
    dataframe['has_bad_words'] = df.bad_word_count > 0
    return dataframe

readability_stats = [
    ('flesch_reading_ease', 'Flesch Reading Ease'),
    ('difficult_words', 'Difficult Words')]

def enhance_readability(df):
    """
    This function enhances the dataframe with the flesch reading ease,
    the smog index, the flesch kincaid grade, the difficulty of words
    from textstat.

    Parameter:
        df: dataframe

    Returns the enhanced dataframe
    """
    dataframe = df
    for item in readability_stats:
        key, label = item
        #tqdm.pandas(desc=label)
        stat = getattr(textstat, key)
        dataframe[key] = df.comment.apply(stat)
    return dataframe

## Training/Validation Split

The following split procedure will yield the following dataframes:

* `train_df`: the training set (90% of the training data provided)
* `val_df`: the validation set (10% of the training data provided)
* `test_df`: the true test set (used for submissions only)
* `train_val_df`: the entire training data set (used for final submissions only)

In [None]:
VAL_FRACTION = 0.1
train_val_df, test_df = (to_dataframe(data) 
                     for data in [data_train, data_test])

train_val_df = enhance_tokenization(train_val_df)
train_val_df = enhance_bad_words(train_val_df)
train_val_df = enhance_readability(train_val_df)

test_df = enhance_tokenization(test_df)
test_df = enhance_bad_words(test_df)
test_df = enhance_readability(test_df)

train_df, val_df = train_test_split(train_val_df, test_size=VAL_FRACTION)

## Utilities
* `NumberSelector`: A transformer that selects a numeric feature out of a dataframe (used in sklearn pipelining)
* `Textselector`: A transformer that selects a textual feature out of a dataframe (used in sklearn pipelining)
* `vectorize_labels`: Converts a list of labels (or any other collection) to a Nx1 `numpy` array
* `to_dense`: Converts sparse matrices to `numpy` arrays if necessary

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class NumberSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]

class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]

In [None]:
def vectorize_labels(labels):
    return np.array(labels)[:, np.newaxis]

def to_dense(*args):
    return [item.todense() if hasattr(item, 'todense') else item for item in args]

## The Universal Sentence Encoder Model

This model extracts feature vectors from Google's Universal Sentence Encoder pre-trained model and trains a simple  fully-connected neural network with 2 hidden layers:

* Inputs: Same as Universal Sentence Encoder outputs
* Layers:
  * Fully Connected (Dense), 1024 units, ReLu activation
  * Fully Connected (Dense), 1024 units, ReLu activation
  * Fully Connected (Dense), 20 units, Softmax activation
* Training
  * Optimizer: Adam
  * Loss: Categorical Cross-Entropy

In [None]:
class USEModel(BaseEstimator, ClassifierMixin):
    def __init__(self, classes, use=None):
        if not use:
            use = hub.Module('https://tfhub.dev/google/universal-sentence-encoder/2')
        self.classes = classes
        self.class_count = len(classes)
        self.use = use
        self.one_hot_encoder = OneHotEncoder()
        self.one_hot_encoder.fit(vectorize_labels(classes))
        
    def preprocess(self, data):
        X, y = data
        X_emb = sess.run(self.use(X.tolist()))
        y_onehot = self.one_hot_encoder.transform(y[:, np.newaxis]).todense()
        return X_emb, y_onehot
    
    def fit(self, x, y):
        self.train_data = self.preprocess((x.comment, y))
        self.idxmap = np.squeeze(
            np.array([np.where(self.one_hot_encoder.categories_[0] == class_label)
                      for class_label in self.classes]))        
        self.build_model(self.train_data[0].shape[1])
        self.train()
        
    def build_model(self, input_shape):
        inputs = layer = L.Input(shape=input_shape, dtype=float)
        layer = L.Dense(1024, activation='relu')(layer)
        layer = L.Dense(1024, activation='relu')(layer)
        layer = L.Dropout(.1)(layer)
        layer = L.Dense(self.class_count, activation='softmax')(inputs)
        self.model = Model(inputs=inputs, outputs=layer)
        self.model.compile(
            loss='categorical_crossentropy',
            optimizer='adam',
            metrics=['acc'])        

    def train(self, epochs=40):
        print("Training")
        X_train, y_train = self.train_data
        history = self.model.fit(X_train, y_train, validation_data=self.preprocess((val_df.comment, val_df.label)), epochs=epochs)
        print(history.history.keys())
        plt.plot(history.history['loss'])
        plt.plot(history.history['val_loss'])
        plt.title('model loss')
        plt.ylabel('loss')
        plt.xlabel('epoch')
        plt.legend(['train', 'validation'], loc='upper left')
        plt.show()

    def predict_proba(self, X):
        X_emb = sess.run(self.use(X.comment))        
        probs = self.model.predict(X_emb)
        return probs[:, self.idxmap]

    def predict(self, x):
        probs = self.predict_proba(x)
        idx = np.argmax(probs, axis=1)
        return self.classes[idx]    

## Base Class for Simple Linear Models

This is not a complete model, but rather, a reusable class that provides feature pipelining using TFIDF-weighted Bag of Words features pipelined together with the hand-engineered length, swear word and readability features.

In [None]:
class SimpleBaseModel(BaseEstimator, ClassifierMixin):
    def __init__(self, classes):
        self.classes = np.array(classes)
        self.tfidf = TfidfVectorizer(stop_words=STOPWORDS, strip_accents='unicode')
        
    def fit(self, x, y):

        p_comment = Pipeline([
                ('selector', TextSelector(key='comment')),
                ('tfidf', TfidfVectorizer(stop_words=STOPWORDS, strip_accents='unicode'))])
        p_length =  Pipeline([
                ('selector', NumberSelector(key='length')),
                ('standard', MinMaxScaler())])
        p_word_count =  Pipeline([
                ('selector', NumberSelector(key='word_count')),
                ('standard', MinMaxScaler())])
        p_sentence_count =  Pipeline([
                ('selector', NumberSelector(key='sentence_count')),
                ('standard', MinMaxScaler())])
        p_flesch_reading_ease =  Pipeline([
                ('selector', NumberSelector(key='flesch_reading_ease')),
                ('standard', MinMaxScaler())])
        p_difficult_words =  Pipeline([
                ('selector', NumberSelector(key='difficult_words')),
                ('standard', MinMaxScaler())])
        p_bad_word_count =  Pipeline([
                ('selector', NumberSelector(key='bad_word_count')),
                ('standard', MinMaxScaler())])
        
        print("Transforming comments")
        p_comment.fit_transform(x)
        print("Transforming length")
        p_length.fit_transform(x)
        print("Transforming word counts")
        p_word_count.fit_transform(x)
        print("Transforming sentence counts")
        p_sentence_count.fit_transform(x)
        print("Transforming reading ease")
        p_flesch_reading_ease.fit_transform(x)
        print("Transforming difficult words")
        p_difficult_words.fit_transform(x)
        print("Transforming bad words")
        p_bad_word_count.fit_transform(x)

        all_features = FeatureUnion([('comment', p_comment),
                                    ('length', p_length),
                                    ('word_count', p_word_count),
                                    ('sentence_count', p_sentence_count),
                                    ('flesch_reading_ease', p_flesch_reading_ease),
                                    ('difficult_words', p_difficult_words),
                                    ('bad_word_count', p_bad_word_count),
                                    ])
        self.p_features = Pipeline([('all_features', all_features)])
        self.p_features.fit_transform(x)
       
        self.classifier = Pipeline([
            ('all_features', self.p_features),
            ('classifier', self.get_classifier())
        ])
        self.classifier.fit(x, y)
        self.idxmap = np.squeeze(
            np.array([np.where(self.classifier.classes_ == class_label)
                      for class_label in self.classes]))

    def predict_proba(self, x):
        probabilities = self.classifier.predict_proba(x)
        return probabilities[:, self.idxmap]
    
    def predict(self, x):
        probs = self.predict_proba(x)
        idx = np.argmax(probs, axis=1)
        return self.classes[idx]

    def get_classifier(self):
        return NotImplemented

## TF-IDF Naive Bayes Model

A simple Multinomial Naive Bayes baseline model using the features available in the `SimpleBaseModel` class above.

In [None]:
class TfidfNaiveBayesModel(SimpleBaseModel):
    def __init__(self, classes, alpha=0.2783577600782345):
        super().__init__(classes)
        self.alpha = alpha
        
    def get_classifier(self):
        return MultinomialNB(alpha=self.alpha)

## Linear Model

A simple linear model based on the `sklearn` `SGDClassifier` using the features available in the `SimpleBaseModel` class above. The underlying classifier provides implementation for logistic regression and support vector machines. A support vector machine is used in this implementation.

Given below are the hyperparameter values that were used for this model:
* `alpha=1e-3`
* `random_state=42` 
* `max_iter=400`

In [None]:
class LinearModel(SimpleBaseModel):
    def __init__(self, classes, **hparams):
        super().__init__(classes)        
        self.hparams = hparams
    
    def get_classifier(self):
        return SGDClassifier(**self.hparams)

## Soft Voting Ensembling Meta-Classifier

This meta-classifier combines the results of multiple classifiers via soft voting, optionally weighting the classifiers in proportion to their performance of the training set.

$$c_{pred} = arg \max_c \frac{1}{n_c} \sum_{i=1}^{n_c} \alpha_i P_i(y=c) $$

When `weighted==True`:
$$
\alpha_i = \frac{1}{2}\ln \frac{1 − \epsilon_i} {\epsilon_i}
$$

When `weighted==False`

$$
\forall i, \alpha_i = 1 
$$

Limitations:
* `predict_proba` is not normalized when classifiers are weighted.
* The weighting procedure is not suitable for classifiers achieving an error rate exceeding 50%.
  * The formula was adopted form AdaBoost. The limitation can be addressed by revising it for multiclass classification; however, this limitation does not affect this particular classification problem given the observed error rates.
* Adding that overfit significantly will hurt performance when weighting is used.

In [None]:
class SimpleVotingClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, estimators, classes, weighted=True, sample=.25):
        self.estimators = estimators
        self.classes = classes
        self.weighted = weighted
        self.sample = sample
        
    def fit(self, x, y):
        for estimator in self.estimators:
            estimator.fit(x, y)
        self.alphas = self.get_alpha_vector(x, y)[:, np.newaxis, np.newaxis]            

    def get_sample(self, x, y):
        data_size = len(x)
        sample_size = math.ceil(data_size * self.sample)
        idx = np.random.choice(np.arange(data_size), sample_size, replace=False)
        return self.subset(x, idx), self.subset(y, idx)
    
    def subset(self, data, idx):
        return data.iloc[idx] if hasattr(data, 'iloc') else data[idx]
                    
    def get_alpha(self, estimator, x, y):
        sample_x, sample_y = self.get_sample(x, y)
        predictions = estimator.predict(sample_x)
        error_rate = np.sum(sample_y != predictions) / len(sample_y)
        return .5 * np.log((1 - error_rate) / error_rate)
    
    def get_alpha_vector(self, x, y):
        if self.weighted:
            alpha = np.array([self.get_alpha(estimator, x, y)
                         for estimator in self.estimators])
        else:
            alpha = np.ones(len(self.estimators))
        return alpha
            
    def predict_proba(self, x):
        probs = np.array(
            [estimator.predict_proba(x) for estimator in self.estimators])
        probs = np.sum(self.alphas * probs, axis=0) / len(self.estimators)
        return probs
    
    def predict(self, x):
        probs = self.predict_proba(x)
        idx = np.argmax(probs, axis=1)
        return self.classes[idx]        

### Universal Sentence Encoder Initialization

In [None]:
sess = tf.InteractiveSession()
use = hub.Module('https://tfhub.dev/google/universal-sentence-encoder/2')
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

### Ensembling Mode Set-Up
* Set up a weighted classifier with the following models:
  * MLP with Universal Sentence Encoder features
  * Naive Bayes with TFIDF + readability + swear words + length
  * Linear SVM with TFIDF + readability + swear words + length

In [None]:
classes = train_val_df.label.unique()
use_model = USEModel(classes, use)
naive_model = TfidfNaiveBayesModel(classes)
linear_model = LinearModel(classes, loss='modified_huber', penalty='l2',alpha=1e-3, random_state=42, max_iter=400, tol=None, verbose=1)
classifiers = [use_model, linear_model, naive_model]
voting = SimpleVotingClassifier(estimators=classifiers, classes=classes)

### Experimentation Mode

Train the model on the training set, display the cross-validation error. This was used while developing the models and evaluating different candidates.

When `submit_predictions` is set to `False`, the test data set is not used at all, and no submission is generated.

In [None]:
if not submit_predictions:
    voting.fit(train_df, train_df.label)

In [None]:
if not submit_predictions:
    predictions = voting.predict(val_df)
    accuracy = np.sum(predictions == val_df.label) / len(val_df)
    print(f"Accuracy: {accuracy}")

### Submission Mode
* Train the model on the entire test set
* Compute a "validation" set accuracy for the sanity check. This is not a true validation error because the validation set has been used in training; rather, it is a training error computed on a sample. It can be useful only to detect potential problems with the implementation
* Output `predictions.csv`

In [None]:
if submit_predictions:
    voting.fit(train_val_df, train_val_df.label)
    sanity_check_predictions = voting.predict(val_df)
    accuracy = np.sum(sanity_check_predictions == val_df.label) / len(val_df)
    print(f"Sanity check accuracy: {accuracy} (not a true validation accuracy)")
    print(train_val_df.shape)
    print(test_df.shape)
    predictions = voting.predict(test_df)
    with open("predictions.csv", 'w', newline='') as f:
        wr = csv.writer(f)
        wr.writerow(["Id", "Category"])
        for i, prediction in enumerate(predictions):
            wr.writerow((i,prediction))  
            print('{0},{1}'.format(i,prediction)) 

### Submission Download
The `FileLink` widget will make it possible to download the submission file if the notebook is being run interactively

In [22]:
from IPython.display import FileLink
FileLink('predictions.csv')