# Detecting Sarcasm in Reddit Comments – BERT

**Team 4:** Nanda H Krishna, Rubini U and Vikram Reddy

**Checklist:**
1. [x] EDA and Pre-processing
2. [x] TF-IDF (Random Forest, Gradient Boosting, Gaussian Naïve Bayes, Multi-Layer Perceptron, Neural Network)
    - [x] TF-IDF on Pre-processed Text
    - [x] TF-IDF on Raw Text
    - [x] Effect of using 2-grams
    - [x] Effect of using PCA
    - [x] Ensembling models
    - [x] Model Interpretability
3. [x] BERT Embeddings

## Importing Modules

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd
import sentence_transformers
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle

In [None]:
random_state = 42

## Loading Dataset

First, we'll be removing all NaNs from the dataset. Then we will be restricting ourselves to 125000 instances from the dataset due to compute power limitations.

In [None]:
df = pd.read_csv('sarcasm/dataset.csv')

In [None]:
df['label'].value_counts()

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.dropna(inplace=True)
df = df.sample(n=125000, random_state=random_state)
df.reset_index(inplace=True, drop=True)

In [None]:
df.shape

## Splitting Data

In [None]:
df = shuffle(df).reset_index(drop=True)

In [None]:
author_le = LabelEncoder()
df['author'] = author_le.fit_transform(df['author'])
sub_le = LabelEncoder()
df['subreddit'] = sub_le.fit_transform(df['subreddit'])

In [None]:
split = int(df.shape[0] * 0.8)
df_train = df.iloc[:split, :].reset_index(drop=True)
df_test = df.iloc[split:, :].reset_index(drop=True)
del df

In [None]:
print(df_train.shape, df_test.shape)

In [None]:
df_train.head()

In [None]:
df_test.head()

## BERT Sentence Embeddings

We will be using the Random Forest Classifier on BERT Sentence Embeddings.

In [None]:
bert_model = sentence_transformers.SentenceTransformer('bert-base-nli-mean-tokens')

In [None]:
train_comment_embeddings = bert_model.encode(df_train['comment'])

In [None]:
train_comment_embeddings[0].shape

### Using only comment embeddings

In [None]:
model = RandomForestClassifier(n_estimators=50, verbose=1, n_jobs=-1)

In [None]:
model.fit(train_comment_embeddings, np.array(df_train['label']))

In [None]:
test_comment_embeddings = bert_model.encode(df_test['comment'])

In [None]:
pred = model.predict(test_comment_embeddings)

In [None]:
print(classification_report(np.array(df_test['label']), pred))

### Adding the subreddit and author as features

In [None]:
for i in range(len(train_comment_embeddings)):
    train_comment_embeddings[i] = np.append(train_comment_embeddings[i], df_train['subreddit'].iloc[i])

In [None]:
model = RandomForestClassifier(n_estimators=50, verbose=1, n_jobs=-1)
model.fit(train_comment_embeddings, np.array(df_train['label']))

In [None]:
for i in range(len(test_comment_embeddings)):
    test_comment_embeddings[i] = np.append(test_comment_embeddings[i], df_test['subreddit'].iloc[i])

In [None]:
pred = model.predict(test_comment_embeddings)

In [None]:
print(classification_report(np.array(df_test['label']), pred))

In [None]:
for i in range(len(train_comment_embeddings)):
    train_comment_embeddings[i] = np.append(train_comment_embeddings[i], df_train['author'].iloc[i])

In [None]:
model = RandomForestClassifier(n_estimators=50, verbose=1, n_jobs=-1)
model.fit(train_comment_embeddings, np.array(df_train['label']))

In [None]:
for i in range(len(test_comment_embeddings)):
    test_comment_embeddings[i] = np.append(test_comment_embeddings[i], df_test['author'].iloc[i])

In [None]:
pred = model.predict(test_comment_embeddings)

In [None]:
print(classification_report(np.array(df_test['label']), pred))

### Adding parent comment embeddings as features

In [None]:
train_parent_embeddings = bert_model.encode(df_train['parent_comment'])

In [None]:
for i in range(len(train_comment_embeddings)):
    train_comment_embeddings[i] = np.append(train_comment_embeddings[i], train_parent_embeddings[i])

In [None]:
model = RandomForestClassifier(n_estimators=50, verbose=1, n_jobs=-1)
model.fit(train_comment_embeddings, np.array(df_train['label']))

In [None]:
test_parent_embeddings = bert_model.encode(df_test['parent_comment'])

In [None]:
for i in range(len(test_comment_embeddings)):
    test_comment_embeddings[i] = np.append(test_comment_embeddings[i], test_parent_embeddings[i])

In [None]:
pred = model.predict(test_comment_embeddings)

In [None]:
print(classification_report(np.array(df_test['label']), pred))

With this amount of data, we managed to achieve an F1-score of about 0.64 on the validation set.