# Flair Prediction for r/AmItheAsshole

This notebook works through the application of machine learning classification models to attempt to predict the flairs of posts in r/AmItheAsshole (r/AITA) based on their CountVectorized text content.

Session setup is done below:

In [2]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.4.0

# install spark-nlp
%pip install spark-nlp==5.1.3

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 23.3.1
  latest version: 23.10.0

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=23.10.0



# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --

In [1]:
import sagemaker
sess = sagemaker.Session()
bucket = sess.default_bucket()

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


In [3]:
import sparknlp
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier#, MultilayerPerceptronClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import re
import os
import json
import random
import pyspark.sql.functions as F
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline
from pyspark.sql.functions import col, when
from sklearn.metrics import confusion_matrix

In [4]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","32G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")\
    .config("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.ContainerCredentialsProvider")\
    .getOrCreate()



:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-dbe30aa1-52f9-4885-acfa-9a41f9701cc3;1.0
	confs: [default]
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 398ms :: artifacts dl 23ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
	org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	---------------------------------------------

In [5]:
print(f"Spark version: {spark.version}")
print(f"sparknlp version: {sparknlp.version()}")

Spark version: 3.4.0
sparknlp version: 5.1.3


## Reading in the Data

Below, the CountVectorized data of text submissions are read in, then filtered to only posts in r/AITA with one of the 4 primary flairs attached (i.e., what we are trying to predict).

In [6]:
%%time
# Read in data from project bucket
bucket = "project17-bucket-alex"
directory = "matt-submissions-cv"

s3_path = f"s3a://{bucket}/{directory}"
submissions_cv = spark.read.parquet(s3_path, header = True)
# Here we subset the submissions to only include posts from r/AmItheAsshole for the subsequent analysis
raw_aita = submissions_cv.filter(F.col('subreddit') == "AmItheAsshole")

# filter submissions to remove deleted/removed posts
aita = raw_aita.filter((F.col('selftext') != '[removed]') & (F.col('selftext') != '[deleted]' ))

# Filter submissions to only include posts tagged with the 4 primary flairs
acceptable_flairs = ['Everyone Sucks', 'Not the A-hole', 'No A-holes here', 'Asshole']
df_flairs = aita.where(F.col('link_flair_text').isin(acceptable_flairs))
df_flairs.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments", "link_flair_text").show()
print(f"shape of the subsetted submissions dataframe of appropriately flaired posts is {df_flairs.count():,}x{len(df_flairs.columns)}")

23/11/28 17:46:16 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
23/11/28 17:46:24 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

+-------------+--------------------+--------------------+--------------------+-------------------+------------+---------------+
|    subreddit|              author|               title|            selftext|        created_utc|num_comments|link_flair_text|
+-------------+--------------------+--------------------+--------------------+-------------------+------------+---------------+
|AmItheAsshole|         Squeakitout|AITA for being pi...|my boyfriendmand ...|2022-03-18 00:34:50|          24| Everyone Sucks|
|AmItheAsshole| Foreign_Quarter8959|WIBTA if I don't ...|so if have been d...|2022-03-18 00:35:50|          28|No A-holes here|
|AmItheAsshole|         100000nopes|AITA for giving a...|i moved into a qu...|2022-03-18 00:38:16|          19| Not the A-hole|
|AmItheAsshole|      MonkeyBeBoolin|AITA for leaving ...|tldr i live with ...|2022-03-18 00:39:29|         107| Not the A-hole|
|AmItheAsshole|Potential-Persimmon3|AITA for wanting ...|for background i’...|2022-03-18 00:44:07|      



shape of the subsetted submissions dataframe of appropriately flaired posts is 110,386x570
CPU times: user 46.8 ms, sys: 5.29 ms, total: 52.1 ms
Wall time: 32.8 s


                                                                                

Next, we look at a small sample of the words in the vocabulary that will be used to predict the flairs. 

In [7]:
# extract vocabulary from dataframe
word_cols = [col for col in df_flairs.columns if 'word_' in col]
vocabulary = [word.replace('word_', '') for word in word_cols]

# print the first ten vocabulary words
print(f"First ten vocabulary words: {', '.join(vocabulary[:10])}")

First ten vocabulary words: like, feel, want, know, time, tell, get, im, think, friend


From this from this vocabulary and word columns, we can establish a SparkML pipeline an employ a multi-class classification model to predict the flairs associated with each subreddit. We have an unbalanced dataset where the flair Not the A-hole is overrepresented, so we will use a weight column to account for this to ensure more accurate predictions.

In [8]:
# Find weights of classes depending on their prevalence in the original dataset
class_counts = df_flairs.groupBy("link_flair_text").count().collect()
total_count = df_flairs.count()
class_weights = {row["link_flair_text"]: total_count / (row["count"] * len(class_counts)) for row in class_counts}

# Add a new column for class weights
df_flairs = df_flairs.withColumn("class_weight", when(col('link_flair_text') == 'Not the A-hole', class_weights['Not the A-hole'])
                                 .when(col('link_flair_text') == 'Everyone Sucks', class_weights['Everyone Sucks'])
                                 .when(col('link_flair_text') == 'Asshole', class_weights['Asshole'])
                                 .otherwise(class_weights['No A-holes here']))


# Add a new column to the existing dataframe with class weights
#for label, weight in class_weights.items():
#    df_flairs = df_flairs.withColumn("classWeight", F.when(F.col("link_flair_text") == label, weight).otherwise(F.col("classWeight")))

                                                                                

In [9]:
train_data, test_data = df_flairs.randomSplit([0.8, 0.2], 24)

In [14]:
%%time
stringIndexer_flair = StringIndexer(inputCol = "link_flair_text", outputCol = "flair_idx")
subreddit_labels = stringIndexer_flair.fit(df_flairs).labels
#print(subreddit_labels)
# create a vector assembler with the appropriate input variables
vectorAssembler_features = VectorAssembler(
    inputCols = word_cols, 
    outputCol = 'input_features')
# create the random forest classification model
model = RandomForestClassifier(
    labelCol = 'flair_idx',
    featuresCol = 'input_features',
    numTrees = 50,
    weightCol = "class_weight")
# create a label converter to bring the numeric predictions back to string labels
labelConverter = IndexToString(
    inputCol = 'prediction', 
    outputCol = 'predicted_flair', 
    labels = subreddit_labels)
# create the pipline with appropriate stages
pipeline_model = Pipeline(
    stages = [stringIndexer_flair,
              vectorAssembler_features, 
              model, labelConverter])



CPU times: user 18.1 ms, sys: 9.18 ms, total: 27.3 ms
Wall time: 13.5 s


                                                                                

In [15]:
%%time
# fit the model
model = pipeline_model.fit(train_data)
                                                                                
# transform the data by applying the model
train_predictions = model.transform(train_data)
predictions = model.transform(test_data)

                                                                                

CPU times: user 208 ms, sys: 48.2 ms, total: 256 ms
Wall time: 3min 8s


Below are the calculations of metrics for the training data.

In [21]:
%%time
evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'accuracy')
train_accuracy = evaluator.evaluate(train_predictions)

evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'f1')
train_f1 = evaluator.evaluate(train_predictions)

evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'weightedPrecision')
train_precision = evaluator.evaluate(train_predictions)

evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'weightedRecall')
train_recall = evaluator.evaluate(train_predictions)



CPU times: user 85.3 ms, sys: 29 ms, total: 114 ms
Wall time: 2min 30s


                                                                                

In [22]:
print("Training Accuracy:"+str(train_accuracy))
print("Training F1:"+str(train_f1))
print("Training Weighted Precision:"+str(train_precision))
print("Training Weighted Recall:"+str(train_recall))

NameError: name 'train_accuracy' is not defined

With our model constructed and fitted, we can now see how accurate it fit to and classified our testing subset of the r/AITA posts.

In [16]:
%%time
evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'accuracy')
test_accuracy = evaluator.evaluate(predictions)

evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'f1')
test_f1 = evaluator.evaluate(predictions)

evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'weightedPrecision')
test_precision = evaluator.evaluate(predictions)

evaluator = MulticlassClassificationEvaluator(labelCol = 'flair_idx',
                                              predictionCol = 'prediction',
                                              metricName = 'weightedRecall')
test_recall = evaluator.evaluate(predictions)



CPU times: user 94.8 ms, sys: 11.5 ms, total: 106 ms
Wall time: 2min 31s


                                                                                

In [17]:
print("Testing Accuracy:"+str(accuracy))
print("Testing F1:"+str(f1))
print("Testing Weighted Precision:"+str(precision))
print("Testing Weighted Recall:"+str(recall))

Model Accuracy:0.24400018010716376
Model F1:0.3117948574507945
Model Weighted Precision:0.679297818646778
Model Weighted Recall:0.24400018010716376


In [18]:
flair_pred = predictions.select("predicted_flair").collect()
flair_orig = predictions.select("link_flair_text").collect()                                           

                                                                                

In [19]:
cm_labels = ['Asshole', 'Everyone Sucks', 'No A-holes here', 'Not the A-hole']
cm = confusion_matrix(flair_orig, flair_pred)
print("Confusion Matrix:")
print(pd.DataFrame(cm, columns = cm_labels, index = cm_labels))

Confusion Matrix:
                 Asshole  Everyone Sucks  No A-holes here  Not the A-hole
Asshole              320             883             1411             465
Everyone Sucks        96             422              227             139
No A-holes here       79             162              655             162
Not the A-hole      1422            5105             6639            4022


From the above metrics and confusion matrix, we can see that this Random Forest model does not do a very good job at predicting the flair based on the text content of the posts when accounting for the frequency of the classes. The weighted precision score is relatively good compared to the F1, accuracy, and recall scores, but the model clearly does not adequately predict the flairs well.

Below, we attempt another classification model, but instead using number of comments as a predictor.