# Flair Prediction/Sentiment Natural Language Processing

In this notebook, text posts from r/AmItheAsshole (r/AITA) are processed using various NLP techniques to isolate important word frequencies and patterns. Additionally, sentiment analysis is applied to these data to find interesting patterns/trends within the text posts across different flairs. This analysis only concerns r/AITA posts that are tagged with of the four "primary" flairs, being Not the A-hole (NTA), Asshole (YTA), No A-holes here (NAH), and Everybody Sucks (ESH).

First, we must read in the data from the project S3 bucket and subset it accordingly to only valid (i.e., remove text posts with no body and area also assigned one of the 4 primary flairs).

In [2]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.2.0

# install spark-nlp
%pip install spark-nlp==5.1.3

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting package metadata (current_repodata.json): done
Solving environment: / 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::anaconda-client==1.7.2=py37_0
  - defaults/noarch::anaconda-project==0.8.4=py_0
  - defaults/linux-64::bokeh==1.4.0=py37_0
  - defaults/noarch::dask==2.11.0=py_0
  - defaults/linux-64::distributed==2.11.0=py37_0
  - defaults/linux-64::spyder==4.0.1=py37_0
  - defaults/linux-64::watchdog==0.10.2=py37_0
failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: \ 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::anaconda-client==1.7.2=py37_0
  - defaults/noarch::anaconda-project

In [3]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

3.2.0


In [4]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import re
import os
import json
import random
import pyspark.sql.functions as F
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline

In [7]:
%%time

# Read in data from project bucket
bucket = "project17-bucket-alex"
#output_prefix_data = "project_2022"

# List of 12 directories each containing 1 month of data
directories = ["project_2022_"+str(i)+"/submissions" for i in range(1,13)]

# Iterate through 12 directories and merge each monthly data set to create one big data set
df = None
for directory in directories:
    s3_path = f"s3a://{bucket}/{directory}"
    month_df = spark.read.parquet(s3_path, header = True)
    
    if df is None:
        df = month_df
    else:
        df = df.union(month_df)

submissions = df

# Here we subset the submissions to only include posts from r/AmItheAsshole for the subsequent analysis
raw_aita = submissions.filter(F.col('subreddit') == "AmItheAsshole")

# filter submissions to remove deleted/removed posts
aita = raw_aita.filter((F.col('selftext') != '[removed]') & (F.col('selftext') != '[deleted]' ))

# Filter submissions to only include posts tagged with the 4 primary flairs
acceptable_flairs = ['Everyone Sucks', 'Not the A-hole', 'No A-holes here', 'Asshole']
df_flairs = aita.where(F.col('link_flair_text').isin(acceptable_flairs))
df_flairs.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments", "link_flair_text").show()
print(f"shape of the subsetted submissions dataframe of appropriately flaired posts is {df_flairs.count():,}x{len(df_flairs.columns)}")

# cache flair df for later use
df_flairs.cache()

+-------------+--------------------+--------------------+--------------------+-------------------+------------+---------------+
|    subreddit|              author|               title|            selftext|        created_utc|num_comments|link_flair_text|
+-------------+--------------------+--------------------+--------------------+-------------------+------------+---------------+
|AmItheAsshole|       geosunsetmoth|AITA for refusing...|I (NB 19) am auti...|2022-01-22 18:15:30|         425| Not the A-hole|
|AmItheAsshole|           pezewuziz|AITA for charging...|So for a bit of b...|2022-01-22 18:28:39|          56| Everyone Sucks|
|AmItheAsshole|            twilipig|AITA for refusing...|Just a bit of bac...|2022-01-22 18:52:52|          52| Not the A-hole|
|AmItheAsshole|              joreia|AITA for not want...|So I’m (32F) a mo...|2022-01-13 02:17:23|          29| Not the A-hole|
|AmItheAsshole|       BazilbeeChuck|AITA for offering...|\nI (29M) have a ...|2022-01-13 02:32:24|      

DataFrame[adserver_click_url: string, adserver_imp_pixel: string, archived: boolean, author: string, author_cakeday: boolean, author_flair_css_class: string, author_flair_text: string, author_id: string, brand_safe: boolean, contest_mode: boolean, created_utc: timestamp, crosspost_parent: string, crosspost_parent_list: array<struct<approved_at_utc:string,approved_by:string,archived:boolean,author:string,author_flair_css_class:string,author_flair_text:string,banned_at_utc:string,banned_by:string,brand_safe:boolean,can_gild:boolean,can_mod_post:boolean,clicked:boolean,contest_mode:boolean,created:double,created_utc:double,distinguished:string,domain:string,downs:bigint,edited:boolean,gilded:bigint,hidden:boolean,hide_score:boolean,id:string,is_crosspostable:boolean,is_reddit_media_domain:boolean,is_self:boolean,is_video:boolean,likes:string,link_flair_css_class:string,link_flair_text:string,locked:boolean,media:string,mod_reports:array<string>,name:string,num_comments:bigint,num_crosspos

Now we establish an NLP pipeline below to process the data and apply a pretrained sentiment model below:

In [8]:
MODEL_NAME='sentimentdl_use_twitter'

documentAssembler = DocumentAssembler()\
    .setInputCol("selftext")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

nlpPipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])

TypeError: 'JavaPackage' object is not callable

In [None]:
# Fit model
pipelineModel = nlpPipeline.fit(df_flairs)
results = pipelineModel.transform(df_flairs)