# Flair Prediction/Sentiment Natural Language Processing

In this notebook, text posts from r/AmItheAsshole (r/AITA) are processed using various NLP techniques to isolate important word frequencies and patterns. Additionally, sentiment analysis is applied to these data to find interesting patterns/trends within the text posts across different flairs. This analysis only concerns r/AITA posts that are tagged with of the four "primary" flairs, being Not the A-hole (NTA), Asshole (YTA), No A-holes here (NAH), and Everybody Sucks (ESH).

First, we must read in the data from the project S3 bucket and subset it accordingly to only valid (i.e., remove text posts with no body and area also assigned one of the 4 primary flairs).

In [2]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.2.0

# install spark-nlp
%pip install spark-nlp==5.1.3

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

Collecting package metadata (current_repodata.json): done
Solving environment: - 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::anaconda-client==1.7.2=py37_0
  - defaults/noarch::anaconda-project==0.8.4=py_0
  - defaults/linux-64::bokeh==1.4.0=py37_0
  - defaults/noarch::dask==2.11.0=py_0
  - defaults/linux-64::distributed==2.11.0=py37_0
  - defaults/linux-64::spyder==4.0.1=py37_0
  - defaults/linux-64::watchdog==0.10.2=py37_0
failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: | 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/linux-64::anaconda-client==1.7.2=py37_0
  - defaults/noarch::anaconda-project

In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import re
import os
import json
import random
import pyspark.sql.functions as F
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline

In [5]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

# Import pyspark and build Spark session
spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[*]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3")\
    .getOrCreate()

print(f"Spark version: {spark.version}")
#print(f"SparkNLP version: {sparknlp.version}")

Spark version: 3.2.0


Now we establish an NLP pipeline below to process the data and apply a pretrained sentiment model below:

In [6]:
import sagemaker
session = sagemaker.Session()
bucket = "project17-bucket-alex"
!wget -qO- https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.1.3.jar | aws s3 cp - s3://{bucket}/lab8/spark-nlp-assembly-5.1.3.jar
!aws s3 ls s3://{bucket}/lab8/spark-nlp-assembly-5.1.3.jar

2023-11-18 05:32:14  708534094 spark-nlp-assembly-5.1.3.jar


In [11]:
%%writefile ./flair-process.py

import os
import sys
import logging
import argparse

# Import pyspark and build Spark session
from pyspark.sql.functions import *
from pyspark.sql.types import (
    DoubleType,
    IntegerType,
    StringType,
    StructField,
    StructType,
)

import json
import sparknlp
import numpy as np
import pandas as pd
from sparknlp.base import *
from pyspark.ml import Pipeline
from sparknlp.annotator import *
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
from sparknlp.pretrained import PretrainedPipeline

logging.basicConfig(format='%(asctime)s,%(levelname)s,%(module)s,%(filename)s,%(lineno)d,%(message)s', level=logging.DEBUG)
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

def main():
    parser = argparse.ArgumentParser(description="app inputs and outputs")
    parser.add_argument("--s3_dataset_path", type=str, help="Path of dataset in S3")
    parser.add_argument("--s3_output_bucket", type=str, help="s3 output bucket")
    parser.add_argument("--s3_output_key_prefix", type=str, help="s3 output key prefix")
    args = parser.parse_args()
    logger.info(f"args={args}")
    
    spark = SparkSession.builder \
    .appName("Spark NLP")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.3")\
    .getOrCreate()
    
    logger.info(f"Spark version: {spark.version}")
    logger.info(f"sparknlp version: {sparknlp.version()}")
    
    # This is needed to save RDDs which is the only way to write nested Dataframes into CSV format
    sc = spark.sparkContext
    sc._jsc.hadoopConfiguration().set(
        "mapred.output.committer.class", "org.apache.hadoop.mapred.FileOutputCommitter"
    )

    # Defining the schema corresponding to the input data. The input data does not contain the headers
    schema = StructType(
        [
            StructField("ID", StringType(), True),
            StructField("Content", StringType(), True),
            StructField("Summary", StringType(), True),
            StructField("Dataset", StringType(), True)
        ]
    )
    
    # Downloading the data from S3 into a Dataframe
    logger.info(f"going to read {args.s3_dataset_path}")
    #df = spark.read.parquet(args.s3_dataset_path, header=True, schema=schema)
    #df = df.repartition(64)
    
        # Read in data from project bucket
    #bucket = "project17-bucket-alex"
    #output_prefix_data = "project_2022"

    # List of 12 directories each containing 1 month of data
    directories = ["project_2022_"+str(i)+"/submissions" for i in range(1,13)]

    # Iterate through 12 directories and merge each monthly data set to create one big data set
    df = None
    for directory in directories:
        s3_path = f"s3a://{bucket}/{directory}"
        month_df = spark.read.parquet(s3_path, header = True)

        if df is None:
            df = month_df
        else:
            df = df.union(month_df)
    logger.info(f"finished reading files...")
    
    submissions = df

    # Here we subset the submissions to only include posts from r/AmItheAsshole for the subsequent analysis
    raw_aita = submissions.filter(F.col('subreddit') == "AmItheAsshole")

    # filter submissions to remove deleted/removed posts
    aita = raw_aita.filter((F.col('selftext') != '[removed]') & (F.col('selftext') != '[deleted]' ))

    # Filter submissions to only include posts tagged with the 4 primary flairs
    acceptable_flairs = ['Everyone Sucks', 'Not the A-hole', 'No A-holes here', 'Asshole']
    df_flairs = aita.where(F.col('link_flair_text').isin(acceptable_flairs))
    #df_flairs.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments", "link_flair_text").show()
    #print(f"shape of the subsetted submissions dataframe of appropriately flaired posts is {df_flairs.count():,}x{len(df_flairs.columns)}")
    df = df_flairs

    
    # get count
    row_count = df.count()
    # create a temp rdd and save to s3
    line = [f"count={row_count}"]
    logger.info(line)
    l = [('count', row_count)]
    tmp_df = spark.createDataFrame(l)
    s3_path = "s3://" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, "count")
    logger.info(f"going to save count to {s3_path}")
    # we want to write to a single file so coalesce
    tmp_df.coalesce(1).write.format('csv').option('header', 'false').mode("overwrite").save(s3_path)
    
    #df = df\
    #.withColumn('politics', F.col("Content").rlike("""(?i)politics|(?i)political|(?i)senate|(?i)democrats|(?i)republicans|(?i)government|(?i)president|(?i)prime minister|(?i)congress"""))\
    #.withColumn('sports', F.col("Content").rlike("""(?i)sport|(?i)ball|(?i)coach|(?i)goal|(?i)baseball|(?i)football|(?i)basketball"""))\
    #.withColumn('arts', F.col("Content").rlike("""(?i)art|(?i)painting|(?i)artist|(?i)museum|(?i)photography|(?i)sculpture"""))\
    #.withColumn('history', F.col("Content").rlike("""(?i)history|(?i)historical|(?i)ancient|(?i)archaeology|(?i)heritage|(?i)fossil""")).persist()
    
    #categories = ['politics', 'arts', 'sports', 'history']
    #for c in categories:
    #    df_soln = df.groupBy(c).count() #.toPandas().to_dict(orient='records')        
    #    s3_path = "s3://" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, c)
    #    logger.info(f"going to save dataframe to {s3_path}")
    #    # we want to write to a single file so coalesce
    #    df_soln.coalesce(1).write.format('csv').option('header', 'false').mode("overwrite").save(s3_path)

    # sentiment analysis
    MODEL_NAME = 'sentimentdl_use_twitter'
    logger.info(f"setting up an nlp pipeline with model={MODEL_NAME}")
    documentAssembler = DocumentAssembler()\
    .setInputCol("selftext")\
    .setOutputCol("document")
    
    use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
     .setInputCols(["document"])\
     .setOutputCol("sentence_embeddings")

    sentimentdl = SentimentDLModel.pretrained(name=MODEL_NAME, lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")

    nlp_pipeline = Pipeline(
      stages = [
          documentAssembler,
          use,
          sentimentdl
      ])
    logger.info(f"going to fit and transform pipeline on dataframe")
    pipeline_model = nlp_pipeline.fit(df)
    results = pipeline_model.transform(df)
    logger.info(f"done with fit and transform pipeline on dataframe")
    
    results=results.withColumn('sentiment', F.explode(results.sentiment.result))
    final_data=results.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments", "link_flair_text",'sentiment')
    final_data.persist()
    #final_data.show()
    cols = ['link_flair_text', 'sentiment']
    #logger.info(f"going to run a group by and count on columns={cols}")
    sum_counts = final_data.groupBy(cols).count()
    logger.info(f"going to convert sum_counts to dict")
    df_sent_baseline = sum_counts #.toPandas().to_dict(orient='records')
    logger.info(df_sent_baseline)
    s3_path = "s3://" + os.path.join(args.s3_output_bucket, args.s3_output_key_prefix, "sentiment_baseline")
    logger.info(f"going to save dataframe to {s3_path}")
    # we want to write to a single file so coalesce
    df_sent_baseline.coalesce(1).write.format('csv').option('header', 'false').mode("overwrite").save(s3_path)
    logger.info("all done")
    
if __name__ == "__main__":
    main()

Overwriting ./flair-process.py


In [12]:
%%time
import boto3
import sagemaker
from sagemaker.spark.processing import PySparkProcessor

account_id = boto3.client('sts').get_caller_identity()['Account']

# Setup the PySpark processor to run the job. Note the instance type and instance count parameters. SageMaker will create these many instances of this type for the spark job.
role = sagemaker.get_execution_role()
spark_processor = PySparkProcessor(
    base_job_name="sm-spark-project17",
    image_uri=f"{account_id}.dkr.ecr.us-east-1.amazonaws.com/sagemaker-spark:latest",
    role=role,
    instance_count=6,
    instance_type="ml.m5.xlarge",
    max_runtime_in_seconds=3600,
)

# s3 paths
session = sagemaker.Session()
bucket = "project17-bucket-alex"
s3_dataset_path = f"s3://{bucket}"
print(f"account_id={account_id}, s3_dataset_path={s3_dataset_path}")
output_prefix_data = f"flairs/data"
output_prefix_logs = f"flairs/spark_logs"


# run the job now, the arguments array is provided as command line to the Python script (Spark code in this case).
spark_processor.run(
    submit_app="./flair-process.py",
    submit_jars=[f"s3://{bucket}/lab8/spark-nlp-assembly-5.1.3.jar"],
    arguments=[
        "--s3_dataset_path",
        s3_dataset_path,
        "--s3_output_bucket",
        bucket,
        "--s3_output_key_prefix",
        output_prefix_data,
    ],
    spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, output_prefix_logs),
    logs=False,
)

INFO:sagemaker:Creating processing-job with name sm-spark-project17-2023-11-18-05-43-52-744


account_id=862339729993, s3_dataset_path=s3://project17-bucket-alex
.................................................................................*

UnexpectedStatusException: Error for Processing job sm-spark-project17-2023-11-18-05-43-52-744: Failed. Reason: AlgorithmError: See job logs for more information