## Setup

Here, we install JDK and set the proper paths using conda.

In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.3.0

# restart kernel
from IPython.core.display import HTML
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

## Start the Spark Session

Here, we start the `spark` session.

In [None]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

In [4]:
print(spark.version)

3.3.0


## Test Reading in Data from Shared Bucket

Here, we try to read in a file that Alex created in our shared bucket, to ensure that someone in the group can read a file that is owned by someone else.

```
%%time
t = spark.read.text(
    "s3a://project17-bucket-alex/eda_ideas.txt"
)
t.show()
```

## Test Writing Data into Shared Bucket

Here, we try to write a small data frame into the shared bucket.

```
%%time

data = [{"Category": 'A', "ID": 1, "Value": 121.44, "Truth": True},
        {"Category": 'B', "ID": 2, "Value": 300.01, "Truth": False},
        {"Category": 'C', "ID": 3, "Value": 10.99, "Truth": None},
        {"Category": 'E', "ID": 4, "Value": 33.87, "Truth": True}
        ]

df = spark.createDataFrame(data)
df.show()
```

```
df.write.csv(
    "s3a://project17-bucket-alex/matt-test-csv.csv"
)
```

## Read in the Data

Here, now that we have our shared bucket configured properly, we can read our filtered project data.

### Comments - Read the Data (ONE MONTH)

```
%%time
comments = spark.read.parquet(
    's3a://project17-bucket-alex/project_jan2021/comments/*.parquet',
    header = True
)
```

### Comments - Read the Data (FULL)

In [5]:
%%time
# Read in data from project bucket
bucket = "project17-bucket-alex"
#output_prefix_data = "project_2022"

# List of 12 directories each containing 1 month of data
directories = ["project_2022_" + str(i) + "/comments" for i in range(1, 13)]

# Iterate through 12 directories and merge each monthly data set to create one big data set
comments = None
for directory in directories:
    s3_path = f"s3a://{bucket}/{directory}"
    month_df = spark.read.parquet(s3_path, header = True)
    
    if comments is None:
        comments = month_df
    else:
        comments = comments.union(month_df)

23/11/25 21:05:08 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties


                                                                                

CPU times: user 10.2 ms, sys: 18.4 ms, total: 28.5 ms
Wall time: 13.8 s


### Comments - View the Data

In [6]:
comments.select(['subreddit', 'author', 'body', 'parent_id', 'link_id', 'id', 'created_utc']).show(10)

[Stage 12:>                                                         (0 + 1) / 1]

+-----------------+--------------------+--------------------+----------+---------+-------+-------------------+
|        subreddit|              author|                body| parent_id|  link_id|     id|        created_utc|
+-----------------+--------------------+--------------------+----------+---------+-------+-------------------+
|    AmItheAsshole|         beckydragon|                 NTA| t3_rz9uu3|t3_rz9uu3|hs0rusg|2022-01-10 04:49:57|
|    AmItheAsshole|        Cactus_chuck|NTA. My partners ...| t3_s0baev|t3_s0baev|hs0rusr|2022-01-10 04:49:57|
|    AmItheAsshole|   Red-belliedOrator|INFO\n\nIn genera...| t3_s0a5hn|t3_s0a5hn|hs0rut9|2022-01-10 04:49:57|
|NoStupidQuestions|  SoMuchForLongevity|You couldn't heat...| t3_s0b5be|t3_s0b5be|hs0rutc|2022-01-10 04:49:57|
|NoStupidQuestions|          MMmason651|it wouldn't taste...| t3_s0axsd|t3_s0axsd|hs0rutm|2022-01-10 04:49:57|
|           AskMen|          redditfu76|  Play with my boobs| t3_s0bc1r|t3_s0bc1r|hs0ruva|2022-01-10 04:49:58|
|

                                                                                

### Comments - Print the Shape of the Data

In [9]:
comments.count(), len(comments.columns)

                                                                                

(76503363, 21)

### Comments - Print the Schema

In [10]:
comments.printSchema()

root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)



### Submissions - Read the Data (ONE MONTH)

```
%%time
comments = spark.read.parquet(
    's3a://project17-bucket-alex/project_jan2021/submissions/*.parquet',
    header = True
)
```

### Submissions - Read the Data (FULL)

In [7]:
%%time
# Read in data from project bucket
bucket = "project17-bucket-alex"
#output_prefix_data = "project_2022"

# List of 12 directories each containing 1 month of data
directories = ["project_2022_" + str(i) + "/submissions" for i in range(1, 13)]

# Iterate through 12 directories and merge each monthly data set to create one big data set
submissions = None
for directory in directories:
    s3_path = f"s3a://{bucket}/{directory}"
    month_df = spark.read.parquet(s3_path, header = True)
    
    if submissions is None:
        submissions = month_df
    else:
        submissions = submissions.union(month_df)

23/11/25 21:05:32 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
CPU times: user 19.7 ms, sys: 3.94 ms, total: 23.6 ms
Wall time: 6.86 s


### Submissions - View the Data

In [8]:
submissions.select(['subreddit', 'author', 'title', 'selftext', 'created_utc', 'num_comments']).show(10)

+-----------------+-------------------+--------------------+--------------------+-------------------+------------+
|        subreddit|             author|               title|            selftext|        created_utc|num_comments|
+-----------------+-------------------+--------------------+--------------------+-------------------+------------+
|NoStupidQuestions|          [deleted]|Who do you call w...|           [deleted]|2022-01-22 18:14:03|           4|
|    AmItheAsshole|          [deleted]|AITA for blowing ...|           [removed]|2022-01-22 18:14:04|           7|
|    AmItheAsshole|       go_awaythrow|AITA if I cut my ...|           [removed]|2022-01-22 18:14:12|           1|
|NoStupidQuestions|          [deleted]|   [deleted by user]|           [removed]|2022-01-22 18:14:16|           1|
|           AskMen|          [deleted]|Do men actually l...|           [removed]|2022-01-22 18:14:21|           1|
|         antiwork|        Vivid_Steel|For Those of You ...|In most states in...

### Submissions - Print the Shape of the Data

In [13]:
submissions.count(), len(submissions.columns)

                                                                                

(3444283, 68)

### Submissions - Print the Schema

In [14]:
submissions.printSchema()

root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    

## Explore the Data

### Comments - Value Counts by Subreddit

Here, we see how many comments we obtained across our twelve subreddits.

In [15]:
comment_counts = comments.groupBy('subreddit').count().orderBy('count', ascending = False).cache()
comment_counts.show()



+-------------------+--------+
|          subreddit|   count|
+-------------------+--------+
|      AmItheAsshole|25323210|
|           antiwork|10393160|
|relationship_advice| 9524526|
|  NoStupidQuestions| 7102304|
|             AskMen| 6733187|
|     TrueOffMyChest| 5808636|
|   unpopularopinion| 5197347|
|           AskWomen| 2275029|
|               tifu| 1699742|
|  explainlikeimfive| 1529311|
|       OutOfTheLoop|  517638|
|       socialskills|  399273|
+-------------------+--------+



                                                                                

In [None]:
# save the comment counts to CSV
comment_counts.write.csv('comment_counts.csv')

                                                                                

### Comments - Dummy Variable for Removed/Deleted/Empty Comments

In [11]:
from pyspark.sql.functions import col

invalid_comments = ['[deleted]', '[removed]', '']
comment_valid_dummy = comments.withColumn('valid', ~col('body').isin(invalid_comments))
comment_valid_dummy.select(['body', 'valid']).show()

+--------------------+-----+
|                body|valid|
+--------------------+-----+
|                 NTA| true|
|NTA. My partners ...| true|
|INFO\n\nIn genera...| true|
|You couldn't heat...| true|
|it wouldn't taste...| true|
|  Play with my boobs| true|
|This happened to ...| true|
|100% NTA! If I we...| true|
|Doesn't justify y...| true|
|                Meow| true|
|Avocados are so f...| true|
|                   ^| true|
|How can we do bet...| true|
|I never liked Elo...| true|
|I thought it was ...| true|
|YTA\n\nI can't un...| true|
|NTA, and your par...| true|
|It’s usually a sl...| true|
|NTA. I think you'...| true|
|Get yourself wet,...| true|
+--------------------+-----+
only showing top 20 rows



### Comments - Value Counts by Subreddit and Removed/Deleted/Empty Comments

In [12]:
# group by subreddit and validity
comment_counts_valid = comment_valid_dummy.groupBy(['subreddit', 'valid']).count().orderBy('subreddit', ascending = True).cache()
comment_counts_valid.show()



+-------------------+-----+--------+
|          subreddit|valid|   count|
+-------------------+-----+--------+
|      AmItheAsshole| true|23966604|
|      AmItheAsshole|false| 1356606|
|             AskMen| true| 6128483|
|             AskMen|false|  604704|
|           AskWomen| true| 1983040|
|           AskWomen|false|  291989|
|  NoStupidQuestions|false|  567936|
|  NoStupidQuestions| true| 6534368|
|       OutOfTheLoop|false|  110012|
|       OutOfTheLoop| true|  407626|
|     TrueOffMyChest|false|  531525|
|     TrueOffMyChest| true| 5277111|
|           antiwork|false|  774101|
|           antiwork| true| 9619059|
|  explainlikeimfive| true| 1335550|
|  explainlikeimfive|false|  193761|
|relationship_advice| true| 8619495|
|relationship_advice|false|  905031|
|       socialskills|false|   29814|
|       socialskills| true|  369459|
+-------------------+-----+--------+
only showing top 20 rows



                                                                                

In [13]:
# convert the validity boolean to integer before converting to pandas
comment_counts_valid = comment_counts_valid.withColumn('valid', col('valid').cast('integer'))
comment_counts_valid.show()

+-------------------+-----+--------+
|          subreddit|valid|   count|
+-------------------+-----+--------+
|      AmItheAsshole|    1|23966604|
|      AmItheAsshole|    0| 1356606|
|             AskMen|    1| 6128483|
|             AskMen|    0|  604704|
|           AskWomen|    1| 1983040|
|           AskWomen|    0|  291989|
|  NoStupidQuestions|    0|  567936|
|  NoStupidQuestions|    1| 6534368|
|       OutOfTheLoop|    0|  110012|
|       OutOfTheLoop|    1|  407626|
|     TrueOffMyChest|    0|  531525|
|     TrueOffMyChest|    1| 5277111|
|           antiwork|    0|  774101|
|           antiwork|    1| 9619059|
|  explainlikeimfive|    1| 1335550|
|  explainlikeimfive|    0|  193761|
|relationship_advice|    1| 8619495|
|relationship_advice|    0|  905031|
|       socialskills|    0|   29814|
|       socialskills|    1|  369459|
+-------------------+-----+--------+
only showing top 20 rows



In [14]:
# save the comment dummy counts to CSV
comment_counts_valid.toPandas().to_csv('../../data/eda-data/comment_counts_valid_new.csv', index = False)

### Submissions - Value Counts by Subreddit

Here, we see how many submissions we obtained across our twelve subreddits.

In [15]:
submission_counts = submissions.groupBy('subreddit').count().orderBy('count', ascending = False).cache()
submission_counts.show()



+-------------------+--------+
|          subreddit|   count|
+-------------------+--------+
|      AmItheAsshole|25323210|
|           antiwork|10393160|
|relationship_advice| 9524526|
|  NoStupidQuestions| 7102304|
|             AskMen| 6733187|
|     TrueOffMyChest| 5808636|
|   unpopularopinion| 5197347|
|           AskWomen| 2275029|
|               tifu| 1699742|
|  explainlikeimfive| 1529311|
|       OutOfTheLoop|  517638|
|       socialskills|  399273|
+-------------------+--------+



                                                                                

In [None]:
# save the submission counts to CSV
submission_counts.write.option('header', True).csv('submission_counts.csv')

                                                                                

### Submissions - Dummy Variable for Removed/Deleted/Empty Submissions

In [15]:
from pyspark.sql.functions import col

invalid_submissions = ['[deleted]', '[removed]', '']
submission_valid_dummy = submissions.withColumn('valid', ~col('selftext').isin(invalid_submissions))
submission_valid_dummy.select(['selftext', 'valid']).show()

+--------------------+-----+
|            selftext|valid|
+--------------------+-----+
|           [deleted]|false|
|           [removed]|false|
|           [removed]|false|
|           [removed]|false|
|           [removed]|false|
|In most states in...| true|
|           [removed]|false|
|           [removed]|false|
|                    |false|
|                    |false|
|           [removed]|false|
|                    |false|
|I like waking up ...| true|
|I (NB 19) am auti...| true|
|           [deleted]|false|
|           [deleted]|false|
|I don’t know why,...| true|
|I've been called ...| true|
|But I’m still in ...| true|
|           [removed]|false|
+--------------------+-----+
only showing top 20 rows



### Submissions - Value Counts by Subreddit and Removed/Deleted Submissions

In [16]:
# group by subreddit and validity
submission_counts_valid = submission_valid_dummy.groupBy(['subreddit', 'valid']).count().orderBy('subreddit', ascending = True).cache()
submission_counts_valid.show()



+-------------------+-----+------+
|          subreddit|valid| count|
+-------------------+-----+------+
|      AmItheAsshole| true|115659|
|      AmItheAsshole|false|405851|
|             AskMen| true| 18240|
|             AskMen|false|230847|
|           AskWomen| true|  2717|
|           AskWomen|false|156313|
|  NoStupidQuestions|false|361097|
|  NoStupidQuestions| true|234253|
|       OutOfTheLoop|false| 22782|
|       OutOfTheLoop| true|  3054|
|     TrueOffMyChest|false|164706|
|     TrueOffMyChest| true|125159|
|           antiwork|false|175513|
|           antiwork| true| 76647|
|  explainlikeimfive| true| 15002|
|  explainlikeimfive|false| 88903|
|relationship_advice| true|311882|
|relationship_advice|false|571074|
|       socialskills|false| 27452|
|       socialskills| true| 23005|
+-------------------+-----+------+
only showing top 20 rows



                                                                                

In [17]:
# convert the validity boolean to integer before converting to pandas
submission_counts_valid = submission_counts_valid.withColumn('valid', col('valid').cast('integer'))

In [19]:
# save the submission dummy counts to CSV
submission_counts_valid.toPandas().to_csv('../../data/eda-data/submission_counts_valid_new.csv', index = False)

### Comments - Value Counts by Subreddit and Month

In [20]:
from pyspark.sql.functions import month

comment_months = comments.withColumn('month_dt', month('created_utc'))

comment_month_counts = comment_months.groupBy(['subreddit', 'month_dt']).count().orderBy('subreddit', ascending = True).cache()
comment_month_counts.show()



+-------------+--------+-------+
|    subreddit|month_dt|  count|
+-------------+--------+-------+
|AmItheAsshole|       4|2231839|
|AmItheAsshole|       1|2123706|
|AmItheAsshole|       3|2257500|
|AmItheAsshole|      11|1943515|
|AmItheAsshole|       5|2121000|
|AmItheAsshole|       6|2024229|
|AmItheAsshole|       9|1999811|
|AmItheAsshole|      12|2172879|
|AmItheAsshole|       8|2180602|
|AmItheAsshole|      10|1978071|
|AmItheAsshole|       2|2029035|
|AmItheAsshole|       7|2261023|
|       AskMen|       2| 563639|
|       AskMen|       8| 512162|
|       AskMen|       1| 650579|
|       AskMen|       3| 726630|
|       AskMen|      12| 462515|
|       AskMen|       6| 571611|
|       AskMen|       4| 748925|
|       AskMen|      11| 466075|
+-------------+--------+-------+
only showing top 20 rows



                                                                                

In [21]:
# save the comment month counts to CSV
comment_month_counts.toPandas().to_csv('../../data/eda-data/comment_month_counts_new.csv', index = False)

### Submissions - Value Counts by Subreddit and Month

In [22]:
from pyspark.sql.functions import month

submission_months = submissions.withColumn('month_dt', month('created_utc'))

submission_month_counts = submission_months.groupBy(['subreddit', 'month_dt']).count().orderBy('subreddit', ascending = True).cache()
submission_month_counts.show()

                                                                                

+-------------+--------+-----+
|    subreddit|month_dt|count|
+-------------+--------+-----+
|AmItheAsshole|       4|43613|
|AmItheAsshole|       1|35682|
|AmItheAsshole|       3|40763|
|AmItheAsshole|      11|38754|
|AmItheAsshole|       5|46383|
|AmItheAsshole|       6|48001|
|AmItheAsshole|       9|43479|
|AmItheAsshole|      12|42396|
|AmItheAsshole|       8|52238|
|AmItheAsshole|      10|42332|
|AmItheAsshole|       2|34398|
|AmItheAsshole|       7|53471|
|       AskMen|       2|19590|
|       AskMen|       8|20967|
|       AskMen|       1|21540|
|       AskMen|       3|22569|
|       AskMen|      12|18658|
|       AskMen|       6|21032|
|       AskMen|       4|24680|
|       AskMen|      11|18648|
+-------------+--------+-----+
only showing top 20 rows



In [23]:
# save the submission month counts to CSV
submission_month_counts.toPandas().to_csv('../../data/eda-data/submission_month_counts_new.csv', index = False)