# Milestone 1: Frame your analysis and EDA




## 1. Project Topics

### Exploratory 1

#### Business Goals

Determine if multimedia (videos, images) in a post affects user interaction.

#### Technical Proposals

Check the means of the distribution of comments for each type of post in a box plot. Perform hypothesis tests. Perform hypothesis tests for statistical significance.




### Exploratory 2

#### Business Goals

Determine what is the correlation that exists between the number of comments and the score of a post.

#### Technical Proposals

Calculate correlations between the score in various selected subreddits and the number of comments in each. Perform hypothesis tests for statistical significance.


### Exploratory 3

#### Business Goals

Determine the times of the day when posts typically receive the most engagement.

#### Technical Proposals

Plot comments over time.

## 2. EDA

### Bucket checks

In [2]:
!aws s3 ls


2023-08-29 23:43:16 sagemaker-studio-692960231031-wo7kgoszj2g
2023-08-29 23:50:01 sagemaker-us-east-1-692960231031
2023-08-30 00:34:21 vad49
2023-09-16 16:02:10 vad49-labdata


In [65]:
#!aws s3 ls s3://vad49/project_lowercase_test/
!aws s3 ls s3://project17-bucket-alex/project_jan2021/

#!aws s3 cp s3://project17-bucket-alex/eda_ideas.txt -

                           PRE comments/
                           PRE submissions/


### Setup

In [91]:
from IPython.core.display import HTML
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.functions import col, length, isnan, when, count
import pandas as pd
pd.set_option('display.max_colwidth', None)  # None means unlimited
pd.set_option('display.width', None)
pd.set_option('display.max_columns', None)


In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.2.0 s3fs pyarrow

# restart kernel
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [None]:
# Import pyspark and build Spark session

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

### Bring in submissions and comments data

In [66]:
%%time
s3_path_submissions = f"s3a://project17-bucket-alex/project_jan2021//submissions"
print(f"reading submissions from {s3_path_submissions}")

submissions = spark.read.parquet(s3_path_submissions, header=True)


reading submissions from s3a://project17-bucket-alex/project_jan2021//submissions
CPU times: user 416 µs, sys: 4.13 ms, total: 4.55 ms
Wall time: 567 ms


In [67]:
%%time
s3_path_comments = f"s3a://project17-bucket-alex/project_jan2021//comments"
print(f"reading submissions from {s3_path_comments}")

comments = spark.read.parquet(s3_path_comments, header=True)


reading submissions from s3a://project17-bucket-alex/project_jan2021//comments
CPU times: user 4.03 ms, sys: 1.93 ms, total: 5.95 ms
Wall time: 515 ms


In [68]:
submissions_small = submissions.sample(withReplacement=False, fraction=0.01, seed=42)
comments_small = comments.sample(withReplacement=False, fraction=0.01, seed=42)


In [69]:
# create small dfs

use_small = True  # to easily swap between the small and small dfs
submissions_active = submissions_small if use_small else submissions
comments_active = comments_small if use_small else comments

### 2.1 Report on the basic info about your dataset. What are the interesting columns? What is the schema? How many rows do you have? etc. etc.


In [70]:
print(f"shape of the submissions dataframe is {submissions_active.count():,}x{len(submissions_active.columns)}")
print(f"shape of the comments dataframe is {comments_active.count():,}x{len(comments_active.columns)}")


                                                                                

shape of the submissions dataframe is 2,628x68




shape of the comments dataframe is 41,245x21


                                                                                

Submissions

In [71]:
submissions_active.printSchema()

root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    

In [92]:
# display a subset of columns
display(submissions_active.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments").limit(20).toPandas())

                                                                                

Unnamed: 0,subreddit,author,title,selftext,created_utc,num_comments
0,NoStupidQuestions,techsavvynerd91,How do you prevent your glasses from falling off during a fight?,"For those that wear glasses and have managed to keep them on during a fight, how did you do that? How did they not come off within the first 5 seconds?",2021-01-02 00:11:05,8
1,relationship_advice,irtriated,Boyfriend too loud,"My (33F) boyfriend (37M) and I live in a 2-br apartment. Together a long time. \nIt seems like every night we have the same argument.\nFor example right now I'm in the living room, trying to watch tv. He is in the bedroom playing video games with friends. \nliving room and bedroom are far enough from each other that if someone watches TV you can't hear it in the bedroom.\nProblem is, he speaks so loudly that I can hear his every last word and exclamation. He also insists on wearing over the ear headphones that exacerbates the problem - in them he doesn't hear how loud he is. (but he's very loud even without the headphones).\nWhen I ask him to keep it down because it's irritating and I can""t enjoy my show, he says to get over it.\nI have tried to not mind it but it drives me crazy, it's like forcing me to pay attention to two things at once. I can't seem to tune it out. My request probably irritates him because he thinks I'm acting a princess and should get over it. I feel like he could me more considerate and on board to figure something out for both of us, but he really won't even entertain it. I really don't know how to get through to him.",2021-01-23 05:40:24,2
2,relationship_advice,RegretSubstantial365,Lonely in a relationship,"I've been with this guy for about 10 months, he is super busy all of the time with multiple businesses. He is always on the phone or physically working. Which working for a living is great but it's overkill. During covid, his ex took him to court for custody stuff and it really stressed him out. They finally had the final hearing and all of it supposed to be done with. I've mentioned several times that we really don't see each other anymore, barely talk on the phone and I'm getting bitter about it. I try to tell him how I feel and he gets aggravated because he doesn't want to talk about that. I was really upset this weekend because he chose not to hang out and go work on a house he is fixing for his friend. I expected at least one day. I was really upset and talked to him about it and after about 45 minutes, he said okay I think 45 minutes is enough time to talk. 🤷🏻‍♀️ he tells me he loves me and has thought about us living Together. But he doesn't act that way. It hurt me so bad. I get my feelings hurt easily so that's why I'm trying to get feedback. I feel like I can't talk to him about anything, that I'm always a bother and I'm asking him to do things he clearly doesn't want to do. He says he doesn't want to break up and that things are fine with him . But I'm kinda miserable because he seems like a part time boyfriend. I keep hoping he will do things differently but it's not happening. My friends keep telling me to wait and give him more time. I've cried tonight just thinking about letting him go because only one of us is getting what we want essentially. I feel really comfy with him and when we are together, it's great. But I look a lot towards the future and I don't see him wanting anything like that. Any ideas or help on what I should do or am I being over emotional?",2021-01-25 03:56:10,7
3,relationship_advice,[deleted],How would you tell your bestfriend she’s dating a guy who has red flags without diacouraging her?,"My bestfriend went to a date yesterday with a brazilian guy yesterday. She went to his house and saw a knife under his bed. He was living alone in this apartment. She knew it was for his safety and protection but she said the dude was joking around “You know I’m a dangerous guy. I can kill you”. He put away the knife back in the kitchen. They had a good night. She said she was really having fun with him. The only thing was whenever she said she’ll go home, the dude would sit on her lap. He’s really heavy she said so she couldn’t go anywhere. \n\nI adviced my bestfriend if the dude still have a some “meh” traits on the next few dates. I think she should stop seeing her just give him another chance cause I could see she had fun with that guy. I don’t want to discouraged her. \n\nShe said she was kind scared with his “kill you” joke but she would still continue seeing him that’s why I said gave it another chance. Did I really do the right thing?",2021-01-23 16:19:24,5
4,NoStupidQuestions,CallMeSpoofy,Is it possible to get bleach out of clothes?,In my case it’s not one big stain but little splashes,2021-01-23 16:38:59,6
5,unpopularopinion,SweetCuddlyFeline,A grilled cheese sandwich (or a cheese toasty as our UK cousins call it) is just any kind of cheese and bread and NOTHING ELSE,"A grilled cheese sandwich is just cheese and bread. Not cheese and ham, not cheese and chicken, not cheese and turkey, not cheese and veggies. Once you add something other than cheese it’s called a MELT. A turkey and cheese melt, a ham and cheese melt, etc.",2021-01-23 16:42:15,15
6,relationship_advice,ivegotquestions3,Need advice on getting married before the ceremony for health insurance,I just got engaged in December and my fiancé and I are planning on legally getting married before the actual wedding. I’ve always wanted a NYE wedding and because of covid we decided next year is too soon and we’re going to wait until 2022. \n\nMy health insurance sucks though and I’m on a medicine that isn’t even covered so in total I spend over $600 a month on the insurance and medicine. If I were on his insurance it would be about $175 a month for the insurance and medicine. And it would include vision and dental which I don’t currently have. \n\nMy parents are aware of the idea and are completely on board with it. The idea has also been brought up to his parents and they did not seem to have a problem with it. Everyone knows that would we save so much money by doing so and that there are a lot of other things that the money could be spent on. I should also mention that the other benefit of getting married is that I deeply love this man and am thankful everyday that he’s with me. It isn’t just about the health insurance! \n\nWhat we’re unsure about is if we should have a small ceremony and include our immediate family? Or if we just go down to the courthouse and not tell anyone? We don’t want to make a huge deal about it and have it take away from the wedding with all of our loved ones in 2022. I know it’s just a piece of paper but I do feel like it has a lot of significance behind and I don’t want to regret how we go about it.\n\nHas anyone done this and/or have any advice?,2021-01-10 15:27:47,4
7,unpopularopinion,Thesaurus123,People who drink black coffee are just as bad as people who drink IPA.,They both love to tell people the bitter taste of their beverage is better than the literal thousands of better tasting options available to them. They will also ridicule the people who like to drink their beverage with some additive that makes it not taste like a butthole.,2021-01-08 23:04:20,17
8,relationship_advice,rickyrudd7,Need help understanding this one girl...,"Hi all! I have been talking with girl for quite awhile now, but haven’t been on an official date yet. We matched online and we started talking, we exchanged phone numbers, social media accounts etc. I have talked to her over the phone multiple times and for hours. I have enjoyed talking to her and I think she does too. I can tell she has a strong personality and doesn’t want to reveal herself. I have actually asked her out couple times, but something came up each time. One time she was out of town, another time she was over to her friend’s house. She didn’t lie to me cause I have her Instagram account and I can confirm that in both cases she was telling the truth. Actually there was another time that I asked her out, and that time she told me that she had problems with her car, so I offered myself to go pick her up, but she didn’t reply to me that day so I assumed she didn’t want to.\n\nAt this point I started to ignore her.. but she always came back to me. She watches every single story I post on Instagram, and most of the time she replies too. I didn’t talk to her for about ten days last time around, and boom she sent me a message saying that we should go out for an hike. I told her that I had to confirm to her since it was around the holidays ( New Years week) and don’t know if I had time for that particular day. Once I got back to her and told her that I was actually free that day and I could go out with her, she saw the message and didn’t reply tough... Couple days ago, once again she sees my story on Instagram and replies to me.. I don’t know if she is that insecure or she likes to be chased or what. She also told me before this last episode that she hopes to see me in 2021... what do you do of her ? She likes me ? She is playing games ? She is just insecure ? To be honest if we were not in a pandemic, I probably don’t think I would have waited all this time on her.",2021-01-08 23:06:33,4
9,antiwork,[deleted],When does it end?,"I'm over here thinking about how expensive it is just to live alone....like holy fucking shit! Basic needs feels impossible to achieve. Maybe I'm just lazy, or maybe I'm just depressed...but man I'm lucky to live with family, but still I hate being stuck with no indepence. I wish I could just be free with my own car, house, money, etc. Don't get wrong, i work only two days a week early mornings, and I love it!\n\nbut I dread full-time work! Seriously where does it end!? ....And just trying to make a career is tiresome. It's funny how we're trapped in this boring endless dystopia where everything is just grey, boring....how can anyone like this!?",2021-01-03 00:12:47,6


In [73]:
submissions_active.groupby('subreddit').count().show()





+-------------------+-----+
|          subreddit|count|
+-------------------+-----+
|     TrueOffMyChest|  117|
|   unpopularopinion|  416|
|           antiwork|   22|
|       socialskills|   42|
|             AskMen|  140|
|      AmItheAsshole|  325|
|relationship_advice|  665|
|  explainlikeimfive|  162|
|       OutOfTheLoop|   58|
|               tifu|   64|
|  NoStupidQuestions|  509|
|           AskWomen|  108|
+-------------------+-----+



                                                                                

Comments

In [74]:
comments_active.printSchema()

root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)



In [75]:
# display a subset of columns
comments_active.select("subreddit", "author", "body", "parent_id", "link_id", "id", "created_utc").show()

+-------------------+-----------------+--------------------+----------+---------+-------+-------------------+
|          subreddit|           author|                body| parent_id|  link_id|     id|        created_utc|
+-------------------+-----------------+--------------------+----------+---------+-------+-------------------+
|   unpopularopinion| stonedironworker|Every man you’ve ...|t1_gl3ljhf|t3_l6okyk|gl3msfm|2021-01-28 15:18:53|
|             AskMen|        [deleted]|That’s what I’m s...|t1_gl3i3ry|t3_l6wmz9|gl3mwdx|2021-01-28 15:19:23|
|   unpopularopinion|     slimezillaaa|Sounds like more ...| t3_l6okyk|t3_l6okyk|gl3n144|2021-01-28 15:19:59|
|   unpopularopinion|chanaandeler_bong|Why does that mat...|t1_gl3n5rb|t3_l6okyk|gl3nmln|2021-01-28 15:22:42|
|       OutOfTheLoop|    AutoModerator|**PLEASE READ ALL...| t3_l6zjdh|t3_l6zjdh|gl3ofzn|2021-01-28 15:26:24|
|      AmItheAsshole|     jenettabrown|So your dad is ba...| t3_l6vkwf|t3_l6vkwf|gl3oh5p|2021-01-28 15:26:33|
|   unpopu

### 2.2 Conduct basic data quality checks! Make sure there are no missing values, check the length of the comments, and remove rows of data that might be corrupted. Even if you think all your data is perfect, you still need to demonstrate that with your analysis.



In [76]:

def check_and_remove_missing(df: DataFrame, threshold: int = 100) -> DataFrame:

    # Check for missing values
    missing_values = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])

    # Show the missing values count for each column
    missing_values_collected = missing_values.collect()[0].asDict()
    print("Missing values in each column:")
    for column, missing_count in missing_values_collected.items():
        print(f"{column}: {missing_count}")

    # Identify columns with missing values above threshold
    columns_to_drop = [column for column, missing_count in missing_values_collected.items() if missing_count > threshold]

    # Drop the identified columns from the dataframe
    df = df.drop(*columns_to_drop)
    
    # Recalculate missing values for the updated DataFrame
    missing_values = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])
    missing_values_collected = missing_values.collect()[0].asDict()
    
    # Print updated missing values count
    print("Missing values after column removal:")
    for column, missing_count in missing_values_collected.items():
        print(f"{column}: {missing_count}")



In [77]:
submissions_active = check_and_remove_missing(submissions_active)

# Print the updated DataFrame shape
print(f"Shape: ({submissions_active.count()}, {len(submissions_active.columns)})")

# Show the first 5 rows of the updated DataFrame
submissions_active.show(5)

                                                                                

Missing values in each column:
adserver_click_url: 2628
adserver_imp_pixel: 2628
archived: 0
author: 0
author_cakeday: 2623
author_flair_css_class: 2608
author_flair_text: 2586
author_id: 2628
brand_safe: 2628
contest_mode: 0
created_utc: 0
crosspost_parent: 2626
crosspost_parent_list: 2626
disable_comments: 2628
distinguished: 2627
domain: 0
domain_override: 2628
edited: 0
embed_type: 2628
embed_url: 2628
gilded: 0
hidden: 0
hide_score: 0
href_url: 2628
id: 0
imp_pixel: 2628
is_crosspostable: 0
is_reddit_media_domain: 0
is_self: 0
is_video: 0
link_flair_css_class: 2095
link_flair_text: 2019
locked: 0
media: 2626
media_embed: 0
mobile_ad_url: 2628
num_comments: 0
num_crossposts: 0
original_link: 2628
over_18: 0
parent_whitelist_status: 0
permalink: 0
pinned: 0
post_hint: 2609
preview: 2609
promoted: 2628
promoted_by: 2628
promoted_display_name: 2628
promoted_url: 2628
retrieved_on: 2628
score: 0
secure_media: 2626
secure_media_embed: 0
selftext: 0
spoiler: 0
stickied: 0
subreddit: 0
su

                                                                                

Missing values after column removal:
archived: 0
author: 0
contest_mode: 0
created_utc: 0
domain: 0
edited: 0
gilded: 0
hidden: 0
hide_score: 0
id: 0
is_crosspostable: 0
is_reddit_media_domain: 0
is_self: 0
is_video: 0
locked: 0
media_embed: 0
num_comments: 0
num_crossposts: 0
over_18: 0
parent_whitelist_status: 0
permalink: 0
pinned: 0
score: 0
secure_media_embed: 0
selftext: 0
spoiler: 0
stickied: 0
subreddit: 0
subreddit_id: 0
thumbnail: 0
title: 0
url: 0
whitelist_status: 0


                                                                                

Shape after column removal: (2628, 33)
+--------+--------------------+------------+-------------------+--------------------+------+------+------+----------+------+----------------+----------------------+-------+--------+------+--------------------+------------+--------------+-------+-----------------------+--------------------+------+-----+--------------------+--------------------+-------+--------+-------------------+------------+---------+--------------------+--------------------+----------------+
|archived|              author|contest_mode|        created_utc|              domain|edited|gilded|hidden|hide_score|    id|is_crosspostable|is_reddit_media_domain|is_self|is_video|locked|         media_embed|num_comments|num_crossposts|over_18|parent_whitelist_status|           permalink|pinned|score|  secure_media_embed|            selftext|spoiler|stickied|          subreddit|subreddit_id|thumbnail|               title|                 url|whitelist_status|
+--------+--------------------+

                                                                                

Shape: (2628, 33)
+--------+--------------------+------------+-------------------+--------------------+------+------+------+----------+------+----------------+----------------------+-------+--------+------+--------------------+------------+--------------+-------+-----------------------+--------------------+------+-----+--------------------+--------------------+-------+--------+-------------------+------------+---------+--------------------+--------------------+----------------+
|archived|              author|contest_mode|        created_utc|              domain|edited|gilded|hidden|hide_score|    id|is_crosspostable|is_reddit_media_domain|is_self|is_video|locked|         media_embed|num_comments|num_crossposts|over_18|parent_whitelist_status|           permalink|pinned|score|  secure_media_embed|            selftext|spoiler|stickied|          subreddit|subreddit_id|thumbnail|               title|                 url|whitelist_status|
+--------+--------------------+------------+--------

In [88]:
submissions_active.printSchema()

root
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- domain: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- hidden: boolean (nullable = true)
 |-- hide_score: boolean (nullable = true)
 |-- id: string (nullable = true)
 |-- is_crosspostable: boolean (nullable = true)
 |-- is_reddit_media_domain: boolean (nullable = true)
 |-- is_self: boolean (nullable = true)
 |-- is_video: boolean (nullable = true)
 |-- locked: boolean (nullable = true)
 |-- media_embed: struct (nullable = true)
 |    |-- content: string (nullable = true)
 |    |-- height: long (nullable = true)
 |    |-- scrolling: boolean (nullable = true)
 |    |-- width: long (nullable = true)
 |-- num_comments: long (nullable = true)
 |-- num_crossposts: long (nullable = true)
 |-- over_18: boolean (nullable = true)
 |-- parent_whitelist_status: st

In [89]:
# booleans to int
# |-- archived: boolean (nullable = true)
# |-- contest_mode: boolean (nullable = true)
# |-- hidden: boolean (nullable = true)
# |-- hide_score: boolean (nullable = true)
# |-- is_crosspostable: boolean (nullable = true)
# |-- is_reddit_media_domain: boolean (nullable = true)
# |-- is_self: boolean (nullable = true)
# |-- is_video: boolean (nullable = true)
# |-- locked: boolean (nullable = true)
# |-- over_18: boolean (nullable = true)
# |-- pinned: boolean (nullable = true)
# |-- spoiler: boolean (nullable = true)
# |-- stickied: boolean (nullable = true)

submissions_active = submissions_active.withColumn("archived", col("archived").cast("integer"))
submissions_active = submissions_active.withColumn("contest_mode", col("contest_mode").cast("integer"))
submissions_active = submissions_active.withColumn("hidden", col("hidden").cast("integer"))
submissions_active = submissions_active.withColumn("hide_score", col("hide_score").cast("integer"))
submissions_active = submissions_active.withColumn("is_crosspostable", col("is_crosspostable").cast("integer"))
submissions_active = submissions_active.withColumn("is_reddit_media_domain", col("is_reddit_media_domain").cast("integer"))
submissions_active = submissions_active.withColumn("is_self", col("is_self").cast("integer"))
submissions_active = submissions_active.withColumn("is_video", col("is_video").cast("integer"))
submissions_active = submissions_active.withColumn("locked", col("locked").cast("integer"))
submissions_active = submissions_active.withColumn("over_18", col("over_18").cast("integer"))
submissions_active = submissions_active.withColumn("pinned", col("pinned").cast("integer"))
submissions_active = submissions_active.withColumn("spoiler", col("spoiler").cast("integer"))
submissions_active = submissions_active.withColumn("stickied", col("stickied").cast("integer"))

    
    

In [94]:
display(submissions_active.limit(5).toPandas())

Unnamed: 0,archived,author,contest_mode,created_utc,domain,edited,gilded,hidden,hide_score,id,is_crosspostable,is_reddit_media_domain,is_self,is_video,locked,media_embed,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,score,secure_media_embed,selftext,spoiler,stickied,subreddit,subreddit_id,thumbnail,title,url,whitelist_status
0,0,techsavvynerd91,0,2021-01-02 00:11:05,self.NoStupidQuestions,false,0,0,0,komhp2,1,0,1,0,0,"(None, None, None, None)",8,0,0,all_ads,/r/NoStupidQuestions/comments/komhp2/how_do_you_prevent_your_glasses_from_falling_off/,0,2,"(None, None, None, None, None)","For those that wear glasses and have managed to keep them on during a fight, how did you do that? How did they not come off within the first 5 seconds?",0,0,NoStupidQuestions,t5_2w844,self,How do you prevent your glasses from falling off during a fight?,https://www.reddit.com/r/NoStupidQuestions/comments/komhp2/how_do_you_prevent_your_glasses_from_falling_off/,all_ads
1,0,irtriated,0,2021-01-23 05:40:24,self.relationship_advice,false,0,0,0,l3684b,1,0,1,0,0,"(None, None, None, None)",2,0,0,all_ads,/r/relationship_advice/comments/l3684b/boyfriend_too_loud/,0,8,"(None, None, None, None, None)","My (33F) boyfriend (37M) and I live in a 2-br apartment. Together a long time. \nIt seems like every night we have the same argument.\nFor example right now I'm in the living room, trying to watch tv. He is in the bedroom playing video games with friends. \nliving room and bedroom are far enough from each other that if someone watches TV you can't hear it in the bedroom.\nProblem is, he speaks so loudly that I can hear his every last word and exclamation. He also insists on wearing over the ear headphones that exacerbates the problem - in them he doesn't hear how loud he is. (but he's very loud even without the headphones).\nWhen I ask him to keep it down because it's irritating and I can""t enjoy my show, he says to get over it.\nI have tried to not mind it but it drives me crazy, it's like forcing me to pay attention to two things at once. I can't seem to tune it out. My request probably irritates him because he thinks I'm acting a princess and should get over it. I feel like he could me more considerate and on board to figure something out for both of us, but he really won't even entertain it. I really don't know how to get through to him.",0,0,relationship_advice,t5_2r0cn,self,Boyfriend too loud,https://www.reddit.com/r/relationship_advice/comments/l3684b/boyfriend_too_loud/,all_ads
2,0,RegretSubstantial365,0,2021-01-25 03:56:10,self.relationship_advice,false,0,0,0,l4fdip,1,0,1,0,0,"(None, None, None, None)",7,0,0,all_ads,/r/relationship_advice/comments/l4fdip/lonely_in_a_relationship/,0,2,"(None, None, None, None, None)","I've been with this guy for about 10 months, he is super busy all of the time with multiple businesses. He is always on the phone or physically working. Which working for a living is great but it's overkill. During covid, his ex took him to court for custody stuff and it really stressed him out. They finally had the final hearing and all of it supposed to be done with. I've mentioned several times that we really don't see each other anymore, barely talk on the phone and I'm getting bitter about it. I try to tell him how I feel and he gets aggravated because he doesn't want to talk about that. I was really upset this weekend because he chose not to hang out and go work on a house he is fixing for his friend. I expected at least one day. I was really upset and talked to him about it and after about 45 minutes, he said okay I think 45 minutes is enough time to talk. 🤷🏻‍♀️ he tells me he loves me and has thought about us living Together. But he doesn't act that way. It hurt me so bad. I get my feelings hurt easily so that's why I'm trying to get feedback. I feel like I can't talk to him about anything, that I'm always a bother and I'm asking him to do things he clearly doesn't want to do. He says he doesn't want to break up and that things are fine with him . But I'm kinda miserable because he seems like a part time boyfriend. I keep hoping he will do things differently but it's not happening. My friends keep telling me to wait and give him more time. I've cried tonight just thinking about letting him go because only one of us is getting what we want essentially. I feel really comfy with him and when we are together, it's great. But I look a lot towards the future and I don't see him wanting anything like that. Any ideas or help on what I should do or am I being over emotional?",0,0,relationship_advice,t5_2r0cn,self,Lonely in a relationship,https://www.reddit.com/r/relationship_advice/comments/l4fdip/lonely_in_a_relationship/,all_ads
3,0,[deleted],0,2021-01-23 16:19:24,self.relationship_advice,1.611419127E9,0,0,0,l3f16q,1,0,1,0,0,"(None, None, None, None)",5,0,0,all_ads,/r/relationship_advice/comments/l3f16q/how_would_you_tell_your_bestfriend_shes_dating_a/,0,4,"(None, None, None, None, None)","My bestfriend went to a date yesterday with a brazilian guy yesterday. She went to his house and saw a knife under his bed. He was living alone in this apartment. She knew it was for his safety and protection but she said the dude was joking around “You know I’m a dangerous guy. I can kill you”. He put away the knife back in the kitchen. They had a good night. She said she was really having fun with him. The only thing was whenever she said she’ll go home, the dude would sit on her lap. He’s really heavy she said so she couldn’t go anywhere. \n\nI adviced my bestfriend if the dude still have a some “meh” traits on the next few dates. I think she should stop seeing her just give him another chance cause I could see she had fun with that guy. I don’t want to discouraged her. \n\nShe said she was kind scared with his “kill you” joke but she would still continue seeing him that’s why I said gave it another chance. Did I really do the right thing?",0,0,relationship_advice,t5_2r0cn,self,How would you tell your bestfriend she’s dating a guy who has red flags without diacouraging her?,https://www.reddit.com/r/relationship_advice/comments/l3f16q/how_would_you_tell_your_bestfriend_shes_dating_a/,all_ads
4,0,CallMeSpoofy,0,2021-01-23 16:38:59,self.NoStupidQuestions,false,0,0,0,l3ff0a,1,0,1,0,0,"(None, None, None, None)",6,0,0,all_ads,/r/NoStupidQuestions/comments/l3ff0a/is_it_possible_to_get_bleach_out_of_clothes/,0,1,"(None, None, None, None, None)",In my case it’s not one big stain but little splashes,0,0,NoStupidQuestions,t5_2w844,self,Is it possible to get bleach out of clothes?,https://www.reddit.com/r/NoStupidQuestions/comments/l3ff0a/is_it_possible_to_get_bleach_out_of_clothes/,all_ads


In [78]:
comments_active = check_and_remove_missing(comments_active)

# Print the updated DataFrame shape
print(f"Shape: ({comments_active.count()}, {len(comments_active.columns)})")

# Show the first 5 rows of the updated DataFrame
comments_active.show(5)

                                                                                

Missing values in each column:
author: 0
author_cakeday: 41104
author_flair_css_class: 40417
author_flair_text: 35282
body: 0
can_gild: 0
controversiality: 0
created_utc: 0
distinguished: 40003
edited: 0
gilded: 0
id: 0
is_submitter: 0
link_id: 0
parent_id: 0
permalink: 0
retrieved_on: 0
score: 0
stickied: 0
subreddit: 0
subreddit_id: 0


                                                                                

Missing values after column removal:
author: 0
body: 0
can_gild: 0
controversiality: 0
created_utc: 0
edited: 0
gilded: 0
id: 0
is_submitter: 0
link_id: 0
parent_id: 0
permalink: 0
retrieved_on: 0
score: 0
stickied: 0
subreddit: 0
subreddit_id: 0


                                                                                

Shape after column removal: (41245, 17)
+-----------------+--------------------+--------+----------------+-------------------+----------+------+-------+------------+---------+----------+--------------------+-------------------+-----+--------+----------------+------------+
|           author|                body|can_gild|controversiality|        created_utc|    edited|gilded|     id|is_submitter|  link_id| parent_id|           permalink|       retrieved_on|score|stickied|       subreddit|subreddit_id|
+-----------------+--------------------+--------+----------------+-------------------+----------+------+-------+------------+---------+----------+--------------------+-------------------+-----+--------+----------------+------------+
| stonedironworker|Every man you’ve ...|    true|               0|2021-01-28 15:18:53|     false|     0|gl3msfm|       false|t3_l6okyk|t1_gl3ljhf|/r/unpopularopini...|2021-05-29 06:07:54|   41|   false|unpopularopinion|    t5_2tk0s|
|        [deleted]|That’s wh

                                                                                

Shape: (41245, 17)
+-----------------+--------------------+--------+----------------+-------------------+----------+------+-------+------------+---------+----------+--------------------+-------------------+-----+--------+----------------+------------+
|           author|                body|can_gild|controversiality|        created_utc|    edited|gilded|     id|is_submitter|  link_id| parent_id|           permalink|       retrieved_on|score|stickied|       subreddit|subreddit_id|
+-----------------+--------------------+--------+----------------+-------------------+----------+------+-------+------------+---------+----------+--------------------+-------------------+-----+--------+----------------+------------+
| stonedironworker|Every man you’ve ...|    true|               0|2021-01-28 15:18:53|     false|     0|gl3msfm|       false|t3_l6okyk|t1_gl3ljhf|/r/unpopularopini...|2021-05-29 06:07:54|   41|   false|unpopularopinion|    t5_2tk0s|
|        [deleted]|That’s what I’m s...|   false|

In [79]:


display(submissions_active.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments").limit(20).toPandas())

Unnamed: 0,subreddit,author,title,selftext,created_utc,num_comments
0,NoStupidQuestions,techsavvynerd91,How do you prevent your glasses from falling off during a fight?,"For those that wear glasses and have managed to keep them on during a fight, how did you do that? How did they not come off within the first 5 seconds?",2021-01-02 00:11:05,8
1,relationship_advice,irtriated,Boyfriend too loud,"My (33F) boyfriend (37M) and I live in a 2-br apartment. Together a long time. \nIt seems like every night we have the same argument.\nFor example right now I'm in the living room, trying to watch tv. He is in the bedroom playing video games with friends. \nliving room and bedroom are far enough from each other that if someone watches TV you can't hear it in the bedroom.\nProblem is, he speaks so loudly that I can hear his every last word and exclamation. He also insists on wearing over the ear headphones that exacerbates the problem - in them he doesn't hear how loud he is. (but he's very loud even without the headphones).\nWhen I ask him to keep it down because it's irritating and I can""t enjoy my show, he says to get over it.\nI have tried to not mind it but it drives me crazy, it's like forcing me to pay attention to two things at once. I can't seem to tune it out. My request probably irritates him because he thinks I'm acting a princess and should get over it. I feel like he could me more considerate and on board to figure something out for both of us, but he really won't even entertain it. I really don't know how to get through to him.",2021-01-23 05:40:24,2
2,AskWomen,[deleted],How do I stop feeling like a pathetic loser when I’m the only single person out of all my friends?,[removed],2021-01-23 05:54:21,1
3,AskMen,[deleted],Should I make a move?,[removed],2021-01-08 15:16:04,0
4,relationship_advice,RegretSubstantial365,Lonely in a relationship,"I've been with this guy for about 10 months, he is super busy all of the time with multiple businesses. He is always on the phone or physically working. Which working for a living is great but it's overkill. During covid, his ex took him to court for custody stuff and it really stressed him out. They finally had the final hearing and all of it supposed to be done with. I've mentioned several times that we really don't see each other anymore, barely talk on the phone and I'm getting bitter about it. I try to tell him how I feel and he gets aggravated because he doesn't want to talk about that. I was really upset this weekend because he chose not to hang out and go work on a house he is fixing for his friend. I expected at least one day. I was really upset and talked to him about it and after about 45 minutes, he said okay I think 45 minutes is enough time to talk. 🤷🏻‍♀️ he tells me he loves me and has thought about us living Together. But he doesn't act that way. It hurt me so bad. I get my feelings hurt easily so that's why I'm trying to get feedback. I feel like I can't talk to him about anything, that I'm always a bother and I'm asking him to do things he clearly doesn't want to do. He says he doesn't want to break up and that things are fine with him . But I'm kinda miserable because he seems like a part time boyfriend. I keep hoping he will do things differently but it's not happening. My friends keep telling me to wait and give him more time. I've cried tonight just thinking about letting him go because only one of us is getting what we want essentially. I feel really comfy with him and when we are together, it's great. But I look a lot towards the future and I don't see him wanting anything like that. Any ideas or help on what I should do or am I being over emotional?",2021-01-25 03:56:10,7
5,NoStupidQuestions,[deleted],Is it weird that I like watching all these ‘iceberg’ videos on YouTube?,[deleted],2021-01-25 03:58:33,0
6,NoStupidQuestions,[deleted],ight is there any intense workouts/stretches/ anything sweat - inducing that I can do lying down besides masturbation?,[deleted],2021-01-25 04:00:50,3
7,TrueOffMyChest,[deleted],Society should change attitudes about feminizing males. It can be traumatic,[deleted],2021-01-25 04:19:38,3
8,NoStupidQuestions,[deleted],"Can a person survive from eating nothing but pussy? If not, what nutrients are missing?",[removed],2021-01-25 04:28:45,3
9,AskWomen,[deleted],Women- what to do when when you're stressed out and your partner doesn't have the answers?,[removed],2021-01-25 04:34:01,1


Let's remove submissions without a body should obviously go, but what about the submissions without a self text (deleted, removed or empty). We can keep where the author is empty.

In [80]:
from pyspark.sql.functions import col

def clean_submissions(df: DataFrame) -> DataFrame:
    """
    Removes submissions from a DataFrame where the 'selftext' column is '[removed]', '[deleted]', or an empty string.

    Parameters:
    df (DataFrame): The PySpark DataFrame to clean.

    Returns:
    DataFrame: The cleaned PySpark DataFrame.
    """
    
    # Define a list of conditions that would indicate a row needs to be removed
    conditions = (col('selftext') != "[removed]") & (col('selftext') != "[deleted]") & (col('selftext').isNotNull() & (col('selftext') != ""))

    # Apply the filter
    cleaned_df = df.filter(conditions)
    
    return cleaned_df

# Usage:
submissions_active = clean_submissions(submissions_active)



In [87]:


# After cleaning, to display 20 entries for the specific columns:
display(submissions_active_test.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments", "media_embed", "over_18").limit(20).toPandas())

                                                                                

Unnamed: 0,subreddit,author,title,selftext,created_utc,num_comments,media_embed,over_18
0,NoStupidQuestions,techsavvynerd91,How do you prevent your glasses from falling off during a fight?,"For those that wear glasses and have managed to keep them on during a fight, how did you do that? How did they not come off within the first 5 seconds?",2021-01-02 00:11:05,8,"(None, None, None, None)",0
1,relationship_advice,irtriated,Boyfriend too loud,"My (33F) boyfriend (37M) and I live in a 2-br apartment. Together a long time. \nIt seems like every night we have the same argument.\nFor example right now I'm in the living room, trying to watch tv. He is in the bedroom playing video games with friends. \nliving room and bedroom are far enough from each other that if someone watches TV you can't hear it in the bedroom.\nProblem is, he speaks so loudly that I can hear his every last word and exclamation. He also insists on wearing over the ear headphones that exacerbates the problem - in them he doesn't hear how loud he is. (but he's very loud even without the headphones).\nWhen I ask him to keep it down because it's irritating and I can""t enjoy my show, he says to get over it.\nI have tried to not mind it but it drives me crazy, it's like forcing me to pay attention to two things at once. I can't seem to tune it out. My request probably irritates him because he thinks I'm acting a princess and should get over it. I feel like he could me more considerate and on board to figure something out for both of us, but he really won't even entertain it. I really don't know how to get through to him.",2021-01-23 05:40:24,2,"(None, None, None, None)",0
2,relationship_advice,RegretSubstantial365,Lonely in a relationship,"I've been with this guy for about 10 months, he is super busy all of the time with multiple businesses. He is always on the phone or physically working. Which working for a living is great but it's overkill. During covid, his ex took him to court for custody stuff and it really stressed him out. They finally had the final hearing and all of it supposed to be done with. I've mentioned several times that we really don't see each other anymore, barely talk on the phone and I'm getting bitter about it. I try to tell him how I feel and he gets aggravated because he doesn't want to talk about that. I was really upset this weekend because he chose not to hang out and go work on a house he is fixing for his friend. I expected at least one day. I was really upset and talked to him about it and after about 45 minutes, he said okay I think 45 minutes is enough time to talk. 🤷🏻‍♀️ he tells me he loves me and has thought about us living Together. But he doesn't act that way. It hurt me so bad. I get my feelings hurt easily so that's why I'm trying to get feedback. I feel like I can't talk to him about anything, that I'm always a bother and I'm asking him to do things he clearly doesn't want to do. He says he doesn't want to break up and that things are fine with him . But I'm kinda miserable because he seems like a part time boyfriend. I keep hoping he will do things differently but it's not happening. My friends keep telling me to wait and give him more time. I've cried tonight just thinking about letting him go because only one of us is getting what we want essentially. I feel really comfy with him and when we are together, it's great. But I look a lot towards the future and I don't see him wanting anything like that. Any ideas or help on what I should do or am I being over emotional?",2021-01-25 03:56:10,7,"(None, None, None, None)",0
3,relationship_advice,[deleted],How would you tell your bestfriend she’s dating a guy who has red flags without diacouraging her?,"My bestfriend went to a date yesterday with a brazilian guy yesterday. She went to his house and saw a knife under his bed. He was living alone in this apartment. She knew it was for his safety and protection but she said the dude was joking around “You know I’m a dangerous guy. I can kill you”. He put away the knife back in the kitchen. They had a good night. She said she was really having fun with him. The only thing was whenever she said she’ll go home, the dude would sit on her lap. He’s really heavy she said so she couldn’t go anywhere. \n\nI adviced my bestfriend if the dude still have a some “meh” traits on the next few dates. I think she should stop seeing her just give him another chance cause I could see she had fun with that guy. I don’t want to discouraged her. \n\nShe said she was kind scared with his “kill you” joke but she would still continue seeing him that’s why I said gave it another chance. Did I really do the right thing?",2021-01-23 16:19:24,5,"(None, None, None, None)",0
4,NoStupidQuestions,CallMeSpoofy,Is it possible to get bleach out of clothes?,In my case it’s not one big stain but little splashes,2021-01-23 16:38:59,6,"(None, None, None, None)",0
5,unpopularopinion,SweetCuddlyFeline,A grilled cheese sandwich (or a cheese toasty as our UK cousins call it) is just any kind of cheese and bread and NOTHING ELSE,"A grilled cheese sandwich is just cheese and bread. Not cheese and ham, not cheese and chicken, not cheese and turkey, not cheese and veggies. Once you add something other than cheese it’s called a MELT. A turkey and cheese melt, a ham and cheese melt, etc.",2021-01-23 16:42:15,15,"(None, None, None, None)",0
6,relationship_advice,ivegotquestions3,Need advice on getting married before the ceremony for health insurance,I just got engaged in December and my fiancé and I are planning on legally getting married before the actual wedding. I’ve always wanted a NYE wedding and because of covid we decided next year is too soon and we’re going to wait until 2022. \n\nMy health insurance sucks though and I’m on a medicine that isn’t even covered so in total I spend over $600 a month on the insurance and medicine. If I were on his insurance it would be about $175 a month for the insurance and medicine. And it would include vision and dental which I don’t currently have. \n\nMy parents are aware of the idea and are completely on board with it. The idea has also been brought up to his parents and they did not seem to have a problem with it. Everyone knows that would we save so much money by doing so and that there are a lot of other things that the money could be spent on. I should also mention that the other benefit of getting married is that I deeply love this man and am thankful everyday that he’s with me. It isn’t just about the health insurance! \n\nWhat we’re unsure about is if we should have a small ceremony and include our immediate family? Or if we just go down to the courthouse and not tell anyone? We don’t want to make a huge deal about it and have it take away from the wedding with all of our loved ones in 2022. I know it’s just a piece of paper but I do feel like it has a lot of significance behind and I don’t want to regret how we go about it.\n\nHas anyone done this and/or have any advice?,2021-01-10 15:27:47,4,"(None, None, None, None)",0
7,unpopularopinion,Thesaurus123,People who drink black coffee are just as bad as people who drink IPA.,They both love to tell people the bitter taste of their beverage is better than the literal thousands of better tasting options available to them. They will also ridicule the people who like to drink their beverage with some additive that makes it not taste like a butthole.,2021-01-08 23:04:20,17,"(None, None, None, None)",0
8,relationship_advice,rickyrudd7,Need help understanding this one girl...,"Hi all! I have been talking with girl for quite awhile now, but haven’t been on an official date yet. We matched online and we started talking, we exchanged phone numbers, social media accounts etc. I have talked to her over the phone multiple times and for hours. I have enjoyed talking to her and I think she does too. I can tell she has a strong personality and doesn’t want to reveal herself. I have actually asked her out couple times, but something came up each time. One time she was out of town, another time she was over to her friend’s house. She didn’t lie to me cause I have her Instagram account and I can confirm that in both cases she was telling the truth. Actually there was another time that I asked her out, and that time she told me that she had problems with her car, so I offered myself to go pick her up, but she didn’t reply to me that day so I assumed she didn’t want to.\n\nAt this point I started to ignore her.. but she always came back to me. She watches every single story I post on Instagram, and most of the time she replies too. I didn’t talk to her for about ten days last time around, and boom she sent me a message saying that we should go out for an hike. I told her that I had to confirm to her since it was around the holidays ( New Years week) and don’t know if I had time for that particular day. Once I got back to her and told her that I was actually free that day and I could go out with her, she saw the message and didn’t reply tough... Couple days ago, once again she sees my story on Instagram and replies to me.. I don’t know if she is that insecure or she likes to be chased or what. She also told me before this last episode that she hopes to see me in 2021... what do you do of her ? She likes me ? She is playing games ? She is just insecure ? To be honest if we were not in a pandemic, I probably don’t think I would have waited all this time on her.",2021-01-08 23:06:33,4,"(None, None, None, None)",0
9,antiwork,[deleted],When does it end?,"I'm over here thinking about how expensive it is just to live alone....like holy fucking shit! Basic needs feels impossible to achieve. Maybe I'm just lazy, or maybe I'm just depressed...but man I'm lucky to live with family, but still I hate being stuck with no indepence. I wish I could just be free with my own car, house, money, etc. Don't get wrong, i work only two days a week early mornings, and I love it!\n\nbut I dread full-time work! Seriously where does it end!? ....And just trying to make a career is tiresome. It's funny how we're trapped in this boring endless dystopia where everything is just grey, boring....how can anyone like this!?",2021-01-03 00:12:47,6,"(None, None, None, None)",0


Now on to the comments.

In [82]:
display(comments_active.select("subreddit", "author", "body", "parent_id", "link_id", "id", "created_utc").limit(20).toPandas())

Unnamed: 0,subreddit,author,body,parent_id,link_id,id,created_utc
0,unpopularopinion,stonedironworker,"Every man you’ve ever slept with has watched porn, i don’t give a fuck what they tell you",t1_gl3ljhf,t3_l6okyk,gl3msfm,2021-01-28 15:18:53
1,AskMen,[deleted],"That’s what I’m saying man. Where are the men who are getting stopped by women going “wow you’re handsome” LMAO NOT IN THIS FUCKIN LIFETIME\n\n(I gotta edit this in now) I was walking into the gym and 2 girls in a Honda pulled up rolled the window down and said “Sorry we were just staring at you” How do you reply to that ? All I could think about is how this situation would look like if I got in the car with a guy friend, pull up on a stranger (girl) say sorry we were just staring at you like WHAT THE FUUUUUCK ?!?!?",t1_gl3i3ry,t3_l6wmz9,gl3mwdx,2021-01-28 15:19:23
2,unpopularopinion,slimezillaaa,Sounds like more of a insecurity rather than a porn problem,t3_l6okyk,t3_l6okyk,gl3n144,2021-01-28 15:19:59
3,unpopularopinion,chanaandeler_bong,"Why does that matter. In either form you aren’t interacting with the person, fictional or “real.”",t1_gl3n5rb,t3_l6okyk,gl3nmln,2021-01-28 15:22:42
4,OutOfTheLoop,AutoModerator,"**PLEASE READ ALL OF THIS BEFORE MESSAGING US:**\n\nThank you for your submission, but it has been removed due to lack of context. \n\n**Please see Rule 2** and repost with a **URL or a screenshot** (you can host it on imgur.com) **in your submission body** so that the users can better know what you're talking about and increase your chances for getting an answer. **Feel free to repost once you find a URL.** (Please remember to include the full URL: URL shorteners don't count.) Thanks!\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/OutOfTheLoop) if you have any questions or concerns.*",t3_l6zjdh,t3_l6zjdh,gl3ofzn,2021-01-28 15:26:24
5,AmItheAsshole,jenettabrown,So your dad is basically saying that your mom could have hidden you fired and your kids could have been taken away but because that DIDN'T happen you shouldn't be mad at her?...I completely get why your brother went NC.\n\nNTA,t3_l6vkwf,t3_l6vkwf,gl3oh5p,2021-01-28 15:26:33
6,unpopularopinion,[deleted],[deleted],t1_gl3n332,t3_l6ze3h,gl3ohtf,2021-01-28 15:26:38
7,relationship_advice,Shay-Bird,"No, and id say that im husky. Most of my weight is muscle.",t1_ghwj3xf,t3_kpdx7d,ghwlums,2021-01-03 05:41:21
8,AskMen,filmcowlel,Your statistically wrong. No wonder your boyfriend watches yiff,t1_ghuzm8g,t3_kozhbc,ghwlxq5,2021-01-03 05:42:02
9,antiwork,[deleted],[deleted],t1_ghwk58x,t3_kpdqli,ghwlzhj,2021-01-03 05:42:24


Let's do the same for the body of the comments.

In [83]:
def clean_comments(df: DataFrame) -> DataFrame:
    """
    Removes comments from a DataFrame where the 'body' column is '[removed]', '[deleted]', or an empty string.

    Parameters:
    df (DataFrame): The PySpark DataFrame to clean.

    Returns:
    DataFrame: The cleaned PySpark DataFrame.
    """
    
    # Define the filter conditions
    conditions = (col('body') != "[removed]") & (col('body') != "[deleted]") & (col('body').isNotNull() & (col('body') != ""))

    # Apply the filter
    cleaned_df = df.filter(conditions)
    
    return cleaned_df

# Usage:
comments_active = clean_comments(comments_active)



In [100]:
comments_active.printSchema()

root
 |-- author: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)



In [110]:
#root
# |-- author: string (nullable = true)
# |-- body: string (nullable = true)
# |-- can_gild: boolean (nullable = true)
# |-- controversiality: long (nullable = true)
# |-- created_utc: timestamp (nullable = true)
# |-- edited: string (nullable = true)
# |-- gilded: long (nullable = true)
# |-- id: string (nullable = true)
# |-- is_submitter: boolean (nullable = true)
# |-- link_id: string (nullable = true)
# |-- parent_id: string (nullable = true)
# |-- permalink: string (nullable = true)
# |-- retrieved_on: timestamp (nullable = true)
# |-- score: long (nullable = true)
# |-- stickied: boolean (nullable = true)
# |-- subreddit: string (nullable = true)
# |-- subreddit_id: string (nullable = true)


comments_active = comments_active.withColumn("can_gild", col("can_gild").cast("integer"))
comments_active = comments_active.withColumn("stickied", col("stickied").cast("integer"))
comments_active = comments_active.withColumn("is_submitter", col("is_submitter").cast("integer"))



In [113]:
display(comments_active.limit(5).toPandas())

Unnamed: 0,author,body,can_gild,controversiality,created_utc,edited,gilded,id,is_submitter,link_id,parent_id,permalink,retrieved_on,score,stickied,subreddit,subreddit_id
0,stonedironworker,"Every man you’ve ever slept with has watched porn, i don’t give a fuck what they tell you",1,0,2021-01-28 15:18:53,false,0,gl3msfm,0,t3_l6okyk,t1_gl3ljhf,/r/unpopularopinion/comments/l6okyk/i_dont_want_to_be_in_a_relationship_with_someone/gl3msfm/,2021-05-29 06:07:54,41,0,unpopularopinion,t5_2tk0s
1,[deleted],"That’s what I’m saying man. Where are the men who are getting stopped by women going “wow you’re handsome” LMAO NOT IN THIS FUCKIN LIFETIME\n\n(I gotta edit this in now) I was walking into the gym and 2 girls in a Honda pulled up rolled the window down and said “Sorry we were just staring at you” How do you reply to that ? All I could think about is how this situation would look like if I got in the car with a guy friend, pull up on a stranger (girl) say sorry we were just staring at you like WHAT THE FUUUUUCK ?!?!?",0,0,2021-01-28 15:19:23,1611870280,0,gl3mwdx,0,t3_l6wmz9,t1_gl3i3ry,/r/AskMen/comments/l6wmz9/men_of_reddit_whats_the_first_thing_you_think_if/gl3mwdx/,2021-05-29 06:08:43,101,0,AskMen,t5_2s30g
2,slimezillaaa,Sounds like more of a insecurity rather than a porn problem,1,0,2021-01-28 15:19:59,false,0,gl3n144,0,t3_l6okyk,t3_l6okyk,/r/unpopularopinion/comments/l6okyk/i_dont_want_to_be_in_a_relationship_with_someone/gl3n144/,2021-05-29 06:10:08,1,0,unpopularopinion,t5_2tk0s
3,chanaandeler_bong,"Why does that matter. In either form you aren’t interacting with the person, fictional or “real.”",1,1,2021-01-28 15:22:42,false,0,gl3nmln,0,t3_l6okyk,t1_gl3n5rb,/r/unpopularopinion/comments/l6okyk/i_dont_want_to_be_in_a_relationship_with_someone/gl3nmln/,2021-05-29 06:14:35,0,0,unpopularopinion,t5_2tk0s
4,AutoModerator,"**PLEASE READ ALL OF THIS BEFORE MESSAGING US:**\n\nThank you for your submission, but it has been removed due to lack of context. \n\n**Please see Rule 2** and repost with a **URL or a screenshot** (you can host it on imgur.com) **in your submission body** so that the users can better know what you're talking about and increase your chances for getting an answer. **Feel free to repost once you find a URL.** (Please remember to include the full URL: URL shorteners don't count.) Thanks!\n\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/OutOfTheLoop) if you have any questions or concerns.*",1,0,2021-01-28 15:26:24,false,0,gl3ofzn,0,t3_l6zjdh,t3_l6zjdh,/r/OutOfTheLoop/comments/l6zjdh/what_is_actually_going_on_with_wallstreetbets_and/gl3ofzn/,2021-05-29 06:21:05,1,0,OutOfTheLoop,t5_2xinb


### 2.3 Produce at least 5 interesting graphs about your dataset. Think about the dimensions that are interesting for your Reddit data! There are millions of choices. Make sure your graphs are connected to your business questions.



### 2.4 Produce at least 3 interesting summary tables about your dataset. You can decide how to split up your data into categories, time slices, etc. There are infinite ways you can make summary statistics. Be unique, creative, and interesting!



### 2.5 Use data transformations to make AT LEAST 3 new variables that are relevant to your business questions. We cannot be more specific because this depends on your project and what you want to explore!



### 2.6 Implement regex searches for specific keywords of interest to produce dummy variables and then make statistics that are related to your business questions. Note, that you DO NOT have to do textual cleaning of the data at this point. The next assignment on NLP will focus on the textual cleaning and analysis aspect.



### 2.7 Find some type of external data to join onto your Reddit data. Don’t know what to pick? Consider a time-related dataset. Stock prices, game details over time, active users on a platform, sports scores, covid cases, etc., etc. While you may not need to join this external data with your entire dataset, you must have at least one analysis that connects to external data. You do not have to join the external data and analyze it yet, just find it.



### If you are planning to make any custom datasets that are derived from your Reddit data, make them now. These datasets might be graph-focused, or maybe they are time series focused, it is completely up to you!