# Milestone 1: Frame your analysis and EDA




## 1. Project Topics

### Exploratory 1

#### Business Goals

Determine if multimedia (videos, images) in a post affects user interaction.

#### Technical Proposals

Check the means of the distribution of comments for each type of post in a box plot. Perform hypothesis tests. Perform hypothesis tests for statistical significance.




### Exploratory 2

#### Business Goals

Determine what is the correlation that exists between the number of comments and the score of a post.

#### Technical Proposals

Calculate correlations between the score in various selected subreddits and the number of comments in each. Perform hypothesis tests for statistical significance.


### Exploratory 3

#### Business Goals

Determine the times of the day when posts typically receive the most engagement.

#### Technical Proposals

Plot comments over time.

## 2. EDA

### Bucket checks

In [3]:
!aws s3 ls


2023-08-29 23:43:16 sagemaker-studio-692960231031-wo7kgoszj2g
2023-08-29 23:50:01 sagemaker-us-east-1-692960231031
2023-08-30 00:34:21 vad49
2023-09-16 16:02:10 vad49-labdata


In [9]:
!aws s3 ls s3://vad49/project_lowercase_test/

                           PRE comments/
                           PRE submissions/


### Setup

In [None]:
from IPython.core.display import HTML
from pyspark.sql import SparkSession


In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y

# install PySpark
%pip install pyspark==3.2.0 s3fs pyarrow

# restart kernel
HTML("<script>Jupyter.notebook.kernel.restart()</script>")

In [10]:
# Import pyspark and build Spark session

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2")
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .getOrCreate()
)

print(spark.version)

3.2.0


### Bring in submissions and comments data

In [15]:
%%time
s3_path_submissions = f"s3a://vad49/project_lowercase_test/submissions"
print(f"reading submissions from {s3_path_submissions}")

submissions = spark.read.parquet(s3_path_submissions, header=True)


reading submissions from s3a://vad49/project_lowercase_test/submissions


23/11/07 18:50:55 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

shape of the submissions dataframe is 251,492x68
CPU times: user 9.87 ms, sys: 8.14 ms, total: 18 ms
Wall time: 18.4 s


                                                                                

In [12]:
%%time
s3_path_comments = f"s3a://vad49/project_lowercase_test/comments"
print(f"reading submissions from {s3_path_comments}")

comments = spark.read.parquet(s3_path_comments, header=True)


reading submissions from s3a://vad49/project_lowercase_test/comments


23/11/07 18:48:36 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties

shape of the comments dataframe is 4,122,561x21
CPU times: user 34.6 ms, sys: 711 µs, total: 35.4 ms
Wall time: 1min 2s


                                                                                

### Report on the basic info about your dataset. What are the interesting columns? What is the schema? How many rows do you have? etc. etc.


In [None]:
print(f"shape of the submissions dataframe is {submissions.count():,}x{len(submissions.columns)}")
print(f"shape of the comments dataframe is {comments.count():,}x{len(comments.columns)}")


In [16]:
submissions.printSchema()

root
 |-- adserver_click_url: string (nullable = true)
 |-- adserver_imp_pixel: string (nullable = true)
 |-- archived: boolean (nullable = true)
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- author_id: string (nullable = true)
 |-- brand_safe: boolean (nullable = true)
 |-- contest_mode: boolean (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- crosspost_parent: string (nullable = true)
 |-- crosspost_parent_list: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- approved_at_utc: string (nullable = true)
 |    |    |-- approved_by: string (nullable = true)
 |    |    |-- archived: boolean (nullable = true)
 |    |    |-- author: string (nullable = true)
 |    |    |-- author_flair_css_class: string (nullable = true)
 |    |    |-- author_flair_text: string (nullable = true)
 |    |    

In [17]:
# display a subset of columns
submissions.select("subreddit", "author", "title", "selftext", "created_utc", "num_comments").show()

+-------------------+-------------------+--------------------+--------------------+-------------------+------------+
|          subreddit|             author|               title|            selftext|        created_utc|num_comments|
+-------------------+-------------------+--------------------+--------------------+-------------------+------------+
|relationship_advice|          [deleted]|I feel like my pa...|           [deleted]|2021-01-01 23:34:26|           5|
|      AmItheAsshole|        WiseSeason0|AITA for asking m...|           [removed]|2021-01-01 23:34:34|           1|
|     TrueOffMyChest|    throwaway268828|I'm really tired ...|Hey everyone. The...|2021-01-01 23:34:53|           1|
|             AskMen|          [deleted]|Ditched by a guy ...|           [removed]|2021-01-01 23:34:56|           0|
|   unpopularopinion|          [deleted]|Marijuana is the ...|           [removed]|2021-01-01 23:34:58|           2|
|   unpopularopinion|          [deleted]|If the roads are ...|  

In [18]:
submissions.groupby('subreddit').count().show()





+-------------------+-----+
|          subreddit|count|
+-------------------+-----+
|     TrueOffMyChest|12086|
|   unpopularopinion|42196|
|           antiwork| 2388|
|       socialskills| 3575|
|             AskMen|14127|
|      AmItheAsshole|32084|
|relationship_advice|60643|
|  explainlikeimfive|14905|
|       OutOfTheLoop| 5865|
|               tifu| 5654|
|  NoStupidQuestions|48014|
|           AskWomen| 9955|
+-------------------+-----+



                                                                                

In [13]:
comments.printSchema()

root
 |-- author: string (nullable = true)
 |-- author_cakeday: boolean (nullable = true)
 |-- author_flair_css_class: string (nullable = true)
 |-- author_flair_text: string (nullable = true)
 |-- body: string (nullable = true)
 |-- can_gild: boolean (nullable = true)
 |-- controversiality: long (nullable = true)
 |-- created_utc: timestamp (nullable = true)
 |-- distinguished: string (nullable = true)
 |-- edited: string (nullable = true)
 |-- gilded: long (nullable = true)
 |-- id: string (nullable = true)
 |-- is_submitter: boolean (nullable = true)
 |-- link_id: string (nullable = true)
 |-- parent_id: string (nullable = true)
 |-- permalink: string (nullable = true)
 |-- retrieved_on: timestamp (nullable = true)
 |-- score: long (nullable = true)
 |-- stickied: boolean (nullable = true)
 |-- subreddit: string (nullable = true)
 |-- subreddit_id: string (nullable = true)



In [14]:
# display a subset of columns
comments.select("subreddit", "author", "body", "parent_id", "link_id", "id", "created_utc").show()

                                                                                

+-------------------+--------------------+--------------------+----------+---------+-------+-------------------+
|          subreddit|              author|                body| parent_id|  link_id|     id|        created_utc|
+-------------------+--------------------+--------------------+----------+---------+-------+-------------------+
|             AskMen|  AnaphoricReference|Going out of his ...| t3_l6ksq0|t3_l6ksq0|gl3matc|2021-01-28 15:16:41|
|   unpopularopinion|         ajsmith0429|My gf doesn't lik...| t3_l6okyk|t3_l6okyk|gl3mau1|2021-01-28 15:16:41|
|      AmItheAsshole|             laser50|A rather large po...| t3_l6u2dm|t3_l6u2dm|gl3mau5|2021-01-28 15:16:41|
|      AmItheAsshole|         BertTheNerd|An obvious NTA , ...| t3_l6vkwf|t3_l6vkwf|gl3mavc|2021-01-28 15:16:41|
|   unpopularopinion|TiaMaria-AndLucozade|Should be put dow...|t1_gl3m029|t3_l1e1a5|gl3mavt|2021-01-28 15:16:41|
|      AmItheAsshole|     throwRA_ShareIt|It's her choice, ...|t1_gl3lx60|t3_l6vpgn|gl3mb82|2021

### Conduct basic data quality checks! Make sure there are no missing values, check the length of the comments, and remove rows of data that might be corrupted. Even if you think all your data is perfect, you still need to demonstrate that with your analysis.



### Produce at least 5 interesting graphs about your dataset. Think about the dimensions that are interesting for your Reddit data! There are millions of choices. Make sure your graphs are connected to your business questions.



### Produce at least 3 interesting summary tables about your dataset. You can decide how to split up your data into categories, time slices, etc. There are infinite ways you can make summary statistics. Be unique, creative, and interesting!



### Use data transformations to make AT LEAST 3 new variables that are relevant to your business questions. We cannot be more specific because this depends on your project and what you want to explore!



### Implement regex searches for specific keywords of interest to produce dummy variables and then make statistics that are related to your business questions. Note, that you DO NOT have to do textual cleaning of the data at this point. The next assignment on NLP will focus on the textual cleaning and analysis aspect.



### Find some type of external data to join onto your Reddit data. Don’t know what to pick? Consider a time-related dataset. Stock prices, game details over time, active users on a platform, sports scores, covid cases, etc., etc. While you may not need to join this external data with your entire dataset, you must have at least one analysis that connects to external data. You do not have to join the external data and analyze it yet, just find it.



### If you are planning to make any custom datasets that are derived from your Reddit data, make them now. These datasets might be graph-focused, or maybe they are time series focused, it is completely up to you!