# EDA Question 3

For this portion of the project, we will attempt to answer the third question proposed in our original Business Goals:

**Business Goal 3:** Explore how the subreddit page of the NCAA champion in 2022, the Kansas Jayhawks, varied over the course of the season. Identify any trends in post frequency or comment length that may correlate to on-the-court successes or failures. \
**Technical Proposal:** Use trend analysis and count number of comments and comment lengths to determine excitement over the course of the NBA season. Correlate any major trends to outside data sets, including Jayhawks wins and losses or injuries of key players. Analyze any additional trends in activity over the course of the 2022 season to find any other additional results before entering into the NLP phase of the project.

As you will see throughout this exploration, this analysis began as only covering the r/jayhawks subreddit, but we expanded it to also include the r/tarheels subreddit as well. This initial glimpse into how these subreddits work will lead into other analyses as we continue.

### Configuring and Cleaning the Data

This process will be the same across our EDA questions.

spark

In [None]:
workspace_default_storage_account = "aml6214727930"
workspace_default_container = "azureml-blobstore-6653633b-3460-4381-9199-d9e0f368353c"

workspace_wasbs_base_url = (
    f"wasbs://{workspace_default_container}@{workspace_default_storage_account}.blob.core.windows.net/"
)

In [None]:
#datastore = 'azureml://datastores/workspaceblobstore/paths'
comments_path = "/basketball_comments_sep.parquet"
submissions_path = "/basketball_submissions_sep.parquet"

comments = spark.read.parquet(f"{workspace_wasbs_base_url}{comments_path}")
submissions = spark.read.parquet(f"{workspace_wasbs_base_url}{submissions_path}")

In [None]:
submissions = submissions.cache()
comments = comments.cache()

In [None]:
submissions = submissions.filter((submissions.selftext != "") & (submissions.selftext != "[deleted]")& (submissions.selftext != "[removed]"))
comments = comments.filter(comments.body != "")

## Showing Frequency of Comments and Submissions in r/jayhawks

In order to analyze these subreddits, we'll begin by diving into the comment and submission frequency for r/jayhawks.

In [None]:
from pyspark.sql.functions import *
jayhawk_comments = (
    comments
    .filter(col("subreddit") == "jayhawks")
    .withColumn("day", dayofmonth(col("created_utc")))
    .withColumn("month", month(col("created_utc")))
    .withColumn("year", year(col("created_utc")))
    .groupBy("year", "month", "day")
    .count()
)
jayhawk_comments_df = jayhawk_comments.toPandas()

In [None]:
import matplotlib.pyplot as plt
import pandas as pd

jayhawk_comments_df['date'] = pd.to_datetime(jayhawk_comments_df[['year', 'month', 'day']])
jayhawk_comments_df = jayhawk_comments_df.sort_values(by='date') 
plt.figure(figsize=(8, 6))
plt.plot(jayhawk_comments_df['date'], jayhawk_comments_df['count'], marker='.', linestyle='-')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Comment Count by Day (r/jayhawks)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This shows us the comment frequency for r/jayhawks. 

*insert analysis here*

Let's do the same thing for submissions and see if we get a similar result.

jayhawk_submissions = (
    submissions
    .filter(col("subreddit") == "jayhawks")
    .withColumn("day", dayofmonth(col("created_utc")))
    .withColumn("month", month(col("created_utc")))
    .withColumn("year", year(col("created_utc")))
    .groupBy("year", "month", "day")
    .count()
)
jayhawk_submissions_df = jayhawk_submissions.toPandas()

In [None]:
jayhawk_submissions_df['date'] = pd.to_datetime(jayhawk_submissions_df[['year', 'month', 'day']])
jayhawk_submissions_df = jayhawk_submissions_df.sort_values(by='date') 
plt.figure(figsize=(8, 6))
plt.plot(jayhawk_submissions_df['date'], jayhawk_submissions_df['count'], marker='.', linestyle='-')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Submission Count by Day (r/jayhawks)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Okay! Now we've completed the initial stages of our exploratory data analysis for r/jayhawks.

*insert analysis here*

We are going to continue this analysis and also do the same thing for r/tarheels.

## Showing Frequency of Comments and Submissions in r/tarheels

In order to analyze these subreddits, we'll begin by diving into the comment and submission frequency for r/tarheels.

In [None]:
tarheel_comments = (
    comments
    .filter(col("subreddit") == "jayhawks")
    .withColumn("day", dayofmonth(col("created_utc")))
    .withColumn("month", month(col("created_utc")))
    .withColumn("year", year(col("created_utc")))
    .groupBy("year", "month", "day")
    .count()
)
tarheel_comments_df = tarheel_comments.toPandas()

In [None]:
tarheel_comments_df['date'] = pd.to_datetime(tarheel_comments_df[['year', 'month', 'day']])
tarheel_comments_df = tarheel_comments_df.sort_values(by='date') 
plt.figure(figsize=(8, 6))
plt.plot(tarheel_comments_df['date'], tarheel_comments_df['count'], marker='.', linestyle='-')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Comment Count by Day (r/tarheels)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

That's a good start! This is interesting to compare to the r/jayhawks analysis of before.

*insert analysis here*

We'll do the same thing for submissions.

In [None]:
tarheel_submissions = (
    submissions
    .filter(col("subreddit") == "tarheels")
    .withColumn("day", dayofmonth(col("created_utc")))
    .withColumn("month", month(col("created_utc")))
    .withColumn("year", year(col("created_utc")))
    .groupBy("year", "month", "day")
    .count()
)
tarheel_submissions_df = tarheel_submissions.toPandas()

In [None]:
tarheel_submissions_df['date'] = pd.to_datetime(tarheel_submissions_df[['year', 'month', 'day']])
tarheel_submissions_df = tarheel_submissions_df.sort_values(by='date') 
plt.figure(figsize=(8, 6))
plt.plot(tarheel_submissions_df['date'], tarheel_submissions_df['count'], marker='.', linestyle='-')
plt.xlabel('Date')
plt.ylabel('Count')
plt.title('Submission Count by Day (r/tarheels)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Okay! That was great as well. Here is how that compares to the r/jayhawks subreddit:

*insert analysis*

## Extra Analysis: Do Others Teams Appear on r/tarheels?

To take this analysis one step further, we wanted to find out if mentions of other teams made their way onto the r/tarheels subreddit. We chose to do this on the tarheels subreddit because it has more mentions and thus has a larger sample size to determine if this occurs. To align with our exploratory analysis in EDA question 4, we chose a few major teams to check for mentions of. Below is our analysis.

In [None]:
other_team_keywords = {
    'houston': ['houston', 'cougars', 'uh'],
    'kansas': ['kansas', 'jayhawks', 'ku', 'rockchalk'],
    'villanova': ['villanova', 'wildcats', 'nova'],
    'duke': ['duke', 'blue devils'],
    'arkansas': ['arkansas', 'razorbacks', 'hogs', 'u of a'],
    'saint peters': ["saint peter's", 'peacocks', 'saint peters', "st. peter's", 'st peters', 'spu'],
    'miami': ['miami', 'hurricanes', 'um']
}

In [None]:
tarheel_comments_df = tarheel_comments_df.withColumn('body', lower(col('body')))
tarheel_submissions_df = tarheel_submissions_df.withColumn('title', lower(col('title')))\
                         .withColumn('selftext', lower(col('selftext')))

In [None]:
def keyword_present(text, keywords):
    if text:
        return any(keyword in text for keyword in keywords)
    return False

keyword_udf = udf(keyword_present, BooleanType())