# Data Cleaning Notebook

### This notebook cleans and merges stock-related data and general Reddit posts.
### The process is divided into the following sections:
###  1. Data Imports and Setup
###  2. Data Cleaning for Stock Posts and Comments
###  3. Merging and Exporting the Final DataFrame


In [108]:
import pandas as pd
import numpy as np

## 1. Data Imports and Setup for Stock Data

In [109]:
# Posts
# Subreddit -- Wall Street Bets Posts
ws_stock_post = pd.read_csv('./reddit_stock/wallstreetbets/wallstreetbets_stocks_posts.csv') 
# Subreddit -- Stocks Posts
stocks_post = pd.read_csv('./reddit_stock/stocks/stocks_stocks_posts.csv')
# General Posts across subreddits
posts = pd.read_csv('./reddit_general/all_posts.csv')

#Comments
# Subreddit -- Wall Street Bets comments
wall_street_comments = pd.read_csv('./reddit_stock/wallstreetbets/wallstreetbets_stocks_comments.csv')
# Subreddit -- Stocks comments
stocks_comments = pd.read_csv('./reddit_stock/stocks/stocks_stocks_comments.csv')
# Subreddit -- Stocks comments
comments = pd.read_csv('./reddit_general/all_comments.csv')

###  2. Data Cleaning for Posts and Comments

In [110]:
# Making sure that there are no unnecessary Empty Cells
ws_stock_post = ws_stock_post.dropna(subset='search_term')
stocks_post = stocks_post.dropna(subset='search_term')
posts = posts.dropna(subset='search_term')

In [111]:
# Making sure the 'score' column is numeric
wall_street_comments['score'] = pd.to_numeric(wall_street_comments['score'])
stocks_comments['score'] = pd.to_numeric(stocks_comments['score'])
comments['score'] = pd.to_numeric(comments['score'])

For each DataFrame, I am grouping the data by the 'post_id' column.
Within each group, I am sorting the rows by the 'score' column and selecting the top 10 rows for each group.
In this case each post will have the top comments to it to add more context to our analysis later on. 

In [112]:
# Group by 'post_id', sort by 'score' within each group, and get the head (top 10)
wall_street_comments = wall_street_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(10)).reset_index(drop=True)

# Group by 'post_id', sort by 'score' within each group, and get the head (top 10)
stocks_comments = stocks_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(10)).reset_index(drop=True)

# Group by 'search_term', sort by 'score' within each group, and get the head (top 5)
comments = comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(10)).reset_index(drop=True)

  wall_street_comments = wall_street_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(10)).reset_index(drop=True)
  stocks_comments = stocks_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(10)).reset_index(drop=True)
  comments = comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(10)).reset_index(drop=True)


#### I have multiple comments per post_id. I want to make sure that all the comments stay together. I am putting all the comments in a list and will group them together. 

In [113]:

wall_street_comments = wall_street_comments.groupby('post_id')['body'].apply(list).reset_index()

stocks_comments = stocks_comments.groupby('post_id')['body'].apply(list).reset_index()

comments = comments.groupby('post_id')['body'].apply(list).reset_index()

#### Let's now merge the posts and comments together

In [114]:
wallstreetbets = pd.merge(ws_stock_post, wall_street_comments, on='post_id', how='left')

stocks = pd.merge(stocks_post, stocks_comments, on='post_id', how='left')

general = pd.merge(posts, comments, on='post_id', how='left')

#### Making sure the datetime is well structured and removing having a uniform date. (Removing Hours and Seconds)

In [115]:
#Convert 'created_utc' to datetime and keep only the date ---
wallstreetbets['created_utc'] = pd.to_datetime(wallstreetbets['created_utc']).dt.date

stocks['created_utc'] = pd.to_datetime(stocks['created_utc']).dt.date

general['created_utc'] = pd.to_datetime(general['created_utc']).dt.date

### Dropping some of the columns that are not necessary to our analysis

In [116]:
columns_to_drop = ['is_self', 'permalink','url','category','flair']

wallstreetbets = wallstreetbets.drop(columns=columns_to_drop)

stocks = stocks.drop(columns=columns_to_drop)

# The general dataframe doesn't have the column 'category'
columns_to_drop.remove('category')
general = general.drop(columns=columns_to_drop)


#### Renaming The columns to have an appropriate name

In [117]:
names = {'selftext':'post','body':'comments'}
wallstreetbets = wallstreetbets.rename(columns=names)
stocks = stocks.rename(columns=names)
general = general.rename(columns=names)

###  3. Merging and Exporting the Final DataFrame

In [None]:
main_df = pd.concat([wallstreetbets, stocks,general], ignore_index=True)

main_df.to_csv('./data/cleaned_stock.csv')