# Data Cleaning Notebook

### This notebook cleans and merges stock-related data and general Reddit posts.
### The process is divided into the following sections:
###  1. Data Imports and Setup
###  2. Data Cleaning for Stock Posts and Comments
###  3. Merging and Exporting the Final DataFrame


In [1]:
import pandas as pd
import numpy as np

## 1. Data Imports and Setup for Stock Data

In [6]:
# Loading general and stock specific posts
posts = pd.read_csv('./data/reddit/general_posts.csv').dropna(subset='category') ## Making sure to drop any empty cells
stock_posts = pd.read_csv('./data/reddit/posts_stock_specific.csv').dropna(subset='category')

# Loading comments and stock specific comments
comments = pd.read_csv('./data/reddit/general_comments.csv')
stock_comments = pd.read_csv('./data/reddit/comments_stock_specific.csv')

### Making sure that the scores are considered as actual number and not strings. Moreover, I will sort the comments by score and group them together by their respective post_id. 

I am only considering the top 15 comments for each post

In [None]:
comments['score'] = pd.to_numeric(comments['score'])
stock_comments['score'] = pd.to_numeric(stock_comments['score'])

In [11]:
# Group by 'post_id', sort by 'score' within each group, and get the head (top 10)
stock_comments = stock_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)

comments = comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)

  stock_comments = stock_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)
  comments = comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)


#### I have multiple comments per post_id. I want to make sure that all the comments stay together. I am putting all the comments in a list and will group them together. 

In [12]:
stock_comments = stock_comments.groupby('post_id')['body'].apply(list).reset_index()
comments = comments.groupby('post_id')['body'].apply(list).reset_index()

#### Let's now merge the posts and comments together

In [None]:
stocks = pd.merge(stock_posts, stock_comments, on='post_id', how='left')
general = pd.merge(posts, comments, on='post_id', how='left')

#### Making sure the datetime is well structured and removing having a uniform date. (Removing Hours and Seconds)

In [14]:
#Convert 'created_utc' to datetime and keep only the date ---
stocks['created_utc'] = pd.to_datetime(stocks['created_utc']).dt.date
general['created_utc'] = pd.to_datetime(general['created_utc']).dt.date

In [15]:
general.head()

Unnamed: 0,post_id,title,selftext,score,upvote_ratio,created_utc,num_comments,author,permalink,url,is_self,flair,subreddit,category,body
0,1esvxig,CNBC: Harris to propose federal ban on 'corpor...,,35612,0.87,2024-08-15,2768,BothZookeepergame612,/r/Economics/comments/1esvxig/cnbc_harris_to_p...,https://www.cnbc.com/2024/08/15/harris-corpora...,False,,Economics,general,[The headline sounds way different than the pr...
1,1f2eubo,Should the world's richest 1% - who gained $42...,,18133,0.91,2024-08-27,2815,Impressive-Ad1944,/r/Economics/comments/1f2eubo/should_the_world...,https://www.business-standard.com/world-news/w...,False,,Economics,general,"[[removed], While the Walton family is one of ..."
2,1ef14i6,Boomers' iron grip on $76 trillion of wealth p...,,13372,0.91,2024-07-29,773,GetRichQuickSchemer_,/r/Economics/comments/1ef14i6/boomers_iron_gri...,https://creditnews.com/economy/boomers-iron-gr...,False,News,Economics,general,[Having a large IRA/401K is understandable. \n...
3,1cbzoay,Nate Silver: Go to a state school. The Ivy Lea...,,12639,0.92,2024-04-24,1432,jivatman,/r/Economics/comments/1cbzoay/nate_silver_go_t...,https://www.natesilver.net/p/go-to-a-state-school,False,,Economics,general,[The point of the ivies isn't the quality of t...
4,1cz1a2v,Some Americans live in a parallel economy wher...,,10761,0.85,2024-05-23,3178,mafco,/r/Economics/comments/1cz1a2v/some_americans_l...,https://finance.yahoo.com/news/some-americans-...,False,News,Economics,general,[The Great Bifurcation occurred right around C...


#### Renaming The columns to have an appropriate name

In [16]:
names = {'selftext':'post','body':'comments'}
stocks = stocks.rename(columns=names)
general = general.rename(columns=names)

###  3. Merging and Exporting the Final DataFrame

In [17]:
main_df = pd.concat([stocks,general], ignore_index=True)

main_df.to_csv('./data/cleaned_stock.csv')