Data Exploration
=======
In this section is mainly to show the raw data structure, cleaning and vectorization process. With the help of `praw` and `pasw`, raw data can be crawled from *Reddit.com*.

In [1]:
import os
import pandas as pd

os.chdir('D:\\github\\master_thesis_2022')
cwd = os.getcwd()

-----
### Data Generating
As shown in the following table, Reddit features, `id`, `url`, `title`, `score`, `num_comments`, `created_utc`,`selftext` and `top_comments` of each submission are recorded by `RdtData`.

* `id` refers to the 6-digit unique code for every submission, and they can be reached by <u>*https://redd.it/* + *id*</u>. For example, <u>https://redd.it/tu2men</u> leads to the submission with `id` : *tu2men*
* `url` is the original url of every submission.
* `title` is the original title of every submission, which can be deleted by user itself or removed by block moderators.
* `score` is simply the number of upvotes minus the number of downvotes.
* `num_comments` refers to the number of comments of every submission.
* `created_utc` is the creat time of every submission in Coordinated Universal Time format.
* `selftext` is the body of every submission. It can be text, image or empty.
* `top_comments` is the combination of the first few comments sorted by option *top*. Note that the comments by spam users i.e. those who always post unrelated content e.g. community regulations and ads are removed in advance.

In [2]:
df_rdt = pd.read_csv(os.path.join(cwd, 'data\\interim\\df_rdt.csv'), encoding='utf-8-sig')
df_rdt.head()

Unnamed: 0,id,url,title,score,num_comments,created_utc,selftext,top_comments
0,tu2rj7,https://www.reddit.com/r/wallstreetbets/commen...,r/place,1,2,1648850000.0,[removed],
1,tu2mtj,https://i.redd.it/yp2zfi16ozq81.jpg,Diamond hands in Madison WI,1,0,1648850000.0,,
2,tu2mmh,https://www.reddit.com/r/wallstreetbets/commen...,GameStop's Board of Directors wants to dilute ...,0,12,1648850000.0,The below is an excerpt from GameStop's recent...,1 Billion shares for a mall/strip-mall brick-a...
3,tu2men,https://www.reddit.com/r/wallstreetbets/commen...,ok..OK... what happened today?,0,24,1648850000.0,[removed],"dude its a momentum trade, theres no real sens..."
4,tu2do4,https://i.redd.it/xtiyaju5mzq81.jpg,Just a couple of my positions… ($SPY & $NIO hi...,1,2,1648849000.0,,


With the help of `FinData`, the *GME* historical price and *S&P500* index are collected at a daily level, and they are later merged with Reddit data `df_rdt` by date.

In [3]:
df_gme = pd.read_csv(os.path.join(cwd, 'data\\interim\\df_gme.csv'), encoding='utf-8-sig')
df_gme.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2022-04-01,188.9,189.7688,155.26,165.0,13189563
1,2022-03-31,163.1,175.745,158.51,166.58,11242228
2,2022-03-30,175.0,183.3369,165.0,166.85,9169186
3,2022-03-29,188.24,199.41,163.0,179.9,18011489
4,2022-03-28,151.98,190.84,151.545,189.59,16316524


In [4]:
df_sp500 = pd.read_csv(os.path.join(cwd, 'data\\interim\\df_sp500.csv'), encoding='utf-8-sig')
df_sp500.head()

Unnamed: 0,DATE,sp500
0,2022-03-01,4306.26
1,2022-03-02,4386.54
2,2022-03-03,4363.49
3,2022-03-04,4328.87
4,2022-03-07,4201.09


------
### Data Cleaning
For data cleaning process, several steps are followed.
1. Instead of `None`, for deleted or removed content, *[deleted]* or *[removed]* are collected, and replace them with `None` is necessary.
2. A submission is removed if all its three key features i.e. `title`, `selftext` and `top_comments` are empty, as it is not useful.
3. *New Line* i.e. */n* are replaced simply with space.
4. *urls* and *images* in the text are removed as they contain no information for NLP.
5. Submit time are reformatted into human time at the day level.
6. emojis are either removed or replaced with its English representation.

In [5]:
df_raw = pd.read_csv(os.path.join(cwd, 'data\\raw\\df_raw.csv'), encoding='utf-8-sig')
df_raw.head()

Unnamed: 0.1,Unnamed: 0,id,url,title,score,num_comments,created_utc,selftext,top_comments,sp500,Open,High,Low,Close,Volume
0,0,tu2rj7,https://www.reddit.com/r/wallstreetbets/commen...,r/place,1,2,2022-04-01,,,4545.86,188.9,189.7688,155.26,165.0,13189563
1,1,tu2mtj,https://i.redd.it/yp2zfi16ozq81.jpg,Diamond hands in Madison WI,1,0,2022-04-01,,,4545.86,188.9,189.7688,155.26,165.0,13189563
2,2,tu2mmh,https://www.reddit.com/r/wallstreetbets/commen...,GameStop's Board of Directors wants to dilute ...,0,12,2022-04-01,The below is an excerpt from GameStop's recent...,1 Billion shares for a mall/strip-mall brick-a...,4545.86,188.9,189.7688,155.26,165.0,13189563
3,3,tu2men,https://www.reddit.com/r/wallstreetbets/commen...,ok..OK... what happened today?,0,24,2022-04-01,,"dude its a momentum trade, theres no real sens...",4545.86,188.9,189.7688,155.26,165.0,13189563
4,4,tu2do4,https://i.redd.it/xtiyaju5mzq81.jpg,Just a couple of my positions… ($SPY & $NIO hi...,1,2,2022-04-01,,,4545.86,188.9,189.7688,155.26,165.0,13189563
