Data Exploration
=======
In this section is mainly to show the raw data structure, cleaning and vectorization process. With the help of `praw` and `pasw`, raw data can be crawled from *Reddit.com*.

In [1]:
import os
import pandas as pd

os.chdir('D:\\github\\master_thesis_2022')
cwd = os.getcwd()

As shown in the following table, `id`, `url`, `title`, `score`, `num_comments`, `created_utc`,`selftext` and `top_comments` of each submission are recorded.

* `id` refers to the 6-digit unique code for every submission, and they can be reached by <u>*https://redd.it/* + *id*</u>. For example, <u>https://redd.it/tu2men</u> leads to the submission with `id` : *tu2men*
* `url` is the original url of every submission.
* `title` is the original title of every submission, which can be deleted by user itself or removed by block moderators.
* `score` is simply the number of upvotes minus the number of downvotes.
* `num_comments` refers to the number of comments of every submission.
* `created_utc` is the creat time of every submission in Coordinated Universal Time format.
* `selftext` is the body of every submission. It can be text, image or empty.
* `top_comments` is the combination of the first few comments sorted by option *top*. Note that the comments by spam users i.e. those who always post unrelated content e.g. community regulations and ads are removed in advance.

In [2]:
raw = pd.read_csv(os.path.join(cwd, 'data\\interim\\df_raw.csv'), encoding='utf-8-sig')
raw.head()

Unnamed: 0,id,url,title,score,num_comments,created_utc,selftext,top_comments
0,tu2rj7,https://www.reddit.com/r/wallstreetbets/commen...,r/place,1,2,1648850000.0,[removed],
1,tu2mtj,https://i.redd.it/yp2zfi16ozq81.jpg,Diamond hands in Madison WI,1,0,1648850000.0,,
2,tu2mmh,https://www.reddit.com/r/wallstreetbets/commen...,GameStop's Board of Directors wants to dilute ...,0,12,1648850000.0,The below is an excerpt from GameStop's recent...,1 Billion shares for a mall/strip-mall brick-a...
3,tu2men,https://www.reddit.com/r/wallstreetbets/commen...,ok..OK... what happened today?,0,24,1648850000.0,[removed],"dude its a momentum trade, theres no real sens..."
4,tu2do4,https://i.redd.it/xtiyaju5mzq81.jpg,Just a couple of my positions… ($SPY & $NIO hi...,1,2,1648849000.0,,


For data cleaning process, several steps are followed.
1. Instead of `None`, *[deleted]* or *[removed]* are collected for deleted or removed content. So, replace them with `None` is necessary.
2. A submission is totally useless if all three key cells i.e. `title`, `selftext` and `top_comments` are empty. So, they are all removed.
3. *New Line* i.e. */n* are replaced simply with space.
4. *urls* and *images* in the text are removed as they contain no information for NLP.
5. Submit time are reformatted into human time at the day level.

In [3]:
clean = pd.read_csv(os.path.join(cwd, 'data\\interim\\df_clean.csv'), encoding='utf-8-sig')
clean.head()

Unnamed: 0,id,url,title,score,num_comments,created_utc,selftext,top_comments
0,tu2rj7,https://www.reddit.com/r/wallstreetbets/commen...,r/place,1,2,04-01-22,,
1,tu2mtj,https://i.redd.it/yp2zfi16ozq81.jpg,Diamond hands in Madison WI,1,0,04-01-22,,
2,tu2mmh,https://www.reddit.com/r/wallstreetbets/commen...,GameStop's Board of Directors wants to dilute ...,0,12,04-01-22,The below is an excerpt from GameStop's recent...,1 Billion shares for a mall/strip-mall brick-a...
3,tu2men,https://www.reddit.com/r/wallstreetbets/commen...,ok..OK... what happened today?,0,24,04-01-22,,"dude its a momentum trade, theres no real sens..."
4,tu2do4,https://i.redd.it/xtiyaju5mzq81.jpg,Just a couple of my positions… ($SPY & $NIO hi...,1,2,04-01-22,,
