# Reddit Trading Strategy




## Deliverables

* `documentation.ipynb` .... This notebook contains information about the delivered files and how to use and maintain them.

* `reddit_stock_analysis.ipynb` .... This notebook provides an in-depth evaluation of the `r/wallstreetbets` subreddit including a detailed sentiment analysis 

* `reddit_data_fetch.ipynb` ... This notebook provides code that automatically downloads reddit data (i.e., posts) and stores it such that it can be used as input data for any further processing.

* `data_wsb/` ... This folder contains a collection of all reddit posts from the `r/wallstreetbets` subreddit between December 1st, 2020 and May 1st, 2021.





## Input Data

* `data_wsb/`

  This folder contains the reddit posts used as input for the processing.
  The data is stored as human-readable CSV files with the following columns:

    * `timestamp` (type: Datetime) ... The point in time that the post was submitted.
    * `id` (type: String) ... The unique ID of the post.
    * `title` (type: String) ... The title of the post.
    * `body` (type: String, optional) ... The text of the post.
    * `num_comments` (type: Integer) ... The number of comments that the post has got at the time the dataset was assembled.
    * `score` (type: Integer) ... The reddit score (upvotes/downvotes) that the post has got at the time the dataset was assembled.

  Example:
    
  > timestamp,id,title,body,num_comments,score  
  > 2021-01-28 00:43:01,l6jb22,WE'RE BACK! LETS BOOM BOYS! 🚀 🚀,"We need to stay strong. Do not let this destory us, we are all in it together.",94,479  
  > 2021-03-05 00:09:05,lxz8mq,We just like the stonk 🚀💎🤲,,5,17  
  > ... 

  Currently, the dataset is split by day, i.e., each file contains only a single day's worth of posts.
  This was done for data management purposes only, so it is possible to split the input data differently or to even provide just a single file.
  The only restriction is that file names need to start with `wsb_posts__` and end with `.csv`, othwerwise it won't be recognized as an input file.
  Furthermore, all files need to start with a header row and follow the column structure described above.

* `wordlist.csv` 

  This CSV file contains a list of all english words for the Stock Ticker cleaning.
  The data is stored only within a single column:

  * `words` (type: String)

  Example:

  > words  
  > A  
  > a  
  > aa  
  > aal  
  > aalii  
  > ...  

  



## Usage


The notebook provides an in-depth analyis of posts from the r/wallstreetbets forum in order to derive potenial trading implications.

- The first part focuses on analayzing the various posts in terms of average comments per post (day and stock) as well as the respective sentiment. This is mainly achvieved by composing a new varibale called feature score.

- The second part relates the derived feature score to actual stock movements and allows to compare them in terms of absolute percentage changes. 

By doing so,  predictions for future stock movements can be examined and thus used to beat the market and achieve significant returns.

## Maintenaince

### Updating the input data:

In order to generate new input data that contains the latest reddit posts, follow the steps below:

1. Open up `reddit_data_fetch.ipynb`
2. Configure the subreddit and update the start and end date, as described in the notebook
3. Execute the notebook and wait until the latest data has been downloaded

### Fine-tuning the Sentiment Analysis:

In order to improve the sentiment analysis capabilities over time, the custom vocabulary can be extended and/or adjusted.
To do this, follow the steps below:

1. Open up `reddit_stock_analysis.ipynb`
2. Go to section **3.1. Custumize Vocabulary**
3. Update the existing word-sentiment pairs or add new ones. Sentiment values are denoted as floating point numbers between -1 (very negative sentiment) and 1 (very postive sentiment)

### Changing the Aggregation Time Period

Currently, day-to-day changes of interest in and prices of stocks are analyzed.
It is possible to adjust this time period (e.g. to hour-by-hour aggregation) by following the steps below:

1. Open up `reddit_stock_analysis.ipynb`
2. Go to section **4. Aggregate Features per Stock per Day**
3. Instead of converting the posts' timestamps to `%Y-%m-%d` (i.e., removing the hour, minutes and seconds information), choose a different timestamp format
2. Go to section **5. Fetch Stock Prices**
3. Instead using the daily close prices delivered by the Yahoo Finance API for further processing, choose the price data suitable for the timestamp format used in step 3

### Refining the Feature Score:

Currently, the Feature Score is calculated as follows:

> $FeatureScore(day, stock) = log(0.1 + MeanCommentsPerPost(day, stock) * (1 + MeanSentiment(day, stock)))$

This formula can however be adapted and updated to fit ones specific needs.


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d7886875-5320-40a2-af00-cc1d95e2b7d3' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>