# Cleaning




Columns are the same for each dataset.     
So we can write one script to clean them all. (Test on one for loop on others)

**WARNING** This script takes around 5-10 minutes to run on a MSOE school computer.
You should also probably close all your other windows (or at least teams). **WARNING**

### Imports and setup

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import os
import seaborn as sns
import sys
import demoji
import nltk 
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
nltk.download('stopwords')


sys.path.append('../')



pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', 250)

MIN_POSTS_PER_DAY = 5

data_path = os.path.join('combined_files.csv')


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/benfouch/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Before data can be read in to dataframe, I think it is necessary to do some preprocessing to the csv file itself

Preprocessing has been done in clean.py. clean.py takes the borken csv files and fixes them into correct records. The text of posts had commas in it which was breaking the csv files. 


## Still some more data cleaning needs to be done

In [2]:
df = pd.read_csv(data_path)

In [3]:
df.head(1)

Unnamed: 0,created,id,author,retrieved,edited,pinned,archived,locked,removed,deleted,is_self,is_video,is_original_content,title,link_flair_text,upvote_ratio,score,gilded,total_awards_received,num_comments,num_crossposts,selftext,thumbnail,shortlink
0,2021-01-01 00:02:06,ko124i,[deleted],2021-02-02 21:52:13,1970-01-01 00:00:00,0,0,0,1,1,1,0,0,3k - 170k since March (Also buy LIT!!),Gain,1.0,34,0,1,14,0,[deleted],default,https://redd.it/ko124i


In [4]:
df.columns[17:20]

Index(['gilded', 'total_awards_received', 'num_comments'], dtype='object')

### Converting types

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482039 entries, 0 to 1482038
Data columns (total 24 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   created                1482039 non-null  object 
 1   id                     1482039 non-null  object 
 2   author                 1482039 non-null  object 
 3   retrieved              1482039 non-null  object 
 4   edited                 1482039 non-null  object 
 5   pinned                 1482039 non-null  int64  
 6   archived               1482039 non-null  int64  
 7   locked                 1482039 non-null  int64  
 8   removed                1482039 non-null  int64  
 9   deleted                1482039 non-null  int64  
 10  is_self                1482039 non-null  int64  
 11  is_video               1482039 non-null  int64  
 12  is_original_content    1482039 non-null  int64  
 13  title                  1482035 non-null  object 
 14  link_flair_text   

In [6]:


# date time columns
df['created'] =  pd.to_datetime(df['created'], format='%Y-%m-%d %H:%M:%S.%f')
df['retrieved'] =  pd.to_datetime(df['retrieved'], format='%Y-%m-%d %H:%M:%S.%f')
df['edited'] =  pd.to_datetime(df['edited'], format='%Y-%m-%d %H:%M:%S.%f')

# boolean / categorical variables
df['pinned'] = df['pinned'].astype('bool')
df['archived'] = df['archived'].astype('bool')
df['locked'] = df['locked'].astype('bool')
df['removed'] = df['removed'].astype('bool')
df['deleted'] = df['deleted'].astype('bool')
df['is_self'] = df['is_self'].astype('bool')
df['is_video'] = df['is_video'].astype('bool')
df['is_original_content'] = df['is_original_content'].astype('bool')

# int types
df['score'] = df['score'].astype('int')
df['gilded'] = df['gilded'].astype('int')
df['total_awards_received'] = df['total_awards_received'].astype('int')
df['num_comments'] = df['num_comments'].astype('int')
df['num_crossposts'] = df['num_crossposts'].astype('int')



Columns:    
| Index | Feature               | Type     | Description                                                    | 
|-------|-----------------------|----------|----------------------------------------------------------------|
| 0     | id                    | string   | The id of the submission                                       |
| 1     | author                | string   | The redditors username                                         |
| 2     | created               | datetime | Time the submission was created                                |
| 3     | retrieved             | datetime | Time the submission was retrieved                              |
| 4     | edited                | datetime | Time the submission was modified                               |
| 5     | pinned                | boolean  | Whether or not the submission is pinned                        |
| 6     | archived              | boolean  | Whether or not the submission is archived                      |
| 7     | locked                | boolean  | Whether or not the submission is locked                        |
| 8     | removed               | boolean  | Whether or not the submission is removed                       |
| 9     | deleted               | boolean  | Whether or not the submission is user deleted                  |
| 10    | is_self               | boolean  | Whether or not the submission is a text                        |
| 11    | is_video              | boolean  | Whether or not the submission is a video                       |
| 12    | is_original_content   | boolean  | Whether or not the submission has been set as original content |
| 13    | title                 | string   | Title of the submission                                        |
| 14    | link_flair_text       | string   | Submission link flairs text content                            |
| 15    | upvote_ratio          | double   | Percentage of upvotes from all votes on submission             |
| 16    | score                 | integer  | number of upvotes                                              |
| 17    | gilded                | integer  | number of gilded awards                                        |
| 18    | total_awards_received | integer  | number of awards on the submission                             |
| 19    | num_comments          | integer  | number of comments on the submission                           |
| 20    | num_crossposts        | integer  | number of crossposts on the submission                         |
| 21    | selftext              | string   | submission selftext on text posts                              |
| 22    | thumbnail             | string   | submission thumbnail on image posts                            |
| 23    | shortlink             | string   | submission short url                                           |    

### Cleaning functions for Title


Things that need to be cleaned from "Title":
- New lines
- Emojis (convert or remove?)
- Spam messages (possibly only take posts that have a certain number of upvotes)
- links (need to remove entire record if link is only thing)
- videos (same as link)
- A lot of records do not talk about a specific stock. (Remove them?)

Columns that can be removed for sure:
- id
- shortlink  
- thumbnail
- retrieved
- edited 
- pinned  
- archived  
- locked  
- removed (if removed is true should we discard the record?)
- deleted (same as removed)
- is_self   
- is_video (use as flag to remove records?)
- gilded   

Maybe keep:   (general stats about the post)
- score
- upvote_ratio
- comments

Keep:     
- created
- title + selftext




Can we combine comments into a score. The score could be a weighted average of upvote ratio, comments, etc.

In [7]:
df['selftext'].value_counts()[0:4]

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

A lot of self-text posts remove the text after or delete it. And there are some additional messages that seem to be from bots that we could probably remove as well. I suggest removing 'deleted' and 'removed' and have an empty string instead. I think we should also combine this column with the title column so we only have one column with text. 

In [8]:
df[(df['is_self'] == 1) & (df['removed'] == 1)]

Unnamed: 0,created,id,author,retrieved,edited,pinned,archived,locked,removed,deleted,is_self,is_video,is_original_content,title,link_flair_text,upvote_ratio,score,gilded,total_awards_received,num_comments,num_crossposts,selftext,thumbnail,shortlink
0,2021-01-01 00:02:06,ko124i,[deleted],2021-02-02 21:52:13,1970-01-01,False,False,False,True,True,True,False,False,3k - 170k since March (Also buy LIT!!),Gain,1.00,34,0,1,14,0,[deleted],default,https://redd.it/ko124i
7,2021-01-01 00:13:41,ko190a,[deleted],2021-02-03 21:12:56,1970-01-01,False,False,False,True,True,True,False,False,TSXV ROVR OTCQB ROVMF could be getting ready t...,General Discussion,1.00,1,0,0,0,0,[deleted],default,https://redd.it/ko190a
9,2021-01-01 00:18:03,ko1bnp,dluther93,2021-02-02 21:52:13,1970-01-01,False,False,False,True,False,True,False,False,What would make GME shorts win?,Discussion,1.00,1,0,0,0,0,[removed],default,https://redd.it/ko1bnp
13,2021-01-01 00:18:57,ko1c6o,[deleted],2021-02-03 21:17:46,1970-01-01,False,False,False,True,True,True,False,False,Stocks for beginners: How do you know which st...,,0.55,1,0,0,14,0,[deleted],default,https://redd.it/ko1c6o
14,2021-01-01 00:22:31,ko1eca,iOinkedU,2021-02-02 21:52:13,1970-01-01,False,False,False,True,False,True,False,True,What Really Happened September 3rd 2020,Meme,1.00,1,0,0,0,0,[removed],default,https://redd.it/ko1eca
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1482028,2021-12-31 23:46:45,rt6fuz,[deleted],2022-01-01 03:29:07,1970-01-01,False,False,False,True,False,True,False,False,I'm about to start making $120K salary. Should...,Auto,1.00,1,0,0,1,0,[removed],default,https://redd.it/rt6fuz
1482029,2021-12-31 23:48:27,rt6gxc,peachezandsteam,2022-01-01 03:29:52,1970-01-01,False,False,False,True,False,True,False,False,Is the notable congresswoman's GOOG calls a gu...,Discussion,0.66,1,0,0,0,0,[removed],https://b.thumbs.redditmedia.com/pp0YjoMYmhccq...,https://redd.it/rt6gxc
1482035,2021-12-31 23:55:49,rt6lul,coyote_of_the_month,2022-01-01 03:29:07,1970-01-01,False,False,False,True,False,True,False,False,Company was unable to process additional 403(b...,R5: Legal,1.00,1,0,0,3,0,[removed],self,https://redd.it/rt6lul
1482036,2021-12-31 23:55:51,rt6lv6,[deleted],2022-01-01 03:56:58,1970-01-01,False,False,False,True,False,True,False,False,Winner or loser? Only time will tell. 2021 end...,Discussion,1.00,1,0,0,1,0,[removed],default,https://redd.it/rt6lv6


In [9]:
df['thumbnail'].value_counts()

default                                                                             849052
self                                                                                430566
image                                                                               113541
nsfw                                                                                  3317
spoiler                                                                               1925
                                                                                     ...  
https://a.thumbs.redditmedia.com/m8QV6nOfMondOzYlaoRpWQ2qjYjz0SB5DezQixtPfM8.jpg         1
https://a.thumbs.redditmedia.com/UvYbtr6hGZ3WFrk32-3oT-hJ18H50aStwX8RF08zKn8.jpg         1
https://a.thumbs.redditmedia.com/wL3ECjUia5J3zuQ3dVSnqtdMUag0o2sie4tMQjXXRu8.jpg         1
https://b.thumbs.redditmedia.com/3b2alO76OukxzyuDfWZzxld-1X7UqPsuvjqslCtvfwU.jpg         1
https://b.thumbs.redditmedia.com/TtUVXN1XpoXXuzY85bJMNo1451L4fTOqYqKailX9M-c.jpg         1

In [10]:
df.columns

Index(['created', 'id', 'author', 'retrieved', 'edited', 'pinned', 'archived',
       'locked', 'removed', 'deleted', 'is_self', 'is_video',
       'is_original_content', 'title', 'link_flair_text', 'upvote_ratio',
       'score', 'gilded', 'total_awards_received', 'num_comments',
       'num_crossposts', 'selftext', 'thumbnail', 'shortlink'],
      dtype='object')

### We just want the text for the most part

Experimenting with upvote_ratio and score as well.


In [11]:
df = df[['created','removed', 'deleted', 'is_self','title', 'upvote_ratio', 'score', 'gilded', 'total_awards_received', 'num_comments','selftext']]

pseudocode:   
```
if removed or deleted:    
    just take title  and date

if is_self and not removed or deleted and type(selftext) is string:
    take date, title + selftext
else:
    take date, title

```

In [12]:
df_extracted = df.loc[df['is_self'] & ~(df['removed'] | df['deleted']) & (df['selftext'].apply(lambda x: type(x) == str))]
df_extracted['text'] = df_extracted['title'] + ' ' + df_extracted['selftext']
df_extracted = df_extracted[['created', 'text', 'upvote_ratio', 'score', 'gilded', 'total_awards_received', 'num_comments']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_extracted['text'] = df_extracted['title'] + ' ' + df_extracted['selftext']


In [13]:
df_extracted.head()

Unnamed: 0,created,text,upvote_ratio,score,gilded,total_awards_received,num_comments
3,2021-01-01 00:05:17,Advice for someone who's never dealt with stoc...,0.4,0,0,0,9
6,2021-01-01 00:13:13,So /r/stocks what was your 2020 investment rat...,0.63,4,0,0,50
8,2021-01-01 00:15:38,WSBVoteBot Log for Jan 01 2021 Every time a ne...,0.5,0,0,0,19
12,2021-01-01 00:18:40,Hedging your portfolio Just out of curiosity ...,0.6,2,0,0,4
17,2021-01-01 00:24:04,BNGO Bear Case (Serious) I'm actually quite sk...,0.74,42,0,0,99


In [14]:
df_extracted.shape

(305773, 7)

In [15]:

df_extracted['text'].head(20)


3     Advice for someone who's never dealt with stoc...
6     So /r/stocks what was your 2020 investment rat...
8     WSBVoteBot Log for Jan 01 2021 Every time a ne...
12    Hedging your portfolio Just out of curiosity  ...
17    BNGO Bear Case (Serious) I'm actually quite sk...
18    Daily Executions- December 31 2020 Hi Everyone...
31    Built two Google Sheets templates with automat...
35    $GAXY Youtuber London Investor will interview ...
41    Thoughts on Old School Value's stock tracking ...
50    GME is the Rockets 🚀🚀🚀🚀 Gamestop colors: Red  ...
51    Western Digital (WDC) rose 11.83% today. Anybo...
52    ARK invest selling $TSLA Ark invest ETF’s ARKW...
58    Recent IPO Chindata (CD). Looks promising. Wha...
61    AMC will be back. AMC had a rough year just li...
68    Am I dumb for keeping 60% of my portfolio in A...
69    PLTR - Public Service Announcment Listen up my...
72    Senseonics $SENS Jumped in on sens at .80  it’...
73    'There is still a painful void:' Greenwich

# Cleaning the text of each post left.

### Removing emojis

In [16]:
df_extracted['text'] = df_extracted['text'].str.encode('ascii', 'ignore').str.decode('ascii')

### Removing links

In [18]:
# df_extracted['text'][10]

In [19]:
import re

#regex from chatgpt seems to work
url_pattern = re.compile(r"(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})")
df_extracted['text'] = df_extracted['text'].str.replace(url_pattern,'')

### Remove reddit mentions (Maybe)

In [20]:
df_extracted['text'] = df_extracted['text'].str.replace(re.compile(r"(\/u\/[a-zA-Z0-9]+|\/r\/[a-zA-Z0-9]+)"),'')

### Remove Duplicate texts (likely from bots)

In [21]:
before = df_extracted.shape[0]
print(f'Shape before drop duplicates: {df_extracted.shape}')
df_extracted = df_extracted.drop_duplicates(subset=["text"], keep=False)
print(f'Shape after drop duplicates: {df_extracted.shape}')
print(f'Records lost: {before - df_extracted.shape[0]}')

Shape before drop duplicates: (305773, 7)
Shape after drop duplicates: (302587, 7)
Records lost: 3186


In [22]:
df_extracted['text']

3          Advice for someone who's never dealt with stoc...
6          So  what was your 2020 investment rate of retu...
8          WSBVoteBot Log for Jan 01 2021 Every time a ne...
12         Hedging your portfolio Just out of curiosity  ...
17         BNGO Bear Case (Serious) I'm actually quite sk...
                                 ...                        
1482003    Why do people open up multiple positions of th...
1482010    Five penny stocks to put on your watchlist in ...
1482012    Any suggestion on what to do with an employer ...
1482030    Best Keeper credit cards? What are the best ke...
1482034    im a teen and i want to start investing in ind...
Name: text, Length: 302587, dtype: object

### Extracting what each post is about (ticker information)


Need to add more tickers

In [23]:
import json

ticker_dict = 0

with open('stonks.json', 'r') as f:
    ticker_dict = json.load(f)
    


def find_ticker(text):
    mentioned = []
    for ticker, names in ticker_dict.items():
        for name in names:
            if name in text:
                mentioned.append(ticker)
                break
    if len(mentioned) == 0:
        return np.nan
    else:
        
        return " ".join(mentioned)

df_extracted['mentioned'] = df_extracted['text'].apply(find_ticker)


In [25]:
# df_extracted['mentioned'].value_counts()
# df_extracted[df_extracted['mentioned'] == 'MSFT TSLA AAPL GOOGL NVDA WFC']['text'][912309]

In [26]:
df_extracted['mentioned'].info()

<class 'pandas.core.series.Series'>
Int64Index: 302587 entries, 3 to 1482034
Series name: mentioned
Non-Null Count  Dtype 
--------------  ----- 
80256 non-null  object
dtypes: object(1)
memory usage: 4.6+ MB


In [27]:
df_extracted.to_csv('just_dates_and_text.csv', index=False)

In [28]:
df_extracted.head(100)

Unnamed: 0,created,text,upvote_ratio,score,gilded,total_awards_received,num_comments,mentioned
3,2021-01-01 00:05:17,Advice for someone who's never dealt with stoc...,0.4,0,0,0,9,
6,2021-01-01 00:13:13,So what was your 2020 investment rate of retu...,0.63,4,0,0,50,
8,2021-01-01 00:15:38,WSBVoteBot Log for Jan 01 2021 Every time a ne...,0.5,0,0,0,19,GME
12,2021-01-01 00:18:40,Hedging your portfolio Just out of curiosity ...,0.6,2,0,0,4,
17,2021-01-01 00:24:04,BNGO Bear Case (Serious) I'm actually quite sk...,0.74,42,0,0,99,
18,2021-01-01 00:24:14,Daily Executions- December 31 2020 Hi Everyone...,0.91,9,0,0,9,
31,2021-01-01 00:37:45,Built two Google Sheets templates with automat...,0.96,50,0,1,4,GOOGL
35,2021-01-01 00:42:36,$GAXY Youtuber London Investor will interview ...,0.97,31,0,0,7,
41,2021-01-01 00:47:19,Thoughts on Old School Value's stock tracking ...,0.5,0,0,0,1,
50,2021-01-01 00:56:35,GME is the Rockets Gamestop colors: Red Whit...,0.82,57,0,1,10,GME


### Removing punctuation

In [29]:
df_extracted['text'] = df_extracted['text'].replace('[^\w\s]', '', regex=True)


### Removing capitalization

In [30]:
df_extracted['text'] = df_extracted['text'].str.lower()



### Stop word removal

In [31]:
stop_word_list = set(stopwords.words('english'))


df_extracted['text'] = df_extracted['text'].map(lambda x : " ".join(w for w in x.split() if w not in stop_word_list))

### Stemming?

The following cell takes a while to run (under 5 min)

In [32]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize




nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()
df_extracted['text'] = df_extracted['text'].apply(lambda x: " ".join(lemmatizer.lemmatize(word) for word in word_tokenize(x)))

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/benfouch/nltk_data...
[nltk_data] Downloading package punkt to /Users/benfouch/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [33]:
df_extracted['text']

3          advice someone who never dealt stock reposted ...
6          2020 investment rate return also give rough se...
8          wsbvotebot log jan 01 2021 every time new subm...
12         hedging portfolio curiosity anyone make move h...
17         bngo bear case serious im actually quite skept...
                                 ...                        
1482003    people open multiple position pair whats benef...
1482010    five penny stock put watchlist 2022 llnw limel...
1482012    suggestion employer pay hi young man working s...
1482030    best keeper credit card best keeper card know ...
1482034    im teen want start investing index stock sp500...
Name: text, Length: 302587, dtype: object

In [34]:
df_extracted.head(10)

Unnamed: 0,created,text,upvote_ratio,score,gilded,total_awards_received,num_comments,mentioned
3,2021-01-01 00:05:17,advice someone who never dealt stock reposted ...,0.4,0,0,0,9,
6,2021-01-01 00:13:13,2020 investment rate return also give rough se...,0.63,4,0,0,50,
8,2021-01-01 00:15:38,wsbvotebot log jan 01 2021 every time new subm...,0.5,0,0,0,19,GME
12,2021-01-01 00:18:40,hedging portfolio curiosity anyone make move h...,0.6,2,0,0,4,
17,2021-01-01 00:24:04,bngo bear case serious im actually quite skept...,0.74,42,0,0,99,
18,2021-01-01 00:24:14,daily execution december 31 2020 hi everyone i...,0.91,9,0,0,9,
31,2021-01-01 00:37:45,built two google sheet template automatic data...,0.96,50,0,1,4,GOOGL
35,2021-01-01 00:42:36,gaxy youtuber london investor interview coo ga...,0.97,31,0,0,7,
41,2021-01-01 00:47:19,thought old school value stock tracking spread...,0.5,0,0,0,1,
50,2021-01-01 00:56:35,gme rocket gamestop color red white blackhoust...,0.82,57,0,1,10,GME


spectral analysis

Pseudocode to combine datasets
```
split dataframe into many dataframes by date (one dataframe per day) -> 4 pm previous day until 4 pm til next day.
get rid of weekend posts



for each day:
    for each record in day:
        for each company in stocks:
            if company in mention column:
                add post to that x and that day
remove records of the dataset that are less than ten posts



```

### Splitting into 365 dataframes (one per day)


In [35]:
df_extracted.head()
day_frames = []
df_extracted['new_date'] = df_extracted['created'] + pd.Timedelta(hours=8)
groups = df_extracted.groupby(df_extracted['new_date'].dt.date) 

for name, group in groups:
    day_frames.append(group[['created', 'new_date', 'text', 'upvote_ratio', 'score', 'gilded', 'total_awards_received', 'num_comments', 'mentioned']])


In [36]:
day_frames[0].head()

Unnamed: 0,created,new_date,text,upvote_ratio,score,gilded,total_awards_received,num_comments,mentioned
3,2021-01-01 00:05:17,2021-01-01 08:05:17,advice someone who never dealt stock reposted ...,0.4,0,0,0,9,
6,2021-01-01 00:13:13,2021-01-01 08:13:13,2020 investment rate return also give rough se...,0.63,4,0,0,50,
8,2021-01-01 00:15:38,2021-01-01 08:15:38,wsbvotebot log jan 01 2021 every time new subm...,0.5,0,0,0,19,GME
12,2021-01-01 00:18:40,2021-01-01 08:18:40,hedging portfolio curiosity anyone make move h...,0.6,2,0,0,4,
17,2021-01-01 00:24:04,2021-01-01 08:24:04,bngo bear case serious im actually quite skept...,0.74,42,0,0,99,


In [37]:
day_frames_no_nan = []
for x in day_frames:
    day_frames_no_nan.append(x.dropna())

In [38]:
day_frames_no_nan[0].head()

Unnamed: 0,created,new_date,text,upvote_ratio,score,gilded,total_awards_received,num_comments,mentioned
8,2021-01-01 00:15:38,2021-01-01 08:15:38,wsbvotebot log jan 01 2021 every time new subm...,0.5,0,0,0,19,GME
31,2021-01-01 00:37:45,2021-01-01 08:37:45,built two google sheet template automatic data...,0.96,50,0,1,4,GOOGL
50,2021-01-01 00:56:35,2021-01-01 08:56:35,gme rocket gamestop color red white blackhoust...,0.82,57,0,1,10,GME
52,2021-01-01 00:59:31,2021-01-01 08:59:31,ark invest selling tsla ark invest etf arkw ar...,0.4,0,0,0,18,TSLA T
61,2021-01-01 01:08:50,2021-01-01 09:08:50,amc back amc rough year like everyone else exc...,0.56,2,0,0,18,GME AMC


In [39]:
day_frames_split = []

for df in day_frames_no_nan:
    # gets all the rows where there is multiple tickers in mentioned
    split_df = df[df['mentioned'].str.contains(' ')].copy()
    # splits the rows that have a space into a list
    split_df['mentioned'] = split_df['mentioned'].str.split(' ')
    # expands it out
    split_df = split_df.explode('mentioned')
    split_df.reset_index(drop=True, inplace=True)
    df = df[~df['mentioned'].str.contains(' ')].copy()
    df.reset_index(drop=True, inplace=True)
    df = pd.concat([df, split_df], sort=False)
    df.sort_values(by='created', inplace=True)
    df.reset_index(drop=True, inplace=True)
    day_frames_split.append(df)
    



In [40]:
day_frames_split[4].head()

Unnamed: 0,created,new_date,text,upvote_ratio,score,gilded,total_awards_received,num_comments,mentioned
0,2021-01-04 16:04:25,2021-01-05 00:04:25,taug 15 prob rise dollar end month announce wa...,0.94,28,0,0,21,WMT
1,2021-01-04 16:04:25,2021-01-05 00:04:25,taug 15 prob rise dollar end month announce wa...,0.94,28,0,0,21,T
2,2021-01-04 16:07:43,2021-01-05 00:07:43,niklf niclv emergence ev created boom lithium ...,0.73,5,0,0,1,GME
3,2021-01-04 16:18:13,2021-01-05 00:18:13,5 reason take tsla mar ok listen fellow autist...,0.76,31,0,0,41,TSLA
4,2021-01-04 16:19:50,2021-01-05 00:19:50,icln rated poorly morningstar rarely use reddi...,0.86,56,0,0,56,GME


### Aggregating everything together for each day

In [41]:
concat_days = []


for df in day_frames_split:

    df.head()
    mentioned_value_counts = df['mentioned'].value_counts()

    grouped = df.groupby("mentioned")["text"].apply(lambda x: " ".join(x))
    grouped = pd.DataFrame(grouped).reset_index()
    mentioned_value_counts = mentioned_value_counts.reset_index().rename(columns={"index": "mentioned", "mentioned": "mentioned_count"})
    grouped = grouped.merge(mentioned_value_counts, on="mentioned", how="left")
    group_2 = df.groupby('mentioned').agg({
        "upvote_ratio": "mean",  # average the 'upvote_ratio' column
        "score": "mean",  # average the 'score' column
        "gilded": "mean",
        "total_awards_received": "mean",
        "num_comments": "mean",
        "new_date": "min",  # take the minimum value of the 'new_date' column
    })

    # # exclude groups that have less than 5 rows
    # grouped = grouped.query("mentioned_count >= MIN_POSTS_PER_DAY")



    grouped = grouped.merge(group_2, on='mentioned', how='left')
    concat_days.append(grouped)

In [42]:
concat_days[6]

Unnamed: 0,mentioned,text,mentioned_count,upvote_ratio,score,gilded,total_awards_received,num_comments,new_date
0,AAPL,america v china leading stock prediction movin...,11,0.734545,99.454545,0.090909,1.909091,50.363636,2021-01-07 01:40:48
1,BA,watchlist 172021 alny great small channel fant...,4,0.8125,3.25,0.0,0.0,7.75,2021-01-07 09:26:43
2,BABA,america v china leading stock prediction movin...,13,0.886923,117.384615,0.076923,0.615385,43.846154,2021-01-07 01:40:48
3,BB,bbi dd personal forecast check recommend peopl...,3,0.92,228.333333,0.0,4.666667,114.0,2021-01-07 00:22:33
4,CMCSA,nikola stock forecast nikola stock rise ash co...,1,0.18,0.0,0.0,0.0,18.0,2021-01-07 05:50:37
5,DIS,q long short 2021 hi everyone got question reg...,6,0.843333,38.166667,0.0,0.166667,18.333333,2021-01-07 03:58:53
6,F,watchlist 172021 alny great small channel fant...,6,0.808333,134.0,0.0,2.333333,68.0,2021-01-07 09:26:43
7,GME,thanks tsla 1092 gain 2020 pandemic started li...,18,0.869444,304.277778,0.166667,3.833333,2811.388889,2021-01-07 00:50:20
8,GOOGL,america v china leading stock prediction movin...,4,0.8825,42.25,0.0,0.25,73.75,2021-01-07 01:40:48
9,INTC,bb king blast past legendary comeback bb kingb...,3,0.66,208.666667,0.0,4.666667,120.666667,2021-01-07 11:40:22


THis is the most beautiful dataframe ever. It is the original reddit posts, but it is shrunk down to 365 dataframes. Each dataframe holds a day of posts. Each dataframe is split by company. under each company there is all of the posts about that company, how many posts it was in, the upvote_ratio (avg), the score (avg) and the date of the post.

In [43]:
sum = 0
for i in concat_days:
    sum += len(i)
sum

9275

**TODO**
- Remove weekends
- Remove times (not needed)
- could add more aggregate functions
- add ticker to end (match day and mentioned)
- Could go sentiment analysis way
- Or could go word2vec way




### Remove times

In [44]:
def remove_time(x):
    # pd.to_datetime(df['created'], format='%Y-%m-%d %H:%M:%S.%f')
    return pd.to_datetime(f'{x.year}-{x.month}-{x.day}', format='%Y-%m-%d')

In [45]:
for df in concat_days:
    df['new_date'] = df['new_date'].apply(remove_time)

In [46]:
concat_days[0].head(100)

Unnamed: 0,mentioned,text,mentioned_count,upvote_ratio,score,gilded,total_awards_received,num_comments,new_date
0,AAPL,top stock pick 2021 stock etf investment looki...,4,0.8975,98.0,0.0,0.0,64.5,2021-01-01
1,AMC,amc back amc rough year like everyone else exc...,1,0.56,2.0,0.0,0.0,18.0,2021-01-01
2,BABA,let reflect performance 2020 happy new year fo...,1,0.42,0.0,0.0,0.0,3.0,2021-01-01
3,F,tsla 2021 trade plan happy new year wsb here p...,1,0.89,79.0,0.0,0.0,49.0,2021-01-01
4,GME,wsbvotebot log jan 01 2021 every time new subm...,11,0.777273,93.181818,0.0,1.363636,42.090909,2021-01-01
5,GOOGL,built two google sheet template automatic data...,2,0.815,26.5,0.0,0.5,9.5,2021-01-01
6,INTC,teladoc health buy 2021 hi happy new year hope...,2,0.66,4.0,0.0,0.0,18.5,2021-01-01
7,META,teladoc health buy 2021 hi happy new year hope...,2,0.66,4.0,0.0,0.0,18.5,2021-01-01
8,MSFT,salesforce crm going next sap got two goal one...,1,0.57,11.0,0.0,0.0,51.0,2021-01-01
9,PFE,advise bntx bought 200 share bntx average pric...,1,0.83,7.0,0.0,0.0,14.0,2021-01-01


In [47]:
full_df = concat_days[0]
for x in concat_days[1:]:
    full_df = pd.concat([full_df,x])
    
full_df.head(100)
full_df.rename(columns={'new_date':'Date'}, inplace=True)

full_df.set_index('Date', inplace=True)
full_df.head()

Unnamed: 0_level_0,mentioned,text,mentioned_count,upvote_ratio,score,gilded,total_awards_received,num_comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-01-01,AAPL,top stock pick 2021 stock etf investment looki...,4,0.8975,98.0,0.0,0.0,64.5
2021-01-01,AMC,amc back amc rough year like everyone else exc...,1,0.56,2.0,0.0,0.0,18.0
2021-01-01,BABA,let reflect performance 2020 happy new year fo...,1,0.42,0.0,0.0,0.0,3.0
2021-01-01,F,tsla 2021 trade plan happy new year wsb here p...,1,0.89,79.0,0.0,0.0,49.0
2021-01-01,GME,wsbvotebot log jan 01 2021 every time new subm...,11,0.777273,93.181818,0.0,1.363636,42.090909


### Matching tickers to mentioned and date

In [48]:
ticker_df = pd.read_csv('ticker_data.csv')


ticker_df['Date'] =  pd.to_datetime(ticker_df['Date'], format='%Y-%m-%d')
ticker_df['Date'] = ticker_df['Date'].apply(remove_time)
ticker_df.set_index('Date', inplace=True)


ticker_df.head()

match_df = pd.merge(left=ticker_df, right=full_df, how='left', on='Date')

match_df.head()

Unnamed: 0_level_0,MSFT,TSLA,GME,AMC,BB,NOK,BABA,AAPL,GOOGL,DIS,SNAP,SPOT,NVDA,F,BA,META,MCD,V,WMT,JNJ,JPM,T,VZ,PG,MRK,KO,PFE,XOM,GE,WFC,CSCO,INTC,CMCSA,PEP,mentioned,text,mentioned_count,upvote_ratio,score,gilded,total_awards_received,num_comments
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
2021-01-04,-0.02175,0.01433,-0.092105,-0.086364,-0.01791,-0.025063,0.00596,-0.030782,-0.019244,-0.025129,-0.016852,-0.020226,0.000706,-0.032917,-0.034667,-0.021253,-0.019908,-0.011305,0.015454,-0.004706,-0.012784,0.001701,-0.001866,-0.013175,-0.012683,-0.027824,-0.001627,0.001206,-0.038568,-0.020449,-0.007899,-0.00441,-0.033856,-0.018638,AAPL,weekend iv report ticker low iv cheap premium ...,18,0.771111,163.5,0.333333,4.166667,79.277778
2021-01-04,-0.02175,0.01433,-0.092105,-0.086364,-0.01791,-0.025063,0.00596,-0.030782,-0.019244,-0.025129,-0.016852,-0.020226,0.000706,-0.032917,-0.034667,-0.021253,-0.019908,-0.011305,0.015454,-0.004706,-0.012784,0.001701,-0.001866,-0.013175,-0.012683,-0.027824,-0.001627,0.001206,-0.038568,-0.020449,-0.007899,-0.00441,-0.033856,-0.018638,AMC,crazy buy amc right thinking going long amc lo...,1,0.44,0.0,0.0,0.0,18.0
2021-01-04,-0.02175,0.01433,-0.092105,-0.086364,-0.01791,-0.025063,0.00596,-0.030782,-0.019244,-0.025129,-0.016852,-0.020226,0.000706,-0.032917,-0.034667,-0.021253,-0.019908,-0.011305,0.015454,-0.004706,-0.012784,0.001701,-0.001866,-0.013175,-0.012683,-0.027824,-0.001627,0.001206,-0.038568,-0.020449,-0.007899,-0.00441,-0.033856,-0.018638,BA,weekend iv report ticker low iv cheap premium ...,7,0.765714,97.428571,0.0,0.428571,37.142857
2021-01-04,-0.02175,0.01433,-0.092105,-0.086364,-0.01791,-0.025063,0.00596,-0.030782,-0.019244,-0.025129,-0.016852,-0.020226,0.000706,-0.032917,-0.034667,-0.021253,-0.019908,-0.011305,0.015454,-0.004706,-0.012784,0.001701,-0.001866,-0.013175,-0.012683,-0.027824,-0.001627,0.001206,-0.038568,-0.020449,-0.007899,-0.00441,-0.033856,-0.018638,BABA,weekend iv report ticker low iv cheap premium ...,9,0.737778,58.111111,0.0,0.222222,31.777778
2021-01-04,-0.02175,0.01433,-0.092105,-0.086364,-0.01791,-0.025063,0.00596,-0.030782,-0.019244,-0.025129,-0.016852,-0.020226,0.000706,-0.032917,-0.034667,-0.021253,-0.019908,-0.011305,0.015454,-0.004706,-0.012784,0.001701,-0.001866,-0.013175,-0.012683,-0.027824,-0.001627,0.001206,-0.038568,-0.020449,-0.007899,-0.00441,-0.033856,-0.018638,BB,guess company due diligence without revealing ...,1,0.96,417.0,1.0,2.0,51.0


### Filtering only for the target company


In [49]:
companies = match_df.columns[0:34]

def get_target(row):
    
    for company in companies:
        if row['mentioned'] == company:
            return row[company]
    return None


match_df['target'] = match_df.apply(lambda row: get_target(row), axis=1)

In [50]:
match_df.drop(columns=companies, inplace=True)


In [51]:
match_df.head()

Unnamed: 0_level_0,mentioned,text,mentioned_count,upvote_ratio,score,gilded,total_awards_received,num_comments,target
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2021-01-04,AAPL,weekend iv report ticker low iv cheap premium ...,18,0.771111,163.5,0.333333,4.166667,79.277778,-0.030782
2021-01-04,AMC,crazy buy amc right thinking going long amc lo...,1,0.44,0.0,0.0,0.0,18.0,-0.086364
2021-01-04,BA,weekend iv report ticker low iv cheap premium ...,7,0.765714,97.428571,0.0,0.428571,37.142857,-0.034667
2021-01-04,BABA,weekend iv report ticker low iv cheap premium ...,9,0.737778,58.111111,0.0,0.222222,31.777778,0.00596
2021-01-04,BB,guess company due diligence without revealing ...,1,0.96,417.0,1.0,2.0,51.0,-0.01791


In [52]:
match_df.to_csv('nice_combined_data.csv')

How to do the word embedding thing


https://www.youtube.com/watch?v=ZogxNcyqVqE&ab_channel=TheAIUniversity


https://www.guru99.com/word-embedding-word2vec.html


http://web.stanford.edu/class/cs224n/

https://www.youtube.com/playlist?list=PLoROMvodv4rOSH4v6133s9LFPRHjEmbmJ

### Word Embeddings - Word2Vec




Training the word2vec model

### Figuring out which companies we can predict for

Since, it would be unfair to ask a model to predict the price of a stock that is not mentioned in the data that is given, we need to do something about it.   

If we are asking the model to predict the price for Tesla in one hour based off the reddit comments from the previous 5 hours, Tesla would need to be mentioned in the previous 5 hours.     

I think a threshold of maybe like at least 5 posts in the last 5 hours to be included as a training example.     

To decrease the likelihood of the the target company not being mentioned we can increase the time window 