## Extract, Transform, Load 
This note book will be responsible for connecting to reddit api, extracting data, and storing it automatically. It will also use the python library, yfinance, to gather Yahoo Finance stock data. 

The goal is to extract data from the yfinance library, extract post content from reddit, automatically transform/clean the data and append it to a MongoDB database (via pymongo). 

Ultimately, this process has the potential to be automated.

In [215]:
# Import dependencies
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import pymongo
import requests
import praw
from datetime import date, timedelta
from config import KEY, CLIENT_ID, PW

In [216]:
# Create variables for API credentials
client_id = CLIENT_ID
client_k =KEY
usr_agent = 'etlAPP'
username = 'joechancey11'
pw = PW

In [217]:
# Create object for PRAW login credentials
def reddit_request():
    reddit = praw.Reddit(client_id=client_id, client_secret=client_k, user_agent=usr_agent, username=username, password=pw)
    return reddit

In [218]:
# Make reddit equal to our object
reddit = reddit_request()

In [219]:
# Choose our subreddit - Can be swapped
subreddit = reddit.subreddit("wallstreetbets")

In [220]:
# # Skip this Cell - This is a sample search so that we can get keys and understand for Reddit API is giving back results. - PRAW makes this irrelevant. 
# first_search = subreddit.search("GME", limit=5, sort='top')
# # This is commented out due to the length of the response - Feel free to uncomment to view keys. As stated above: PRAW makes this irrelevant. 
# [vars(x) for x in first_search]

In [221]:
# Create an empty DataFrame to add our data
df = pd.DataFrame(columns=['Title', 'Date', 'Upvote Ratio', 'Total Comments'])

In [222]:
# Query Reddit API for submissions that include GME
for submission in subreddit.search("GME", limit=50):
    df = df.append({'Title': submission.title, 'Date': submission.created_utc, 'Upvote Ratio': submission.upvote_ratio, 'Total Comments': submission.num_comments}, ignore_index=True)
df

Unnamed: 0,Title,Date,Upvote Ratio,Total Comments
0,"Daily Popular Tickers Thread for September 16,...",1631790000.0,0.93,12391
1,"Daily Popular Tickers Thread for September 15,...",1631707000.0,0.92,7229
2,I just quit my job so that I could roll over m...,1630590000.0,0.82,2079
3,Today is the day. Over 2M in my favorite stock...,1631101000.0,0.89,1347
4,"Daily Popular Tickers Thread for September 20,...",1632132000.0,0.92,2139
5,GME GANG IS BACK,1629831000.0,0.85,1526
6,"Daily Popular Tickers Thread for September 21,...",1632218000.0,0.92,1780
7,My GME gain from Tuesday. Went all in with my ...,1629889000.0,0.85,1445
8,"I made a lot of money on GME and quit my job, ...",1630343000.0,0.77,2961
9,"Daily Popular Tickers Thread for September 22,...",1632305000.0,0.91,1449


In [223]:
# Ensure our DataFrame contains GME by dropping items that do not have GME in the title
df = df[~df["Title"].str.contains("GME")==False]

In [224]:
# Convert to datetime
df['Date'] = pd.to_datetime(df['Date'], unit='s').dt.normalize()
df

Unnamed: 0,Title,Date,Upvote Ratio,Total Comments
0,"Daily Popular Tickers Thread for September 16,...",2021-09-16,0.93,12391
1,"Daily Popular Tickers Thread for September 15,...",2021-09-15,0.92,7229
2,I just quit my job so that I could roll over m...,2021-09-02,0.82,2079
3,Today is the day. Over 2M in my favorite stock...,2021-09-08,0.89,1347
4,"Daily Popular Tickers Thread for September 20,...",2021-09-20,0.92,2139
5,GME GANG IS BACK,2021-08-24,0.85,1526
6,"Daily Popular Tickers Thread for September 21,...",2021-09-21,0.92,1780
7,My GME gain from Tuesday. Went all in with my ...,2021-08-25,0.85,1445
8,"I made a lot of money on GME and quit my job, ...",2021-08-30,0.77,2961
9,"Daily Popular Tickers Thread for September 22,...",2021-09-22,0.91,1449


## Yahoo Finance Data

In [225]:
# Assign GME yahoo finance data to variable
gme = yf.Ticker("GME")
# Uncomment line below if you'd like to confirm ticker data
# gme.info

In [227]:
current_date = date.today()
year_ago = current_date - timedelta(days=365)
hist = gme.history(start=year_ago, end=current_date)
hist

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Dividends,Stock Splits
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2020-09-23,10.600000,10.860000,9.920000,10.040000,10651200,0,0
2020-09-24,9.710000,9.810000,9.010000,9.140000,7938800,0,0
2020-09-25,9.190000,10.180000,9.100000,10.020000,7515200,0,0
2020-09-28,10.160000,10.260000,9.550000,10.090000,6764300,0,0
2020-09-29,10.000000,10.650000,9.930000,10.350000,5237600,0,0
...,...,...,...,...,...,...,...
2021-09-16,202.330002,216.550003,201.149994,206.369995,3058200,0,0
2021-09-17,208.020004,212.490005,200.779999,204.970001,3945900,0,0
2021-09-20,200.000000,202.850006,184.550003,192.199997,3941800,0,0
2021-09-21,199.360001,199.360001,186.000000,189.949997,2633800,0,0
