# Data Acquistion from Reddit

Go to <a href=#bookmark>bookmark</a>

### 2019-06-08 - Goal - Develop End-to-End Data Flow, at least at small scale.
# OR BUST

![](https://images.unsplash.com/photo-1515255384510-23e8b6a6ca3c?ixlib=rb-1.2.1&auto=format&fit=crop&w=1489&q=80)

---

## Libraries

In [1]:
# Install libs on this computer:
# !pip install praw
# !pip install pymongo
# !pip install psycopg2

In [2]:
import os             # file system stuff
import json           # digest json
import praw           # reddit API
import pandas as pd   # Dataframes
import pymongo        # MongoDB
import numpy as np    # math and arrays

from time import time # To time stuff

#DATA STORAGE
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

In [3]:
import helper     # Custom helper functions

---

## 1A Load Reddit keys

Step 3: Create your first Authorized Reddit Instance

In [5]:
# Define path to secret

secret_path = os.path.join(os.environ['HOME'], '.secret', 'reddit.json')
#secret_path = os.path.join(os.environ['HOME'], 'mia/.secret', 'reddit_api.json')

secret_path

'C:\\Users\\werlindo\\.secret\\reddit.json'

#### Save submissions to DB

In [83]:
# Define path to secret

secret_path_aws = os.path.join(os.environ['HOME'], '.secret', 
                           'aws_ps_flatiron.json')
secret_path_aws

'C:\\Users\\werlindo\\.secret\\aws_ps_flatiron.json'

## 1B Load AWS-PostgreSQL DB keys

#### Load keys

In [84]:
aws_keys = helper.get_keys(secret_path_aws)
user = aws_keys['user']
ps = aws_keys['password']
host = aws_keys['host']
db = aws_keys['db_name']

In [85]:
aws_ps_engine = ('postgresql://' + user + ':' + ps + '@' + host + '/' + db)

### Use SQLAlchemy to create PSQL engine

In [86]:
# dialect+driver://username:password@host:port/database
sql_alc_engine = create_engine(aws_ps_engine)

## 2 Load keys, Create Reddit Instance

In [6]:
keys = helper.get_keys(secret_path)

In [7]:
reddit = praw.Reddit(client_id=keys['client_id'] 
                     ,client_secret=keys['api_key']
                     ,username=keys['username']
                     ,password=keys['password']
                     ,user_agent='reddit_research accessAPI:v0.0.1 (by /u/FlatDubs)')

## 3 Obtain a Subreddit Instance(s) from your Reddit Instance

In [23]:
#politics = reddit.subreddit('politics')

#### Instantiate subreddit

In [24]:
got = reddit.subreddit('gameofthrones') #Let's start with got for now. If can dev flow for one, can just dupe for other

## 4 Get subreddit submissions; save to dataframe

#### Initialize parameters for this submissions pull

In [79]:
# persons = """
#             "bran" OR 'brandon stark' OR 'jon snow' OR 'jon' 
#                         OR 'khaleesi' OR 'dany' OR 'daenerys' OR 'danyris'
#          """

persons = """"
doran" OR "davos"
"""

results_lim = 1000

#### Execute Search

In [82]:
got_search = got.search(persons, 
                        sort='comments',
                       limit= results_lim
                       ,time_filter='week')

# Count # of results
# num_results = sum(1 for s in got_search)
# print('Returned {} results.'.format(num_results))

# Compile submission into list
title = [] 
num_comments = []
upvote_ratio = []
sub_id = []
i=0

for submission in got_search:
    i+=1
    title.append(submission.title)
    num_comments.append(submission.num_comments)
    upvote_ratio.append(submission.upvote_ratio)
    sub_id.append(submission.id)
#     body.append(?) #look at this later! is it comment[0]? 
    if i%100 == 0:
        print(f'{i} submissions completed')

df_got = pd.DataFrame(
    {'title': title,
     'num_comments': num_comments,
     'upvote_ratio': upvote_ratio,
     'id':sub_id
    })

df_got

Unnamed: 0,title,num_comments,upvote_ratio,id
0,[Spoilers] Winds of Winter was a perfect episode,158,0.74,bvwunn
1,[SPOILERS] The Great Council of 305 AC,48,0.64,bxir03
2,[No Spoilers] I found this horde of A Song of ...,2,0.62,bwg9dd
3,[SPOILERS] Modified scenario 6 episodes of sea...,1,0.18,bwtqj1


#### Now loop through each sub and grab it's comments

In [211]:
# List to hold all the comments dfs
comm_dfs = []

for index, row in df_got.iterrows():
#     print(row['id'])
    submission = reddit.submission(id=row['id'])

    # Instantiate lists to hold comments data
    comment_body = []
    comment_id = []
    sub_id = []

    while True:
        try:
            submission.comments.replace_more()
            break
        except PossibleExceptions:
            print('Handling replace_more exception')
            sleep(1)
    
    # Loop through comments and put into list
    for comment in submission.comments.list():
    #     print(comment.body)
    #     print(comment.id)
        comment_id.append(comment.id)
        comment_body.append(comment.body)
        sub_id.append(row['id'])

    # create df from lists
    this_df = pd.DataFrame({
        'comment': comment_body,
        'comment_id':comment_id,
        'sub_id':sub_id
    })
    
    # Add this sub's comments df to list of dfs
    comm_dfs.append(this_df)


#### Put all the comments into common df

In [221]:
df_got_comm = pd.concat(comm_dfs, axis=0).reset_index(drop=True)

## 5 Save dataframes' contents to PS DB

#### Use `pandas.to_sql` to write the dataframe to the PostgreSQL database, using the SQLAlchemy engine.
    

In [222]:
df_got.to_sql('got_subs', con=sql_alc_engine, if_exists='append')

In [223]:
df_got_comm.to_sql('got_comms', con=sql_alc_engine, if_exists='append')

![](https://images.unsplash.com/photo-1553058296-61093581de13?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80)

### f. Check that the table was created, or can be appended.

In [224]:
# Setup PSQL connection
conn = psql.connect(
    database=db,
    user=user,
    password=ps,
    host=host,
    port='5432'
)

In [225]:
# QUERY TO GET LIST OF TABLES
# query = """
#     SELECT * FROM pg_catalog.pg_tables
#     WHERE schemaname = 'public';
# """

In [226]:
# Instantiate cursor
cur = conn.cursor()

In [228]:
# Set up query
query = """
    SELECT * FROM got_comms;
"""

In [229]:
# Execute the query
cur.execute(query)

In [148]:
# conn.rollback()

In [230]:
# Check results
df_clone = pd.DataFrame(cur.fetchall())
df_clone.columns = [col.name for col in cur.description]

In [231]:
conn.commit()

In [232]:
df_clone

Unnamed: 0,index,comment,comment_id,sub_id
0,0,That & battle of the bastards was an amazing o...,ept2erq,bvwunn
1,1,The cut from baby Jon to serious Jon is probab...,ept8q8e,bvwunn
2,2,What kind of ruins a lot if this episode for m...,epufz4u,bvwunn
3,3,"They are good scenes by themselves, but the pr...",ept469e,bvwunn
4,4,While the episode got me excited it really was...,eptaxge,bvwunn
5,5,You are correct. The season 6 finale was amazi...,eptfuuv,bvwunn
6,6,"No matter how many times I watch it, I’ll alwa...",eptydki,bvwunn
7,7,Light of the Seven was really what made me fal...,eptqc8g,bvwunn
8,8,It will always be my favorite episode of all t...,eptwzpe,bvwunn
9,9,"I dislike Winds of Winter, got the same proble...",eptho50,bvwunn


In [137]:
conn.close()