# <s>Exploring Reddit API</s>  Dev Data Flow

Go to <a href=#bookmark>bookmark</a>

### 2019-06-08 - Goal - Develop End-to-End Data Flow, at least at small scale.
# OR BUST

![](https://images.unsplash.com/photo-1515255384510-23e8b6a6ca3c?ixlib=rb-1.2.1&auto=format&fit=crop&w=1489&q=80)

---

## Libraries

In [69]:
# Install libs on this computer:
# !pip install praw
# !pip install pymongo
# !pip install psycopg2

In [70]:
import os             # file system stuff
import json           # digest json
import praw           # reddit API
import pandas as pd   # Dataframes
import pymongo        # MongoDB
import numpy as np    # math and arrays

from time import time # To time stuff

#DATA STORAGE
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

In [71]:
import helper     # Custom helper functions

---

## 1 Load Reddit keys

Step 3: Create your first Authorized Reddit Instance

In [72]:
# Define path to secret

secret_path = os.path.join(os.environ['HOME'], '.secret', 'reddit.json')
#secret_path = os.path.join(os.environ['HOME'], 'mia/.secret', 'reddit_api.json')

secret_path

'C:\\Users\\werlindo\\.secret\\reddit.json'

## 2 Load keys, Create Reddit Instance

In [73]:
keys = helper.get_keys(secret_path)

In [74]:
reddit = praw.Reddit(client_id=keys['client_id'] 
                     ,client_secret=keys['api_key']
                     ,username=keys['username']
                     ,password=keys['password']
                     ,user_agent='reddit_research accessAPI:v0.0.1 (by /u/FlatDubs)')

## 3 Obtain a Subreddit Instance(s) from your Reddit Instance

In [76]:
got = reddit.subreddit('gameofthrones') #Let's start with got for now. If can dev flow for one, can just dupe for other

## 4 Investigate how long it takes to get a submission

In [77]:
time_log = []

Let's time how long it takes to pull submissions. Idea:  
- Loop 5 times
- Taking 10 submissions max.
- Take and log amount of time through each loop.

In [78]:
num_loops = 5
results_lim = 10

for _ in range(num_loops):
    # Get start time
    start = time()

    # Create search generator
    got_search = got.search('bran' or 'brandon stark' 
                            or 'jon snow' or 'jon' #will reddit authors be included in results?
                            or 'khaleesi' or 'dany' or 'daenerys' or 'danyris', 
                            sort='comments',
                           limit= results_lim
                           ,time_filter='month')

    # Compile submission into list
    title = []
    time_created = []
    num_upvotes = []
    num_comments = []
    upvote_ratio = []
    link_flair = []
    redditor = []
    sub_id = []
    i=0

    # for each submission     
    for submission in got_search:
        i+=1
        title.append(submission.title)
        time_created.append(submission.created_utc)
        num_upvotes.append(submission.score)
        num_comments.append(submission.num_comments)
        upvote_ratio.append(submission.upvote_ratio)
        link_flair.append(submission.link_flair_text)
        redditor.append(submission.author)
        sub_id.append(submission.id)
    #     body.append(?) #look at this later! is it comment[0]? 
        if i%5 == 0:
            print(f'{i} submissions completed')


    df = pd.DataFrame(
        {'title': title,
         'time': time_created,
         'num_comments': num_comments,
         'num_upvotes': num_upvotes,
         'upvote_ratio': upvote_ratio,
         'link_flair': link_flair,
         'redditor': redditor,
         'id':sub_id
        })
    df

    end = time()

    print("That took {:.4f} seconds!".format(end-start))

    time_log.append(round(end-start,2))

5 submissions completed
10 submissions completed
That took 31.9189 seconds!
5 submissions completed
10 submissions completed
That took 38.2708 seconds!
5 submissions completed
10 submissions completed
That took 38.6107 seconds!
5 submissions completed
10 submissions completed
That took 36.2504 seconds!
5 submissions completed
10 submissions completed
That took 37.6551 seconds!


In [79]:
ave = np.mean(time_log)/results_lim
print("It takes about {:.3f} for each sub".format(ave))

It takes about 3.654 for each sub


Ok, so we think it will take about 2-4 seconds per sub. Good to know. SLOW.

In [80]:
num_subs = 100
mins = ave * num_subs / 60 
print("If we want {} submissions then it will probably take {} minutes."
      .format(num_subs, round(mins,1)))

If we want 100 submissions then it will probably take 6.1 minutes.


In [81]:
df.shape

(10, 8)

## 5 Let's figure out how to save the dataframe

Let's run a small testing set.

In [82]:
results_lim = 1

# Create search generator
got_search = got.search("bran" or "brandon stark"
                        or "jon snow" or "jon" 
                        or "khaleesi" or "dany" or "daenerys" or "danyris", 
                        sort='comments',
                       limit= results_lim
                       ,time_filter='month')

# Compile submission into list
title = []
time_created = []
num_upvotes = []
num_comments = []
upvote_ratio = []
link_flair = []
redditor = []
sub_id = []
i=0

# for each submission     
for submission in got_search:
    i+=1
    title.append(submission.title)
    time_created.append(submission.created_utc)
    num_upvotes.append(submission.score)
    num_comments.append(submission.num_comments)
    upvote_ratio.append(submission.upvote_ratio)
    link_flair.append(submission.link_flair_text)
    redditor.append(submission.author)
    sub_id.append(submission.id)
#     body.append(?) #look at this later! is it comment[0]? 
    if i%5 == 0:
        print(f'{i} submissions completed')


df = pd.DataFrame(
    {'title': title,
     'time_created': time_created,
     'num_comments': num_comments,
     'num_upvotes': num_upvotes,
     'upvote_ratio': upvote_ratio,
     'link_flair': link_flair,
     'redditor': redditor,
     'id':sub_id
    })
df


Unnamed: 0,title,time_created,num_comments,num_upvotes,upvote_ratio,link_flair,redditor,id
0,"[SPOILERS] History repeats itself, the show en...",1558320000.0,2145,16951,0.87,Spoilers,Cryptonite323,bqpke9


##### Coerce to string because I know from prior testing this column creates issues downstream

In [10]:
df.redditor = df.redditor.astype(str)

In [11]:
df

Unnamed: 0,title,time_created,num_comments,num_upvotes,upvote_ratio,link_flair,redditor,id
0,"[SPOILERS] History repeats itself, the show en...",1558320000.0,2145,16956,0.87,Spoilers,Cryptonite323,bqpke9
1,[Spoilers] Post-Episode Survey Results - S8E4 ...,1557405000.0,1937,1476,0.96,Sticky,BWPhoenix,bmj8ne
2,[Spoilers] Post-Episode Survey Results - S8E6 ...,1558616000.0,1894,1243,0.96,Sticky,BWPhoenix,bs2jl7


### a. Investigate AWS Data Storage

Now that we have our data, let's store it in a PostgreSQL db on AWS so we don't have to keep rebuilding it.

### b. Get AWS creds for our DB

In [41]:
# Define path to secret

secret_path_aws = os.path.join(os.environ['HOME'], '.secret', 
                           'aws_ps_flatiron.json')
secret_path_aws

'C:\\Users\\werlindo\\.secret\\aws_ps_flatiron.json'

### c. Load keys

In [44]:
aws_keys = helper.get_keys(secret_path_aws)
user = aws_keys['user']
ps = aws_keys['password']
host = aws_keys['host']
db = aws_keys['db_name']

In [45]:
aws_ps_engine = ('postgresql://' + user + ':' + ps + '@' + host + '/' + db)

### d. Use SQLAlchemy to create PSQL engine

In [46]:
# dialect+driver://username:password@host:port/database
sql_alc_engine = create_engine(aws_ps_engine)

### e. Use `pandas.to_sql` to write the dataframe to the PostgreSQL database, using the SQLAlchemy engine.
    

In [47]:
df.to_sql('got_subs', con=sql_alc_engine, if_exists='append')

### f. Check that the table was created, or can be appended.

In [50]:
# Setup PSQL connection
conn = psql.connect(
    database=db,
    user=user,
    password=ps,
    host=host,
    port='5432'
)

In [51]:
# Set up query
# query = """
#     SELECT * FROM pg_catalog.pg_tables
#     WHERE schemaname = 'public';
# """

# Set up query
query = """
    SELECT * FROM got_subs
"""

# # Set up query
# query = """
#     DROP TABLE got_subs3;
# """

In [52]:
# Instantiate cursor
cur = conn.cursor()

In [53]:
# Execute the query
cur.execute(query)

In [54]:
#conn.rollback()

In [55]:
# Check results
df_clone = pd.DataFrame(cur.fetchall())
df_clone.columns = [col.name for col in cur.description]

In [56]:
conn.commit()

In [57]:
df_clone

Unnamed: 0,index,title,time_created,num_comments,num_upvotes,upvote_ratio,link_flair,redditor,id
0,0,"[SPOILERS] History repeats itself, the show en...",1558320000.0,2145,16956,0.87,Spoilers,Cryptonite323,bqpke9
1,1,[Spoilers] Post-Episode Survey Results - S8E4 ...,1557405000.0,1937,1476,0.96,Sticky,BWPhoenix,bmj8ne
2,2,[Spoilers] Post-Episode Survey Results - S8E6 ...,1558616000.0,1894,1243,0.96,Sticky,BWPhoenix,bs2jl7
3,0,"[SPOILERS] History repeats itself, the show en...",1558320000.0,2145,16956,0.87,Spoilers,Cryptonite323,bqpke9
4,1,[Spoilers] Post-Episode Survey Results - S8E4 ...,1557405000.0,1937,1476,0.96,Sticky,BWPhoenix,bmj8ne
5,2,[Spoilers] Post-Episode Survey Results - S8E6 ...,1558616000.0,1894,1243,0.96,Sticky,BWPhoenix,bs2jl7
6,0,"[SPOILERS] History repeats itself, the show en...",1558320000.0,2145,16956,0.87,Spoilers,Cryptonite323,bqpke9
7,1,[Spoilers] Post-Episode Survey Results - S8E4 ...,1557405000.0,1937,1476,0.96,Sticky,BWPhoenix,bmj8ne
8,2,[Spoilers] Post-Episode Survey Results - S8E6 ...,1558616000.0,1894,1243,0.96,Sticky,BWPhoenix,bs2jl7


In [58]:
df_clone = df_clone.drop_duplicates()

In [59]:
df_clone

Unnamed: 0,index,title,time_created,num_comments,num_upvotes,upvote_ratio,link_flair,redditor,id
0,0,"[SPOILERS] History repeats itself, the show en...",1558320000.0,2145,16956,0.87,Spoilers,Cryptonite323,bqpke9
1,1,[Spoilers] Post-Episode Survey Results - S8E4 ...,1557405000.0,1937,1476,0.96,Sticky,BWPhoenix,bmj8ne
2,2,[Spoilers] Post-Episode Survey Results - S8E6 ...,1558616000.0,1894,1243,0.96,Sticky,BWPhoenix,bs2jl7


2019.06.08 (WM) 
- Can get DF uploaded to AWS Postgres DB.
- Can simply append to table.
- Can pull back down into a dataframe.
- Can handle dupes on pandas side (or on SQL, for that matter. Might be easier/cleaner that way depending on how much data eventually have).

Left off here <a name='bookmark' />

![](https://images.unsplash.com/photo-1534224563519-fea04849cadf?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80
 )

![](https://images.unsplash.com/photo-1553058296-61093581de13?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80)

In [None]:
dems_search = politics.search('kamala' or 'senator harris' or 'K. Harris' or 
                              'biden' or 
                              'mayor pete' or 'buttigidg' or 'buttigieg' or 'bootijedge', 
                              sort='comments',
                             limit=5)

# Compile submission into list
title = []
time = []
num_upvotes = []
num_comments = []
upvote_ratio = []
link_flair = []
redditor = []
sub_id=[]
i=0

for submission in dems_search:
    i+=1
    title.append(submission.title)
    time.append(submission.created_utc)
    num_upvotes.append(submission.score)
    num_comments.append(submission.num_comments)
    upvote_ratio.append(submission.upvote_ratio)
    link_flair.append(submission.link_flair_text)
    redditor.append(submission.author)
    sub_id.append(submission.id)
    if i%5 == 0:
        print(f'{i} submissions completed')

df_dems = pd.DataFrame(
    {'title': title,
     'time': time,
     'num_comments': num_comments,
     'num_upvotes': num_upvotes,
     'upvote_ratio': upvote_ratio,
     'link_flair': link_flair,
     'redditor': redditor,
     'id':sub_id
    })
df_dems

---
What if we try to get just the submisison essentials? 


In [None]:
got_search = got.search('bran' or 'brandon stark' 
                        or 'jon snow' or 'jon' #will reddit authors be included in results?
                        or 'khaleesi' or 'dany' or 'daenerys' or 'danyris', 
                        sort='comments',
                       limit=5)

# Compile submission into list
title = [] 
num_comments = []
upvote_ratio = []
sub_id = []
i=0

for submission in got_search:
    i+=1
    title.append(submission.title)
    num_comments.append(submission.num_comments)
    upvote_ratio.append(submission.upvote_ratio)
    sub_id.append(submission.id)
#     body.append(?) #look at this later! is it comment[0]? 
    if i%100 == 0:
        print(f'{i} submissions completed')

df_got = pd.DataFrame(
    {'title': title,
     'num_comments': num_comments,
     'upvote_ratio': upvote_ratio,
     'id':sub_id
    })
df_got

In [None]:
df

# Dems Search

In [85]:
politics = reddit.subreddit('politics')

In [86]:
dem_search = politics.search('kamala' or 'senator harris' or 'K. Harris' or 
                              'biden' or 
                              'mayor pete' or 'buttigidg' or 'buttigieg' or 'bootijedge', 
                              sort='comments',
                             limit=5)

# Compile submission into list
title = [] 
num_comments = []
upvote_ratio = []
sub_id = []
i=0

for submission in dem_search:
    i+=1
    title.append(submission.title)
    num_comments.append(submission.num_comments)
    upvote_ratio.append(submission.upvote_ratio)
    sub_id.append(submission.id)
#     body.append(?) #look at this later! is it comment[0]? 
    if i%100 == 0:
        print(f'{i} submissions completed')

df_dem = pd.DataFrame(
    {'title': title,
     'num_comments': num_comments,
     'upvote_ratio': upvote_ratio,
     'id':sub_id
    })
df_dem

Unnamed: 0,title,num_comments,upvote_ratio,id
0,Megathread: AG Willam Barr releases his top li...,45584,0.88,b50gkr
1,Megathread: President Trump delivers remarks o...,32332,0.82,6tx8h7
2,Megathread: Likely Explosive Devices Addressed...,21363,0.9,9rlm9p
3,Megathread: President Trump announces a deal t...,12928,0.88,ajsubi
4,[Megathread] President Trump’s Address on Bord...,9083,0.91,ae2e7b


## How about we try to get all the comments?

Let's try the example from the [docs]() first.

In [88]:
# Here's a thread (is that even the right term?)
# https://www.reddit.com/r/gameofthrones/comments/bqa2qd/spoilers_live_episode_discussion_season_8_episode/
submission = reddit.submission(id=df_dem['id'][0])

In [90]:
# Instantiate list to hold comments
test_comments = []
comments_dicts = []

submission.comments.replace_more(limit=5)
for comment in submission.comments.list()[:100]:
#     print(comment.body)
    # List of comments, as strings
    test_comments.append(comment.body)

    # List of comments (dicts)
    comments_dicts.append({
        'comment': comment.body
    })
    

In [91]:
# Check 
test_comments[1]

'[https://twitter.com/RepJerryNadler/status/1109913142933573632](https://twitter.com/RepJerryNadler/status/1109913142933573632)\n\n"In light of the very concerning discrepancies and final decision making at the Justice Department following the Special Counsel report, where Mueller did not exonerate the President, we will be calling Attorney General Barr in to testify before [~~@~~**HouseJudiciary**](https://twitter.com/HouseJudiciary) in the near future."'

In [None]:
# Put it in a dataframe, as POC
pol_df = pd.DataFrame(test_comments, columns=['comment'])

pol_df.head()

In [None]:
pol_df['comment'].str.contains('joyann')
#case sensitive
#should write function that attributes comment to person 
#forward slashes in links seem to operate as spaces 
#make all comments all lowercase to simplify attributing phase

## How about some Vader Sentiment Action?

In [None]:
from vaderSentiment import vaderSentiment

In [None]:
analyser = vaderSentiment.SentimentIntensityAnalyzer()

In [None]:
for comment in test_comments:
    print(comment)
    print(analyser.polarity_scores(comment))

### Let's save it to MongoDB Atlas!

In [None]:
# Set up connection string
mongo_user = 'werlindo'
mongo_pw = 'dsaf040119'

In [None]:
# Instantiate client
client = pymongo.MongoClient("mongodb+srv://" + mongo_user + ":" 
                         + mongo_pw 
                         + "@dsaf-oy1s0.mongodb.net/test?retryWrites=true")


In [None]:
#cli = pymongo.MongoClient('mongodb+srv://werlindo:dsaf040119@dsaf-oy1s0.mongodb.net/test?retryWrites=true')

In [None]:
db = client['got']
coll = db['s8e6']

In [None]:
coll.delete_many({})

In [None]:
coll.insert_many(comments_dicts)

In [None]:
# Look at DB names
cur = client.list_databases()

for item in cur:
    print(item)

In [None]:
# Look at everything in our collection!
cur = coll.find({})

for item in cur:
    print(item)

# THINGS TO FIGURE OUT

- ## Extract data back out from MongoDB  
~~- ## Use MongoDB Atlas?~~
- ## Build Corpus from Mongo'd data
- ## Sentiment Analysis from Corpus

---

## Below here is island of old lame code

In [None]:
client.database_names

In [None]:
cur = client.list_databases()

In [None]:
for item in cur:
    print(item)

In [None]:
# Mongo Prep
mc = pymongo.MongoClient(host='localhost', port=27017)
db = mc['got']
coll = db['test_collection']

In [None]:
dbee = client['got']
collee = dbee['reddit_test']

In [None]:
topics

Try inserting into collection.

In [None]:
collee.insert_many(topics)


client = pymongo.MongoClient("mongodb://USER:PASSWORD@ABC-cluster-shard-00-00-XYZ.mongodb.net:27017" + 
                            ",ABC-cluster-shard-00-01-XYZ.mongodb.net:27017," +
                            "ABC-cluster-shard-00-02-XYZ.mongodb.net:27017/" + 
                            "DATABASE?ssl=true&replicaSet=ABC-cluster-shard-0&authSource=admin")

---

---


client = pymongo.MongoClient("mongodb://USER:PASSWORD@ABC-cluster-shard-00-00-XYZ.mongodb.net:27017" + 
                            ",ABC-cluster-shard-00-01-XYZ.mongodb.net:27017," +
                            "ABC-cluster-shard-00-02-XYZ.mongodb.net:27017/" + 
                            "DATABASE?ssl=true&replicaSet=ABC-cluster-shard-0&authSource=admin")

---

---

---

---

In [None]:
# Alex's code
# Load secret keys from credentials.json
import json
url = 'https://www.reddit.com/'
with open('/Users/<Your CPUs User>/.secrets/credentials.json') as f:
    params = json.load(f)

In [None]:
def get_keys(path):
    with open(path) as f:
        return json.load(f)

### Can we put this in a MongoDB?

Instantiate MongoDB

In [None]:
# Mongo Prep
mc = pymongo.MongoClient(host='localhost', port=27017)
db = mc['got']
coll = db['test_collection']

In [None]:
# Initialize
i = 0
topics = []

for submission in subreddit.top(limit=50):
    i+=1
    topics.append({
                   'title': submission.title
                    ,'time': submission.created_utc
                    ,'num_upvotes': submission.score
                    ,'num_comments': submission.num_comments
                    ,'upvote_ratio': submission.upvote_ratio
                    ,'link_flair': submission.link_flair_text
#                     ,'redditor': submission.author
                    ,'body': submission.selftext
                 })
#    topics_dict['title'].append(submission.title)
#     time.append(submission.created_utc)
#     num_upvotes.append(submission.score)
#     num_comments.append(submission.num_comments)
#     upvote_ratio.append(submission.upvote_ratio)
#     link_flair.append(submission.link_flair_text)
#     redditor.append(submission.author)
#     body.append(submission.selftext)

    if i%5 == 0:
        print(f'{i} submissions completed')

In [None]:
topics

Try inserting into collection.

In [None]:
coll.insert_many(topics)

Yay! it worked.

### Can we put this in a MongoDB...in the cloud!?

In [None]:
# Set up connection string
mongo_user = 'werlindo'
mongo_pw = 'dsaf040119'

In [None]:
#cli = pymongo.MongoClient('mongodb+srv://werlindo:dsaf040119@dsaf-oy1s0.mongodb.net/test?retryWrites=true')

In [None]:
# Initialize
i = 0
topics = []

for submission in subreddit.top(limit=50):
    i+=1
    topics.append({
                   'title': submission.title
                    ,'time': submission.created_utc
                    ,'num_upvotes': submission.score
                    ,'num_comments': submission.num_comments
                    ,'upvote_ratio': submission.upvote_ratio
                    ,'link_flair': submission.link_flair_text
#                     ,'redditor': submission.author
                    ,'body': submission.selftext
                 })
#    topics_dict['title'].append(submission.title)
#     time.append(submission.created_utc)
#     num_upvotes.append(submission.score)
#     num_comments.append(submission.num_comments)
#     upvote_ratio.append(submission.upvote_ratio)
#     link_flair.append(submission.link_flair_text)
#     redditor.append(submission.author)
#     body.append(submission.selftext)

    if i%5 == 0:
        print(f'{i} submissions completed')

In [None]:
# Instantiate client
client = pymongo.MongoClient("mongodb+srv://" + mongo_user + ":" 
                         + mongo_pw 
                         + "@dsaf-oy1s0.mongodb.net/test?retryWrites=true")


In [None]:
db = client['got']
coll = db['test_collection']

In [None]:
coll.delete_many({})

In [None]:
coll.insert_many(topics)

In [None]:
# Look at DB names
cur = client.list_databases()

for item in cur:
    print(item)

In [None]:
# Look at everything in our collection!
cur = coll.find({})

for item in cur:
    print(item)

---
## What if we just dump the entire submission into a dataframe?

In [None]:
def serialize(post):
    """
    https://www.reddit.com/r/redditdev/comments/90bdr4/subreddit_sentiment_analysis/
    posted by f_k_a_g_n
    """
    
    """Helper function for converting PRAW objects to python dictionary"""
    result = {}
    for k, v in post.__dict__.items():
        if k.startswith('_'):
            continue
        if k in {'author', 'subreddit'}:
            result[k] = str(v)
            continue
        if v is None:
            continue
        result[k] = v
    return result

In [None]:
submissions = subreddit.top(limit=10)

# load into pandas
subs = pd.DataFrame(serialize(post) for post in submissions)

# change the `created_utc` column to a datetime object
subs['created_utc'] = pd.to_datetime(subs.created_utc, unit='s')

In [None]:
subs.head()

Works, but I don't know if I like it.

How many docs in this here coll?

In [None]:
coll.count_documents({})

## Nice to have? Don't need.

In [35]:
import ipywidgets as widgets

In [36]:
widgets.Checkbox(
    value=False,
    description='Check me',
    disabled=False
)

Checkbox(value=False, description='Check me')