# Data Acquistion from Reddit

Go to <a href=#bookmark>bookmark</a>

### 2019-06-08 - Goal - Develop End-to-End Data Flow, at least at small scale.
# OR BUST

![](https://images.unsplash.com/photo-1515255384510-23e8b6a6ca3c?ixlib=rb-1.2.1&auto=format&fit=crop&w=1489&q=80)

---

## Libraries

In [2]:
# Install libs on this computer:
# !pip install praw
# !pip install pymongo
# !pip install psycopg2

In [3]:
import os             # file system stuff
import json           # digest json
import praw           # reddit API
import pandas as pd   # Dataframes
import pymongo        # MongoDB
import numpy as np    # math and arrays

import time           # To time stuff

#DATA STORAGE
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

In [4]:
import helper     # Custom helper functions

In [5]:
import reddit_data # Custom reddit scraper

---

## 1A Load Reddit keys

Step 3: Create your first Authorized Reddit Instance

In [6]:
# Define path to secret

secret_path = os.path.join(os.environ['HOME'], '.secret', 'reddit.json')
#secret_path = os.path.join(os.environ['HOME'], 'mia/.secret', 'reddit_api.json')

secret_path

'/Users/tjjj/.secret/reddit.json'

#### Save submissions to DB

In [11]:
# Define path to secret

secret_path_aws = os.path.join(os.environ['HOME'], 'mia', '.secret', 
                           'aws_ps_flatiron.json')
secret_path_aws

'/Users/tjjj/mia/.secret/aws_ps_flatiron.json'

## 1B Load AWS-PostgreSQL DB keys

#### Load keys

In [12]:
aws_keys = helper.get_keys(secret_path_aws)
user = aws_keys['user']
ps = aws_keys['password']
host = aws_keys['host']
db = aws_keys['db_name']

In [13]:
aws_ps_engine = ('postgresql://' + user + ':' + ps + '@' + host + '/' + db)

### Use SQLAlchemy to create PSQL engine

In [14]:
# dialect+driver://username:password@host:port/database
sql_alch_engine = create_engine(aws_ps_engine)

## 2 Load keys, Create Reddit Instance

In [10]:
keys = helper.get_keys(secret_path)

In [11]:
reddit = praw.Reddit(client_id=keys['client_id'] 
                     ,client_secret=keys['api_key']
                     ,username=keys['username']
                     ,password=keys['password']
                     ,user_agent='reddit_research accessAPI:v0.0.1 (by /u/FlatDubs)')

---

## 3 Initialize parameters for this submissions pull

https://ballotpedia.org/Presidential_candidates,_2020

https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_characters

In [45]:
subreddit_nm = 'gameofthrones'

# query = """
#         "qyburn" OR "yara"
        
#         """

query = "harry strickland"

results_lim = 1

nm_subs_tbl = 'got_subs'

nm_comms_tbl = 'got_comms'

---

## 4 Get subreddit submissions and their comments

In [47]:
reddit_data.get_subred_subs_coms(praw_reddit=reddit
                    ,sql_alch_engine=sql_alch_engine
                    ,subreddit_nm=subreddit_nm
                    ,query=query
                    ,results_lim=results_lim
                    ,nm_subs_tbl='got_subs'
                    ,nm_comms_tbl='got_comms'
                    ,)

Starting: Sun Jun  9 21:45:35 2019

Retrieved submissions.
Retrieved comments.
Writing to got_subs
Writing to got_comms

Finished: Sun Jun  9 21:45:37 2019

It took 0.04 minutes to complete.
There were 1 submissions added.
There were 26 comments added.


---

That should all there be to getting data now. 
- Note that results just append. So will eventually need to either get rid of dupes on the SQL or python side.

---

# Prior Runs

## Game of Thrones

persons = """"
doran" OR "davos"
"""

persons = """
            "bran" OR 'brandon stark' OR 'jon snow' OR 'jon' 
                         OR 'khaleesi' OR 'dany' OR 'daenerys' OR 'danyris'
          """
          
It took 14.21 minutes to complete.
There were 249 submissions added.
There were 11272 comments added.

persons = """
            "cersei" OR 'tyrion' OR 'sansa' OR 'arya' 
                        OR 'stannis' OR 'varys' OR 'jamie' OR 'brienne'
"""

It took 92.47 minutes to complete.  
There were 246 submissions added.  
There were 65,896 comments added.

persons = """
            "samwell" OR "jorah" OR "theon" OR "hound" OR "littlefinger" 
          """

It took 30.70 minutes to complete.  
There were 246 submissions added.  
There were 30,374 comments added.  

persons = """
            "joffrey" OR "sandor" OR "mountain" OR "gregor" OR "baelish" 
          """  
          
It took 24.55 minutes to complete.  
There were 249 submissions added.  
There were 23,146 comments added.

persons = """
            "robb" OR "drogo" OR "melisandre" OR "bronn" OR "gilly" OR
            "ramsey" OR "missandei" OR "gendry" OR "grey worm"
          """
          
It took 23.15 minutes to complete.  
There were 249 submissions added.  
There were 28,920 comments added.  

persons = """
            "ned" OR "eddard" OR "catelyn" OR "bronn" OR "torumund" OR
            "robert" OR "tommen" OR "viserys" OR "margaery"
          """
          
It took 21.79 minutes to complete.  
There were 250 submissions added.  
There were 23,731 comments added.     

## Democratic Candidates

persons = """
            "kamala" OR "senator harris" OR "K. Harris" OR "biden" OR 
            "mayor pete" OR "buttigidg" OR "buttigieg" OR "bootijedge"
        """
        
It took 80.46 minutes to complete.  
There were 249 submissions added.  
There were 61,298 comments added.

persons = """
            "gillibrand" OR "hickenlooper" OR "klobuchar" OR "warren" OR
            "booker" OR "inslee" OR "castro" OR "gabbard" OR "sanders" 
            
It took 122.33 minutes to complete.  
There were 250 submissions added.  
There were 95,034 comments added.  

persons = """
            "de blasio" OR "bullock" OR "gravel" OR "messam"  
        """  
        
It took 7.84 minutes to complete.  
There were 91 submissions added.  
There were 2,486 comments added.

persons = """
            "o'rourke"  
        """
        
It took 17.92 minutes to complete.  
There were 87 submissions added.  
There were 7,150 comments added.

persons = """
            "bennet" OR "delaney" OR "moulton" OR "swalwell" OR "williamson"
            OR "yang"
        """  
        
It took 18.63 minutes to complete.  
There were 97 submissions added.  
There were 6,907 comments added.

### Bookmark! <a name='bookmark' />

![](https://images.unsplash.com/photo-1534224563519-fea04849cadf?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80
 )

![](https://images.unsplash.com/photo-1553058296-61093581de13?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1351&q=80)

### f. Check that the table was created, or can be appended.

In [16]:
# Setup PSQL connection
conn = psql.connect(
    database=db,
    user=user,
    password=ps,
    host=host,
    port='5432'
)

In [17]:
#QUERY TO GET LIST OF TABLES
query = """
    SELECT * FROM pg_catalog.pg_tables
    WHERE schemaname = 'public';
"""

In [18]:
# Instantiate cursor
cur = conn.cursor()

In [33]:
# Set up query
query = """
    SELECT count(*) ct FROM got_comms;
"""

In [19]:
# Execute the query
cur.execute(query)

In [46]:
conn.rollback()

In [21]:
# Check results
df_clone = pd.DataFrame(cur.fetchall())
df_clone.columns = [col.name for col in cur.description]

In [22]:
conn.commit()

In [23]:
df_clone

Unnamed: 0,schemaname,tablename,tableowner,tablespace,hasindexes,hasrules,hastriggers,rowsecurity
0,public,dems_subs,dsaf,,True,False,False,False
1,public,dems_comms,dsaf,,True,False,False,False
2,public,test_table,dsaf,,True,False,False,False
3,public,pitches_test,dsaf,,True,False,False,False
4,public,got_subs,dsaf,,True,False,False,False
5,public,got_comms,dsaf,,True,False,False,False
6,public,pitches,dsaf,,True,False,False,False
7,public,got_subs_old,dsaf,,True,False,False,False


In [39]:
conn.close()

In [52]:
got_comments = """
    SELECT DISTINCT comment 
    FROM dems_comms 
"""
cur.execute(got_comments)
df_dems = pd.DataFrame(cur.fetchall())
df_dems.columns = [col.name for col in cur.description]

In [55]:
df_dems.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165221 entries, 0 to 165220
Data columns (total 1 columns):
comment    165221 non-null object
dtypes: object(1)
memory usage: 1.3+ MB


In [56]:
from vaderSentiment import vaderSentiment

analyzer = vaderSentiment.SentimentIntensityAnalyzer()
sentiment_list = []
for comment in df_dems['comment'][:10]:
    sentiment_list.append(analyzer.polarity_scores(comment))

In [57]:
sentiment_list

[{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.066, 'neu': 0.842, 'pos': 0.092, 'compound': 0.2153},
 {'neg': 0.347, 'neu': 0.565, 'pos': 0.089, 'compound': -0.6369},
 {'neg': 0.042, 'neu': 0.888, 'pos': 0.07, 'compound': 0.7347},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.244, 'neu': 0.64, 'pos': 0.116, 'compound': -0.296},
 {'neg': 0.0, 'neu': 0.667, 'pos': 0.333, 'compound': 0.25}]