# Data Acquistion from Reddit

Go to <a href=#bookmark>bookmark (I'm clickable!)</a>

### 2019-06-08 - Goal - Develop End-to-End Data Flow, at least at small scale.
# OR BUST

![](https://images.unsplash.com/photo-1515255384510-23e8b6a6ca3c?ixlib=rb-1.2.1&auto=format&fit=crop&w=1489&q=80)

---

## Libraries

In [1]:
# Install libs on this computer:
# !pip install praw
# !pip install pymongo
# !pip install psycopg2

In [2]:
import os             # file system stuff
import json           # digest json
import praw           # reddit API
import pandas as pd   # Dataframes
import pymongo        # MongoDB
import numpy as np    # math and arrays

import time           # To time stuff

#DATA STORAGE
from sqlalchemy import create_engine # SQL helper
import psycopg2 as psql #PostgreSQL DBs

from pandas.io.json import json_normalize

from vaderSentiment import vaderSentiment

In [3]:
import helper     # Custom helper functions

In [4]:
import reddit_data # Custom reddit scraper

---

## 1B Load AWS-PostgreSQL DB keys

#### Save submissions to DB

In [5]:
# Define path to secret

# secret_path_aws = os.path.join(os.environ['HOME'], 'mia', '.secret', 
#                            'aws_ps_flatiron.json')
secret_path_aws = os.path.join(os.environ['HOME'], '.secret', 
                           'aws_ps_flatiron.json')
secret_path_aws

'/Users/werlindo/.secret/aws_ps_flatiron.json'

#### Load keys

In [6]:
aws_keys = helper.get_keys(secret_path_aws)
user = aws_keys['user']
ps = aws_keys['password']
host = aws_keys['host']
db = aws_keys['db_name']

In [7]:
aws_ps_engine = ('postgresql://' + user + ':' + ps + '@' + host + '/' + db)

### Use SQLAlchemy to create PSQL engine

In [8]:
# dialect+driver://username:password@host:port/database
sql_alch_engine = create_engine(aws_ps_engine)

### f. Check that the table was created, or can be appended.

In [9]:
# Setup PSQL connection
conn = psql.connect(
    database=db,
    user=user,
    password=ps,
    host=host,
    port='5432'
)

In [20]:
#QUERY TO GET LIST OF TABLES
query = """
    SELECT * FROM pg_catalog.pg_tables
    WHERE schemaname = 'public';
"""

In [11]:
# Instantiate cursor
cur = conn.cursor()

In [25]:
# Set up query
query = """
    SELECT count(*) ct FROM comms_sntmnt_190612;
"""

In [26]:
# Execute the query
cur.execute(query)

conn.rollback()

In [27]:
# Check results
df_clone = pd.DataFrame(cur.fetchall())
df_clone.columns = [col.name for col in cur.description]

In [28]:
conn.commit()

In [29]:
df_clone

Unnamed: 0,ct
0,305899


In [39]:
conn.close()

In [52]:
got_comments = """
    SELECT DISTINCT comment 
    FROM dems_comms 
"""
cur.execute(got_comments)
df_dems = pd.DataFrame(cur.fetchall())
df_dems.columns = [col.name for col in cur.description]

In [55]:
df_dems.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165221 entries, 0 to 165220
Data columns (total 1 columns):
comment    165221 non-null object
dtypes: object(1)
memory usage: 1.3+ MB


![](https://images.unsplash.com/photo-1489533119213-66a5cd877091?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1951&q=80)

### Bookmark! <a name='bookmark' />

# Make combined comments df

In [106]:
# Setup PSQL connection
conn = psql.connect(
    database=db,
    user=user,
    password=ps,
    host=host,
    port='5432'
)

In [107]:
# Instantiate cursor
cur = conn.cursor()

In [93]:
# Pull combined comments
got_comments = """
DROP table comms_sntmnt_2
    ;   
"""

# 'Cast' results to dataframe
cur.execute(got_comments)
df_comms = pd.DataFrame(cur.fetchall())
df_comms.columns = [col.name for col in cur.description]

ProgrammingError: no results to fetch

In [64]:
# Pull combined comments
got_comments = """
    SELECT DISTINCT comment, 'got' as domain 
    FROM got_comms 
    UNION ALL
    SELECT DISTINCT comment, 'dems' as domain 
    FROM dems_comms 
    ;   
"""

# 'Cast' results to dataframe
cur.execute(got_comments)
df_comms = pd.DataFrame(cur.fetchall())
df_comms.columns = [col.name for col in cur.description]

In [65]:
df_comms.head()

Unnamed: 0,comment,domain
0,,got
1,^,got
2,=),got
3,‍,got
4,¯\\\_(ツ)\_/¯,got


Check how many comments per domain

In [66]:
df_comms['domain'].value_counts()

dems    178320
got     127579
Name: domain, dtype: int64

## Perform sentiment analysis via `Vader`

In [67]:
analyzer = vaderSentiment.SentimentIntensityAnalyzer()
sentiment_list = []
for comment in df_comms['comment']:
    sentiment_list.append(analyzer.polarity_scores(comment))

In [68]:
len(sentiment_list)

305899

Double check there are as many rows in this list as `comments` dataframe. Should be 0.

In [69]:
len(sentiment_list) - df_comms.shape[0]

0

Great. Now cast to dataframe.

In [70]:
df_sentiment = json_normalize(sentiment_list)

In [71]:
df_sentiment.head()

Unnamed: 0,compound,neg,neu,pos
0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0
2,0.4939,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0


Combine with the 'comments' dataframe.

In [72]:
df_comms_sent = pd.concat([df_comms, df_sentiment], axis=1)

Check shape, etc.

In [81]:
df_comms_sent.tail(10)

Unnamed: 0,comment,domain,compound,neg,neu,pos
305889,> Bahahaha. You haven't a clue what anarchism ...,dems,-0.5267,0.104,0.896,0.0
305890,"""Bret Baier, had been keen to point out that ...",dems,0.3612,0.0,0.912,0.088
305891,">Cory Booker, for example, is assailed thus: ...",dems,0.8311,0.03,0.755,0.215
305892,"> ""Everyone should be deeply troubled by the v...",dems,-0.9468,0.227,0.758,0.015
305893,« I can just see a compilation of all the weir...,dems,-0.5574,0.204,0.796,0.0
305894,> I mean everyone assumed she didn’t like DeV...,dems,0.0772,0.067,0.856,0.076
305895,""" It seems to me to be equally plain that no b...",dems,0.7096,0.021,0.894,0.086
305896,"« Pretty please daddy putin, leave Ukraine alo...",dems,0.5106,0.274,0.256,0.47
305897,> she is single-handedly elevating the already...,dems,0.3804,0.064,0.815,0.121
305898,(ﾉ °益°)ﾉ 彡 ˙ɔ˙u˙p\n\nInslee FTW. [r/inslee2020...,dems,0.8506,0.0,0.35,0.65


Let's make comments all lowercase

In [82]:
df_comms_sent['comment'] = df_comms_sent['comment'].str.lower()

Check.

In [83]:
df_comms_sent.tail(10)

Unnamed: 0,comment,domain,compound,neg,neu,pos
305889,> bahahaha. you haven't a clue what anarchism ...,dems,-0.5267,0.104,0.896,0.0
305890,"""bret baier, had been keen to point out that ...",dems,0.3612,0.0,0.912,0.088
305891,">cory booker, for example, is assailed thus: ...",dems,0.8311,0.03,0.755,0.215
305892,"> ""everyone should be deeply troubled by the v...",dems,-0.9468,0.227,0.758,0.015
305893,« i can just see a compilation of all the weir...,dems,-0.5574,0.204,0.796,0.0
305894,> i mean everyone assumed she didn’t like dev...,dems,0.0772,0.067,0.856,0.076
305895,""" it seems to me to be equally plain that no b...",dems,0.7096,0.021,0.894,0.086
305896,"« pretty please daddy putin, leave ukraine alo...",dems,0.5106,0.274,0.256,0.47
305897,> she is single-handedly elevating the already...,dems,0.3804,0.064,0.815,0.121
305898,(ﾉ °益°)ﾉ 彡 ˙ɔ˙u˙p\n\ninslee ftw. [r/inslee2020...,dems,0.8506,0.0,0.35,0.65


Good!

In [84]:
df_comms_sent.shape

(305899, 6)

## Since it takes a bit to get to this point, how about we save this DF to AWS?

In [88]:
import datetime

In [95]:
print(datetime.datetime.now())

2019-06-12 08:30:42.341165


started 8:19am

df_comms_sent.to_sql('comms_sntmnt_190612', con=sql_alch_engine, if_exists='append')

### Check!

# Pull combined comments
got_comments = """
    SELECT DISTINCT comment, 'got' as domain 
    FROM got_comms 
    UNION ALL
    SELECT DISTINCT comment, 'dems' as domain 
    FROM dems_comms 
    ;   
"""

# 'Cast' results to dataframe
cur.execute(got_comments)
df_comms = pd.DataFrame(cur.fetchall())
df_comms.columns = [col.name for col in cur.description]

In [38]:
query = """
        select * from comms_sntmnt_190612 limit 1000;
        """

# 'Cast' results to dataframe
cur.execute(query)
df_check = pd.DataFrame(cur.fetchall())
df_check.columns = [col.name for col in cur.description]

In [39]:
conn.commit()

In [40]:
df_check.tail()

Unnamed: 0,index,comment,domain,compound,neg,neu,pos
995,1751,"after the fight was clearly lost, yeah. after ...",got,-0.5994,0.371,0.422,0.207
996,1752,after the final episode it doesn’t seem as bad...,got,-0.25,0.184,0.684,0.132
997,1753,"after the first dive bomb sure, but roughly 10...",got,0.5588,0.081,0.795,0.124
998,1754,after the first few scorpion crews got torched...,got,-0.8957,0.307,0.643,0.05
999,1755,"after the game, the king and the pawn go into ...",got,0.0,0.0,1.0,0.0


In [41]:
df_check.head()

Unnamed: 0,index,comment,domain,compound,neg,neu,pos
0,770,"6 and 7 were okay, certainly better than 5, bu...",got,0.6486,0.144,0.499,0.358
1,771,6b is the most satisfying sketching pencil! go...,got,0.5963,0.0,0.721,0.279
2,772,6. dany acts so so happy to reward gendry afte...,got,0.873,0.075,0.661,0.265
3,773,"6 episodes, it was obvious as fuck things will...",got,-0.5423,0.226,0.774,0.0
4,774,6 inches is 15.24 cm,got,0.0,0.0,1.0,0.0
