# Case Study, Part 2: Scraping Reddit with PRAW

Although Pushshift is a wonderful resource when it comes to scraping Reddit data, it's not infallible. In some cases, important data will be missing from the Pushshift API, and you'll need to supplement the Pushshift data with the metadata available through Reddit's official API. 

Luckily, we can accomplish this using the [PRAW](https://praw.readthedocs.io/en/latest/) Reddit API Wrapper. This chapter will go through the steps necessary to supplement Pushshift data using PRAW.

## Setup

In [1]:
pip install praw

Note: you may need to restart the kernel to use updated packages.


In [2]:
import praw

In [3]:
import requests
import pandas as pd
import json
import csv
import time

## Creating a Reddit App

In order to use PRAW, you'll need to develop your own application on Reddit. In order to do *that*, you'll need to create a Reddit account. 


Once you've created an account on Reddit, you can navigate to the [developed applications](https://www.reddit.com/prefs/apps) page from Reddit preferences. Here, you'll see a button prompting you to "create app." Click it, and you should see the following: 

![create application](https://i.snipboard.io/zKZ3vq.jpg)

Make sure you're creating a **script** app, as this is what we'll need in order to make requests with PRAW. Feel free to name and describe the app as you see fit, then click the button at the bottom to create your app. 

For additional guidance on how to develop your own Reddit application, see [here](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example#first-steps).

## Obtaining a Reddit Instance

Now that we've created an application on Reddit, we can obtain Reddit instances using PRAW. While it's possible to create two separate types of instance -- read-only or authorized -- for the sake of this chapter we'll focus on obtaining a read-only instance. 

This is where the script application we just created becomes important. We'll need to provide our `client-id`, our `client_secret`, and our `user_agent` in order to obtain a read-only Reddit instance:

*Note*: You'll want to keep this information as confidential as possible while still accessing the data you need. 

In [5]:
reddit = praw.Reddit(client_id="my client id",
                     client_secret="my client secret",
                     user_agent="my user agent")

With the information from our script application, we'll be able to `print` a read-only Reddit instance. As an example, below are 20 of the most "hot" submissions from [r/MakeupRehab](https://www.reddit.com/r/MakeupRehab/): 

In [7]:
print(reddit.read_only)
for submission in reddit.subreddit("MakeupRehab").hot(limit=20):
    print(submission.title)

True
Megathread: COVID-19 / Coronavirus Resource and Discussion
Topic Tuesday - 'What is Your Favourite Lip Product in Your Collection' ''- May 26, 2020
Makeup Rehabbers.. and at what point did you feel fully rehabbed? What happens when you end a no buy? & Other questions..
Use Your Perfume Samples!
Tempted to get rid of most of my makeup. What should I do?
Constantly obsessively feeling like my collection is incomplete due to changing needs/preferences.
Limited Edition does not mean you have to buy it.
Not makeup, hope this is ok.
Talk me out of a purchase?
Ways to use up products I had bought "for fun" and now regret?
2 months of buying NO makeup!!!
30 days WITHOUT buying any makeup/skincare!
Feel like I'm forcing myself to shop at Sephora because I have a gift card
MuR Daily Chat - May 26, 2020
*May.25th: Daily check in thread for No Buy May (NBM)*
Talk me out of buying Urban Decay Brow Blade
Does your makeup expire?
makeup companies and my little pony—how they sell us the same prod

## Filling in gaps in the Pushshift API

While Pushshift is a great entryway into scraping data from Reddit, you'll occasionally run into some notable gaps in available data. These gaps in the Pushshift data tend to turn up following a [large wave of subreddit quarantines in September 2018](https://www.newsweek.com/reddit-quarantine-subs-toxic-controversial-moderators-1144663).


Using the API's [Reddit Search](https://redditsearch.io/) referenced at the end of Part 1, we can see how scope out any potential holes in subreddit data before going through the process of scraping that data ourselves. 

As an example, let's look at the data one of the subreddits quarantined in September 2018, r/CringeAnarchy:

![redditsearch](https://i.snipboard.io/7qI8eF.jpg)

There's a fairly large chunk of data missing following the quarantine. Data availabiliy only picks back up at the beginning of 2019. 

Let's see if we're able to get more complete data from r/CringeAnarchy by combining the availability made possible through the Pushshift API and PRAW. The next section of this chapter will go through a step-by-step process of how we can accomplish this, focusing first on scraping subreddit submissions.

### Step 1: Collecting Post ids from Pushshift

As we've done in the first part of the case study, we'll want to set a first and last UTC from which to scrape data. 

In [212]:
first_utc = 1420070400 # Before the first post in December 2014
last_utc = 1556409600 # After the subreddit was banned in April 2019

In [169]:
def ScrapeSubmissions(after, before):
    url = 'https://api.pushshift.io/reddit/submission/search/?sort_type=created_utc&sort=asc&subreddit=CringeAnarchy&after='+ str(after) +"&before"+str(before)+"&size=1000"
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']


We'll need to set up a number of empty lists to collect timestamps, post ids, and scores.

In [170]:
timestamps = list()
post_ids = list()
score = list()

In [171]:
after = first_utc
while int(after) < last_utc:
    try:
        data = ScrapeSubmissions(after,last_utc)
        for post in data:
            tmp_time = post['created_utc']
            tmp_id = post['id']
            timestamps.append(tmp_time)
            post_ids.append(tmp_id)
        after = timestamps[-1]
        print([str(len(post_ids)) + " posts collected so far."])
        print(tmp_time)
        time.sleep(0.1)
        if int(after) > 1556226432:
            break
    except ValueError:  # includes simplejson.decoder.JSONDecodeError
        print ('Decoding JSON has failed')


['1000 posts collected so far.']
1423856435
['2000 posts collected so far.']
1426708648
['3000 posts collected so far.']
1429408503
['4000 posts collected so far.']
1431579050
['5000 posts collected so far.']
1433438217
['6000 posts collected so far.']
1435174323
['7000 posts collected so far.']
1436709841
['8000 posts collected so far.']
1438097104
['9000 posts collected so far.']
1439394909
['10000 posts collected so far.']
1440542293
['11000 posts collected so far.']
1441584881
['12000 posts collected so far.']
1442777685
['13000 posts collected so far.']
1443836862
['14000 posts collected so far.']
1444938335
['15000 posts collected so far.']
1445856871
['16000 posts collected so far.']
1446797684
['17000 posts collected so far.']
1447729468
['18000 posts collected so far.']
1448604477
['19000 posts collected so far.']
1449457521
['20000 posts collected so far.']
1450249316
['21000 posts collected so far.']
1451081393
['22000 posts collected so far.']
1451855501
['23000 posts colle

ChunkedEncodingError: ('Connection broken: OSError("(54, \'ECONNRESET\')")', OSError("(54, 'ECONNRESET')"))

Looks like we've hit a snag with a `ChunkedEncodingError`. This is likely a one-off error related to a connection issue, so it's not too much to worry about.

Fortunarely, it's not too much of an issue to pick up where we left off: 

In [164]:
from datetime import datetime
%matplotlib inline

In [179]:
new_utc = 1520897062 # New start UTC, March 12th 2018

Now we can run our `while` loop again, this time with the new specified starting UTC.

In [180]:
after = new_utc
while int(after) < last_utc:
    try:
        data = ScrapeSubmissions(after,last_utc)
        for post in data:
            tmp_time = post['created_utc']
            tmp_id = post['id']
            timestamps.append(tmp_time)
            post_ids.append(tmp_id)
        after = timestamps[-1]
        print([str(len(post_ids)) + " posts collected so far."])
        print(tmp_time)
        time.sleep(0.1)
        if int(after) > 1556226432:
            break
    except ValueError:  # includes simplejson.decoder.JSONDecodeError
        print ('Decoding JSON has failed')


['162000 posts collected so far.']
1521215770
['163000 posts collected so far.']
1521546633
['164000 posts collected so far.']
1521894062
['165000 posts collected so far.']
1522179053
['166000 posts collected so far.']
1522513991
['167000 posts collected so far.']
1522813021
['168000 posts collected so far.']
1523128092
['169000 posts collected so far.']
1523450293
['170000 posts collected so far.']
1523759577
['171000 posts collected so far.']
1524068065
['172000 posts collected so far.']
1524406860
['173000 posts collected so far.']
1524694760
['174000 posts collected so far.']
1525035146
['175000 posts collected so far.']
1525378330
['176000 posts collected so far.']
1525729515
['177000 posts collected so far.']
1526057177
['178000 posts collected so far.']
1526385057
['179000 posts collected so far.']
1526664165
['180000 posts collected so far.']
1526940636
['181000 posts collected so far.']
1527198769
['182000 posts collected so far.']
1527478434
['183000 posts collected so far.']

Now we want to save the scraped data to a csv file:

In [181]:
d = {'id':post_ids, 'timestamp':timestamps}
df = pd.DataFrame(d)
df.to_csv("post_ids.csv", index=False)

Let's use `datetime` to convert our UTCs to human-readable datetime data: 

In [215]:
df['created_time'] = pd.to_datetime(df['timestamp'], unit='s')

In [218]:
df['created_time'].describe()

count                  233778
unique                 233329
top       2017-07-08 23:59:18
freq                        3
first     2015-01-01 00:12:00
last      2019-04-25 21:07:13
Name: created_time, dtype: object