# Case Study, Part 2: Scraping Reddit with PRAW

Although Pushshift is a wonderful resource when it comes to scraping Reddit data, it's not infallible. In some cases, important data will be missing from the Pushshift API, and you'll need to supplement the Pushshift data with the metadata available through Reddit's official API. 

Luckily, we can accomplish this using the [PRAW](https://praw.readthedocs.io/en/latest/) Reddit API Wrapper. This chapter will go through the steps necessary to supplement Pushshift data using PRAW.

## Setup

In [2]:
pip install praw

Note: you may need to restart the kernel to use updated packages.


In [3]:
import praw

In [4]:
import requests
import pandas as pd
import json
import csv
import time

## Creating a Reddit App

In order to use PRAW, you'll need to develop your own application on Reddit. In order to do *that*, you'll need to create a Reddit account. 


Once you've created an account on Reddit, you can navigate to the [developed applications](https://www.reddit.com/prefs/apps) page from Reddit preferences. Here, you'll see a button prompting you to "create app." Click it, and you should see the following: 

![create application](https://i.snipboard.io/zKZ3vq.jpg)

Make sure you're creating a **script** app, as this is what we'll need in order to make requests with PRAW. Feel free to name and describe the app as you see fit, then click the button at the bottom to create your app. 

For additional guidance on how to develop your own Reddit application, see [here](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example#first-steps).

## Obtaining a Reddit Instance

Now that we've created an application on Reddit, we can obtain Reddit instances using PRAW. While it's possible to create two separate types of instance -- read-only or authorized -- for the sake of this chapter we'll focus on obtaining a read-only instance. 

This is where the script application we just created becomes important. We'll need to provide our `client-id`, our `client_secret`, and our `user_agent` in order to obtain a read-only Reddit instance:

*Note*: You'll want to keep this information as confidential as possible while still accessing the data you need. 

In [5]:
reddit = praw.Reddit(client_id="my client id",
                     client_secret="my client secret",
                     user_agent="my user agent")

With the information from our script application, we'll be able to `print` a read-only Reddit instance. As an example, below are 20 of the most "hot" submissions from [r/MakeupRehab](https://www.reddit.com/r/MakeupRehab/): 

In [7]:
print(reddit.read_only)
for submission in reddit.subreddit("MakeupRehab").hot(limit=20):
    print(submission.title)

True
Megathread: COVID-19 / Coronavirus Resource and Discussion
Topic Tuesday - 'What is Your Favourite Lip Product in Your Collection' ''- May 26, 2020
Makeup Rehabbers.. and at what point did you feel fully rehabbed? What happens when you end a no buy? & Other questions..
Use Your Perfume Samples!
Tempted to get rid of most of my makeup. What should I do?
Constantly obsessively feeling like my collection is incomplete due to changing needs/preferences.
Limited Edition does not mean you have to buy it.
Not makeup, hope this is ok.
Talk me out of a purchase?
Ways to use up products I had bought "for fun" and now regret?
2 months of buying NO makeup!!!
30 days WITHOUT buying any makeup/skincare!
Feel like I'm forcing myself to shop at Sephora because I have a gift card
MuR Daily Chat - May 26, 2020
*May.25th: Daily check in thread for No Buy May (NBM)*
Talk me out of buying Urban Decay Brow Blade
Does your makeup expire?
makeup companies and my little pony—how they sell us the same prod

## Filling in gaps in the Pushshift API

While Pushshift is a great entryway into scraping data from Reddit, you'll occasionally run into some notable gaps in available data. These gaps in the Pushshift data tend to turn up following a [large wave of subreddit quarantines in September 2018](https://www.newsweek.com/reddit-quarantine-subs-toxic-controversial-moderators-1144663).


Using the API's [Reddit Search](https://redditsearch.io/) referenced at the end of Part 1, we can see how scope out any potential holes in subreddit data before going through the process of scraping that data ourselves. 

As an example, let's look at the data one of the subreddits quarantined in September 2018, r/CringeAnarchy:

![redditsearch](https://i.snipboard.io/7qI8eF.jpg)

There's a fairly large chunk of data missing starting following the quarantine. Data availabiliy only picks back up at the beginning of 2019. 