# Case Study, Part 2: Scraping Reddit with PRAW

Although Pushshift is a wonderful resource when it comes to scraping Reddit data, it's not infallible. In some cases, important data will be missing from the Pushshift API, and you'll need to supplement the Pushshift data with the metadata available through Reddit's official API. 

Luckily, we can accomplish this using the [PRAW](https://praw.readthedocs.io/en/latest/) Reddit API Wrapper. This chapter will go through the steps necessary to supplement Pushshift data using PRAW.

## Setup

In [None]:
pip install praw



In [2]:
import praw

In [3]:
import requests
import pandas as pd
import json
import csv
import time

## Creating a Reddit App

In order to use PRAW, you'll need to develop your own application on Reddit. In order to do *that*, you'll need to create a Reddit account. 


Once you've created an account on Reddit, you can navigate to the [developed applications](https://www.reddit.com/prefs/apps) page from Reddit preferences. Here, you'll see a button prompting you to "create app." Click it, and you should see the following: 

![create application](https://i.snipboard.io/zKZ3vq.jpg)

Make sure you're creating a **script** app, as this is what we'll need in order to make requests with PRAW. Feel free to name and describe the app as you see fit, then click the button at the bottom to create your app. 

For additional guidance on how to develop your own Reddit application, see [here](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example#first-steps).

## Obtaining a Reddit Instance

Now that we've created an application on Reddit, we can obtain Reddit instances using PRAW. While it's possible to create two separate types of instance -- read-only or authorized -- for the sake of this chapter we'll focus on obtaining a read-only instance. 

This is where the script application we just created becomes important. We'll need to provide our `client-id`, our `client_secret`, and our `user_agent` in order to obtain a read-only Reddit instance:

*Note*: You'll want to keep this information as confidential as possible while still accessing the data you need. 

In [5]:
reddit = praw.Reddit(client_id="my client id",
                     client_secret="my client secret",
                     user_agent="my user agent")

With the information from our script application, we'll be able to `print` a read-only Reddit instance. As with Pushshift, we'll have to determine whether we'd like to look at data for submissions or comments.


As an example, below are the 3 "hottest" submissions from [r/MakeupRehab](https://www.reddit.com/r/MakeupRehab/), along with submission authors, titles, scores, and body text: 

In [20]:
print(reddit.read_only)
for submission in reddit.subreddit("MakeupRehab").hot(limit=3):
    print(submission.author)
    print(submission.title)
    print(submission.score)  
    print(submission.selftext)

True
toyaqueen
Megathread: COVID-19 / Coronavirus Resource and Discussion
43
Hi everyone,

r/DISCLAIMER: This is not to be taken as a post for health information or precautions on the novel coronavirus pandemic. PLEASE keep up-to-date via your local governments and health representatives.

As we all know by now, the novel coronavirus is impacting daily life all over the globe. Here in MUR, you may have already seen some of our fellow members post about struggles directly related to the virus - having to move suddenly and unexpectedly, job losses, trying to deal with the stress, seeing which companies are being shady and more. With that in mind, feel free to use this thread as the hub for any conversations you want to have on this topic.

We’ve decided to create a megathread that will be updated with resources from other subs (the crosspost rule will be lifted for the purposes of this thread only), relevant to the needs and interests of the community to use during this trying time for m

**NOTE**: PRAW will not allow you to create instances for quarantined or banned subreddits. Attempting to do so will return 403 and 404 Errors, respectively. 

## Filling in gaps in the Pushshift API

While Pushshift is a great entryway into scraping data from Reddit, you'll occasionally run into some notable gaps in available data.

Using the API's [Reddit Search](https://redditsearch.io/) referenced at the end of Part 1, we can see how scope out any potential holes in subreddit data before going through the process of scraping that data ourselves. 

As an example, let's look at [r/AmItheAsshole](https://www.reddit.com/r/AmItheAsshole/):

![r/AmItheAsshole](https://i.snipboard.io/BwiqKY.jpg)

The data visualizations available through Pushshift stand in pretty stark contrast to r/AmItheAsshole's creation in June 2013. 