# Reddit Data Collection and Analysis with PSAW

To collect Reddit data, we're going to use the [Pushift API](https://www.reddit.com/r/pushshift/comments/bcxguf/new_to_pushshift_read_this_faq/), specifically a Python wrapper for the Pushshift API called [PSAW](https://github.com/dmarx/psaw) (PushShift API Wrapper). Why are we using the Pushshift API instead of the official Reddit API, and PSAW instead of Pushshift itself?

Well, as Pushshift's creator Jason Baumgartner and his co-authors describe it in their [published paper](https://arxiv.org/pdf/2001.08435.pdf), "Pushshift makes it
much easier for researchers to query and retrieve historical Reddit data, provides extended functionality by providing fulltext search against comments and submissions, and has larger single query limits." PSAW, meanwhile, makes it easier to work with Pushshift and provides better documentation.

## Install PSAW

To use PSAW, we first need to install it.

In [None]:
!pip install psaw

Then we will import pandas for eventually working with the collected data, and we will change pandas default display setting to make our DataFrame columns wider.

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 200
pd.options.display.max_columns = 50

Next we will import the PushshiftAPI from psaw and initialize it.

In [None]:
from psaw import PushshiftAPI

# Initialize PushShift
api = PushshiftAPI()

## Collect Reddit Posts (By Subreddit)

To collect Reddit posts, we will use `api.search_submissions()` and then establish the parameters of our query, such as which subreddit we want to search in and what threshold of upvote score we want to set.

Below we are setting up to search for posts in the subreddit "AmITheAsshole" that have an upvote score of at least 2,000 or more.

In [None]:
api_request_generator = api.search_submissions(subreddit='AmITheAsshole',
                                               score = ">2000")

Once this generator is set up, we can use it to collect Reddit posts. The code below is a list comprehension that loops through the generator and extracts relevant data for each matching Reddit post. It then turns that list into a Pandas DataFrame.

In [None]:
aita_submissions = pd.DataFrame([submission.d_ for submission in api_request_generator])

*The cell above should take a while to run. It's searching through Reddit's entire history. It's ok if you periodically get errors while it's running.*

In [None]:
aita_submissions

To get a quick peak of the data, we can look at 10 random rows of data, and only for the columns "title" and upvote "score."

In [None]:
aita_submissions[['title', 'score']].sample(10)

## Pick Another Subreddit

Now it's your turn to collect Reddit posts from a different subreddit. Pick a new subreddit and insert it below, then run the cells below to collect Reddit data.

*If you need help picking a subreddit, check the [right side bar of the Reddit home page](https://www.reddit.com/) or the [top growing communities](https://www.reddit.com/subreddits/leaderboard/).* 

In [None]:
api_request_generator = api.search_submissions(subreddit='#Your Choice Here',
                                               score = ">2000")

Once this generator is set up, we can use it to collect Reddit posts. The code below is a list comprehension that loops through the generator and extracts relevant data for each matching Reddit post. It then turns that list into a Pandas DataFrame.

In [None]:
reddit_df  = pd.DataFrame([submission.d_ for submission in api_request_generator])

*The cell above should take a while to run. It's searching through Reddit's entire history. It's ok if you periodically get errors while it's running.*

Sort the data to examine the top 10 posts in this subreddit with the highest upvote scores. Only examine the columns "title," "score," and "selftext."

In [None]:
# Your code here

## Collect Reddit Posts (By Keyword)

What are the top subreddits for a given keyword? To search by a keyword, you can add `q=` and insert a query phrase. Pick a query phrase and search for it across Reddit by running the cells below.

In [None]:
api_request_generator = api.search_submissions(q='#Your Choice Here'', score = ">2000")

Once this generator is set up, we can use it to collect Reddit posts. The code below is a list comprehension that loops through the generator and extracts relevant data for each matching Reddit post. It then turns that list into a Pandas DataFrame.

In [None]:
reddit_df = pd.DataFrame([submission.d_ for submission in api_request_generator])

*The cell above should take a while to run. It's searching through Reddit's entire history. It's ok if you periodically get errors while it's running.*

Now make a plot of the top 5 subreddits for the keyword that you chose.

In [None]:
# Your code here




