# Scraping Submissions

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [None]:
# pushshift api is an "endpoint" api, i.e. we connect to a url link
# to request specific data. The parameters passed at the end of the link to specify
# what data gets requested
nfl_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=nfl&fields=title,selftext&limit=100'

# the base url comes in 2 varieties:
## submissions
base_post_url = 'https://api.pushshift.io/reddit/search/submission/'

base_comment_url = 'https://api.pushshift.io/reddit/search/comment/'

In [None]:
req = requests.get(
                base_post_url,
                params = {'subreddit':'datascience', # specifies subreddit to search
                          'fields':'title',
                          'limit':'100' # specifies how many posts to do it for
                         }
                )


req.status_code, req.url

In [None]:
dsjson = req.json()['data']

In [None]:
dsjson[0]

In [None]:
for i in range(len(dsjson)):
    print("Title: "+dsjson[i]['title'])
    print(" ")
    print(" ")
    print(" ")


<br>

*remark:* ```Pushshift.io``` uses comma separated values to pass multiple values to its parameters. For example, say we wanted to get both the Title and Self-Text content from a post. Then the url we would need to request would be:

```desired_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=datascience&fields=title,selftext&limit=100'```

notice in the url we have: ```fields=title,selftext``` where the comma ```,``` is used to denote multiple values. 

However, we try to pass multiple values in the ```requests.get()```, notice what happens

In [None]:
attempt = requests.get(
                base_post_url,
                params = {'subreddit':'datascience',
                          'fields':'title,selftext', # try to pass a comma
                          'limit':'100'
                         }
                )


attempt.status_code, attempt.url

In [None]:
dsattempt = attempt.json()['data']

In [None]:
dsattempt[0]

- We don't get anything back! This is because ```requests.get()``` encodes characters using RFC 3986 standard and the comma ```,``` gets encoded as ```%2C``` which we can see in the url that gets returned.

<br> 

So how do we get around this? We could try passing ```title``` and ```selftext``` as separate values via a tuple:

In [None]:
attempt2 = requests.get(
                base_post_url,
                params = {'subreddit':'datascience',
                          'fields':('title','selftext'), # try to pass a comma
                          'limit':'100'
                         }
                )


attempt2.status_code, attempt2.url

In [None]:
dsattempt2 = attempt2.json()['data']

In [None]:
dsattempt2[0]

- This looks ok, let's check out the rest of the posts it grabbed

In [None]:
for i in range(len(dsattempt2)):
    print("Title: "+dsattempt2[i]['title'])
    print(" ")
    print(" ")
    print(dsattempt2[i]['selftext'])
    print(" ")
    print(" ")
    print(" ")
    print(" ")

In [None]:
dsjson

---

# Pushshift.io Search Parameters

## Submission


| Parameter | Description | Default | Accepted Values|
|----------|-------------|---------|----------------|
| ids | get posts by their ids | N/A | comma-separated base36 ids |
| q | Search term; will search ALL possible fields| N/A | string; qouted string for phrases|
| q:not | Exclude search term | N/A | string; quoted string for phrases|
| title | searches title for term | N/A | string; quoted string for phrases|
| title:not | exclude terms for title field | N/A | string; quoted string for phrases|
| selftext | searches through body of the post | N/A | string; quoted string for phrases|
| selftext:not | excludes terms from body of the post | N/A | string; quoted string for phrases|
| size | Number of results to return | 25 | integer <= 500 |
| fields | one return specific fields (comma separated) | All Fields | string or comma-separated string |
| sort | sort results in a specific order | "desc" | "asc" or "desc" |
| sort_type | sort by a specific attribute | "created_utc" (time created) | "score", "num_comments", "created_utc" |
| aggs | return aggregation summary | N/A | \[ "author", "link_id", "created_utc", "subreddit"\]|
| author | author/creator of content | N/A | string; quoted string for phrases|
| subreddit | search specific subreddit | N/A | string; quoted string for phrases|
| after | return results after this date | N/A | Epoch value or integer + "s,m,h,d" (e.g. 30d for 30 days)|
| before | return results before this date | N/A | Epoch value or integer + "s,m,h,d" (e.g. 30d for 30 days)|
| score | restrict search based on score (upvotes) | N/A | interger or > x or < x (e.g. score => 100 or score=<25)|
| num_comments | restrict based on number of comments | N/A | interger or > x or < x (e.g. score => 100 or score=<25)|
| over_18 | restrict to nsfw or sfw | both allowed | "true" or "false" |
| is_video | restrict to video content | both allowed | "true" or "false" |
| locked | return locked or unlocked threads | both allowed | "true" or "false"|
| stickied | return stickied or unstickied content | both allowed | "true" or "false"|
| spoiler | exclude or include spoilers only | both allowed | "true" or "false"|
| contest_mode | exclude or include contest mode submissions | both allowed | "true" or "false"|
| frequency | used with the aggs parameter when set to created_utc | N/A | "second", "minute", "hour", "day"|
| metadata | display metadata about the query | false | "true" or "false" |


--- 

## List of Endpoints

| Endpoint | Description | Status |
|---|---|---|
| /reddit/search/comment | search reddit comments | active |
| /reddit/search/submission | search reddit submissions | active |
| /reddit/submission/comment_ids/{base36-submission-id} | retrieve comments for a submission object | active |