# Scraping Submissions

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [3]:
# pushshift api is an "endpoint" api, i.e. we connect to a url link
# to request specific data. The parameters passed at the end of the link to specify
# what data gets requested
nfl_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=nfl&fields=title,selftext&limit=100'

# the base url comes in 2 varieties:
## submissions
base_post_url = 'https://api.pushshift.io/reddit/search/submission/'

base_comment_url = 'https://api.pushshift.io/reddit/search/comment/'

In [4]:
req = requests.get(
                base_post_url,
                params = {'subreddit':'datascience', # specifies subreddit to search
                          'fields':'title',
                          'limit':'100' # specifies how many posts to do it for
                         }
                )


req.status_code, req.url

(200,
 'https://api.pushshift.io/reddit/search/submission/?subreddit=datascience&fields=title&limit=100')

In [5]:
dsjson = req.json()['data']

In [6]:
dsjson[0]

{'title': 'Text analytics -- anybody do reading levels?'}

In [7]:
for i in range(len(dsjson)):
    print("Title: "+dsjson[i]['title'])
    print(" ")
    print(" ")
    print(" ")

Title: Text analytics -- anybody do reading levels?
 
 
 
Title: Suggestions on Courses in Masters Program?
 
 
 
Title: Is on-call going to become a responsibility for data science?
 
 
 
Title: Like the term 'Big Data' which terms in data science are 'outdated'?
 
 
 
Title: Best path to learn Data Analytics
 
 
 
Title: Resources to learn time-series forecasting
 
 
 
Title: How to use ARIMA to forecast for multi step in future ?
 
 
 
Title: Data Science Internships
 
 
 
Title: Help me solve this conundrum.
 
 
 
Title: Data Council - Austin
 
 
 
Title: Choosing the higher education route over bootcamps but unsure of what subject to study at BSc and MSc level.
 
 
 
Title: Why it’s Super Hard to be an ML Researcher or Developer?
 
 
 
Title: Why it’s Super Hard to be an ML Researcher or Developer?
 
 
 
Title: Need help.
 
 
 
Title: Fiducial Probability
 
 
 
Title: I've been a remote Data Analyst for one year. Going into office first time next week. Tell me some cool BI/Data Sc


<br>

*remark:* ```Pushshift.io``` uses comma separated values to pass multiple values to its parameters. For example, say we wanted to get both the Title and Self-Text content from a post. Then the url we would need to request would be:

```desired_url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=datascience&fields=title,selftext&limit=100'```

notice in the url we have: ```fields=title,selftext``` where the comma ```,``` is used to denote multiple values. 

However, we try to pass multiple values in the ```requests.get()```, notice what happens

In [8]:
attempt = requests.get(
                base_post_url,
                params = {'subreddit':'datascience',
                          'fields':'title,selftext', # try to pass a comma
                          'limit':'100'
                         }
                )


attempt.status_code, attempt.url

(200,
 'https://api.pushshift.io/reddit/search/submission/?subreddit=datascience&fields=title%2Cselftext&limit=100')

In [9]:
dsattempt = attempt.json()['data']

In [10]:
dsattempt[0]

{}

- We don't get anything back! This is because ```requests.get()``` encodes characters using RFC 3986 standard and the comma ```,``` gets encoded as ```%2C``` which we can see in the url that gets returned.

<br> 

So how do we get around this? We could try passing ```title``` and ```selftext``` as separate values via a tuple:

In [11]:
attempt2 = requests.get(
                base_post_url,
                params = {'subreddit':'datascience',
                          'fields':('title','selftext'), # try to pass a comma
                          'limit':'100'
                         }
                )


attempt2.status_code, attempt2.url

(200,
 'https://api.pushshift.io/reddit/search/submission/?subreddit=datascience&fields=title&fields=selftext&limit=100')

In [12]:
dsattempt2 = attempt2.json()['data']

In [13]:
dsattempt2[0]

{'selftext': '&amp;#x200B;\n\nI\'ve been doing text analytics for awhile, and it just occurred to me -- why aren\'t people doing reading levels (e.g., "This text is at an 8-th grade level").    I\'ve seen this done other places, but haven\'t come across documentation with my normal tools (R, Python, SAS) for text analytics packages that will return this, similar to the way you can do sentiment analysis.\n\nAnybody got any pointers?    R packages would be my preferred idiom, but I can use other tools.',
 'title': 'Text analytics -- anybody do reading levels?'}

- This looks ok, let's check out the rest of the posts it grabbed

In [14]:
for i in range(len(dsattempt2)):
    print("Title: "+dsattempt2[i]['title'])
    print(" ")
    print(" ")
    print(dsattempt2[i]['selftext'])
    print(" ")
    print(" ")
    print(" ")
    print(" ")

Title: Text analytics -- anybody do reading levels?
 
 
&amp;#x200B;

I've been doing text analytics for awhile, and it just occurred to me -- why aren't people doing reading levels (e.g., "This text is at an 8-th grade level").    I've seen this done other places, but haven't come across documentation with my normal tools (R, Python, SAS) for text analytics packages that will return this, similar to the way you can do sentiment analysis.

Anybody got any pointers?    R packages would be my preferred idiom, but I can use other tools.
 
 
 
 
Title: Suggestions on Courses in Masters Program?
 
 
I was recently accepted into a Masters program and I have agreed to attend. Which of the following Majors and possibly other electives on the below list would you recommend for me to gain the most value from the program? (I give a quick bio of myself below, so "value" can be judged more appropriately)

 [Master of Data Science, UC, 2022 Spring: Class Descriptions - Northwestern University](https

In [15]:
dsjson

[{'title': 'Text analytics -- anybody do reading levels?'},
 {'title': 'Suggestions on Courses in Masters Program?'},
 {'title': 'Is on-call going to become a responsibility for data science?'},
 {'title': "Like the term 'Big Data' which terms in data science are 'outdated'?"},
 {'title': 'Best path to learn Data Analytics'},
 {'title': 'Resources to learn time-series forecasting'},
 {'title': 'How to use ARIMA to forecast for multi step in future ?'},
 {'title': 'Data Science Internships'},
 {'title': 'Help me solve this conundrum.'},
 {'title': 'Data Council - Austin'},
 {'title': 'Choosing the higher education route over bootcamps but unsure of what subject to study at BSc and MSc level.'},
 {'title': 'Why it’s Super Hard to be an ML Researcher or Developer?'},
 {'title': 'Why it’s Super Hard to be an ML Researcher or Developer?'},
 {'title': 'Need help.'},
 {'title': 'Fiducial Probability'},
 {'title': "I've been a remote Data Analyst for one year. Going into office first time next

---

# Pushshift.io Search Parameters

## Submission


| Parameter | Description | Default | Accepted Values|
|----------|-------------|---------|----------------|
| ids | get posts by their ids | N/A | comma-separated base36 ids |
| q | Search term; will search ALL possible fields| N/A | string; qouted string for phrases|
| q:not | Exclude search term | N/A | string; quoted string for phrases|
| title | searches title for term | N/A | string; quoted string for phrases|
| title:not | exclude terms for title field | N/A | string; quoted string for phrases|
| selftext | searches through body of the post | N/A | string; quoted string for phrases|
| selftext:not | excludes terms from body of the post | N/A | string; quoted string for phrases|
| size | Number of results to return | 25 | integer <= 500 |
| fields | one return specific fields (comma separated) | All Fields | string or comma-separated string |
| sort | sort results in a specific order | "desc" | "asc" or "desc" |
| sort_type | sort by a specific attribute | "created_utc" (time created) | "score", "num_comments", "created_utc" |
| aggs | return aggregation summary | N/A | \[ "author", "link_id", "created_utc", "subreddit"\]|
| author | author/creator of content | N/A | string; quoted string for phrases|
| subreddit | search specific subreddit | N/A | string; quoted string for phrases|
| after | return results after this date | N/A | Epoch value or integer + "s,m,h,d" (e.g. 30d for 30 days)|
| before | return results before this date | N/A | Epoch value or integer + "s,m,h,d" (e.g. 30d for 30 days)|
| score | restrict search based on score (upvotes) | N/A | interger or > x or < x (e.g. score => 100 or score=<25)|
| num_comments | restrict based on number of comments | N/A | interger or > x or < x (e.g. score => 100 or score=<25)|
| over_18 | restrict to nsfw or sfw | both allowed | "true" or "false" |
| is_video | restrict to video content | both allowed | "true" or "false" |
| locked | return locked or unlocked threads | both allowed | "true" or "false"|
| stickied | return stickied or unstickied content | both allowed | "true" or "false"|
| spoiler | exclude or include spoilers only | both allowed | "true" or "false"|
| contest_mode | exclude or include contest mode submissions | both allowed | "true" or "false"|
| frequency | used with the aggs parameter when set to created_utc | N/A | "second", "minute", "hour", "day"|
| metadata | display metadata about the query | false | "true" or "false" |


--- 

## List of Endpoints

| Endpoint | Description | Status |
|---|---|---|
| /reddit/search/comment | search reddit comments | active |
| /reddit/search/submission | search reddit submissions | active |
| /reddit/submission/comment_ids/{base36-submission-id} | retrieve comments for a submission object | active |