# Data Collection from Reddit using the Pushshift API

<a href=https://github.com/pushshift/api>Here</a> is a general introduction to Pushshift, which you can use to get Reddit data. 

<h1>Creating URL's as Queries</h1>

As you can see on the Pushshift github page, you create queries in the form of URL's. This URL gives comments mentioning 'science': 

https://api.pushshift.io/reddit/search/comment/?q=science

Notice that it specifies the query term like this: "?q=science"

You can simply run these queries in your browser (try it!)






<h2>Question 0</h2>

How many comments get returned by the above query?


25

<h2>Question 1</h2>

Give the URL for the following query:


Recent comments mentioning "St√∏jberg"

https://api.pushshift.io/reddit/search/comment/?q=St%C3%B8jberg

<h2>Question 2</h2>

What subreddits do these comments come from?

"newsdk", "denmark", "politics", and "scandinavia"

<h2>Question 3</h2>

Give the URL for this query: 

recent comments mentioning "Gamestop".

https://api.pushshift.io/reddit/search/comment/?q=Gamestop

<h2>Question 4</h2>

What subreddits do these comments come from?

"GME", "SNDL", "deadcells", "wallstreetbets", "PokemonSwordAndShield", "Destiny", "SPACs", "NintendoSwitch"
, "WallStreetWin", "WallStreetbetsELITE", "PokemonTCG", "RedditSessions", and "wallstreetbetsOGs"

<h1>Accessing Pushshift in Python</h1>

To access Pushshift in Python, we can use <a href=https://requests.readthedocs.io/en/master/>requests</a>. You can use the <b>get</b> method to send a URL string, instead of typing it into the browser. 
<p>
    Pushshift returns <a href=https://www.json.org/json-en.html> JSON</a>, which in Python turns into <a href=https://www.w3schools.com/python/python_dictionaries.asp>dict</a>.
    Below is an example.

In [2]:
import requests 
import json 

In [7]:
url = 'https://api.pushshift.io/reddit/search/comment/?q=science'
r = requests.get(url)
data = json.loads(r.text)
d = data['data']


<h2>Question 5</h2>

Loop Through Query Results
<p>

The variable <b>d</b> consists of a list of <b>dict</b>'s. You can examine them using the <b>keys</b> method. Print the keys for the first value of <b>d</b>.

In [9]:
d[0].keys()

dict_keys(['all_awardings', 'associated_award', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'body', 'collapsed_because_crowd_control', 'comment_type', 'created_utc', 'gildings', 'id', 'is_submitter', 'link_id', 'locked', 'no_follow', 'parent_id', 'permalink', 'retrieved_on', 'score', 'send_replies', 'stickied', 'subreddit', 'subreddit_id', 'top_awarded_type', 'total_awards_received', 'treatment_tags'])

<h2>Question 6</h2>

Loop Through Query Results
<p>

Use a loop to print the author and body of each element of <b>d</b>


In [14]:
#I don't know how
d[0].body

AttributeError: 'dict' object has no attribute 'body'

<h2>Question 7</h2>

Create a new query in Python, getting the most recent 100 posts mentioning "Gamestop".


In [15]:
url = 'https://api.pushshift.io/reddit/search/comment/?q=Gamestop&size=100'
r = requests.get(url)
data = json.loads(r.text)
d = data['data']

<h2>Question 8</h2>

Convert the variable <b>d</b>, a list of dicts, to a Pandas dataframe, using <a href=https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_dict.html>from_dict</a>



In [17]:
import pandas as pd
df = pd.DataFrame.from_dict(d)

<h2>Question 9</h2>

How many different subreddits occur in your data? Use the <a href=https://kanoki.org/2020/03/09/how-to-use-pandas-count-and-value_counts/>value_counts</a> method to see how many times each subreddit is mentioned. (You need to apply value_counts to the subreddit column.)




In [21]:
df['subreddit'].value_counts()

wallstreetbets          17
GME                     14
WallStreetWin           12
Wallstreetbetsnew        9
PokemonTCG               4
investing                3
PKMNTCGDeals             2
mauerstrassenwetten      2
amcstock                 2
NintendoSwitch           2
lotrmemes                1
WANDAVISION              1
NEO                      1
nfl                      1
argentina                1
PoliticalHumor           1
wallstreetbets2          1
Persona5                 1
soccer                   1
conspiracy               1
gaming                   1
PS5                      1
sadcringe                1
PS4                      1
todayilearned            1
gme_meltdown             1
ethfinance               1
smallstreetbets          1
CryptoCurrency           1
AdamCurtis               1
eos                      1
Bullion                  1
PewdiepieSubmissions     1
PokemonHome              1
funny                    1
dogecoin                 1
jovemnerd                1
Z