In [1]:
# Import libraries
import requests
import pandas as pd
import time

We will be using the PushShift API to scrape reddit posts.

In [3]:
url = "https://api.pushshift.io/reddit/search/submission"

In [5]:
res = requests.get(url, params)

Requests Status Code 200 means our request was successful.

In [6]:
res.status_code

200

The function 'scrape_posts' gathers 100 posts at a time (abiding by PushShift's limits) from your subreddit of choice, filters out the ones that haven't received enough attention (comments) and adds the post's relevant information to a DataFrame, which when filled to your specified post amount, returns. Filtering out posts that have not received attention is important because PushShift gathers *every* post, even the ones immediately deleted by moderators.

In [23]:
def scrape_posts(sub, is_consp, post_count,min_attn):
    df_list=[]
    params = {
        'subreddit': sub,
        'size': 100,
        'before': str(round(time.time()))
    }
    posts_scraped = 0
    while posts_scraped <= post_count:
        
        res = requests.get(url, params)
        
        data = res.json()['data']
        
        df = pd.DataFrame({"post":[post['title'] for post in data],
                          'time':[post['created_utc'] for post in data],
                           'comments':[post['num_comments'] for post in data],
                          "is_conspiracy":is_consp})

        params['before'] = df['time'].min()
        
        df = df[df['comments']>=min_attn]
        
        posts_scraped+=df.shape[0]
        df_list.append(df)
    
    df_full = pd.concat(df_list,ignore_index=True)
    df_full.drop(columns=['time','comments'],inplace=True)
    df_full.drop_duplicates(subset="post",inplace=True,ignore_index=True)
    return df_full

Let's gather 1500 'r/news' posts that have received a minimum of 10 comments into a Dataframe.

In [26]:
news_df = scrape_posts('news',0,1500,10)

In [28]:
news_df.shape

(1447, 2)

Let's gather 1500 'r/conspiracy' posts that have received a minimum of 10 comments into a Dataframe.

In [30]:
consp_df = scrape_posts('conspiracy',1,1500,10)

In [31]:
consp_df.shape

(1529, 2)

Let's concatenate these dataframes together and export it as a csv.

In [32]:
df = pd.concat([news_df,consp_df],ignore_index=True)
df.shape

(2976, 2)

In [33]:
df.to_csv("../data/raw.csv",index=False)