## Reddit Web Scrapping

In this workbook, we aim to scrape 2 subreddits for use in our modelling process. 

* [Groundwork](#sec_1)
* [Functions](#sec_2)
* [Web Scraping](#sec_3)
* [Data Cleaning](#sec_4)
* [Export Data](#sec_5)

## <a name="sec_1"></a>Groundwork

In [41]:
# We import the necessary libraries
import requests
import pandas as pd
import time
import random

In [64]:
# We user configs
headers = {'User-agent': 'greg 0.2'}

## <a name="sec_2"></a>Functions

In [76]:
# We will use this to scrape data
posts = [] #REMEMBER TO CHANGE THIS
after = None
for _ in range(50):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/CongratsLikeImFive/.json'   # CHANGE THIS URL
    res = requests.get(url, params = params, headers = headers)
    if res.status_code == 200:
        x = res.json()
        posts.extend(x['data']['children']) # CHANGE POSTS
        after = x['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1) # slow down the for loop by 1 second in between, to prevent looking like ddos attack


In [58]:
def create_df(posts,df):
    for i, p in enumerate(posts):
        p_keys = p['data'].keys()
        
        for col in cols:
            if col in p_keys:
                df.at[i,col] = p['data'][col]
            else:
                df.at[i,col] = np.nan
                print(i,col,'nan')
    return df

## <a name="sec_3"></a>Web Scraping

In [87]:
# Scrape first sub reddit
hist = [] #REMEMBER TO CHANGE THIS
after = None
for _ in range(50):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/AskHistorians/.json'   # CHANGE THIS URL
    res = requests.get(url, params = params, headers = headers)
    if res.status_code == 200:
        x = res.json()
        hist.extend(x['data']['children']) # CHANGE POSTS
        after = x['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1) # slow down the for loop by 1 second in between, to prevent looking like ddos attack

In [84]:
# Scrape second sub reddit
science = [] #REMEMBER TO CHANGE THIS
after = None
for _ in range(50):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/askscience/.json'   # CHANGE THIS URL
    res = requests.get(url, params = params, headers = headers)
    if res.status_code == 200:
        x = res.json()
        science.extend(x['data']['children']) # CHANGE POSTS
        after = x['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(1) # slow down the for loop by 1 second in between, to prevent looking like ddos attack

In [88]:
# Check the number of unique posts in each subreddit
print(len(set([p['data']['name'] for p in hist])))
print(len(set([p['data']['name'] for p in science])))

999
1000


## <a name="sec_4"></a>Data Cleaning

In [107]:
# We look at what data we need
hist[0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'collections', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'is_created_from_ads_ui', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort

In [119]:
# We create a list of columns of the potential columns we might want
cols = ['subreddit', 'name', 'title', 'selftext', 'score']

In [120]:
# We create our dataframes with the columns

df_hist = pd.DataFrame(columns = cols)
df_science = pd.DataFrame(columns = cols)

In [121]:
# We then extract data to our dataframes
df_hist = create_df(hist, df_hist)
df_science = create_df(science, df_science)

In [122]:
# We look at our dataframe
df_hist.head()

Unnamed: 0,subreddit,name,title,selftext,score
0,AskHistorians,t3_nyvt8e,Sunday Digest | Interesting &amp; Overlooked P...,[Previous](https://www.reddit.com/r/AskHistori...,28
1,AskHistorians,t3_nvvd8w,"Short Answers to Simple Questions | June 09, 2021",[Previous weeks!](https://www.reddit.com/r/Ask...,29
2,AskHistorians,t3_o0dn7y,"In fairy tales, there is a popular trope of be...",,3061
3,AskHistorians,t3_o0w9hv,"In the Middle Ages, were merchants allowed to ...",The author describes a wool merchant chopping ...,40
4,AskHistorians,t3_o0ts67,Did English nobility get involved in the lives...,I’m watching Downton Abby for the first time. ...,60


In [123]:
df_science.head()

Unnamed: 0,subreddit,name,title,selftext,score
0,askscience,t3_l4yi0i,AskScience Panel of Scientists XXIV,**Please read this entire post carefully and f...,267
1,askscience,t3_o0bncs,AskScience AMA Series: We have 60+ years of ex...,"""We"" are part of [REN21](https://www.ren21.net...",748
2,askscience,t3_o0i9s0,How deep can water be before the water at the ...,Let's assume the water is pure H20 (and not se...,4230
3,askscience,t3_nzy81l,"If Hailey’s comet loses ice to form its tail, ...",,5201
4,askscience,t3_o0k1fs,Why are liquids almost always shiny/reflective?,I am painting some furniture and just started ...,109


In [124]:
# We drop duplicates
df_hist.drop_duplicates(subset='name', inplace=True)
df_science.drop_duplicates(subset='name', inplace=True)

In [125]:
# We look at the distinct posts we have
print(df_hist.shape)
print(df_science.shape)

(999, 5)
(1000, 5)


In [126]:
# Check for nans

df_hist.isnull().sum()

subreddit    0
name         0
title        0
selftext     0
score        0
dtype: int64

In [127]:
df_science.isnull().sum()

subreddit    0
name         0
title        0
selftext     0
score        0
dtype: int64

## <a name="sec_5"></a>Export Data

In [130]:
# Export data
df_hist.to_csv('./data/askhistorians.csv')
df_science.to_csv('./data/askscience.csv')