<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP  
## PART 1/2: Data Scraping

-------

## 0. Data Scraping

In [2]:
import requests
import pandas as pd
import time
from datetime import datetime

In [3]:
url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
params = { "subreddit":"zelda",
         "size":100}

In [4]:
res = requests.get(url,params)

In [5]:
res.status_code

200

### 0.1 Identifying parameters to retrieve Subreddit posts by timeframe

I am targetting to retrieve at least 1,500 submissions per subreddit. As Pushshift can only retrieve up to 100 submissions per request, I need to submit multiple requests (15 requests per Subreddit) that collect data according to a designated timeframe:

In [6]:
data = res.json()
posts = data['data']

first = posts[0]['created_utc']
last = posts[-1]['created_utc']
print(f"earliest = {first},latest = {last}")

#we are looking for 'created_utc'
#run through: https://www.epochconverter.com/
#the other 499 posts are after this time

# First = Tuesday, 19 July 2022 07:36:58
# Last = Saturday, 16 July 2022 23:59:54

#So, we want to setup the params as before the latest date

earliest = 1658293974,latest = 1658159355


In [7]:
len(posts)

100

### 0.2 Creating function(s) to retrieve and save multiple subreddits based on a pre-defined list of subreddits

In [6]:
def list_creator(subreddit,count):
    df_list = [f'df{i+1}' for i in range(count)]
    df_files = [f'../dataset/{subreddit}/{df}.csv' for df in df_list]
    list_creator.df_list = df_list
    list_creator.df_files = df_files
    

In [7]:
def scraper(subreddit,n):
    df_files = [f'../dataset/{subreddit}/{df}.csv' for df in df_list]
    res = requests.get(url,params)
    data = res.json()
    posts = data['data']
    latest = posts[-1]['created_utc']
    scraper.latest = latest
    time.sleep(3)
    globals()[df_list[n]] = pd.DataFrame(posts)
    globals()[df_list[n]] = globals()[df_list[n]].loc[:,['subreddit','selftext','title']]
    globals()[df_list[n]].to_csv(df_files[n],index=False)  

In [8]:
def file_save(df_list):
    for n,df in enumerate(df_list):
        print(f"{df} saved as {df_files[n]}")
        globals()[df] = pd.read_csv(df_files[n])
    print("All individual files saved")


### 0.3 Function to combine individual outputs from each request into one consolidated file

In [9]:
def file_combine(subreddit,df_list):
    concat_list = [globals()[df] for df in df_list]
    globals()[subreddit] = pd.concat(concat_list,ignore_index=True)
    globals()[subreddit].to_csv(f"../dataset/{subreddit}.csv")
    print(f"{len(df_list)} files combined as '../dataset/{subreddit}.csv'")

In [13]:
subr_list = ["zelda","adidas","Nike","wiiu","crocs","StardewValley","harvestmoon","NintendoSwitch"]
params = { "subreddit":"zelda",
         "size":100,"before":0}

for subreddit in subr_list:
    print(subreddit)
    print("---------")
    params["subreddit"] = subreddit
    del params["before"]
    print(params)
    
    res = requests.get(url,params)
    data = res.json()
    posts = data['data']
    
    list_creator(subreddit,15)

    df_list = list_creator.df_list
    df_files = list_creator.df_files 

    scraper(subreddit,0)

    for n in range(len(df_list)-1):
        params["before"] = scraper.latest
        x = n+1
        print(f"number of files saved:{x}")
        print(datetime.fromtimestamp(scraper.latest).strftime('%d-%m-%y'))
        scraper(subreddit,x)

    file_save(df_list)

    file_combine(subreddit,df_list)

zelda
---------
{'subreddit': 'zelda', 'size': 100}
number of files saved:1
18-07-22
number of files saved:2
16-07-22
number of files saved:3
14-07-22
number of files saved:4
12-07-22
number of files saved:5
09-07-22
number of files saved:6
07-07-22
number of files saved:7
05-07-22
number of files saved:8
03-07-22
number of files saved:9
01-07-22
number of files saved:10
29-06-22
number of files saved:11
26-06-22
number of files saved:12
24-06-22
number of files saved:13
21-06-22
number of files saved:14
18-06-22
df1 saved as ../dataset/zelda/df1.csv
df2 saved as ../dataset/zelda/df2.csv
df3 saved as ../dataset/zelda/df3.csv
df4 saved as ../dataset/zelda/df4.csv
df5 saved as ../dataset/zelda/df5.csv
df6 saved as ../dataset/zelda/df6.csv
df7 saved as ../dataset/zelda/df7.csv
df8 saved as ../dataset/zelda/df8.csv
df9 saved as ../dataset/zelda/df9.csv
df10 saved as ../dataset/zelda/df10.csv
df11 saved as ../dataset/zelda/df11.csv
df12 saved as ../dataset/zelda/df12.csv
df13 saved as ../da

In [30]:
subr_list = ["keto","Paleo","Chiropractic","physiotherapy"]
params = { "subreddit":"zelda",
         "size":100,"before":0}

for subreddit in subr_list:
    print(subreddit)
    print("---------")
    params["subreddit"] = subreddit
    del params["before"]
    print(params)
    
    res = requests.get(url,params)
    data = res.json()
    posts = data['data']
    
    list_creator(subreddit,15)

    df_list = list_creator.df_list
    df_files = list_creator.df_files 

    scraper(subreddit,0)

    for n in range(len(df_list)-1):
        params["before"] = scraper.latest
        x = n+1
        print(f"number of files saved:{x}")
        print(datetime.fromtimestamp(scraper.latest).strftime('%d-%m-%y'))
        scraper(subreddit,x)

    file_save(df_list)

    file_combine(subreddit,df_list)

keto
---------
{'subreddit': 'keto', 'size': 100}
number of files saved:1
19-07-22
number of files saved:2
17-07-22
number of files saved:3
15-07-22
number of files saved:4
14-07-22
number of files saved:5
11-07-22
number of files saved:6
09-07-22
number of files saved:7
07-07-22
number of files saved:8
05-07-22
number of files saved:9
03-07-22
number of files saved:10
01-07-22
number of files saved:11
29-06-22
number of files saved:12
27-06-22
number of files saved:13
24-06-22
number of files saved:14
23-06-22
df1 saved as ../dataset/keto/df1.csv
df2 saved as ../dataset/keto/df2.csv
df3 saved as ../dataset/keto/df3.csv
df4 saved as ../dataset/keto/df4.csv
df5 saved as ../dataset/keto/df5.csv
df6 saved as ../dataset/keto/df6.csv
df7 saved as ../dataset/keto/df7.csv
df8 saved as ../dataset/keto/df8.csv
df9 saved as ../dataset/keto/df9.csv
df10 saved as ../dataset/keto/df10.csv
df11 saved as ../dataset/keto/df11.csv
df12 saved as ../dataset/keto/df12.csv
df13 saved as ../dataset/keto/df1

### 0.4 Preliminary Analysis of Retrieved Subreddit Data

In [28]:
def reader(subreddit):
    x = globals()[subreddit]
    x = pd.read_csv(f"../dataset/{subreddit}.csv",index_col = [0])
    print(subreddit)
    print("-----")
    print(x.shape)
    print(x.isnull().sum())
    globals()[subreddit] = x
    return globals()[subreddit]

In [31]:
consol_df_list = [reader(subreddit) for subreddit in ["zelda","adidas","Nike","wiiu","crocs","StardewValley","harvestmoon","NintendoSwitch","keto","Paleo","Chiropractic","physiotherapy"]]


zelda
-----
(1499, 3)
subreddit       0
selftext     1020
title           0
dtype: int64
adidas
-----
(1500, 3)
subreddit       0
selftext     1085
title           0
dtype: int64
Nike
-----
(1498, 3)
subreddit       0
selftext     1133
title           0
dtype: int64
wiiu
-----
(1500, 3)
subreddit      0
selftext     584
title          0
dtype: int64
crocs
-----
(1499, 3)
subreddit       0
selftext     1075
title           0
dtype: int64
StardewValley
-----
(1500, 3)
subreddit      0
selftext     836
title          0
dtype: int64
harvestmoon
-----
(1500, 3)
subreddit      0
selftext     699
title          0
dtype: int64
NintendoSwitch
-----
(1500, 3)
subreddit      0
selftext     415
title          0
dtype: int64
keto
-----
(1500, 3)
subreddit    0
selftext     8
title        0
dtype: int64
Paleo
-----
(1498, 3)
subreddit      0
selftext     623
title          0
dtype: int64
Chiropractic
-----
(1500, 3)
subreddit      0
selftext     484
title          0
dtype: int64
physiotherapy
-----


## Data Selection Strategy

1. Using Pushshift's API, I collected submissions from multiple subreddits, focusing on the "title" and "selftext" fields.
   - I excluded posts/comments for this scraping exercise. This is intended to reduce unnecessary noise as comments can often times veer off-topic. 
2. Criteria for selection:
   - At least 1500 observations per subreddit, to ensure there is sufficient data left after cleaning (ie: at least 1000 observations per subreddit post-cleaning)
   - Minimal null values in the fields retrieved. Target is to have less than 500 null cells in *each* chosen subreddit. 
   - Submissions in the subreddit must be more text-heavy rather than image-heavy, such that there is sufficient text data left for NLP. That rules out subreddits with plenty of images such as food photos, shoes, memes, in-game screenshots, etc. 


Based on the above criteria, I have shortlisted the **"Physiotherapy"** and **"Chiropractic"** subreddits as the focus of this Project. 