# Project: Using Reddit's API for Predicting Comments
### Author: Kihoon Sohn

### Table of Contents

- **Notebook 1 - Data Fetching (current)**: `json` webscrap and unpack to dataframe
- Notebook 2 - Data Cleansing: exploratory data analysis and feature engineering
- Notebook 3 - Data Modeling: build a predictive model

**Disclaimer**: Due to the file size restriction in GitHub, `/dataset/` folder and other large files were ignored by `.gitignore`. Therefore the notebook might not reproducible. 

In [1]:
# import libraries for the notebook

import requests
import json
import pandas as pd
import numpy as np
import time
import datetime

### 1a: Fetch `json` from Reddit.com (Hot posts)

In [2]:
# URL and get request from Reddit
# Set `limit=100` for fetch 100 posts per attempt

URL = "http://www.reddit.com/hot.json?limit=100"
res = requests.get(URL, headers={'User-agent': 'KH'})
data = res.json()

In [3]:
# To check the number of 1st fetch 
print(len(data['data']['children']))

100


In [4]:
# create dictionary for the first 100 fetched data.
reddit = [child['data'] for child in data['data']['children']]
reddit = pd.DataFrame(reddit)

# add fetched time into the dataframe by `.utcnow()`
fetch_time = pd.Timestamp.utcnow()
reddit['fetched time'] = fetch_time
reddit.head(2)

Unnamed: 0,approved_at_utc,approved_by,archived,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,...,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls,fetched time
0,,,False,driop,,,[],,,,...,140.0,Fantastic Beasts: The Crimes of Grindelwald - ...,10574,https://www.youtube.com/watch?v=vvFybpmyB9E,[],,False,all_ads,6.0,2018-07-21 20:25:59.036374+00:00
1,,,False,ugoindownsaka1,,,[],,,,...,140.0,Rockstar gave away a 'key to the city' to peop...,38598,https://i.redd.it/5i7py1fdlbb11.jpg,[],,False,all_ads,6.0,2018-07-21 20:25:59.036374+00:00


### 1b: iterate webscrapping

In [5]:
# Let's scrap additional 400 posts (to make it 500)
# print out the URL to check the code is working and live. 

post_name = data['data']['after']

for i in range(4): 
    try:
        URL = "http://www.reddit.com/hot.json?limit=100&after="+ post_name
        res = requests.get(URL, headers={'User-agent': 'KH'})
        if res.status_code == 200:
            post = res.json()
            fetch_time = pd.Timestamp.utcnow()
            post_name = post['data']['after']
            df = [child['data'] for child in post['data']['children']]
            df = pd.DataFrame(df)
            reddit = pd.concat([reddit, df], ignore_index=True)
            reddit['fetched time'] = fetch_time
            time.sleep(2)
            if i % 2 == 0:
                print("{}th url: ".format(i), URL)
        else:
            print(res.status_code)
            break
    except:
        print("No more posts left to be fetched. Try it later!")
        break

print(reddit.shape)
print(reddit['id'].nunique())
reddit.head(2)

0th url:  http://www.reddit.com/hot.json?limit=100&after=t3_90p1cp
2th url:  http://www.reddit.com/hot.json?limit=100&after=t3_90qabk
(500, 98)
500


Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,driop,,,,[],,,...,105.0,140.0,Fantastic Beasts: The Crimes of Grindelwald - ...,10574,https://www.youtube.com/watch?v=vvFybpmyB9E,[],,False,all_ads,6.0
1,,,False,ugoindownsaka1,,,,[],,,...,129.0,140.0,Rockstar gave away a 'key to the city' to peop...,38598,https://i.redd.it/5i7py1fdlbb11.jpg,[],,False,all_ads,6.0


In [6]:
# Let's scrap additional posts upto 10,000. 
# Also, increase sleep time into 3 secs to minimize duplicated posts. 

for i in range(0,95): 
    try:
        URL = "http://www.reddit.com/hot.json?limit=100&after="+ post_name
        res = requests.get(URL, headers={'User-agent': 'KH'})
        if res.status_code == 200:
            post = res.json()
            fetch_time = pd.Timestamp.utcnow()
            post_name = post['data']['after']
            df = [child['data'] for child in post['data']['children']]
            df = pd.DataFrame(df)
            reddit = pd.concat([reddit, df], ignore_index=True)
            reddit['fetched time'] = fetch_time
            time.sleep(3)
            if i % 20 == 0:
                print("{}th url: ".format(i), URL)
        else:
            print(res.status_code)
            break
    except:
        print("No more posts left to be fetched. Try it later!")
        break

print(reddit.shape)
print(reddit['id'].nunique())
reddit.head(2)

0th url:  http://www.reddit.com/hot.json?limit=100&after=t3_90qta7
20th url:  http://www.reddit.com/hot.json?limit=100&after=t3_90q6os
40th url:  http://www.reddit.com/hot.json?limit=100&after=t3_90pryq
No more posts left to be fetched. Try it later!
(5999, 99)
5891


Unnamed: 0,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,...,thumbnail_height,thumbnail_width,title,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,,,False,driop,,,,[],,,...,105.0,140.0,Fantastic Beasts: The Crimes of Grindelwald - ...,10574,https://www.youtube.com/watch?v=vvFybpmyB9E,[],,False,all_ads,6.0
1,,,False,ugoindownsaka1,,,,[],,,...,129.0,140.0,Rockstar gave away a 'key to the city' to peop...,38598,https://i.redd.it/5i7py1fdlbb11.jpg,[],,False,all_ads,6.0


### 1c: Save it to CSV

In [7]:
# let's look into few features as crafted dataframe. 
red_df = reddit[['id', 'title', 'subreddit', 'num_comments', 'created_utc', 'fetched time']]
print(red_df.shape)

# courtesy of Harsha, to insert timestamp in the csv file name
current_time = time.strftime(" %d-%m-%Y (%H%M")

# Save original and crafted data in csv
reddit.to_csv('./dataset/hotposts_original_'+current_time+'hrs).csv')
red_df.to_csv('./dataset/hotposts_crafted_'+current_time+'hrs).csv')

(5999, 6)
