##Reddit Scraper Utility

The main driver of this scraper file is the python connector to the Reddit API, coded as praw (Python Reddit API Wrapper).

PIL is the Python Imaging Library and is used to more effectively process image data collected from reddit.

In [1]:
import praw
import os
import sys
import pandas as pd
from PIL import Image
import urllib.request
import secret
from datetime import datetime

After importing the necessary dependencies, I have to make a connection to reddit by calling praw.Reddit() and inputting my API keys.

In [2]:
reddit = praw.Reddit(client_id=secret.client_id, client_secret=secret.client_secret, user_agent=secret.user_agent)

The data that we want to collect from Reddit of the subreddit of interest (`sub_reddit`) are the title, score, id, subreddit, link url, number of comments, selftext, and timestamp the post was submitted. 

In [4]:
number_to_scrape=700
sub_reddit = 'Sneakers'
posts = []

sneakers_subreddit = reddit.subreddit(sub_reddit)

for post in sneakers_subreddit.hot(limit=number_to_scrape):
    posts.append([  post.title, 
                    post.score, 
                    post.id, 
                    post.subreddit, 
                    post.url, 
                    post.num_comments,  
                    post.created
                ])

posts = pd.DataFrame(posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'created'])

posts.head(5)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,New Releases Thread 5/11 - 5/17,18,ghi2qv,Sneakers,https://www.reddit.com/r/Sneakers/comments/ghi...,385,"Please post all your questions, pics and comme...",1589206000.0
1,Due to PUPular demand here are the culprits. S...,2832,gi4lq8,Sneakers,https://i.redd.it/auw3v9rto9y41.jpg,128,,1589289000.0
2,"Missed out on Royal toes this weekend, so I pu...",1411,ghzpfa,Sneakers,https://i.redd.it/fppna4lw58y41.jpg,71,,1589270000.0
3,Linen 3M,112,gibp4m,Sneakers,https://i.redd.it/1nsd1g72bcy41.jpg,11,,1589321000.0
4,Much better!!! Just feels right now!,471,gi4dhy,Sneakers,https://i.redd.it/ch01e1dsl9y41.jpg,39,,1589288000.0


In [5]:
posts.dtypes

title            object
score             int64
id               object
subreddit        object
url              object
num_comments      int64
body             object
created         float64
dtype: object

In [15]:
# convert unix time to a readable datetime format
posts['created'] = pd.to_datetime(posts['created'], unit='s')
posts['created']

0     2020-05-11 14:10:56
1     2020-05-12 13:09:00
2     2020-05-12 08:01:08
3     2020-05-12 21:57:12
4     2020-05-12 12:51:57
              ...        
695   2020-05-10 16:58:30
696   2020-05-10 06:01:51
697   2020-05-10 08:28:30
698   2020-05-10 18:28:33
699   2020-05-10 13:03:34
Name: created, Length: 700, dtype: datetime64[ns]

Have to convert UNIX time to a more readable datetime format

In [6]:
# iterate through all queried posts and download the corresponding image as a jpg file

for i in range(1, len(posts['url'])):
    try:
        urllib.request.urlretrieve(posts['url'][i], "raw_data/0000"+str(i)+".png")
    except:
        pass

In [7]:
# resize all of the downloaded images for better scaling effects for the downstream convolutional neural network

# Resources used!
# https://stackoverflow.com/questions/21517879/python-pil-resize-all-images-in-a-folder
# https://kishstats.com/python/2018/04/27/bulk-image-resizing-python.html
# https://stackoverflow.com/questions/22282760/filenotfounderror-errno-2-no-such-file-or-directory

path = "/Users/Oliver/GDrive/Data_Scientist_Career/Projects/Hype/raw_data"

for image in os.listdir(path):
    try:
        current_img = Image.open(path+"/"+image)
        f, e = os.path.splitext(image)
        resize_img = current_img.resize((150,150), Image.ANTIALIAS)
        resize_img.save('resized_data/' + f +'-resized.png', 'PNG', quality = 90)
    except:
        pass