# Project 3 - Using Beautiful Soup, Reddit API and PushShift API to scrape SUBREDDITS

In this portion of the Project, we'll use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), [Reddit API](https://www.reddit.com/dev/api/) and [Push Shift API](https://github.com/pushshift/api) to scrape the [ProCreate](https://www.reddit.com/r/ProCreate/) and [Adobe Illustrator](https://www.reddit.com/r/AdobeIllustrator/) subreddits.

We would discuss the pros and cons of the 3 different scraping mechanisms and our reasoning for picking PushShift API.

In [1]:
# Imports
import pandas as pd
import re
import requests
import time
import json
from time import sleep

In [2]:
# Import libaries
# imports
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Contents

- [1.0 Using Beautiful Soup to Scrape](#using)
- [2.0 Utilize Reddits API to Scrape](#reddit)
- [3.0 Using PushShift API to Scrape](#push)

## 1.0 Using Beautiful Soup to Scrape<a name="using"></a>

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [3]:
# Create Beautiful SOUP function to scrape url page of REDDIT
def scraper_dictionary (url, subreddit):
    
    sodos = [] 
    
    dictlist = ['post', 'votes', 'comments','sub_reddit'] # essentially my header
    
    headers = {'User-Agent': 'Mozilla/5.0'}
    
    res = requests.get(url, headers=headers)
    
    soup = BeautifulSoup(res.content, 'html.parser')
    
    for i in soup.find_all('div',{'data-promoted':'false'}): # Ensures adverts are ignored
        
        empty = []
        
        # Posts
        y=i.find('a',{'data-event-action':'title'})
        empty.append(y.text)
    
        # Votes
        try:
            y=i.find('div',{'class':'score unvoted'})
            empty.append(int(y.text))
        except:
            continue
            
        # Comments
        y = i.find('a',{'data-event-action':'comments'})
        empty.append(y.text)
        
        empty.append(subreddit) # Subreddit
        
        sodos.append(empty)
        
    emptydict = []

    for i in sodos:

        emptydict.append(dict(zip(dictlist,i))) # This converts list of lists to dictionary 
# https://betterprogramming.pub/10-ways-to-convert-lists-to-dictionaries-in-python-d2c728d2aeb8

    return emptydict

In [4]:
# Create For Loop to loop through every 25 posts up to 1000 posts

final_dataframe = []

for i in range(0,1000,25):
     
    final_dataframe.extend(scraper_dictionary (f'https://old.reddit.com/r/ProCreate/?count={i}&after=t3_lrzhf4','Procreate'))
    
    sleep(11)

In [5]:
procreate = pd.DataFrame(final_dataframe)

> # 25
is the Maximum Number of Posts Beautiful Soup could scrape automatically due to Reddit's dynamic links which ensures each page link cannot be determined with a simple logic. 

In [6]:
procreate
# This did not work as Reddit utilizes random numbers to throw off webscraping page by page.
# So it just returned the same result 40 times.

Unnamed: 0,post,votes,comments,sub_reddit
0,Inumaki Toge’s first word || Jujutsu Kaisen an...,2,comment,Procreate
1,Help - Procreate freezes/stutters frequently o...,2,4 comments,Procreate
2,Random access memories in honor of Daft Punk. ...,2,3 comments,Procreate
3,i’ve really been enjoying this digital painter...,567,26 comments,Procreate
4,Favorite free brushes?,1,comment,Procreate
...,...,...,...,...
995,Finally feeling a bit more comfortable with Pr...,4,2 comments,Procreate
996,Close up portrait💜,9,comment,Procreate
997,Wanna go camping ?,3,3 comments,Procreate
998,if you delete the app is there still no way to...,1,2 comments,Procreate


# 2.0 Utilize Reddits API to Scrape<a name="reddit"></a>

An API to allow developers to build great products powered by Reddit because the developer community is integral to the success of the Reddit platform. The API is also to protect Reddit users’ privacy and security regardless of how they choose to consume Reddit content.

Source: https://www.reddit.com/wiki/api

In [7]:
import praw # This is a REDDIT Api created to Scrape Reddit pages officially

In [8]:
reddit = praw.Reddit(client_id='dD37vmEd4uZG0Q',client_secret='PEzHd6SVMQtz5FR1IvSuJGKANdgXSw',user_agent='project_3')
# I enter my reddit credentials

### Procreate New Posts
> This returns the latest 990 to 1000 posts in ProCreate subreddit

In [9]:
pc_new_posts = []

procreate = reddit.subreddit('ProCreate')

for pc_post in procreate.new(limit = 1000):
    
    pc_new_posts.append([pc_post.title, pc_post.score, pc_post.id, pc_post.subreddit, pc_post.url, pc_post.num_comments, pc_post.selftext, pc_post.created])
    
pc_new_posts = pd.DataFrame(pc_new_posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [10]:
len(pc_new_posts)

996

### Illustrator New Posts

> This returns the latest 990 to 1000 posts in Adobe Illustrator subreddit

In [11]:
ai_new_posts = []

ai = reddit.subreddit('AdobeIllustrator')

for post in ai.new(limit = 1000):
    
    ai_new_posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
ai_new_posts = pd.DataFrame(ai_new_posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [12]:
len(ai_new_posts)

993

### Procreate Top Posts

> This returns the top 990 to 1000 posts in ProCreate subreddit

In [13]:
pc_top_posts = []

procreate = reddit.subreddit('ProCreate')

for pc_post in procreate.top(limit = 1000):
    
    pc_top_posts.append([pc_post.title, pc_post.score, pc_post.id, pc_post.subreddit, pc_post.url, pc_post.num_comments, pc_post.selftext, pc_post.created])
    
pc_top_posts = pd.DataFrame(pc_top_posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [14]:
len(pc_top_posts)

1000

### Adobe Illustrator Top Posts

> This returns the top 990 to 1000 posts in Adobe Illustrator subreddit

In [15]:
ai_top_posts = []

ai = reddit.subreddit('AdobeIllustrator')

for post in ai.top(limit = 1000):
    
    ai_top_posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
ai_top_posts = pd.DataFrame(ai_top_posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [16]:
len(ai_top_posts)

999

### Procreate Hot Posts

> This returns the hottest 990 to 1000 posts in ProCreate subreddit

In [17]:
pc_hot_posts = []

procreate = reddit.subreddit('ProCreate')

for pc_post in procreate.top(limit = 1000):
    
    pc_hot_posts.append([pc_post.title, pc_post.score, pc_post.id, pc_post.subreddit, pc_post.url, pc_post.num_comments, pc_post.selftext, pc_post.created])
    
pc_hot_posts = pd.DataFrame(pc_hot_posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [18]:
len(pc_hot_posts)

1000

### Ilustrator Hot Posts

> This returns the top 990 to 1000 posts in Aodbe Illustrator subreddit

In [19]:
ai_hot_posts = []

ai = reddit.subreddit('AdobeIllustrator')

for post in ai.top(limit = 2000):
    
    ai_hot_posts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created])
    
ai_hot_posts = pd.DataFrame(ai_hot_posts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

In [20]:
len(ai_hot_posts)

999

In [21]:
ai_hot_posts.head() # Inspect Posts

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,I made an illustration during the COVID-19 loc...,2034,gwfbpd,AdobeIllustrator,https://i.redd.it/x38ytci5lv251.jpg,77,,1591297000.0
1,I did some linework first time in many years,1636,e89e61,AdobeIllustrator,https://i.redd.it/i3l3tsnwql341.png,32,,1575924000.0
2,Spongebob & Patrick Minimal portrait,1579,eza5r1,AdobeIllustrator,https://i.redd.it/iy28ekav54f41.jpg,60,,1580941000.0
3,Drew a Casio Baby-G.,1443,ibdnt0,AdobeIllustrator,https://i.redd.it/vrnv2jd19kh51.jpg,24,,1597698000.0
4,The designers at Snapchat be like,1441,cq7ydh,AdobeIllustrator,https://i.redd.it/rosbdsir8eg31.jpg,66,,1565808000.0


> # 1000
is the Maximum Number of Posts the Reddit API could scrape automatically due to Reddit's limit on its API. It was by far the simplest of the 3 scraping mechanisms to use but it did not give me enough data. My target was at least 5,000

In [22]:
ai_top_posts.head() # Inspect Posts

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,I made an illustration during the COVID-19 loc...,2030,gwfbpd,AdobeIllustrator,https://i.redd.it/x38ytci5lv251.jpg,77,,1591297000.0
1,I did some linework first time in many years,1633,e89e61,AdobeIllustrator,https://i.redd.it/i3l3tsnwql341.png,32,,1575924000.0
2,Spongebob & Patrick Minimal portrait,1578,eza5r1,AdobeIllustrator,https://i.redd.it/iy28ekav54f41.jpg,60,,1580941000.0
3,Drew a Casio Baby-G.,1451,ibdnt0,AdobeIllustrator,https://i.redd.it/vrnv2jd19kh51.jpg,24,,1597698000.0
4,The designers at Snapchat be like,1446,cq7ydh,AdobeIllustrator,https://i.redd.it/rosbdsir8eg31.jpg,66,,1565808000.0


In [23]:
ai_new_posts.head() # Inspect Posts

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,"I made this wallpaper for mobile devices, any ...",5,lvozjc,AdobeIllustrator,https://i.redd.it/fmh5wkkybik61.jpg,1,,1614673000.0
1,Best Technique to create similar characters? l...,1,lvmyv0,AdobeIllustrator,https://i.redd.it/3xrijcrowhk61.png,0,,1614668000.0
2,Beginner question here: how do I edit what thi...,1,lvlfys,AdobeIllustrator,https://i.redd.it/v4rjnx7clhk61.jpg,3,,1614664000.0
3,Is it possible to align to a selected Key Obje...,1,lvl9uy,AdobeIllustrator,https://www.reddit.com/r/AdobeIllustrator/comm...,0,Hey everyone! I use the align to key object fe...,1614663000.0
4,How do I convert line to shape without ending ...,1,lvhiq9,AdobeIllustrator,https://www.reddit.com/r/AdobeIllustrator/comm...,2,The image describes it best. I want to draw a ...,1614655000.0


## 3.0 Using PushShift API to Scrape<a name="push"></a>

The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions. The project lead, /u/stuck_in_the_matrix, is the maintainer of the Reddit comment and submissions archives located at https://files.pushshift.io.

This RESTful API gives full functionality for searching Reddit data and also includes the capability of creating powerful data aggregations. With this API, you can quickly find the data that you are interested in and find fascinating correlations.

Source: https://github.com/pushshift/api

In [24]:
def subreddit_scrape (subreddit,start_date):
    
    url = 'https://api.pushshift.io/reddit/submission/search/' # My base link
    
    empty = [] # Empty string to tag on the Scraped Data
    
    while (start_date < 1614314325): # Set an Epoch Time Limit

        params = {
            'subreddit': subreddit,
            'size': 100,
            'after': start_date,
            'sort_type' : 'created_utc',
            'fields' : ['subreddit','title','selftext','created_utc','score','num_comments','subreddit_subscribers','full_link']
        } # Define Parameters to scrape from Reddit

        res = requests.get (url, params) # Combine parameters with URL

        dataseries = res.json() # Extract .json data from URL page
        
        empty.extend(dataseries['data']) # Add Data to empty list
        
        sleep(3) # Sleep for 3 seconds so we don't get flagged by Reddit
        
        if len(dataseries['data']) == 100: # This is to ensure there is no error 
        #occurs if less than 100 posts returned
        
            start_date = dataseries['data'][99]['created_utc']
            # Tag on the oldest date to our list and continue iterating
            
        else:
            
            break
        
    df = pd.DataFrame(empty) # Convert lists to dataframe

    return df

# https://pushshift.io/api-parameters/

In [25]:
procreate = subreddit_scrape ('ProCreate',1546306316) # Start Jan 2019

> Export PushShift API dataset to our Data Folder

In [26]:
# Write the DataFrame you created to a csv called 'procreate_finals.csv'
procreate.to_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/submissions/Projects/Project 3/data/0_Scraped_Data/procreate_final.csv', index=False)
print('Submission CSV is ready!')

Submission CSV is ready!


> # 28,000!
Is the maximum number of posts we were able to scrape with the PushShift API and this was for the ProCreate Reddit. It took the longest to execute of the three, but the reward for effort was worth it.

In [27]:
procreate.columns # Inspect Columns

Index(['created_utc', 'full_link', 'num_comments', 'score', 'selftext',
       'subreddit', 'subreddit_subscribers', 'title'],
      dtype='object')

In [28]:
procreate.shape # Inspect Shape

(28000, 8)

In [29]:
procreate[procreate['num_comments'] <= 2].shape # Check posts with comments less than 2

(18999, 8)

In [30]:
procreate[procreate['selftext'] == '[deleted]'].shape # Check posts with deleted self text

(474, 8)

In [31]:
adobeillustrator = subreddit_scrape('AdobeIllustrator',1546306316) # Start Jan 2019

> Export PushShift API dataset to our Data Folder

In [32]:
# Write the DataFrame you created to a csv called 'adobeillustrators.csv'
adobeillustrator.to_csv('/Users/macbook/Google Drive/0. Ofilispeaks Business (Mac and Cloud)/9. Data Science/0. Python/General Assembly Training/submissions/Projects/Project 3/data/0_Scraped_Data/adobeillustrator.csv', index=False)
print('Submission CSV is ready!')

Submission CSV is ready!


In [33]:
adobeillustrator.shape # Inspect Shape

(19882, 8)

In [34]:
adobeillustrator[adobeillustrator['num_comments'] <= 2].shape

(11930, 8)

In [35]:
adobeillustrator[adobeillustrator['selftext'] == '[deleted]'].shape

(314, 8)

“I am feeling 😊 today"