# Setup

Tutorial for reddit scraping: https://www.geeksforgeeks.org/scraping-reddit-using-python/

In [1]:
import praw
import pandas as pd
from praw.models import MoreComments
# from tqdm.notebook import tqdm
from tqdm import tqdm


# ids for scraping (from christians setup)
client_id = 'Ut5UgaAMOEWBELtYRWnw0g'
client_secret = '5xGs1w6mav5Ke685afpG28Q8nfusmg'
user_agent = 'polarity search'

# Scraping

First we initialize a read-only instance. A read-only instance can only scrape publicly available information and cannot upvote or otherwise interact like users can.

In [2]:
# Read-only instance
reddit_read_only = praw.Reddit(client_id=client_id,         # your client id
                               client_secret=client_secret,      # your client secret
                               user_agent=user_agent)        # your user agent

## Getting comments on a specific post

This code scrapes over the comments of a specified post. It looks only at the top-level comments (none of the replies to comments).

In [3]:
def scrape_post(url, n=100):
    '''given a url:
     - scrapes n number of top-level comments
     - comment author (username) and comment body ()
     - outputs a pandas dataframe'''

    # Creating a submission object
    submission = reddit_read_only.submission(url=url)
    
    # should get all top level comments on the post
    #if all_comments==True:
    #    submission.comments.replace_more(limit=None)

    post_authors = []
    post_comments = []

    # specifying how many times we should "load more comments"
    limit = round(n/20) # n/20 ensures we hit load more enough times to scrape the number of comments we want 
    submission.comments.replace_more(limit=limit, threshold=0)

    count = 0 # tracking how many comments have been scraped
    for comment in submission.comments: # iterates only over top level comments
        # stops counting when n amount of comments have been scraped
        if count == n:
            break

        if type(comment) == MoreComments:
            continue

        post_authors.append(comment.author)
        post_comments.append(comment.body)

        count += 1

    post_dict = {'author': post_authors, 'comment': post_comments}
    post_df = pd.DataFrame(post_dict)
    
    return post_df

### Examples of how scraping a post works

In [4]:
# scraping a post without much content - takes <1 second
# there are 12 top level comments
df = scrape_post("https://www.reddit.com/r/MaraudersGame/comments/ylxsq4/marauders_be_like/")

print(df.shape)
df.head()

(12, 2)


Unnamed: 0,author,comment
0,Lozsta,Why is there not a toggle to turn that off. I ...
1,OpossumHades,...that destroyed ÖRTH
2,JEClockwork,For 70 years we have long lived in the shadows...
3,l3lNova,Ok but real talk that movie was wack
4,sw4mpy_1,Well no more!!!!


In [5]:
# scraping a decently sized post (function default scrapes 100 comments) - takes ~10 seconds
# df = scrape_post("https://www.reddit.com/r/politics/comments/1092xhl/the_american_public_no_longer_believes_the/")

# print(df.shape)
# df.head()

In [6]:
# # scraping 200 comments from a decently sized post (post has 3.8k comments) - takes ~15 seconds
# df = scrape_post("https://www.reddit.com/r/politics/comments/1092xhl/the_american_public_no_longer_believes_the/", n=200)

# print(df.shape)
# df.head()

In [7]:
# aidan: idk what below line was so i just commented it out
# praw.models.reddit.submission.Submission?

## Getting top month posts on specified subreddit
This code grabs the top 100 posts of the past month and saves various information on them into a dictionary

In [8]:
def scrape_top_month(subreddit, ppsr=100):
    # specifying subreddit
    subreddit = reddit_read_only.subreddit(subreddit)

    # Specifying to look at top posts of the current month
    posts = subreddit.top("month", limit=ppsr)

    # Initializing dictionary to save post data to
    posts_dict = {"Title": [], "Post Text": [],
                  "ID": [], "Score": [],
                  "Total Comments": [], "Post URL": [], 'Post_author' : []
                  }

    # Loop for saving post details
    for post in posts:
        # print(post)
        # Title of each post
        posts_dict["Title"].append(post.title)

        # Text inside a post
        posts_dict["Post Text"].append(post.selftext)

        # Unique ID of each post
        posts_dict["ID"].append(post.id)

        # The score of a post
        posts_dict["Score"].append(post.score)

        # Total number of comments inside the post
        posts_dict["Total Comments"].append(post.num_comments)

        # Author of the post
        posts_dict['Post_author'].append(post.author)

        # URL of each post
        # print('https://www.reddit.com'+f'{post.permalink}')
        posts_dict["Post URL"].append('https://www.reddit.com'+f'{post.permalink}')
        
    return posts_dict

### Examples of how scraping top month posts works

In [9]:
dict_ = scrape_top_month('politics')

Call this function with 'time_filter' as a keyword argument.
  posts = subreddit.top("month", limit=ppsr)


In [10]:
# post samples
print(dict_['Title'][0])
print(dict_['Post Text'][0])
print(dict_['ID'][0])
print(dict_['Score'][0])
print(dict_['Total Comments'][0])
print(dict_['Post URL'][0])
print(len(dict_['Title']))
print(dict_['Post_author'][0])

Ocasio-Cortez calls for Thomas impeachment after report of undisclosed gifts from GOP donor

12dna0j
103880
3411
https://www.reddit.com/r/politics/comments/12dna0j/ocasiocortez_calls_for_thomas_impeachment_after/
100
Gato1980


In [11]:
# dict_ = scrape_top_month('politics', ppsr=150)

In [12]:
# # post samples
# print(dict_['Title'][0])
# print(dict_['Post Text'][0])
# print(dict_['ID'][0])
# print(dict_['Score'][0])
# print(dict_['Total Comments'][0])
# print(dict_['Post URL'][0])
# print(len(dict_['Title']))
# print(dict_['Post_author'][0])

## Getting comments on top monthly posts on multiple subreddits

Data we are keeping and why:
 - post_title: to embed and become node values of our users
 - post_url: in case we ever want to visit the post for any inspection reasons
 - comment_author: to keep track of who left the comment
 - comment_text: to analyse the sentiment of the comment towards the post
 - post_author: to track who made the post

In [13]:
def scrape_multiple_save(subreddits, ppsr=100, n=100, save=False, destination=''):
    '''scrapes and saves subreddit comments to csv files
     - subreddits: list of subreddit name strings, fx Politics
     - ppsr: number of posts per subbreddit to scrape
     - n: number of comments per post to scrape'''
    
    if save==False:
        print('WARNING: save=True has not been specified!')

    print(f'Scraping {ppsr} posts per subreddit and {n} comments per post')
    
    # looping through subreddits
    for subreddit in subreddits:
        print(f'Scraping r/{subreddit}...')
        
        # initialize dictionary for saving all comments and post info
        sub_dict = {'post_title': [],
                    # 'post_text': [],
                    # 'post_id': [],
                    # 'post_score': [],
                    # 'post_total_comments': [],
                    'post_url': [],
                    'comment_author': [],
                    'comment_text': [], 
                    'post_author' : []}
        
        posts_dict = scrape_top_month(subreddit, ppsr) # getting top of the month post info
        
        # looping through posts
        for idx, url in tqdm(enumerate(posts_dict['Post URL']),):
            
            # df for comments on the post
            comment_df = scrape_post(url, n=n)
            
            # looping through comments on post and appending all comment info to sub_dict
            for row_idx, row in comment_df.iterrows():
                sub_dict['post_title'].append(posts_dict['Title'][idx])
                # sub_dict['post_text'].append(posts_dict['Post Text'][idx])
                # sub_dict['post_id'].append(posts_dict['ID'][idx])
                # sub_dict['post_score'].append(posts_dict['Score'][idx])
                # sub_dict['post_total_comments'].append(posts_dict['Total Comments'][idx])
                sub_dict['post_url'].append(posts_dict['Post URL'][idx])
                sub_dict['comment_author'].append(row['author'])
                sub_dict['comment_text'].append(row['comment'])
                sub_dict['post_author'].append(posts_dict['Post_author'][idx])
            
        # changing sub_dict to pandas dataframe
        global sub_df
        sub_df = pd.DataFrame.from_dict(sub_dict)

        # saving to csv
        if save==True:
            sub_df.to_csv(f'{destination}{subreddit}.csv', index=False)
        
    print('Done!')

    if save==True:
        return None
    else:
        return sub_df

### How to use function

Below line shows how to scrape r/politics, scraping 200 posts and 100 comments from each post.

It won't save the dataframe unless save=True is specified, so that it can be tested without overwriting anything.

It takes ~ 30-40 minutes to run as it's scraping 200*100 = 20,000 comments. It won't end up actually being 20,000 comments however, probably because 200 posts down there will be less than 100 top-level comments.

In [14]:
# takes ~ 40 minutes to run (it's scraping 200*100 = 20,000 comments - though it wont be that many)
#df_test = scrape_multiple_save(['politics'], ppsr=200, n=100)

In [15]:
#print(df_test.shape)
#df_test.head()

In [16]:
#testy = df_test.copy()

### Running scrapes - ONLY RUN IF WILLING TO LEAVE RUNNING FOR HOURS

Make sure to specify destination folder before running. Example destination: destination='../data/28feb/scrapes/'

Also make sure that the folder destination exists before running, I'm not 100% sure if it is necessary, but it's better to be sure that the dataframe is saveable instead of running the code for an hour and getting a "destination doesn't exist" error...

In [17]:
# ~65 mins to run, finds 33k comments
#scrape_multiple_save(['politics'], ppsr=200, n=200, save=True, destination='../data/19march/scrapes/')

In [18]:
#df_politics = pd.read_csv('../data/19march/scrapes/politics.csv')
#df_politics.shape

In [19]:
# ~38 mins to run, finds 28k comments
#scrape_multiple_save(['gaming'], ppsr=200, n=200, save=True, destination='../data/19march/scrapes/')

In [20]:
#df_gaming = pd.read_csv('../data/19march/scrapes/gaming.csv')
#df_gaming.shape

In [21]:
# ~28 mins to run, finds 28k comments
#scrape_multiple_save(['EscapefromTarkov'], ppsr=200, n=200, save=True, destination='../data/19march/scrapes/')

In [22]:
#df_tarkov = pd.read_csv('../data/19march/scrapes/EscapefromTarkov.csv')
#df_tarkov.shape

In [23]:
# ~7 mins to run, ~6k comments
#scrape_multiple_save(['HuntShowdown'], ppsr=200, n=200, save=True, destination='../data/19march/scrapes/')

In [24]:
#df_HuntShowdown = pd.read_csv('../data/19march/scrapes/HuntShowdown.csv')
#df_HuntShowdown.shape

# Scraping even larger datasets (Gonna leave pc running for a long time)
-Chris 

In [25]:
#scrape_multiple_save(['politics'], ppsr=1000, n=200, save=True, destination='../data/23march_chur/scrape/')

In [26]:
#scrape_multiple_save(['gaming'], ppsr=1000, n=200, save=True, destination='../data/23march_chur/scrape/')

In [28]:
scrape_multiple_save(['FIFA'], ppsr=1000, n=200, save=True, destination='../data/date_folders/april_18/scrape/')

Scraping 1000 posts per subreddit and 200 comments per post
Scraping r/FIFA...


Call this function with 'time_filter' as a keyword argument.
  posts = subreddit.top("month", limit=ppsr)
999it [36:47,  2.21s/it]


Done!


In [29]:
scrape_multiple_save(['CallOfDuty'], ppsr=1000, n=200, save=True, destination='../data/date_folders/april_18/scrape/')

Scraping 1000 posts per subreddit and 200 comments per post
Scraping r/CallOfDuty...


Call this function with 'time_filter' as a keyword argument.
  posts = subreddit.top("month", limit=ppsr)
397it [09:46,  1.48s/it]

Done!



