# Setup

Tutorial for reddit scraping: https://www.geeksforgeeks.org/scraping-reddit-using-python/

In [1]:
import praw
import pandas as pd
from praw.models import MoreComments
# from tqdm.notebook import tqdm
from tqdm import tqdm


# ids for scraping (from christians setup)
client_id = 'Ut5UgaAMOEWBELtYRWnw0g'
client_secret = '5xGs1w6mav5Ke685afpG28Q8nfusmg'
user_agent = 'polarity search'

# Scraping

First we initialize a read-only instance. A read-only instance can only scrape publicly available information and cannot upvote or otherwise interact like users can.

In [2]:
# Read-only instance
reddit_read_only = praw.Reddit(client_id=client_id,         # your client id
                               client_secret=client_secret,      # your client secret
                               user_agent=user_agent)        # your user agent

Version 7.6.1 of praw is outdated. Version 7.7.0 was released Saturday February 25, 2023.


## Getting comments on a specific post

This code scrapes over the comments of a specified post. It looks only at the lead comments (none of the replies to comments). It only goes over the first 112 comments for some reason.

In [3]:
def scrape_post(url, all_comments=False):
    # Creating a submission object
    submission = reddit_read_only.submission(url=url)
    
    # should get all top level comments on the post
    if all_comments==True:
        submission.comments.replace_more(limit=None)

    post_authors = []
    post_comments = []

    for comment in submission.comments:
        if type(comment) == MoreComments:
            continue

        post_authors.append(comment.author)
        post_comments.append(comment.body)

    post_dict = {'author': post_authors, 'comment': post_comments}
    post_df = pd.DataFrame(post_dict)
    
    return post_df

In [5]:
df = scrape_post("https://www.reddit.com/r/MaraudersGame/comments/ylxsq4/marauders_be_like/")

In [6]:
praw.models.reddit.submission.Submission?

[0;31mInit signature:[0m
[0mpraw[0m[0;34m.[0m[0mmodels[0m[0;34m.[0m[0mreddit[0m[0;34m.[0m[0msubmission[0m[0;34m.[0m[0mSubmission[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mreddit[0m[0;34m:[0m [0;34m'praw.Reddit'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mid[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0murl[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0m_data[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mAny[0m[0;34m][0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
A class for submissions to Reddit.

.. include:: ../../typical_attributes.rst

Attribute                  Description
``author``                 Provides an instance of

In [7]:
print(df.shape)
df.head()

(12, 2)


Unnamed: 0,author,comment
0,Lozsta,Why is there not a toggle to turn that off. I ...
1,OpossumHades,...that destroyed ÖRTH
2,JEClockwork,For 70 years we have long lived in the shadows...
3,l3lNova,Ok but real talk that movie was wack
4,sw4mpy_1,Well no more!!!!


In [42]:
df = scrape_post("https://www.reddit.com/r/politics/comments/1092xhl/the_american_public_no_longer_believes_the/")

In [43]:
print(df.shape)
df.head()

(109, 2)


Unnamed: 0,author,comment
0,AutoModerator,"\nAs a reminder, this subreddit [is for civil ..."
1,romacopia,Because there's no reason to think they're imp...
2,downwardspiralstairs,"Oh, were we supposed to believe that it's impa..."
3,2FalseSteps,They're not wrong.
4,SmackEh,When the wife of a Supreme Court Justice atten...


## Getting top month posts on specified subreddit
This code grabs the top 100 posts of the past month and saves various information on them into a dictionary

In [4]:
def scrape_top_month(subreddit, ppsr=100):
    # specifying subreddit
    subreddit = reddit_read_only.subreddit(subreddit)

    # Specifying to look at top posts of the current month
    posts = subreddit.top("month", limit=ppsr)

    # Initializing dictionary to save post data to
    posts_dict = {"Title": [], "Post Text": [],
                  "ID": [], "Score": [],
                  "Total Comments": [], "Post URL": [], 'Post_author' : []
                  }

    # Loop for saving post details
    for post in posts:
        # print(post)
        # Title of each post
        posts_dict["Title"].append(post.title)

        # Text inside a post
        posts_dict["Post Text"].append(post.selftext)

        # Unique ID of each post
        posts_dict["ID"].append(post.id)

        # The score of a post
        posts_dict["Score"].append(post.score)

        # Total number of comments inside the post
        posts_dict["Total Comments"].append(post.num_comments)

        # print(post.author)
        # Author of the post
        posts_dict['Post_author'].append(post.author)

        # URL of each post
        # print('https://www.reddit.com'+f'{post.permalink}')
        posts_dict["Post URL"].append('https://www.reddit.com'+f'{post.permalink}')
        
    return posts_dict

In [5]:
dict_ = scrape_top_month('politics')

Call this function with 'time_filter' as a keyword argument.
  posts = subreddit.top("month", limit=ppsr)


In [26]:
# post samples
print(dict_['Title'][0])
print(dict_['Post Text'][0])
print(dict_['ID'][0])
print(dict_['Score'][0])
print(dict_['Total Comments'][0])
print(dict_['Post URL'][0])
print(len(dict_['Title']))
print(dict_['Post_author'][0])

Bernie Sanders says it's time for a four-day work week

118jfd5
94379
4280
https://www.reddit.com/r/politics/comments/118jfd5/bernie_sanders_says_its_time_for_a_fourday_work/
100
Picture-unrelated


In [27]:
dict_ = scrape_top_month('politics', ppsr=150)

Call this function with 'time_filter' as a keyword argument.
  posts = subreddit.top("month", limit=ppsr)


In [28]:
# post samples
print(dict_['Title'][0])
print(dict_['Post Text'][0])
print(dict_['ID'][0])
print(dict_['Score'][0])
print(dict_['Total Comments'][0])
print(dict_['Post URL'][0])
print(len(dict_['Title']))
print(dict_['Post_author'])

Bernie Sanders says it's time for a four-day work week

118jfd5
94371
4280
https://www.reddit.com/r/politics/comments/118jfd5/bernie_sanders_says_its_time_for_a_fourday_work/
150
[Redditor(name='Picture-unrelated'), Redditor(name='hopopo'), Redditor(name='newnemo'), Redditor(name='Picture-unrelated'), Redditor(name='GDPisnotsustainable'), Redditor(name='mdj1359'), Redditor(name='theindependentonline'), Redditor(name='CapitalCourse'), Redditor(name='Ozymandias_a'), Redditor(name='Beckles28nz'), Redditor(name='Hot-Bint'), Redditor(name='Gari_305'), Redditor(name='CapitalCourse'), Redditor(name='slaysia'), Redditor(name='LuvKrahft'), Redditor(name='cool_name52'), Redditor(name='AreYouPurple'), Redditor(name='Picture-unrelated'), Redditor(name='LudovicoSpecs'), Redditor(name='semaphore-1842'), Redditor(name='HauntingJackfruit'), Redditor(name='jonfla'), Redditor(name='bildo72'), Redditor(name='joyfullypresent'), Redditor(name='boregon'), Redditor(name='slaysia'), Redditor(name='LieutJimDan

## Getting comments on top monthly posts on multiple subreddits

In [12]:
def scrape_multiple_save(subreddits, ppsr=100, all_comments=False):
    '''scrapes and saves subreddit comments to csv files
       Naming convention is: SUBREDDIT_POSTID.csv / SUBREDDIT_POSTID_INFO.txt'''
    
    
    if all_comments==False:
        print(f'Scraping {ppsr} posts per subreddit and ~100 comments per post')
    else:
        print(f'Scraping {ppsr} posts per subreddit and all comments per post')
    
    # looping through subreddits
    for subreddit in subreddits:
        print(f'Scraping r/{subreddit}...')
        
        # initialize dictionary for saving all comments and post info
        sub_dict = {'post_title': [],
                    'post_text': [],
                    'post_id': [],
                    'post_score': [],
                    'post_total_comments': [],
                    'post_url': [],
                    'comment_author': [],
                    'comment_text': [], 
                    'post_author' : []}
        
        posts_dict = scrape_top_month(subreddit, ppsr) # getting top of the month post info
        
        # looping through posts
        for idx, url in tqdm(enumerate(posts_dict['Post URL']),):
            
            # df for comments on the post
            comment_df = scrape_post(url, all_comments=all_comments)
            
            # looping through comments on post and appending all comment info to sub_dict
            for row_idx, row in comment_df.iterrows():
                sub_dict['post_title'].append(posts_dict['Title'][idx])
                sub_dict['post_text'].append(posts_dict['Post Text'][idx])
                sub_dict['post_id'].append(posts_dict['ID'][idx])
                sub_dict['post_score'].append(posts_dict['Score'][idx])
                sub_dict['post_total_comments'].append(posts_dict['Total Comments'][idx])
                sub_dict['post_url'].append(posts_dict['Post URL'][idx])
                sub_dict['comment_author'].append(row['author'])
                sub_dict['comment_text'].append(row['comment'])
                sub_dict['post_author'].append(posts_dict['Post_author'][idx])
            
        # changing sub_dict to pandas dataframe
        global sub_df
        sub_df = pd.DataFrame.from_dict(sub_dict)

        # saving to csv
        #sub_df.to_csv(f'../data/28feb/scrapes/{subreddit}.csv', index=False)
        
    print('Done!')
    return None

In [13]:
scrape_multiple_save(['politics'])

Scraping 100 posts per subreddit and ~100 comments per post
Scraping r/politics...


Call this function with 'time_filter' as a keyword argument.
  posts = subreddit.top("month", limit=ppsr)
100it [10:04,  6.04s/it]

Done!





In [14]:
sub_df

Unnamed: 0,post_title,post_text,post_id,post_score,post_total_comments,post_url,comment_author,comment_text,post_author
0,Bernie Sanders says it's time for a four-day w...,,118jfd5,94375,4280,https://www.reddit.com/r/politics/comments/118...,AutoModerator,"\nAs a reminder, this subreddit [is for civil ...",Picture-unrelated
1,Bernie Sanders says it's time for a four-day w...,,118jfd5,94375,4280,https://www.reddit.com/r/politics/comments/118...,AgentM44,Life-changing. Switched to 4-day week about 4 ...,Picture-unrelated
2,Bernie Sanders says it's time for a four-day w...,,118jfd5,94375,4280,https://www.reddit.com/r/politics/comments/118...,Picture-unrelated,>> This isn't the first time a four-day work w...,Picture-unrelated
3,Bernie Sanders says it's time for a four-day w...,,118jfd5,94375,4280,https://www.reddit.com/r/politics/comments/118...,ContentSeal,Ill leave my job for any that has 4 day work w...,Picture-unrelated
4,Bernie Sanders says it's time for a four-day w...,,118jfd5,94375,4280,https://www.reddit.com/r/politics/comments/118...,ViennettaLurker,I've been seeing and hearing so much about 4 d...,Picture-unrelated
...,...,...,...,...,...,...,...,...,...
13545,Democrat who nearly unseated Boebert launches ...,,1128xx4,13054,316,https://www.reddit.com/r/politics/comments/112...,spunkypudding,"""name is Adam Frisch not Nancy Pelosi""",newnemo
13546,Democrat who nearly unseated Boebert launches ...,,1128xx4,13054,316,https://www.reddit.com/r/politics/comments/112...,UConnUser92,10 bucks to Adam Frisch coming in from NY!,newnemo
13547,Democrat who nearly unseated Boebert launches ...,,1128xx4,13054,316,https://www.reddit.com/r/politics/comments/112...,BickNickerson,Time to send Frisch another campaign donation.,newnemo
13548,Democrat who nearly unseated Boebert launches ...,,1128xx4,13054,316,https://www.reddit.com/r/politics/comments/112...,Ivorcomment,All Frisch needs is to find four hundred and s...,newnemo


In [15]:
sub_df.to_csv(f'../data/28feb/scrapes/politics.csv', index=False)