<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Scrape_Reddit_posts_%26_comments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍯 Scrape Reddit posts & comments

# Import required libraries

In [None]:
!pip install praw

import requests
import json
import re
import time
import csv
import praw
import datetime as dt
import pandas as pd

# Retrive submission IDs
In this cell you will be able to retrieve the submission id's from a given subreddit. You can specify subreddit and query in the bottom where the function is called.

The output of this cell will be a file called loop.csv, which you can find in the file folder in the sidebar. Loop.csv contains all submission IDs matching your query.

**If you wanna use multiple different queries make sure to save these files between running the cell, and combine them afterward! It might work by appending the new query to the list, but as a precaution make sure to backup the loop.csv**

In [None]:
PUSHSHIFT_REDDIT_URL = "https://api.pullpush.io/reddit"

def fetchObjects(**kwargs):
    # Default paramaters for API query
    params = {
        "sort_type":"created_utc",
        "sort":"asc",
        "size": 1000,
        }

    # Add additional paramaters based on function arguments
    for key,value in kwargs.items():
        params[key] = value

    # Print API query paramaters
    print(params)

    # Set the type variable based on function input
    # The type can be "comment" or "submission", default is "comment"
    type = "comment"
    if 'type' in kwargs and kwargs['type'].lower() == "submission":
        type = "submission"

    # Perform an API request
    r = requests.get(PUSHSHIFT_REDDIT_URL + "/" + type + "/search/", params=params, timeout=30)

    # Check the status code, if successful, process the data
    if r.status_code == 200:
        response = json.loads(r.text)
        data = response['data']
        sorted_data_by_id = sorted(data, key=lambda x: int(x['id'],36))
        return sorted_data_by_id

def extract_reddit_data(**kwargs):
    # Speficify the start timestamp
    max_created_utc = 1672563620  # 01/12/2021 @ 11:29pm - start with this
    max_id = 0

    # Open a file for JSON output
    with open('loop.csv','a') as f1:
        writer=csv.writer(f1, delimiter=',',lineterminator='\n',)

        # While loop for recursive function
        while 1:
            nothing_processed = True
            # Call the recursive function
            objects = fetchObjects(**kwargs,after=max_created_utc)

            # Loop the returned data, ordered by date
            for object in objects:
                id = int(object['id'],36)
                if id > max_id:
                    nothing_processed = False
                    created_utc = object['created_utc']
                    max_id = id
                    if created_utc > max_created_utc: max_created_utc = created_utc
                    # Output JSON data to the opened file
                    row = [object['id']]
                    writer.writerow(row)

            # Exit if nothing happened
            if nothing_processed: return
            max_created_utc -= 1

            # Sleep a little before the next recursive function call
            time.sleep(1)

# Start program by calling function with:
# 1) Subreddit specified
# 2) The type of data required (comment or submission)
# 3) A query - if provided an empty string it will just extract
extract_reddit_data(subreddit="teachers",type="submission", q='artificial intelligence')

# Loading the list of IDs
Here we will input the loop.csv from the previous cell. If you wanna manually curate a list of submission you can load in the file here. Make sure to save your manually curated list of submission in a csv, and call the column: id.

It will print the total amount of IDs in the file after loading the file.

In [None]:
data = pd.read_csv("loop.csv",
                   sep=',',
                   names=["id"])


len(data)

# Retrieving the content and comments of a Reddit submission
In this cell the actual content will be retrieved based on the ID. From the submission we'll retrieve: The title, the upvote score, the ID, the URL, number of comments, UNIX timestamp, the text content, and the username.

Further, the first-level comments of the submission will also be retrieve. It is possible to add a recursive loop to this, but I haven't had the time. Experiment on your own.

You might also have to create your own account and widget, if the account I have supplied becomes blocked. You can find information on this online.

First you need to decide on a limit for how many times it should load in more comments. This is mainly in cases where there are a lot of comments on a post. A limit of None loads in more comments 0 times, while a limit of 2 loads in more comments two times.

In [None]:
more_limit = None

reddit = praw.Reddit(client_id="hhxzoVnl0x0jlA",                      # client id
                     client_secret="wPJjQSLNIeMbtK3xilkrHdW3ftToCw",  # client secret
                     user_agent="scraper",                            # user agent name
                     username = "digitalscienceboi",                  # reddit username
                     password = "htHejLGDFLYVrX6",                    # reddit password
                     check_for_async=False)

post_dict = { "title":[],"score":[],"id":[], "url":[], "comms_num":[],"created":[],"body":[],"author":[]}
comment_dict = {"author":[],"score":[],"id":[],"created":[],"link_id":[],"body":[]}

#for x in data['id']:
submission = (praw.models.Submission(reddit, id = 'tjb97k'))
post_dict["title"].append(submission.title)
post_dict["score"].append(submission.score)
post_dict["id"].append(submission.id)
post_dict["url"].append(submission.url)
post_dict["comms_num"].append(submission.num_comments)
post_dict["created"].append(submission.created)
post_dict["body"].append(submission.selftext)
post_dict["author"].append(submission.author)
comments = submission.comments
comments.replace_more(limit=more_limit)
for comment in comments:
  comment_dict["author"].append(comment.author)
  comment_dict["score"].append(comment.score)
  comment_dict["id"].append(comment.id)
  comment_dict["created"].append(comment.created)
  comment_dict["link_id"].append(comment.link_id)
  comment_dict["body"].append(comment.body)


# Converting UNIX timestamp to datetime
In this cell the UNIX timestamp will be converted to a proper datetime (YYYY-MM-DD HH:MM:SS) that is more usable.

This cell will also output two files: One for submissions (submissions.csv) and one for comments (comments.csv). These can be merged if you wanna do a semantic network.

In [None]:
def get_date(created):
    return dt.datetime.fromtimestamp(created)

post_out = pd.DataFrame(post_dict)
comment_out = pd.DataFrame(comment_dict)

post_timestamp = post_out['created'].apply(get_date)
comment_timestamp = comment_out['created'].apply(get_date)

post_out = post_out.assign(timestamp = post_timestamp)
comment_out = comment_out.assign(timestamp = comment_timestamp)

post_out.to_csv('submissions.csv')
comment_out.to_csv('comments.csv')