<a href="https://colab.research.google.com/github/linesn/reddit_analysis/blob/main/Reddit_Search_NL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gathering Reddit political posts on Biden
*Nick Lines*

# Introduction

This is an adaptation of a notebook provided by the Discovery Lab. License details for my use follow.

The intention of this script is to allow the user to download a corpus of recent Reddit comments from a set of reference and political subredits.

Licence
-------------
Developed by the Discovery Lab, Applied Intelligence Group, Accenture Federal Systems.

```
Copyright (c) 2020 Accenture Federal Systems.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```


Purpose
-------------
This notebook collects posts and comments, and associated metadata from the [Reddit](https://www.reddit.com/) social media platform. It uses Reddit API through [PRAW](https://praw.readthedocs.io/en/latest/) and requires credentials.


Input
-------------
**Required Parameters**

- _**client\_id**_ (string) - A unique client id provided by Reddit.

- _**client\_secret**_ (string) - Secret associated with client id provided by Reddit.

- _**user\_agent**_ (string) - A unique user agent provided by Reddit.

- _**search\_terms**_ (array of strings) - Search terms.

- _**subreddits**_ (array of strings) - Names of subreddits to search from.


**Optional Parameters**

- _**post\_limit**_ (integer, default: 100, maximum: 1000) - Maximum number of posts to download.


Output
-------------
The outputs are two `CSV` files named `REDDIT_POSTS_{{DATETIME}}.csv` and `REDDIT_COMMENTS_{{DATETIME}}.csv`, where `{{DATETIME}}` is the approximate date/time of the data collection. These `CSV` files are saved in the `{{HOME}}/data/raw/Reddit` folder, where `{{HOME}}` is the project installation folder.

The columns of the output file `REDDIT_POSTS_{{DATETIME}}.csv` is as follows.

- _**author**_ (string) -  Provides an instance of `Redditor`.

- _**clicked**_ (binary) -  Whether or not the submission has been clicked by the client. 

- _**comments**_ (array of strings) -  Provides an instance of `CommentForest`. 

- _**created_utc**_ (datetime) - Time the submission was created, represented in Unix Time. 

- _**distinguished**_ (binary) - Whether or not the submission is distinguished. 

- _**edited**_ (binary) - Whether or not the submission has been edited. 

- _**id**_ (string) - ID of the submission. 

- _**is\_original_content**_ (binary) - Whether or not the submission has been set as original content. 

- _**is\_self**_ (binary) - Whether or not the submission is a selfpost (text-only). 

- _**link\_flair_template\_id**_ (string) - The link flair’s ID, or None if not flaired. 

- _**link\_flair\_text**_ (text) - The link flair’s text content, or None if not flaired. 

- _**locked**_ (binary) - Whether or not the submission has been locked. 

- _**name**_ (string) - Fullname of the submission. 

- _**num\_comments**_ (integer) - The number of comments on the submission. 

- _**over\_18**_ (binary) - Whether or not the submission has been marked as NSFW. 

- _**permalink**_ (string) - A permalink for the submission. 

- _**poll\_data**_ (object) - A PollData object representing the data of this submission, if it is a poll submission. 

- _**score**_ (integer) - The number of upvotes for the submission. 

- _**selftext**_ (text) - The submissions’ selftext - an empty string if a link post. 

- _**spoiler**_ (binary) - Whether or not the submission has been marked as a spoiler. 

- _**stickied**_ (binary) - Whether or not the submission is stickied. 

- _**subreddit**_ (string) - Provides an instance of Subreddit. 

- _**title**_ (text) - The title of the submission. 

- _**upvote\_ratio**_ (double) - The percentage of upvotes from all votes on the submission. 

- _**url**_ (string) - The URL the submission links to, or the permalink if a selfpost. 


The columns of the output file `REDDIT_COMMENTS_{{DATETIME}}.csv` is as follows.

- _**author**_ (string) - Provides an instance of Redditor. 

- _**body**_ (text) -  The body of the comment, as Markdown.

- _**body\_html**_ (text) - The body of the comment, as HTML.

- _**created\_utc**_ (datetime) - Time the comment was created, represented in Unix Time. 

- _**distinguished**_ (binary) - Whether or not the comment is distinguished. 

- _**edited**_ (binary) - Whether or not the comment has been edited. 

- _**id**_ (string) - ID of the comment. 

- _**is\_submitter**_ (binary) - Whether or not the comment author is also the author of the submission. 

- _**link\_id**_ (string) - The submission ID that the comment belongs to. 

- _**parent\_id**_ (string) - The ID of the parent comment (prefixed with t1\_). If it is a top-level comment, this returns the submission ID instead (prefixed with t3\_). 

- _**permalink**_ (string) - A permalink for the comment. Comment objects from the inbox have a context attribute instead. 

- _**replies**_ (integer) - Provides an instance of CommentForest. 

- _**score**_ (integer) - The number of upvotes for the comment. 

- _**stickied**_ (binary) - Whether or not the comment is stickied. 
 
- _**submission**_ (string) - Provides an instance of the submission that the comment belongs to. 

- _**subreddit**_ (string) - Provides an instance of the subreddit that the comment belongs to. 

- _**subreddit\_id**_ (string) - The subreddit ID that the comment belongs to. 


# Setup

<p> The imports, function and class defintions, global variables, and system-dependent configuration are in this section. </p>

<p> The system dependent configuration should be carefully reviewed and configured for each system (e.g., Linux vs. Windows, or the path of an external program) since the playbook will most likely fail without proper configuration. </p>

## Imports

In [None]:
# for scraping
try:
  from selenium import webdriver
  from selenium.common.exceptions import StaleElementReferenceException
  from selenium.webdriver.common.keys import Keys
  from selenium.webdriver.chrome.options import Options
  from selenium.webdriver.support.ui import WebDriverWait
except:
  !pip install selenium
  from selenium import webdriver
  from selenium.common.exceptions import StaleElementReferenceException
  from selenium.webdriver.common.keys import Keys
  from selenium.webdriver.chrome.options import Options
  from selenium.webdriver.support.ui import WebDriverWait

If you use synchronous prawl, you may get overwhelmed by warnings to use asynchronous praw. Therefore, we may want to silence the warnings (not something you should usually do).

In [None]:
# For Reddit
try:
  import praw
  from praw import Reddit
  from praw.models import MoreComments
except:
  !pip install praw
  import praw
  from praw import Reddit
  from praw.models import MoreComments  
import warnings
warnings.filterwarnings("ignore")

In [None]:
"""This cell imports necessary Python modules and performs initial configuration
"""

# Data manipulation libraries
import json
import pandas as pd 
import csv


# Visualization and Interaction
# import matplotlib.pyplot as plt
# plt.style.use('ggplot')
from IPython.display import set_matplotlib_formats, display, clear_output, HTML
set_matplotlib_formats('retina')
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot 
init_notebook_mode(connected=True)
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import VBox, HBox, Button, HTML, Label


# Computation libraries 
import numpy as np
import re
import random


# Graph analysis
# import networkx as nx
# import community


# System related
import io
import platform
from pathlib import Path
import os
from getpass import getpass
# from joblib import Parallel, delayed


# Datetime libraries
from datetime import datetime
import time
from pytz import timezone

# Scraping libraries

from bs4 import BeautifulSoup

# Logging
import logging 
logging.basicConfig(level=logging.INFO)

## Parameters

In [None]:
"""This cell defines global variables and parameters used throughout the playbook
"""

# Set this to True if you want to watch Selenium scrape pages
# WATCH_SCRAPING = True

# Set this to True if you want to use incognito mode
# USE_INCOGNITO = True

# Number of posts 
post_limit = 1000

# The data is written 
#RAW_DATA_DIRECTORY = Path("../data/raw/Reddit/")

# Setup logging level
LOGGING_LEVEL = logging.INFO 
logging.basicConfig(level=LOGGING_LEVEL)

## Functions and Classes

In [None]:
"""This cell defines functions and classes used throughout the playbook
"""

def __init__(self, client_id, client_secret, user_agent, password):
    self.client_id = client_id
    self.client_secret = client_secret
    self.user_agent = user_agent
    self.password = password


def token(client_id, client_secret, user_agent):
    reddit = Reddit(client_id=client_id,
                         client_secret=client_secret,
                         user_agent=user_agent,
                         check_for_async=False,
                    )
    if (reddit != False):
        print("Successful token")
    else:
        print("Failed token")
    return reddit


def search_reddit(reddit, search_term, sort_type, time_limit, post_limit):
    """
    GRAB REDDIT POSTS BY SEARCH TERM
    search_term = any boolean search #https://www.reddit.com/dev/api/
    sort_type = 'relevance', 'hot', 'top', 'new', 'comments'
    time_limit = 'all', 'hour', 'day', 'week', 'month', 'year'
    post_limit = 1000 maximum
    """

    posts = []
    subreddit = reddit.subreddit("all")
    for post in subreddit.search(search_term, sort=sort_type, time_filter=time_limit, limit=post_limit):
        posts.append([post.subreddit, post.id, post.title, post.selftext, post.author, post.url, post.permalink,
                      post.num_comments, post.created, post.score, post.distinguished, post.is_original_content,
                      post.upvote_ratio, post.link_flair_text])
    posts = pd.DataFrame(posts,
                         columns=['subreddit', 'post_id', 'title', 'post_body', 'post_author', 'url', 'post_permalink',
                                  'num_comments', 'post_created', 'post_score', 'post_distinguished',
                                  'original_content', 'upvote_ratio', 'flair_text'])
    posts['post_created'] = pd.to_datetime(posts['post_created'], unit='s')
    posts['scrape_time'] = datetime.now()
    posts[['subreddit', 'post_id', 'title', 'post_author',
           'post_body', 'url', 'post_permalink', 'flair_text']] = posts[['subreddit', 'post_id', 'title', 'post_author',
                                                                         'post_body', 'url', 'post_permalink',
                                                                         'flair_text']].astype(str)
    return posts


def get_subreddit(reddit, sub, sort_type, time_limit, post_limit):
    '''
    GRAB REDDIT POSTS BY SUBREDDIT
    sub = subreddits
    sort_type = 'hot', 'top', 'new', 'gilded', 'rising', 'controversial'
    time_limit = 'all', 'hour', 'day', 'week', 'month', 'year'
    post_limit = 1000 maximum
    '''
    subreddit = reddit.subreddit(sub)
    posts = []
    if (sort_type == "top") or (sort_type == "hot") or (sort_type == "controversial"):
        for post in subreddit.top(time_filter=time_limit, limit=post_limit):
            posts.append([post.subreddit, post.id, post.title, post.selftext, post.author, post.url, post.permalink,
                          post.num_comments, post.created, post.score, post.distinguished, post.is_original_content,
                          post.upvote_ratio, post.link_flair_text])

    if (sort_type == "new") or (sort_type == "rising"):
        for post in subreddit.new(limit=post_limit):
            posts.append([post.subreddit, post.id, post.title, post.selftext, post.author, post.url, post.permalink,
                          post.num_comments, post.created, post.score, post.distinguished, post.is_original_content,
                          post.upvote_ratio, post.link_flair_text])

    posts = pd.DataFrame(posts,
                         columns=['subreddit', 'post_id', 'title', 'post_body', 'post_author', 'url', 'post_permalink',
                                  'num_comments', 'post_created', 'post_score', 'post_distinguished',
                                  'original_content', 'upvote_ratio', 'flair_text'])
    posts['post_created'] = pd.to_datetime(posts['post_created'], unit='s')
    posts['scrape_time'] = datetime.now()
    posts[['subreddit', 'post_id', 'title', 'post_author',
           'post_body', 'url', 'post_permalink', 'flair_text', ]] = posts[
        ['subreddit', 'post_id', 'title', 'post_author',
         'post_body', 'url', 'post_permalink', 'flair_text']].astype(str)
    return posts


def get_reddit_comments(reddit, post_id):
    submission = reddit.submission(id=post_id)
    comment = []
    for top_level in submission.comments:
        if isinstance(top_level, MoreComments):
            continue
        comment.append([top_level.subreddit, top_level.submission, top_level.id, top_level.parent_id, top_level.author,
                        top_level.permalink, top_level.body, top_level.created, top_level.score,
                        top_level.distinguished])
    comments = pd.DataFrame(comment, columns=['subreddit', 'post_id', 'comment_id', 'parent_id', 'comment_author',
                                              'comment_permalink', 'comment_body', 'comment_created', 'comment_score',
                                              'comment_distinguished'])
    comments['comment_created'] = pd.to_datetime(comments['comment_created'], unit='s')
    comments['scrape_time'] = datetime.now()
    comments[['subreddit', 'post_id', 'comment_id', 'parent_id',
              'comment_author', 'comment_permalink', 'comment_body']] = comments[
        ['subreddit', 'post_id', 'comment_id', 'parent_id',
         'comment_author', 'comment_permalink', 'comment_body']].astype(str)
    return comments

## System-dependent Configuration

In [None]:
"""This cell defines system-dependent configuration such as those different in Linux vs. Windows
"""
if 'COLAB_GPU' in os.environ: # a hacky way of determining if you are in colab.
  print("Notebook is running in colab")
  from google.colab import drive
  drive.mount("/content/drive")
  OUTPUT_DIR = "./drive/My Drive/Data/raw/"
  RAW_DATA_DIRECTORY = Path("./drive/My Drive/Data/raw/Reddit/")
  
else:
  # Get the system information from the OS
  PLATFORM_SYSTEM = platform.system()

  # Darwin is macOS
  if PLATFORM_SYSTEM == "Darwin":
      EXECUTABLE_PATH = Path("../dependencies/chromedriver")
  elif PLATFORM_SYSTEM == "Windows":
      EXECUTABLE_PATH = Path("../dependencies/chromedriver.exe")
  else:
      logging.critical("Chromedriver not found or Chromedriver is outdated...")
      exit()
  RAW_DATA_DIRECTORY = Path("../Data/raw/Reddit/")

os.makedirs(RAW_DATA_DIRECTORY, exist_ok=True)
tz = timezone('US/Eastern')    

Notebook is running in colab
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Collect Data

## Collect Reddit Submissions and Comments

In [None]:
"""This cell retrieves page posts and comments, for a given page.
"""

def collect(input_subreddits=None, input_search_terms=None, filename=""):
    # Credentials (create client_id, client_secret, user_agent by following 
    # https://praw.readthedocs.io/en/latest/getting_started/quick_start.html)
    #client_id = getpass("Enter client_id: ") 
    client_id = "zZ50cezaVccRCA"
    client_secret = getpass("Enter client secret: ")
    #client_secret = "" # if you feel comfortable including it
    #user_agent = input("Enter user agent: ")
    user_agent = "PostMasterGeneral"

    ''' Designate input parameters for functions
    search terms = key terms to search ALL of reddit
    subreddits = subreddits to collect
    search_sort_type = 'relevance', 'hot', 'top', 'new', 'comments'
    sub_sort_type = 'hot', 'top', 'new', 'gilded', 'rising', 'controversial'
    time_limit = 'all', 'hour', 'day', 'week', 'month', 'year'
    post_limit = 10
    '''
    post_limit = 1000 # maximum of 1000
    sub_sort_type = 'new'  # , 'top', 'new', 'gilded', 'rising', 'controversial'
    search_sort_type = 'new'  # , 'hot', 'top', 'new', 'comments', 'relevance'
    time_limit = 'year'  # , 'hour', 'day', 'week', 'month', 'year'
    
    # search_terms = ["covid19", "coronavirus"]
    # subreddits = ["CoronavirusMemes", "Coronavirus", "CoronavirusUS"]
    if input_search_terms is None:
      input_search_terms = input("Enter search terms (seperated by spaces): ")
    search_terms = input_search_terms.split()
    if input_subreddits is None:
      input_subreddits = input("Enter subreddits (separated by spaces): ")
    subreddits = input_subreddits.split()
    
    ''' Collect posts & corresponding comments
    '''
    # Create client
    r = token(client_id, client_secret, user_agent)
    list_posts_df = []
    try:
        for query in search_terms:
            post_df = search_reddit(r, query, search_sort_type, time_limit, post_limit)
            print("Grabbed", len(post_df), "posts with search term:", query)
            list_posts_df.append(post_df)
    except:
        pass

    try:
        for sub in subreddits:
            post_df = get_subreddit(r, sub, sub_sort_type, time_limit, post_limit)
            print("Grabbed", len(post_df), "posts from subreddit:", sub)
            list_posts_df.append(post_df)
    except:
        pass
    if not list_posts_df:
      print("no posts found!")
      return
    new_posts = pd.concat(list_posts_df)
    print("Number of posts: ", new_posts.shape[0])

    # File output for posts
    filename_csv = filename + "REDDIT_POSTS_" + datetime.now(tz=tz).strftime("%Y-%m-%dT%H-%M-%S%z") + ".csv"
    new_posts.to_csv(str(RAW_DATA_DIRECTORY / filename_csv), index=False)
    print("Exported posts to CSV")

    # Get comments
    post_ids = new_posts['post_id'].to_list()
    list_comment_dfs = []
    i = 0
    for post in post_ids:
        try:
            comment_df = get_reddit_comments(r, post)
            list_comment_dfs.append(comment_df)
            # print(i, "Grabbed", len(comment_df), "comments from post id:", post)
        except:
            pass
        i += 1
    post_comments = pd.concat(list_comment_dfs)
    print("Number of total comments: ", post_comments.shape[0])

    # File output for comment
    com_filename_csv = filename + "REDDIT_COMMENTS_" + datetime.now(tz=tz).strftime("%Y-%m-%dT%H-%M-%S%z") + ".csv"
    post_comments.to_csv(str(RAW_DATA_DIRECTORY / com_filename_csv), index=False)
    print("Exported comments to CSV")

In [None]:
collect(input_subreddits="highschool Highschool_Help teenagers HighschoolTheater", filename="highschool")
    
client_id = "zZ50cezaVccRCA"
client_secret = "	0maHmZXgAS2oDhO4t0eoYRLzumyWNA"
user_agent = "PostMasterGeneral"
searchterms = "biden"
reddits = "politics uspolitics PoliticalDiscussion geopolitics freethought changemyview news worldnews government politics2"
reddits = "highschool Highschool_Help teenagers HighschoolTheater"

Enter client secret: ··········
Enter search terms (seperated by spaces): 
Successful token
Grabbed 999 posts from subreddit: highschool
Grabbed 260 posts from subreddit: Highschool_Help
Grabbed 994 posts from subreddit: teenagers
Grabbed 145 posts from subreddit: HighschoolTheater
Number of posts:  2398
Exported posts to CSV
Number of total comments:  7352
Exported comments to CSV


In [None]:
collect(input_subreddits="politics uspolitics PoliticalDiscussion geopolitics freethought changemyview news worldnews government politics2", filename="politics")

Enter client secret: ··········
Enter search terms (seperated by spaces): 
Successful token
Grabbed 940 posts from subreddit: politics
Grabbed 999 posts from subreddit: uspolitics
Grabbed 965 posts from subreddit: PoliticalDiscussion
Grabbed 801 posts from subreddit: geopolitics
Grabbed 956 posts from subreddit: freethought
Grabbed 985 posts from subreddit: changemyview
Grabbed 159 posts from subreddit: news
Grabbed 294 posts from subreddit: worldnews
Grabbed 504 posts from subreddit: government
Grabbed 973 posts from subreddit: politics2
Number of posts:  7576
Exported posts to CSV




Number of total comments:  81081
Exported comments to CSV


In [None]:
collect(input_subreddits="GradSchool PhD PhDStress LawSchool", filename="grad")

Enter client secret: ··········
Enter search terms (seperated by spaces): 
Successful token
Grabbed 981 posts from subreddit: GradSchool
Grabbed 980 posts from subreddit: PhD
Grabbed 212 posts from subreddit: PhDStress
Grabbed 999 posts from subreddit: LawSchool
Number of posts:  3172
Exported posts to CSV
Number of total comments:  14879
Exported comments to CSV
