# Scraping r/scarystories With PRAW

### Table of Contents 
- [Set Up](#set-up)
- [Determine Available Attributes to Scrape](#determine-available-attributes-to-scrape)
- [Write Functions For Pulling Submissions and Comments](#write-functions-for-pulling-submissions-and-comments)
- [Case Study 1: Pull 10 Hot r/scarystories Submissions](#case-study-1-pull-10-hot-rscarystories-submissions)
- [Case Study 2: Pull Submissions By Keywords](#case-study-2-pull-submissions-by-keywords)
- [Case Study 3: Delete Duplicate Entries in CSV](#case-study-3-delete-duplicate-entries-in-csv)
- [Case Study 4: Pull Posts Within a Specific Time Range](#case-study-4-pull-posts-within-a-specific-time-range)

### Relevant Resources Used: 
- [PRAW Docs](https://praw.readthedocs.io/en/stable/tutorials/comments.html)
- [Sentiment Analysis Using HuggingFace](https://huggingface.co/blog/sentiment-analysis-python)
- [Cultural Analytics with Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/14-Reddit-Data.html)

### Set Up

In [16]:

# Set up Pandas

import pandas as pd
from datetime import datetime
import csv
import praw

pd.set_option('max_colwidth', 500)


# Set up PRAW with athentication


reddit = praw.Reddit(
    client_id = "QErgebb-REIyaM6wsoQ-Nw",
    client_secret = "PPuRFlKap6UQ4D5f9tYi_pvY68ePkA",
    username = "Ok_Scientist2546",
    password = "EZ8y@'ctT!f4A%L",
    user_agent = "Praw-test"
)

### Determine Available Attributes to Scrape

In [8]:
# Determine Available Attributes of a Submission object
import pprint

sub_by_id = reddit.submission("14828yd")
print(sub_by_id.title)  
pprint.pprint(vars(sub_by_id))

# Determine Available Attributes of a Comment object
comment = list(sub_by_id.comments)[0]
print(comment.body)  
pprint.pprint(vars(comment))


### Write Functions For Pulling Submissions and Comments

In [38]:
### FUNCTIONS ###

from datetime import datetime, timezone

def pull_submissions(num_subs: int, sub_name: str, sort: str, times=[], keywords=[]):
    """
    Gets key details about num_subs number of submissions on a particular subreddit sub_name. 


    Inputs:
        - num_subs [int]: the number of submissions to pull
        - sub_name [str]: subreddit name without the r/, i.e., "scarystories"
        - sort [str]: the way to sort the subreddit, i.e. by "controversial," 
            "hot," "new," "rising," or "top". Note that rising doesn't necessarily have as 
            many submissions as specified in num_subs.
        - keywords [List[str]]: list of keywords to filter by. By default an empty list. At least 1 
            keyword must appear once in the submission text for the submission to be returned
        - times [List[str]]: the bounds of the time range you want to pull from. [past_time, recent_time]
            both in utc form. By default, will pull from every year up to now.

    
    Returns:
        [List[Dict[10 items]]]: a list of dictionaries, one for each submission in the specified subreddit 
    """

    subreddit = reddit.subreddit(sub_name)
    res = []
    num = 0

    SORTED_SUBMISSIONS = {"hot": subreddit.hot(), 
                          "controversial": subreddit.controversial(), 
                          "gilded": subreddit.gilded(),
                          "top": subreddit.top(),
                          "new": subreddit.new(),
                          "rising": subreddit.rising()
                          }
    
    if times:
        past, recent = times
    else:
        past = datetime(2005, 6, 23).replace(tzinfo=timezone.utc).timestamp()
        recent = datetime.now(timezone.utc).timestamp()

    for submission in SORTED_SUBMISSIONS[sort]:
        if submission.created_utc >= past and submission.created_utc < recent:
            if keywords == [] or key_words_in_text(keywords, submission.selftext):
                if num >= num_subs:
                    return res
                num += 1
                print("Copying submission info...")
                story = {}
                story["title"] = submission.title
                story["submission_id"] = submission.id
                story["score"] = submission.score
                story["url"] = submission.url
                story["author"] = "Deleted" if submission.author is None else submission.author.name
                story["text"] = (submission.selftext.replace("’", "'").
                                replace("…", "...").replace("\n", " ").replace("“", "\"").
                                replace("”", "\""))
                story["subreddit"] = submission.subreddit
                story["num_comments"] = submission.num_comments
                story["date_created"] = datetime.fromtimestamp(submission.created_utc)
                res.append(story)


def key_words_in_text(keywords, text):
    """
    Checks if any of the keywords are in the text.

    Inputs:
        keywords [List[str]]: a list of key words to check
        text [str]: string text to check for words

    Returns: True if any of the keywords are in the text, False otherwise.
    """
    for word in keywords: 
        if word in text.lower(): 
            return True
    return False

def pull_comments(subreddit_id: str, amount: str="all"):
    """
    Pull all or top level comments from a certain reddit submission.

    Inputs:
        subreddit_id [str]: the subreddit id of subreddit you want to pull from
        amount [str]: how many comments to pull, all comments or only top level comments. 
            By default, this variable has value "all"


    Returns: 
        [List[Dict[8 items]]]: a list of comments from a single submission with the comment details
    """

    sub_by_id = reddit.submission(subreddit_id)

    # Select top level comments or all comments 
    sub_by_id.comments.replace_more(limit=None)
    if amount == "top_level":
        comments = []
        for top_level_comment in sub_by_id.comments:
            comments.append(top_level_comment) 
    else:
        comments = sub_by_id.comments.list()

    # Return List of dictionaries with comment details
    res = []
    for comment in comments:
        new_comment = {}
        new_comment["text"] = (comment.body.replace("’", "'").
                             replace("…", "...").replace("\n", " ").replace("“", "\"").
                            replace("”", "\""))
        # Text needs to contain the keyword to be returned!
        new_comment["author"] = "Deleted" if comment.author is None else comment.author.name
        new_comment["score"] = comment.score
        new_comment['comment_id'] = comment.id
        new_comment["is_op"] = comment.is_submitter
        new_comment["submission_id"] = comment._submission.id
        new_comment["subreddit"] = comment.subreddit_name_prefixed
        new_comment["subreddit_id"] = comment.subreddit_id
        res.append(new_comment)
    return res


def write_to_csv(obj, file, mode="a"):
     """
     Writes info from a List[Dict[items]] object into a csv file.

     Inputs:
        mode [str]: "w" for write or "a" for append. By default, "a" for append.
        obj [List[Dict[items]]]: the object that contains the info to write
        file [str]: csv file name

     Returns:
        Nothing
     """
     fieldnames = obj[0].keys()
     with open(file, mode, newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        if mode == "w":
            writer.writeheader()
        writer.writerows(obj)

### Case Study 1: Pull 10 Hot r/scarystories Submissions

In [4]:
hot_ten_stories = pull_submissions(num_subs=10, sub_name="scarystories", sort="hot")
write_to_csv(hot_ten_stories, "submissions.csv", "w")
stories_df = pd.read_csv("submissions.csv", delimiter=',', encoding='utf-8')

### Case Study 2: Pull Submissions By Keywords

In [25]:
subs_50 = pull_submissions(num_subs=50, sub_name="scarystories", keywords=["scary", "stab", "kill"])
write_to_csv(subs_50, 'submissions.csv', "w")

### Case Study 3: Delete Duplicate Entries in CSV

In [45]:
# Pull first 50
subs_50 = pull_submissions(num_subs=50, sub_name="scarystories", sort="hot", keywords=["scary", "stab", "kill"])
write_to_csv(subs_50, 'submissions.csv', "w")
df = pd.read_csv("submissions.csv", delimiter=',', encoding='utf-8')
df

# Pull another 50
other_50 = pull_submissions(num_subs=50, sort="new", sub_name="scarystories")
write_to_csv(other_50, 'submissions.csv')

# Delete duplicate entries
temp = pd.read_csv('submissions.csv')
pd.concat([temp, df]).drop_duplicates().to_csv('submissions.csv')


### Case Study 4: Pull Posts Within a Specific Time Range

In [40]:
# Only pull between a certain time range

from datetime import timezone

current_time = datetime.now(timezone.utc).timestamp()
one_year_ago = datetime(2022, 1, 1).replace(tzinfo=timezone.utc).timestamp()
eight_years_ago = datetime(2015, 1, 1).replace(tzinfo=timezone.utc).timestamp()

# Pull 20 submissions from between eight years ago and one year ago
after_twentyfifteen = pull_submissions(num_subs=20, sub_name="scarystories", sort="hot", times=[eight_years_ago, one_year_ago])
print(after_twentyfifteen)

# this won't work with PRAW!

None


### Case Study 5: Sentiment Analysis

In [11]:
from transformers import pipeline

comment = "https://www.reddit.com/r/sysadmin/comments/11egyx5/comment/jahuiqz/"
subreddit = reddit.subreddit("sysadmin")
submission = reddit.submission("11egyx5")
text1 = "The entire history of Firepower could easily be viewed as a graduate school case study in how clueless executives can correctly identify a product gap, spend the right money on the right technology to close that gap, and piss all of the value away through mismanagement anyway."
text2 = "Cisco knew PIX was beyond it's usable service-life, and original ASA-OS was weak. Cisco knew they needed a new Layer-7-aware Firewall solution to compete with Palo & Fortinet and CheckPoint. In 2013 Cisco pulled out the BIG checkbook and bought the OG, mac-daddy, gold-standard L7 aware security solution: Snort."
data = []
data.append(text1)
data.append(text2)


specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
sentiments = specific_model(data)
for i, senti in enumerate(sentiments):
    print(f"{i}: {senti}")

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at finiteautomata/bertweet-base-sentiment-analysis.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
emoji is not installed, thus not converting emoticons or emojis into text. Install emoji: pip3 install emoji==0.6.0


0: {'label': 'NEG', 'score': 0.9753668904304504}
1: {'label': 'NEU', 'score': 0.6773768067359924}


### Things to Consider

- TODO: filter pull_comments() by keywords too
- Might need to go through csv manually to make sure that keyword is not accidental -- probably less lilely if keyword is "Cisco", but more likely with something like "stab"
- Duplicates might not include same post with slightly different stats - need to sort out manually if so
