# Scraping r/scarystories With PRAW

### Table of Contents 
- [Set Up](#set-up)
- [Determine Available Attributes to Scrape](#determine-available-attributes-to-scrape)
- [Write Functions For Pulling Submissions and Comments](#write-functions-for-pulling-submissions-and-comments)
- [Case Study 1: Pull 10 Hot r/scarystories Submissions](#case-study-1-pull-10-hot-rscarystories-submissions)
- [Case Study 2: Pull Submissions By Keywords](#case-study-2-pull-submissions-by-keywords)
- [Case Study 3: Delete Duplicate Entries in CSV](#case-study-3-delete-duplicate-entries-in-csv)
- [Case Study 4: Sentiment Analysis](#case-study-4-sentiment-analysis)

### Relevant Resources Used: 
- [PRAW Docs](https://praw.readthedocs.io/en/stable/tutorials/comments.html)
- [Sentiment Analysis Using HuggingFace](https://huggingface.co/blog/sentiment-analysis-python)
- [Cultural Analytics with Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/14-Reddit-Data.html)

### Set Up

In [16]:

# Set up Pandas

import pandas as pd
from datetime import datetime
import csv
import praw
from typing import List, Dict

pd.set_option('max_colwidth', 500)


# Set up PRAW with athentication


reddit = praw.Reddit(
    client_id = "QErgebb-REIyaM6wsoQ-Nw",
    client_secret = "PPuRFlKap6UQ4D5f9tYi_pvY68ePkA",
    username = "Ok_Scientist2546",
    password = "EZ8y@'ctT!f4A%L",
    user_agent = "Praw-test"
)

### Determine Available Attributes to Scrape

In [8]:
# Determine Available Attributes of a Submission object
import pprint

sub_by_id = reddit.submission("14828yd")
print(sub_by_id.title)  
pprint.pprint(vars(sub_by_id))

# Determine Available Attributes of a Comment object
comment = list(sub_by_id.comments)[0]
print(comment.body)  
pprint.pprint(vars(comment))


### Write Functions For Pulling Submissions and Comments

In [17]:
### FUNCTIONS ###

from datetime import datetime, timezone

def pull_submissions(num_subs: int, sub_name: str, sort: str, keywords:List[str]=[]):
    """
    Gets key details about num_subs number of submissions on a particular subreddit sub_name. 


    Inputs:
        - num_subs [int]: the number of submissions to pull
        - sub_name [str]: subreddit name without the r/, i.e., "scarystories"
        - sort [str]: the way to sort the subreddit, i.e. by "controversial," 
            "hot," "new," "rising," or "top". Note that rising doesn't necessarily have as 
            many submissions as specified in num_subs.
        - keywords [List[str]]: list of keywords to filter by. By default an empty list. At least 1 
            keyword must appear once in the submission text for the submission to be returned

    
    Returns:
        [List[Dict[10 items]]]: a list of dictionaries, one for each submission in the specified subreddit 
    """

    subreddit = reddit.subreddit(sub_name)
    res = []
    num = 0

    SORTED_SUBMISSIONS = {"hot": subreddit.hot(), 
                          "controversial": subreddit.controversial(), 
                          "gilded": subreddit.gilded(),
                          "top": subreddit.top(),
                          "new": subreddit.new(),
                          "rising": subreddit.rising()
                          }

    for submission in SORTED_SUBMISSIONS[sort]:
        if keywords == [] or key_words_in_text(keywords, submission.selftext):
            if num >= num_subs:
                break
            num += 1
            story = {}
            story["title"] = submission.title
            story["submission_id"] = submission.id
            story["score"] = submission.score
            story["url"] = submission.url
            story["author"] = "Deleted" if submission.author is None else submission.author.name
            story["text"] = (submission.selftext.replace("’", "'").
                            replace("…", "...").replace("\n", " ").replace("“", "\"").
                            replace("”", "\""))
            story["subreddit"] = submission.subreddit
            story["num_comments"] = submission.num_comments
            story["date_created"] = datetime.fromtimestamp(submission.created_utc)
            res.append(story)
    return res


def key_words_in_text(keywords, text):
    """
    Checks if any of the keywords are in the text.

    Inputs:
        keywords [List[str]]: a list of key words to check
        text [str]: string text to check for words

    Returns: True if any of the keywords are in the text, False otherwise.
    """
    for word in keywords: 
        if word in text.lower(): 
            return True
    return False

def pull_comments(subreddit_id: str, amount: str="all", keywords:List[str]=[]):
    """
    Pull all or top level comments from a certain reddit submission.

    Inputs:
        - subreddit_id [str]: the subreddit id of subreddit you want to pull from
        - amount [str]: how many comments to pull, all comments or only top level comments. 
            By default, this variable has value "all"
        - keywords [List[str]]: list of keywords to filter by. By default an empty list. At least 1 
            keyword must appear once in the submission text for the submission to be returned
        

    Returns: 
        [List[Dict[8 items]]]: a list of comments from a single submission with the comment details
    """

    sub_by_id = reddit.submission(subreddit_id)

    # Select top level comments or all comments 
    sub_by_id.comments.replace_more(limit=None)
    if amount == "top_level":
        comments = []
        for top_level_comment in sub_by_id.comments:
            comments.append(top_level_comment) 
    else:
        comments = sub_by_id.comments.list()

    # Return List of dictionaries with comment details
    res = []
    for comment in comments:
        if keywords == [] or key_words_in_text(keywords, comment.body):
            new_comment = {}
            new_comment["text"] = (comment.body.replace("’", "'").
                                replace("…", "...").replace("\n", " ").replace("“", "\"").
                                replace("”", "\""))
            # Text needs to contain the keyword to be returned!
            new_comment["author"] = "Deleted" if comment.author is None else comment.author.name
            new_comment["score"] = comment.score
            new_comment['comment_id'] = comment.id
            new_comment["is_op"] = comment.is_submitter
            new_comment["submission_id"] = comment._submission.id
            new_comment["subreddit"] = comment.subreddit_name_prefixed
            new_comment["subreddit_id"] = comment.subreddit_id
            res.append(new_comment)
    return res


def write_to_csv(obj, file, mode="a"):
     """
     Writes info from a List[Dict[items]] object into a csv file.

     Inputs:
        mode [str]: "w" for write or "a" for append. By default, "a" for append.
        obj [List[Dict[items]]]: the object that contains the info to write
        file [str]: csv file name

     Returns:
        Nothing
     """
     fieldnames = obj[0].keys()
     with open(file, mode, newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        if mode == "w":
            writer.writeheader()
        writer.writerows(obj)

### Case Study 1: Pull 10 Hot r/scarystories Submissions

In [28]:
hot_ten_stories = pull_submissions(num_subs=10, sub_name="scarystories", sort="hot")
write_to_csv(hot_ten_stories, "submissions.csv", "w")
stories_df = pd.read_csv("submissions.csv", delimiter=',', encoding='utf-8')
stories_df

Unnamed: 0,title,submission_id,score,url,author,text,subreddit,num_comments,date_created
0,Weird story my mom told me while drunk!,14askf5,14,https://www.reddit.com/r/scarystories/comments/14askf5/weird_story_my_mom_told_me_while_drunk/,ImpossiblePart561,"Not to long ago I got drunk with my mom and she told me a story from when I was a child... She started by asking me if I remembered someone named bob which I replied ""no"" and then proceeded to tell me that bob was the name I gave the shadows in my closet that scared me every night and talked to me, but that's not the scariest part! She then told me that hearing those words scared her because when she was growing up in that house, she stayed in the same room and saw shadows coming from the s...",scarystories,1,2023-06-16 05:49:25
1,she chased me out of my sleep,14ayzez,2,https://www.reddit.com/r/scarystories/comments/14ayzez/she_chased_me_out_of_my_sleep/,creedchurch,"It all started in a sunday night when I was asleep. I remember to have a strange dream about me and some guys that I don't know in a cemetery, we were running and laughing of someone. We seemed to be good friends with each other. It was a dark night, no moon at the sky, no stars, only black. One of these guys in my dream joked about we break into a tomb that we found, it seemed a bad ideia, and we discussed for a minute about it, but the guy that had the idea seemed confident and persistent ...",scarystories,0,2023-06-16 10:55:36
2,Under The Stairs,14alfew,14,https://www.reddit.com/r/scarystories/comments/14alfew/under_the_stairs/,adarngoodread,"Under The [Stairs](https://youtu.be/iRU6CjtXO_c)\r \tFor those who don't know, the spandrel is the little room underneath the stairs. The bottom of my stairs had a landing. Under that landing was a little crevice. Just big enough for a small child, or two to crawl under the stairs. My younger brother and I used to hide from our father as often as we could. We would wait for him to pass out from one of his violent, drunken stupors. \r \r It happened one day, when my mother and brother w...",scarystories,3,2023-06-15 23:07:14
3,Scariest thing,14avsju,2,https://www.reddit.com/r/scarystories/comments/14avsju/scariest_thing/,Actual-Drummer-628,what is the scariest thing happened to you that you can call worst nightmare,scarystories,0,2023-06-16 08:38:57
4,The Sequoia Incident,14asok8,3,/r/nosleep/comments/149ykw8/the_sequoia_incident/,Accomplished-Day6294,,scarystories,0,2023-06-16 05:56:05
5,Don't stay up too late,14aki5e,9,https://www.reddit.com/r/scarystories/comments/14aki5e/dont_stay_up_too_late/,daveromannhorror,"I have a problem, and it has made my life a living hell due to my current issue. For some nights, I like to stay up late to watch movies like The Shining and Final Destination. I also like to read ghost stories on the internet. But I don't understand how I hear slamming at the front door. I also hear a man begging for me to allow him inside. And once he is done crying in the dead of night, I also hear a little girl singing an old lullaby. I take a peek through the window and she's always fif...",scarystories,0,2023-06-15 22:20:35
6,Scary thing that may have saved my life…,14ay1qq,1,https://www.reddit.com/r/scarystories/comments/14ay1qq/scary_thing_that_may_have_saved_my_life/,Cursdzz,It was about 4 years ago when I was visiting my grandmother in Germany. She gave me food but the problem was that the food was probably the reason I felt very sick. I was feeling uncomfortable and like I was about to throw up so I decided to sleep a bit in the guest room. Right before I was sleeping I suddenly got waken up by an unfamiliar voice that waked me up by talking soft and calm. I was scared because I didn‘t know this voice but when I waked the voice disappeard. Now the scary thing ...,scarystories,0,2023-06-16 10:17:19
7,Being alone in a manor,14axz4f,1,https://www.reddit.com/r/scarystories/comments/14axz4f/being_alone_in_a_manor/,Bd-wong,https://www.youtube.com/watch?v=UL2BuesuqCY&t=10s,scarystories,0,2023-06-16 10:14:20
8,False Awakening?? Loop Dream??,14aom79,3,https://www.reddit.com/r/scarystories/comments/14aom79/false_awakening_loop_dream/,Cautious_Bicycle1718,"When I was younger I had a dream that left me terrified , I am still scared to this day. When I was younger, I had a dream where I would wake up, go downstairs, see my mom cooking breakfast, then get shot by a random man that appeared behind me. But then I would end up waking up in the same spot, but the same event occured and this endless loop kept occuring until I ended up grabbing his gun since I was aware that he was there. Then I woke up and cautiously walked downstairs to my mom who w...",scarystories,3,2023-06-16 01:59:09
9,Disapearing Bucket Man,14anx44,4,https://www.reddit.com/r/scarystories/comments/14anx44/disapearing_bucket_man/,Zandoms42,"One night around 1am i was walking up and down the dirt road my house is on. It's like an L shape, and i was coming towards my house which is at the base of the tall part of the L. As i was coming around the corner there was another person. They were completely dark because of a street light behind them, they were broad and carrying something like a bucket in their right hand. We were coming towards each other so my face was illuminated. They must have been confused because i was starring th...",scarystories,0,2023-06-16 01:19:17


### Case Study 2: Pull Submissions By Keywords

In [29]:
subs_50 = pull_submissions(num_subs=50, sub_name="scarystories", keywords=["scary", "stab", "kill"], sort="hot")
write_to_csv(subs_50, 'submissions.csv', "w")

# Note that this isn't pulling 50 submissions, but pulling 50 first and then filtering down to 26

### Case Study 3: Delete Duplicate Entries in CSV

In [45]:
# Pull first 50
subs_50 = pull_submissions(num_subs=50, sub_name="scarystories", sort="hot", keywords=["scary", "stab", "kill"])
write_to_csv(subs_50, 'submissions.csv', "w")
df = pd.read_csv("submissions.csv", delimiter=',', encoding='utf-8')
df

# Pull another 50
other_50 = pull_submissions(num_subs=50, sort="new", sub_name="scarystories")
write_to_csv(other_50, 'submissions.csv')

# Delete duplicate entries
temp = pd.read_csv('submissions.csv')
pd.concat([temp, df]).drop_duplicates().to_csv('submissions.csv')


### Case Study 4: Sentiment Analysis

In [60]:
from transformers import pipeline

comment = "https://www.reddit.com/r/sysadmin/comments/11egyx5/comment/jahuiqz/"
subreddit = reddit.subreddit("sysadmin")
submission = reddit.submission("11egyx5")
text1 = "The entire history of Firepower could easily be viewed as a graduate school case study in how clueless executives can correctly identify a product gap, spend the right money on the right technology to close that gap, and piss all of the value away through mismanagement anyway."
text2 = "Cisco knew PIX was beyond it's usable service-life, and original ASA-OS was weak. Cisco knew they needed a new Layer-7-aware Firewall solution to compete with Palo & Fortinet and CheckPoint. In 2013 Cisco pulled out the BIG checkbook and bought the OG, mac-daddy, gold-standard L7 aware security solution: Snort."
data = []
data.append(text1)
data.append(text2)


specific_model = pipeline(model="finiteautomata/bertweet-base-sentiment-analysis")
sentiments = specific_model(data)
for i, senti in enumerate(sentiments):
    print(f"{i}: {senti}")

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at finiteautomata/bertweet-base-sentiment-analysis.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


0: {'label': 'NEG', 'score': 0.9753668904304504}
1: {'label': 'NEU', 'score': 0.6773768067359924}


In [62]:
from transformers import pipeline

def get_scores(data):
    """
    Inputs:
        data [List[Dict[items]]]: a list of pulled comments with relevant details

    Returns a list of dictionaires containing the text and sentiment of comments, sometimes separated 
    """
    paragraphs = []
    for comment in data:
        for key, value in comment.items():
            if key == "text" and len(value) < 1000:
                paragraphs.append(value)

    scores = []
    sentiment_pipeline = pipeline("sentiment-analysis")
    sentiments = sentiment_pipeline(paragraphs)
    for i, senti in enumerate(sentiments):
        new_score = {}
        new_score["text"] = paragraphs[i]
        new_score["score"] = -senti["score"] if senti["label"] == "NEGATIVE" else 0 if senti["label"] == "NEUTRAL" else senti["score"]
        new_score["author"] = data[i]["author"]
        new_score["subreddit"] = data[i]["subreddit"]
        new_score["comment_id"] = data[i]["comment_id"]
        scores.append(new_score)
    return scores


pulled = pull_comments(subreddit_id="su5rpm", amount="top")
scores = get_scores(pulled[10:])
r_security_df = pd.DataFrame(scores)


No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


### Things to Consider

- Might need to go through csv manually to make sure that keyword is not accidental -- probably less lilely if keyword is "Cisco", but more likely with something like "stab"
- Duplicates might not include same post with slightly different stats - need to sort out manually if so
- PRAW cannot pull between a certain time range
