# Pulling posts from reddit
- posts
    - fetch previously-collected posts
    - fetch earlier posts
    - fetch more recent posts
    - extract data from posts
    - combine new data with older data
    - check for duplicates
- look into lack of new posts in "space" subreddit
- save updated post data as csv
- initial data pull
- notes & other exploratory stuff


In [1]:
import praw
import pandas as pd 
import numpy as np 

import reddit_utilities
import modeling_reporting

In [2]:
# instantiate reddit instance
reddit = praw.Reddit()
# you're now connected & authenticated!

In [3]:
subreddits = ["AskScienceFiction", "space", "askscience"]

# Posts
## Fetch previously-collected posts

In [4]:
old_data = pd.read_csv("data/raw_posts.csv")
# old_data.tail()

### Quick check on the old data to make sure it's in good shape

In [5]:
old_data.shape

(2686, 6)

In [6]:
old_data["subreddit"].value_counts(normalize=True)

subreddit
AskScienceFiction    0.406925
space                0.325763
askscience           0.267312
Name: proportion, dtype: float64

## Fetch earlier posts
- this never brought in any additional posts (but I just kept trying it anyway)

In [7]:
earlier_posts = reddit_utilities.get_earlier_posts(old_data, reddit, subreddits)
posts_dict = reddit_utilities.extract_posts(earlier_posts)

In [8]:
len(posts_dict)

0

## Fetch more recent posts

In [9]:
later_posts = reddit_utilities.get_more_recent_posts(old_data, reddit, subreddits)

## Extract data from posts

In [10]:
posts_dict = reddit_utilities.extract_posts(later_posts)
len(posts_dict)

126

In [11]:
new_data = pd.DataFrame(posts_dict)
new_data.head()

Unnamed: 0,title,selftext,subreddit,created_utc,name,type
0,[MCU] I died in the snap and resurrected after...,,AskScienceFiction,1706455000.0,t3_1ad4uh1,post
1,[Wreck It Ralph] Has the arcade never experien...,I find it hard to believe that in all the year...,AskScienceFiction,1706454000.0,t3_1ad4fnk,post
2,[Bleach] Is Ichigo's mother somewhere in Soul ...,,AskScienceFiction,1706453000.0,t3_1ad43v7,post
3,[Frontlines series book 3 Angles of attack spe...,This 3rd book was pretty annoying.\n\n1) They ...,AskScienceFiction,1706452000.0,t3_1ad3si6,post
4,[Code Geass] If they had mechs in this timelin...,Wouldn't it make sense to have fully functiona...,AskScienceFiction,1706448000.0,t3_1ad2m64,post


- checking `subreddit` column for shenanigans

In [12]:
new_data["subreddit"].unique()  
# initially, these labels were just strings that matched the subreddit name
# for some reason they switched to become a `Subreddit(display_name='{subreddit name}')`???

array([Subreddit(display_name='AskScienceFiction'),
       Subreddit(display_name='askscience')], dtype=object)

In [13]:
old_data["subreddit"].unique()

array(['askscience', 'space', 'AskScienceFiction'], dtype=object)

In [14]:
# ¯\_(ツ)_/¯  changing all the subreddit labels to be the same format
for subreddit in subreddits:
    new_data["subreddit"] = np.where(new_data["subreddit"] == subreddit, subreddit, new_data["subreddit"])
new_data["subreddit"].unique()

array(['AskScienceFiction', 'askscience'], dtype=object)

## Combine new data with older data

In [15]:
updated_df = pd.concat([old_data, new_data])
updated_df.shape

(2812, 6)

In [16]:
updated_df["subreddit"].unique()

array(['askscience', 'space', 'AskScienceFiction'], dtype=object)

## Check for duplicates
- the `name` field is unique to each subreddit
- there are definitely duplicates from one subreddit to another but should not be dupes within a given subreddit
- might decide to let the duplicates stay if there's an imbalance among the subreddits and the dupes are in the under-represented group
    - 1/21/24 added 6 duplicates to the "askscience" subreddit
    - 1/22 added 2 more duplicates to "askscience"
    - 1/26 added 2 more to "askscience"
- `name` for comments is always `None` so any duplicates that appear will only be in posts

In [17]:
modeling_reporting.find_duplicates(updated_df)

{'askscience': 10}

### Check `updated_df` for any issues before saving it

In [18]:
updated_df["subreddit"].value_counts()

subreddit
AskScienceFiction    1193
space                 875
askscience            744
Name: count, dtype: int64

In [19]:
updated_df.isna().sum()

title             0
selftext       1107
subreddit         0
created_utc       0
name              0
type              0
dtype: int64

In [20]:
# a lot of the posts, especially in the "space" subreddit are just a title and a URL, without any `selftext`
modeling_reporting.find_null_selftext(updated_df)

{'askscience': 115, 'space': 720, 'AskScienceFiction': 272}

----
## Look into lack of new posts in "space" subreddit
- I noticed I wasn't getting any new posts from the "space" subreddit in the last few days, even though it looked like there were new posts when I looked at the site
- So I removed the "before" param and fetched what I could get & will remove duplicates manually

In [21]:
# initially used this to get the latest post but it didn't return any new posts so left it out of the call
latest_space = max(old_data.loc[old_data["subreddit"] == "space"]["name"])

space = reddit.subreddit("space").new(limit=None)  # this will definitely have some overlap with data I already have
space_dict = reddit_utilities.extract_posts([space])
len(space_dict)

851

In [24]:
old_space = updated_df.loc[updated_df["subreddit"] == "space"]
all_space = pd.concat([old_space, pd.DataFrame.from_dict(space_dict)])
all_space["subreddit"] = "space"  # fixing that subreddit label thing
all_space.shape

(1726, 6)

In [25]:
modeling_reporting.find_duplicates(all_space)

{'space': 769}

- I definitely have to get rid of the duplicates but this says only 769 of the 1,726 rows were duplicated. I don't know why using the "before" or "after" param didn't pull in data from this subreddit but leaving it out DID get more data. Weird but good for my project.

In [27]:
all_space = all_space.drop_duplicates(subset=["name"])
all_space.shape

(957, 6)

- obviously going to check the other subreddits to see if I can get more out of them too

In [30]:
new_science = reddit.subreddit("askscience").new(limit=None)
science_dict = reddit_utilities.extract_posts([new_science])
print(len(science_dict))

old_science = updated_df.loc[updated_df["subreddit"] == "askscience"]
all_science = pd.concat([old_science, pd.DataFrame.from_dict(science_dict)])
all_science["subreddit"] = "askscience"  # fixing that subreddit label thing
all_science.shape

665


(1409, 6)

In [31]:
print(f"duplicates to get rid of: {modeling_reporting.find_duplicates(all_science)}")
all_science = all_science.drop_duplicates(subset=["name"])
all_science.shape

duplicates to get rid of: {'askscience': 672}


(737, 6)

In [34]:
new_scifi = reddit.subreddit("AskScienceFiction").new(limit=None)
scifi_dict = reddit_utilities.extract_posts([new_scifi])
print(len(scifi_dict))

old_scifi = updated_df.loc[updated_df["subreddit"] == "AskScienceFiction"]
all_scifi = pd.concat([old_scifi, pd.DataFrame.from_dict(scifi_dict)])
all_scifi["subreddit"] = "AskScienceFiction"  # fixing that subreddit label thing
all_scifi.shape

996


(2189, 6)

In [35]:
print(f"duplicates to get rid of: {modeling_reporting.find_duplicates(all_scifi)}")
all_scifi = all_scifi.drop_duplicates(subset=["name"])
all_scifi.shape

duplicates to get rid of: {'AskScienceFiction': 989}


(1200, 6)

In [36]:
old_data["subreddit"].value_counts()

subreddit
AskScienceFiction    1093
space                 875
askscience            718
Name: count, dtype: int64

In [38]:
updated_df["subreddit"].value_counts()

subreddit
AskScienceFiction    1193
space                 875
askscience            744
Name: count, dtype: int64

In [37]:
really_updated_df = pd.concat([all_science, all_space, all_scifi])
really_updated_df["subreddit"].value_counts()

subreddit
AskScienceFiction    1200
space                 957
askscience            737
Name: count, dtype: int64

- the increase in "AskScienceFiction" is probably just because it's been a few minutes since I pulled that data & there have been a few more posts to that topic in that time
- the decrease in "askscience" is because I already had 10 duplicates in that subreddit and I've just gotten rid of any duplicates -- so I got three new posts and removed 10 dupes I already had, net -7 posts
- still don't know why the "space" subreddit wasn't getting updates with the `before` param but it's nice to have a few dozen more posts

## Save updated post data to csv

In [39]:
really_updated_df.to_csv("data/raw_posts.csv", index=False)

----
## Initial data pull
- putting this at the end because I only used this code the first time I fetched from each subreddit
- also kind of a restart when I decided to use comments as well as just posts

In [None]:
subreddits = ["AskScienceFiction", "space", "askscience"]

listing_gens = []

for subreddit in subreddits:
    listing_gens.append(reddit.subreddit(subreddit).new(limit=None))

In [None]:
listing_gens

In [None]:
# this isn't exactly my initial data pull, from which I only extracted posts
#  this is the first batch to extract both posts & comments
#  commented out the post extraction part of the function the first time through to just get comments
posts = reddit_utilities.extract_posts(listing_gens)

In [None]:
len(posts)

In [None]:
start_df = pd.DataFrame(posts)
start_df.head()

In [None]:
start_df.shape

----
## Notes & other exploratory stuff
- most of this is also represented above but in this section I'm keeping some notes and some of my trial-and-error processing, which may be useful to my future self

### Getting comments out of posts
- wanted to increase the number of documents in my dataset
- basics from the [praw tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)
- also got some assistance from Benjamin Wolff with the code to fetch the comments

In [None]:
# fetch a bunch of posts, not limited to those since we last checked
science_posts = reddit.subreddit("askscience").new(limit=10)

comments = []

for post in science_posts:
    post.comments.replace_more(limit=None)
    for comment in post.comments.list()[:2]:
        comments.append({
            "title": post.title,
            "selftext": comment.body,
            "created_utc": post.created_utc,
            "subreddit": "askscience",
            "name": None,  # using the post's `name` to identify duplicates so not using that here
            "type": "comment"
        })

In [None]:
len(comments)

### Starter code
- mostly from Eric's praw walkthrough on 1/18

In [None]:
space = "space"
space_posts = reddit.subreddit(space).new(limit=None)

science = "askscience"
science_posts = reddit.subreddit(science).new(limit=None)  # set limit to None to get the max
type(science_posts)  # have to iterate through this "ListingGenerator" to get at the stuff, can't access by index with []

- that generator is a queue, once you've iterated through it, it's empty

- now that you've printed the stuff, `science_posts` is empty
- `post` is still available though, the last one in the queue

In [None]:
posts_list = []

In [None]:
for post in science_posts:
    posts_list.append({
        "title": post.title,
        "selftext": post.selftext,
        "subreddit": post.subreddit,
        "created_utc": post.created_utc,
        "name": post.name
    })

    # check out praw docs for what's available: https://praw.readthedocs.io/en/stable/code_overview/models/submission.html#praw.models.Submission

### Add params to get posts before/after what you already have

In [None]:
# see docs on ListingGenerator, which tells you you can add `params` dictionary
# also see reddit api docs to know what can go in your `params`: https://www.reddit.com/dev/api/
science_posts = reddit.subreddit(science).new(limit=10, params={"after": "t3_197iuy3"}) # this is the last "name" in our current data
# the submissions is like a stack so "after" is lower in the stack (older) & "before" is higher in the stack (newer)
# if you want comments, check out praw's tutorial: https://praw.readthedocs.io/en/stable/tutorials/comments.html

- playing with the `name` field, before I tried just sorting it as it is...which worked just fine

In [None]:
# from the reddit docs, the starting chars in `name` refer to its subreddit
#  the rest is its unique identifier within that subreddit as an base36 int
df["name_base36"] = df["name"].str[3:]

In [None]:
# can I convert that to an int which I could then sort?
df["name_base36"] = [ int(number, 36) for number in df["name_base36"]]