In [47]:
import pandas as pd
import requests
import time
from datetime import date
import os

I developed the function below in a separate notebook. Please let me know if you'd like to see it. A note on it: it appears that there are posts that don't actually translate to rows in the dataframe. In other words, they wind up counting toward the 'limit' (in this case 100), but don't actually save into the dataframe. After talking to Alanna about it, it seemed clear the problem was with the data and not the function and that it wasn't worth the time it would take to dig down and figure out which posts that was and why, but rather to work with the data that I got. 

In [36]:
def build_reddit_df(subreddit, desired_size, endpoint = '/reddit/search/submission', limit = 100):
    
    '''
    NOTE: you must set your desired dataframe equal to this function to save the dataframe outside the function.
    
    This function is designed to use pushshift API to build a dataframe of specified size 
    filled with data from the specified subreddit. It starts with the most recent post and works backwards.

    subreddit = the subreddit you'd like to scrape
    
    desired_size = the total number of posts you'd like to have in the dataframe. The function will hit the minimum size 
        above that value that the 'limit' value allows. In other words, it may go over this value up to the amount of the limit.
    
    endpoint = your desired endpoint. Defaults to '/reddit/search/submission' for submissions (main post) and
       '/reddit/search/comment' for comments at the time of the writing of this function (6/22/2022)     

    limit = the limit for number of posts that can be pulled at once. The default is 100, the maximum
        allowed at the time of the writing of this function (6/22/2022)
    '''
    
    url = 'https://api.pushshift.io'+endpoint
    
    counter = 0
    
    fncdf = pd.DataFrame() #establish with certainty that the new dataframe name is empty.
    
    for i in range(2):
   
        if len(fncdf) == 0:
            params = {
                'subreddit': subreddit,
                'size': limit,
                'filter': ['title', 'selftext', 'subreddit', 'created_utc'] #katie pointed out this parameter to me to save cleaning later.
            }
            res = requests.get(url, params)
            if res.status_code == 200:
                data = res.json()
                posts = data['data']
                fncdf = pd.DataFrame(posts)
                counter += 1
            else:
                print(f'ERROR: status code not 200. Failure occured on loop number {counter+1}')

        else: # after the df has been established.
            while len(fncdf) < desired_size:
                params = {
                    'subreddit': subreddit,
                    'size': limit,
                    'before': fncdf.iloc[-1]['created_utc'],
                    'filter': ['title', 'selftext', 'subreddit', 'created_utc']
                }
                res = requests.get(url, params)

                if res.status_code == 200:
                    data = res.json()
                    posts = data['data']
                    newdf = pd.DataFrame(posts)
                    fncdf = pd.concat([fncdf, newdf], ignore_index = True)
                    counter +=1
                    time.sleep(3) #alanna suggested adding this

                else:
                    print('ERROR: status code not 200. Failure occured on loop number {counter+1}')
    
    return fncdf

In [111]:
startrek = build_reddit_df('startrek', 3500)

I'm backing this dataset up because otherwise I can't go back and explore the separate sets of posts without cleaning again because of the time-related nature of what's puled into the initial dataframe.

In [113]:
startrekbackup = startrek.copy()

I used [this stackoverflow answer](https://stackoverflow.com/a/50885228) to guide my work on eliminating duplicates. My rationale is that any duplicate submissions will only overfit the model. This site showed me [a way to use .drop_duplicates](https://stackoverflow.com/a/58311003) that preserves specific values that are duplicated. See why, below.

While my initial instinct was to delete "[removed]" and blank posts, I was curious to see that the 'starwars' subreddit seems to have far more [removed] posts. I found [this post](https://www.reddit.com/r/NoStupidQuestions/comments/b3czg1/what_does_removed_mean/) that indicated that "[removed]" means that a moderator has taken down the post. It appears that the level of '[removed]' may help indicate if a post is a Star Wars or Star Trek post simply because a higher percentage of them are removed. While I ultimately intend to use 'removed' as a stop word and/or remove those lines from the dataframe, I'm opting to leave those posts in for now so I can explore them further. I'd also like to be able to leave the data of the residual titles in the dataframe for now and intend to examine those, too.

I considered carefully whether or not to remove duplicate titles. The argument for keeping is that at least on the StarWars subreddit, [reposting is explicitly forbidden](https://www.reddit.com/r/StarWars/wiki/rules#wiki_read_and_follow_reddiquette), so there's potentially a relationship between repetition and removal. However, as I'm interested in exploring the languaged used in the title's of removed posts and particularly word-counts, I'd rather lose the potential to explore patterns of repetition in favor of not overweighting the words appearing in the titles. On the day that I drew my data, there were 53 repeated Star Trek titles and 83 repeated Star Wars titles. These represent a relatively small number of data points.

I also became curious to see if the blank 'selftext' rows reflected what appeared to be posts that consisted more-or-less solely of the title. That appears to be the case, so I'm going to keep those in the dataframe, as well. In addition to being able to use the titles, I'll be curious to see if there are discrepancies in how many posts of that type the two subredditors create.

I'm going to pull 3500 of both Star Trek and Star Wars posts to ensure that I have at least 1000 of each that have text in their 'selftext', in case I decide to remove the blank and '[removed]' posts in analysis.

In [114]:
len(startrek[startrek.duplicated(['title'])])

53

In [115]:
print(f'Initial Shape: {startrek.shape}')
print('='*20)
print(f'Initial Top 5 Value Counts: {startrek["selftext"].value_counts().head()}')
startrek = pd.concat([startrek[startrek['selftext']=='[removed]'],
                     startrek[startrek['selftext']==''],
                     startrek[startrek['selftext']=='[deleted]'],
                     startrek[(startrek['selftext'] != '[removed]') & (startrek['selftext'] != '') & (startrek['selftext'] != '[deleted]')]\
                      .drop_duplicates(["selftext"], keep = 'first')])
startrek = startrek.drop_duplicates(['title'], keep = 'first')
print('')
print('='*20)
print('')
print(f'Final Shape: {startrek.shape}')
print('='*20)
print(f'Final Top 5 Value Counts: {startrek["selftext"].value_counts().head()}')

Initial Shape: (3595, 4)
Initial Top 5 Value Counts: [removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

In [116]:
len(startrek[startrek.duplicated(['title'])])

0

The below is to confirm that the 'selftext' isn't repeated other than '[removed]', '[deleted]', '['']', and nulls, which I'll deal with later. Just checking for duplicates is not enough because of the repetition of those four things.

In [131]:
startrek[(startrek.duplicated(['selftext']))]

Unnamed: 0,created_utc,selftext,subreddit,title
5,1656254308,[removed],startrek,On the Gorn and language
8,1656248567,[removed],startrek,What are some good things that can be said abo...
11,1656238740,[removed],startrek,A Lord of the Rings reference in SNW 1x08
12,1656238132,[removed],startrek,The sword props used in SNW 1x08 are replicas ...
13,1656237020,[removed],startrek,TNG vs the world?
...,...,...,...,...
3582,1650549682,,startrek,"The Ready Room: ""Mercy"" (Jeri Ryan and Santiag..."
3583,1650546302,,startrek,What the hell is Rios doing in Star Trek Picar...
3585,1650538210,,startrek,Star Trek: Picard Is Garbage
3593,1650508756,,startrek,The Realism of Science Fiction in Old Trek vs ...


In [132]:
stonlytext = startrek[(startrek['selftext'] != '[removed]') & (startrek['selftext'].notnull()) & (startrek['selftext'] != '') & (startrek['selftext'] != '[deleted]')]

In [129]:
stonlytext[stonlytext.duplicated(['selftext'])]

Unnamed: 0,created_utc,selftext,subreddit,title


In [110]:
starwars = build_reddit_df('starwars', 3500)
starwars.head()

Unnamed: 0,created_utc,selftext,subreddit,title
0,1656264455,,StarWars,One of the most heartwarming moments I’ve ever...
1,1656264338,Lego Star Wars: The Video Game is a Lego game ...,StarWars,Lego Star Wars: The Video Game was released be...
2,1656264179,Writing was horrible constant nostalgia pander...,StarWars,My rant on obiwan show
3,1656263912,My friends love to piss me off and say its lik...,StarWars,Is it AT-AT(aht-aht) or is it A T-A T (a t- a t)
4,1656263896,Maul. We know he’s around mixing it up with in...,StarWars,Obi-wan Kenobi Season 2 - Must have Cameo


I'm backing this dataset up because otherwise I can't go back and explore the separate sets of posts without cleaning again because of the time-related nature of what's puled into the initial dataframe.

In [112]:
starwarsbackup = starwars.copy()

In [130]:
len(starwars[starwars.duplicated(['title'])])

83

In [133]:
print(f'Initial Shape: {starwars.shape}')
print('='*20)
print(f'Initial Top 5 Value Counts: {starwars["selftext"].value_counts().head()}')
starwars = pd.concat([starwars[starwars['selftext']=='[removed]'],
                     starwars[starwars['selftext']==''],
                     starwars[starwars['selftext']=='[deleted]'],
                     starwars[(starwars['selftext'] != '[removed]') & (starwars['selftext'] != '') & (starwars['selftext'] != '[deleted]')]\
                      .drop_duplicates(["selftext"], keep = 'first')])
starwars = starwars.drop_duplicates(['title'])
print('')
print('='*20)
print('')
print(f'Final Shape: {starwars.shape}')
print('='*20)
print(f'Final Top 5 Value Counts: {starwars["selftext"].value_counts().head()}')

Initial Shape: (3500, 4)
Initial Top 5 Value Counts:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

In [134]:
len(starwars[starwars.duplicated(['title'])])

0

Checking to make sure the duplicate 'selftext' has been removed.

In [137]:
swonlytext = starwars[(starwars['selftext'] != '[removed]') & (starwars['selftext'].notnull()) & (starwars['selftext'] != '') & (starwars['selftext'] != '[deleted]')]

In [138]:
swonlytext[swonlytext.duplicated(['selftext'])]

Unnamed: 0,created_utc,selftext,subreddit,title


In [139]:
df = pd.concat([startrek, starwars])
print(f'Star Trek Shape: {startrek.shape}')
print(f'Star Wars Shape: {starwars.shape}')
print(f'Combined Shape: {df.shape}')

Star Trek Shape: (3537, 4)
Star Wars Shape: (3409, 4)
Combined Shape: (6946, 4)


In [140]:
3537+3409

6946

Because the function always pullest the newest posts to the given subreddits, I've written the following to write the data to a csv marked with the date and to prevent the file from being overwritten if this cell is run more than once in a day. This seems particularly important to preserving the actual data that was used for my analysis.

[This site](https://www.geeksforgeeks.org/python-datetime-module/) showed me how to call the date. I remembered we checked if a directory existed with `os` during the Excel Lab (2.01), but I needed [this site](https://www.pythontutorial.net/python-basics/python-check-if-file-exists/) to understand what to call to check if the file existed.

In [141]:
df.dtypes

created_utc     int64
selftext       object
subreddit      object
title          object
dtype: object

In [98]:
if os.path.exists(f'data/data{date.today()}.csv') == True:
    print('ERROR: This filename exists. Please choose a different filename. FILE WAS NOT SAVED.')
else:
    df.to_csv(f'data/data{date.today()}.csv', index = False)

In [143]:
# below is a line that can be uncommented and used to create a new dataframe on the same date.
# It's set to create data{TODAY'SDATE}-1.

# df.to_csv(f'data/data{date.today()}-1.csv', index = False)