In [18]:
import pandas as pd
import functions as fun
from datetime import date
import os

In [19]:
startrek = fun.build_reddit_df('startrek', 3500)

I'm backing this dataset up because otherwise I can't go back and explore the separate sets of posts without cleaning again because of the time-related nature of what's puled into the initial dataframe.

In [20]:
startrekbackup = startrek.copy()

In [21]:
startrek = startrekbackup.copy()

# Initial Cleaning
I used [this stackoverflow answer](https://stackoverflow.com/a/50885228) to guide my work on eliminating duplicates. My rationale is that any duplicate submissions will only overfit the model. This site showed me [a way to use .drop_duplicates](https://stackoverflow.com/a/58311003) that preserves specific values that are duplicated. See why, below.

While my initial instinct was to delete "[removed]" and blank posts, I was curious to see that the 'starwars' subreddit seems to have far more [removed] posts. I found [this post](https://www.reddit.com/r/NoStupidQuestions/comments/b3czg1/what_does_removed_mean/) that indicated that "[removed]" means that a moderator has taken down the post. It appears that the level of '[removed]' may help indicate if a post is a Star Wars or Star Trek post simply because a higher percentage of them are removed. While I ultimately intend to use 'removed' as a stop word and/or remove those lines from the dataframe, I'm opting to leave those posts in for now so I can explore them further. I'd also like to be able to leave the data of the residual titles in the dataframe for now and intend to examine those, too.

I considered carefully whether or not to remove duplicate titles. The argument for keeping is that at least on the StarWars subreddit, [reposting is explicitly forbidden](https://www.reddit.com/r/StarWars/wiki/rules#wiki_read_and_follow_reddiquette), so there's potentially a relationship between repetition and removal. However, as I'm interested in exploring the languaged used in the title's of removed posts and particularly word-counts, I'd rather lose the potential to explore patterns of repetition in favor of not overweighting the words appearing in the titles. On the day that I drew my data, there were 53 repeated Star Trek titles and 83 repeated Star Wars titles. These represent a relatively small number of data points.

I also became curious to see if the blank 'selftext' rows reflected what appeared to be posts that consisted more-or-less solely of the title. That appears to be the case, so I'm going to keep those in the dataframe, as well. In addition to being able to use the titles, I'll be curious to see if there are discrepancies in how many posts of that type the two subredditors create.

I'm going to pull 3500 of both Star Trek and Star Wars posts to ensure that I have at least 1000 of each that have text in their 'selftext', in case I decide to remove the blank and '[removed]' posts in analysis.

In [22]:
startrek = fun.clean_subreddit_df(startrek)

Initial number of duplicate titles: 5
*************************
Initial Shape: (500, 4)
Initial Top 5 Value Counts: [removed]                                                                                                                                                                237
                                                                                                                                                                          86
https://imgur.com/a/M4OrgDc                                                                                                                                                2
Do they have a ship name yet? What should it be? I like Spel.\n\nAlso, their kiss scene in ep 7 was fire, absolutely loved it!                                             1
I’m watching some of the old episodes of everything here and there and I’ll be damned if I don’t love both their characters and the insight or the badasses they are.      1
Name: selftext, dty

## Verifying What's Still Duplicated - Star Trek Data
The below is to confirm that the 'selftext' isn't repeated other than '[removed]', '[deleted]', '['']', and nulls, which I'll deal with later. Just checking for duplicates is not enough because of the repetition of those four things.

In [23]:
stonlytext = startrek[(startrek['selftext'] != '[removed]') & (startrek['selftext'].notnull()) & (startrek['selftext'] != '') & (startrek['selftext'] != '[deleted]')]

In [24]:
stonlytext[stonlytext.duplicated(['selftext'])]

Unnamed: 0,created_utc,selftext,subreddit,title


In [25]:
starwars = fun.build_reddit_df('starwars', 3500)
starwars.head()

Unnamed: 0,created_utc,selftext,subreddit,title
0,1656610303,,StarWars,DALLE: Luke and Vader sitting together
1,1656610127,In Palpatine's speech to the Senate where he d...,StarWars,"""The attempt on my life"""
2,1656610059,,StarWars,Book of Boba VHS
3,1656609635,"Personally for me it’s Crimson Dawn, I just fi...",StarWars,What’s your favorite part of Star Wars?
4,1656609630,"(This is just a thought I had, genuinely sorry...",StarWars,Anyone else think the Jedi order might just be...


I'm backing this dataset up because otherwise I can't go back and explore the separate sets of posts without cleaning again because of the time-related nature of what's puled into the initial dataframe.

In [26]:
starwarsbackup = starwars.copy()

In [27]:
starwars = fun.clean_subreddit_df(starwars)

Initial number of duplicate titles: 10
*************************
Initial Shape: (500, 4)
Initial Top 5 Value Counts:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     293
[removed]                                                                                                                                  

## Verifying What's Still Duplicated - Star Wars Data
The below is to confirm that the 'selftext' isn't repeated other than '[removed]', '[deleted]', '['']', and nulls, which I'll deal with later. Just checking for duplicates is not enough because of the repetition of those four things.

In [28]:
swonlytext = starwars[(starwars['selftext'] != '[removed]') & (starwars['selftext'].notnull()) & (starwars['selftext'] != '') & (starwars['selftext'] != '[deleted]')]

In [29]:
swonlytext[swonlytext.duplicated(['selftext'])]

Unnamed: 0,created_utc,selftext,subreddit,title


In [30]:
df = pd.concat([startrek, starwars])
print(f'Star Trek Shape: {startrek.shape}')
print(f'Star Wars Shape: {starwars.shape}')
print(f'Combined Shape: {df.shape}')

Star Trek Shape: (494, 4)
Star Wars Shape: (490, 4)
Combined Shape: (984, 4)


In [31]:
# verifiyng the concatenation adds up correctly, as determined by shape

startrek.shape[0]+starwars.shape[0]==df.shape[0]

True

In [32]:
# confirming datatypes are as expected

df.dtypes

created_utc     int64
selftext       object
subreddit      object
title          object
dtype: object

# Exporting Data to CSV
Because the function always pullest the newest posts to the given subreddits, I've written the following to write the data to a csv marked with the date and to prevent the file from being overwritten if this cell is run more than once in a day. This seems particularly important to preserving the actual data that was used for my analysis.

[This site](https://www.geeksforgeeks.org/python-datetime-module/) showed me how to call the date. I remembered we checked if a directory existed with `os` during the Excel Lab (2.01), but I needed [this site](https://www.pythontutorial.net/python-basics/python-check-if-file-exists/) to understand what to call to check if the file existed.

In [33]:
if os.path.exists(f'data/data{date.today()}.csv') == True:
    print('ERROR: This filename exists. Please choose a different filename. FILE WAS NOT SAVED.')
else:
    df.to_csv(f'data/data{date.today()}.csv', index = False)

In [143]:
# below is a line that can be uncommented and used to create a new dataframe on the same date.
# It's set to create data{TODAY'SDATE}-1.

# df.to_csv(f'data/data{date.today()}-1.csv', index = False)