# Project 3: Web APIs & NLP

In week four we learned about a few different classifiers. In week five we're learning about webscraping, APIs, and Natural Language Processing (NLP). This project will put those skills to the test.

For project 3, your goal is two-fold:
1. Using [PRAW](https://praw.readthedocs.io/en/stable/index.html), you'll collect posts from two subreddits of your choosing.
2. You'll then use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

## Part 1: Data Collection

For this project, you will be using [PRAW](https://praw.readthedocs.io/en/stable/index.html) to collect posts from two different subreddits. The subreddits we will use in this project will be 'UnresolvedMysteries' and 'FanTheories'.

# **1. Install and Import PRAW and Pandas**

In [1]:
# Import packages

#!pip install praw
import praw

import pandas as pd

# **2. Initialize PRAW**

In [2]:
# Dummified my credentials
# Saved my credentials in a secret spot

reddit = praw.Reddit(
    client_id='dummy',
    client_secret='dummy',
    user_agent='dummy',
    username='dummy',
    password='dummy'
)

# **3. Read in Subreddits**

In [3]:
# Reading in subreddit UnresolvedMysteries
subreddit1 = reddit.subreddit('UnresolvedMysteries')

# Pulling new, hot and top subreddits
# Adjusting the limit to 1000 since we need at least 1000 posts
posts1_new = subreddit1.new(limit=1000)
posts1_hot = subreddit1.hot(limit=1000)
posts1_top = subreddit1.top(limit=1000)

In [4]:
# Reading in subreddit FanTheories
subreddit2 = reddit.subreddit('FanTheories')

# Pulling new, hot and top subreddits
# Adjusting the limit to 1000 since we need at least 1000 posts
posts2_new = subreddit2.new(limit=1000)
posts2_hot = subreddit2.hot(limit=1000)
posts2_top = subreddit2.top(limit=1000)

In [5]:
# Looping through new posts and creating dataframe with needed columns
data = []
for post in posts1_new:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit, post.score, post.num_comments])

# Turn into a dataframe
mysteries_new = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit', 'score', 'num_comments'])
mysteries_new.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1709228000.0,A ten year old goes out to run errands and nev...,Ten-year-old Mary Ann Verdecchia was excited o...,UnresolvedMysteries,346,39
1,1709215000.0,A Shadow Over Leigh: The Unsolved Murder of 14...,**Background** \nLisa Hession was a 14 year o...,UnresolvedMysteries,217,23
2,1709175000.0,Tera Tracy,Saw this story online and thought it was worth...,UnresolvedMysteries,67,10
3,1709069000.0,The Boy and the Bike: the Disappearance of Dav...,"I know this has been covered a few times, but ...",UnresolvedMysteries,348,54
4,1709053000.0,"A body of a young man is found near a parkway,...",Hello everyone! Thank you for all your comment...,UnresolvedMysteries,231,35


In [6]:
# Below is a function I was working on to use but did not get the time to implement

# def subreddit(mysubreddit):
#     subreddit = reddit.subreddit(mysubreddit)
#     post = subreddit.top(limit=1000)
#     data = []
    
#     for row in post:
#         data.append([row.created_utc, row.title, row.selftext, row.subreddit])
#         df = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
#     return df

In [7]:
# subreddit('UnresolvedMysteries')

In [8]:
# Looking at how many rows of data we got
mysteries_new.shape

(931, 6)

In [9]:
# Looping through hot posts and creating dataframe with needed columns
data = []
for post in posts1_hot:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit, post.score, post.num_comments])

# Turn into a dataframe
mysteries_hot = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit', 'score', 'num_comments'])
mysteries_hot.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1708949000.0,"Meta Monday! - February 26, 2024 Talk about an...",This is a weekly thread for off topic discussi...,UnresolvedMysteries,15,4
1,1706616000.0,"What are you listening to, watching, or readin...",This is a weekly thread for media recommendati...,UnresolvedMysteries,27,57
2,1709228000.0,A ten year old goes out to run errands and nev...,Ten-year-old Mary Ann Verdecchia was excited o...,UnresolvedMysteries,344,39
3,1709215000.0,A Shadow Over Leigh: The Unsolved Murder of 14...,**Background** \nLisa Hession was a 14 year o...,UnresolvedMysteries,214,23
4,1709175000.0,Tera Tracy,Saw this story online and thought it was worth...,UnresolvedMysteries,68,10


In [10]:
# Looking at how many rows of data we got
mysteries_hot.shape

(459, 6)

In [11]:
# Looping through top posts and creating dataframe with needed columns
data = []
for post in posts1_top:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit, post.score, post.num_comments])

# Turn into a dataframe
mysteries_top = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit', 'score', 'num_comments'])
mysteries_top.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1607713000.0,FBI confirms that the Zodiac Killer’s “340 Cyp...,The Zodiac Killer is an unidentified serial ki...,UnresolvedMysteries,62426,2844
1,1579803000.0,The mystery surrounding Britney Spears,I know this isn’t the typical content usually ...,UnresolvedMysteries,30001,1703
2,1613172000.0,Why I stopped watching the Elisa Lam documentary,"Right, I'm sure I'm gonna get some flack for t...",UnresolvedMysteries,26696,3417
3,1524647000.0,East Area Rapist/Original Night Stalker OFFICI...,"It's real, the EAR/ONS has been **officially c...",UnresolvedMysteries,25025,5137
4,1653236000.0,"8 months ago, the Sandy Hook shooter Adam Lanz...",https://www.reddit.com/r/masskillers/comments/...,UnresolvedMysteries,24934,1351


In [12]:
# Looking at how many rows of data we got
mysteries_top.shape

(997, 6)

In [13]:
# Combining new, hot and top together to make one dataframe for mysteries
frames = [mysteries_new, mysteries_hot, mysteries_top]
unresolved_mysteries = pd.concat(frames)

In [14]:
# Looking at how many total rows of data we have
unresolved_mysteries.shape

(2387, 6)

In [15]:
# Removing any duplicates
unresolved_mysteries = unresolved_mysteries.drop_duplicates()

In [16]:
# Checking again for how many rows we have
unresolved_mysteries.shape

(2324, 6)

In [17]:
# Looping through new posts and creating dataframe with needed columns
data = []
for post in posts2_new:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit, post.score, post.num_comments])

# Turn into a dataframe
fantheories_new = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit', 'score', 'num_comments'])
fantheories_new.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1709327000.0,"(Girl, Interrupted) Lisa and Susanna are in lo...",Girl Interrupted is my favorite movie ever and...,FanTheories,0,0
1,1709310000.0,[The Butterfly Effect 3: Revelations] What Hap...,An older movie with a story that is left open ...,FanTheories,4,2
2,1709308000.0,[Ratatouille] Gusteau is Disney himself,Just like most Pixar movies in the Lasseter er...,FanTheories,44,5
3,1709261000.0,The Hateful Eight: Warren didn't know Sandy's ...,"Sandy Smithers: ""he came up this mountain look...",FanTheories,72,16
4,1709234000.0,(Wallace and gromit) Wallace is abusive to gro...,"throughout the series, Wallace continually pat...",FanTheories,0,8


In [18]:
# Looking at how many rows of data we got 
fantheories_new.shape

(867, 6)

In [19]:
# Looping through hot posts and creating dataframe with needed columns
data = []
for post in posts2_hot:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit, post.score, post.num_comments])

# Turn into a dataframe
fantheories_hot = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit', 'score', 'num_comments'])
fantheories_hot.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1634130000.0,Welcome to r/FanTheories! Please read this pos...,"Recently, the moderation team has noticed an u...",FanTheories,352,0
1,1708465000.0,Reminder: All fan theories must be in-universe...,"Recently, it came to the attention of the r/fa...",FanTheories,109,26
2,1709308000.0,[Ratatouille] Gusteau is Disney himself,Just like most Pixar movies in the Lasseter er...,FanTheories,43,5
3,1709261000.0,The Hateful Eight: Warren didn't know Sandy's ...,"Sandy Smithers: ""he came up this mountain look...",FanTheories,72,16
4,1709310000.0,[The Butterfly Effect 3: Revelations] What Hap...,An older movie with a story that is left open ...,FanTheories,5,2


In [20]:
# Looking at how many rows of data we got
fantheories_hot.shape

(874, 6)

In [21]:
# Looping through top posts and creating dataframe with needed columns
data = []
for post in posts2_top:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit, post.score, post.num_comments])

# Turn into a dataframe
fantheories_top = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit', 'score', 'num_comments'])
fantheories_top.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1531675000.0,"[SPOILERS] Infinity War: ""...you never once us...",The title quote comes from Thanos speaking to ...,FanTheories,13542,819
1,1481235000.0,The entire movie of Aladdin was simply the ful...,Something that always bothered and confused me...,FanTheories,13246,686
2,1555263000.0,Why Steve Rogers was able to resist Thanos.,I'm referring to at 0:33 in this video: [http...,FanTheories,12128,598
3,1542909000.0,[Harry Potter] [Spoilers] Ron Weasley used the...,**WARNING: SPOILERS ARE ALL OVER THIS THEORY L...,FanTheories,11322,864
4,1482197000.0,Predator (1987): The alien tries each man's ma...,It's a well-worn idea that _Predator_ is a fil...,FanTheories,11010,298


In [22]:
# Looking at how many rows of data we got
fantheories_top.shape

(998, 6)

In [23]:
# Combining new, hot and top together to make one dataframe for fantheories
frames = [fantheories_new, fantheories_hot, fantheories_top]
fan_theories = pd.concat(frames)

In [24]:
# Looking at how many total rows of data we have
fan_theories.shape

(2739, 6)

In [25]:
# Removing any duplicates
fan_theories = fan_theories.drop_duplicates()

In [26]:
# Checking again for how many rows we have
fan_theories.shape

(2461, 6)

In [27]:
# Combining unresolved_mysteries and fan_theories together to get one dataframe
reddits = [unresolved_mysteries, fan_theories]
redditdata1 =  pd.concat(reddits)
redditdata1.head()

Unnamed: 0,created_utc,title,self_text,subreddit,score,num_comments
0,1709228000.0,A ten year old goes out to run errands and nev...,Ten-year-old Mary Ann Verdecchia was excited o...,UnresolvedMysteries,346,39
1,1709215000.0,A Shadow Over Leigh: The Unsolved Murder of 14...,**Background** \nLisa Hession was a 14 year o...,UnresolvedMysteries,217,23
2,1709175000.0,Tera Tracy,Saw this story online and thought it was worth...,UnresolvedMysteries,67,10
3,1709069000.0,The Boy and the Bike: the Disappearance of Dav...,"I know this has been covered a few times, but ...",UnresolvedMysteries,348,54
4,1709053000.0,"A body of a young man is found near a parkway,...",Hello everyone! Thank you for all your comment...,UnresolvedMysteries,231,35


In [28]:
# Checking how many total rows we have
redditdata1.shape

(4785, 6)

In [29]:
# Writing the data out to be cleaned in the next step
redditdata1.to_csv('./data/reddit_data.csv')