### Project 3: NLP Model to Predict 2 Subreddits: Game of Thrones & Lord of the Rings

#### By: Melissa McMillan

#### Executive Summary

For Project 3, our mission was to utilize two somewhat similar/overlapping subreddits from the Reddit website and run a few Natural Language Processing (NLP) models to try to delineate which posts came from which subreddit page. For further narrative and context, I have a "friend" who is the main subreddit moderator for the r/GameofThrones webpage, and she has a consistent problem in that she often spends time manually removing posts related to the Lord of the Rings on her subreddit page. So, I thought I would try to help her by creating a model that can delineate between Game of the Thrones-related posts and those that should belong on the r/LordoftheRings (r/LOTR) subreddit page instead. Then, my friend could use the model to help filter out the Lord of the Rings posts from the r/GameofThrones webpage. 

To gather data, I use Pushshift's Reddit API to scrape 5,100 post titles and descriptions from each page and put them into a dataframe. I did some cleaning and EDA on each and ultimately decided not to use the post descriptions in my model because there were so many null values between the two datasets. Once I combined my datasets into one large dataframe, I did further EDA and cleaning and explored using a RegEx Tokenizer to filter out some of the text data. Notebook 01 contains the data collection phase of my project and Notebook 02 contains the EDA and Cleaning phase of my project.

After the EDA and cleaning phase, I then began to build some models. I tested both the Multinomial Naive Bayes Classifier and the Random Forest Classifier models to explore which would work best with this dataset. I ran about 20 models between the two types of models, and those can be found in Notebooks 03, 04, and 05. I found that both types of models performed similarly overall, but the best model was a Random Forest Classifier because it had the lowest Recall/Sensitivity score and lowest False Negative score. Those were two key metrics I decided to use for helping my friend because I thought it would be better for her to allow in a small amount of Lord of the Rings posts rather than discard legitimate Game of Thrones posts. 

After running many models and tuning parameters, I found that I was able to create a good model that can help my friend with ~93% accuracy and ~9-10% of the predictions coming up as false negatives. This model was a Random Forest Classifier that utilized a Tfidf Vectorizer and WordNetLemmatizer. This best model can be found in Notebook 05.  

#### Problem Statement

I have a friend who is the main subreddit moderator for the r/GameofThrones webpage, and she has a consistent problem in that she often spends time manually removing posts related to the Lord of the Rings films on her subreddit page. So, I thought I would try to help her by creating a model that can delineate between Game of the Thrones-related posts and those that should belong to the r/LordoftheRings (r/LOTR) subreddit page instead. Then, my friend could use the model to help filter out the Lord of the Rings posts from the r/GameofThrones webpage. The goal of my project is to develop the best classification model that optimizes for accuracy score, Sensitivity/Recall score, and reduces the false negative predictions as much as possible.

In [2]:
#Here I will be using Pushshift's API to webscrape the two Subreddits' posts

In [1]:
import requests
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [63]:
api_url = 'https://api.pushshift.io/reddit/search/submission'

In [142]:
params = {
    'subreddit' : 'gameofthrones',
    'size' : 500
}

In [143]:
res = requests.get(api_url, params)

In [144]:
res.status_code

200

In [145]:
got_data = res.json()

In [146]:
#To get a list of the first 25 posts:
first_100 = got_data['data']

In [147]:
len(first_100)

100

In [148]:
len(first_100[0])

63

In [149]:
len(first_100[10])

63

In [150]:
len(first_100[-1])

59

In [151]:
df = pd.DataFrame(first_100)

In [152]:
df[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,gameofthrones,,Some genius explains how the Trump riots is ‘G...
1,gameofthrones,,"Jagjeet Sandhu Age, Career, Personal Life- Bio..."
2,gameofthrones,,"Kaagaz: plot, cast, review. Kaagaz is a 2021 I..."
3,gameofthrones,,"Urvi Singh Age, Career, Personal Life- Biograp..."
4,gameofthrones,,"Abhishek Banerjee: Wiki, age, birthday &amp; f..."


In [153]:
first_100[-1].get('created_utc')

1610435895

In [154]:
#to get more posts than the first 100:
params2 = {
    'subreddit' : 'gameofthrones',
    'size' : 500,
    'before' : 1610435895
}

In [155]:
res2 = requests.get(api_url, params2)

In [156]:
got_data2 = res2.json()

In [157]:
#To get a list of the second 100 posts:
second_100 = got_data2['data']

In [158]:
len(second_100)

100

In [159]:
second_100[-1].get('created_utc')

1610383108

In [160]:
df2 = pd.DataFrame(second_100)
df2[['subreddit', 'selftext', 'title']].head()

Unnamed: 0,subreddit,selftext,title
0,gameofthrones,,I was playing Mario Rabbids and this just came...
1,gameofthrones,,"[NO SPOILERS] When GRRM finishes the books, we..."
2,gameofthrones,,"When GRRM finishes the books, we need an anima..."
3,gameofthrones,We need a good Thrones video game on playstati...,[NO SPOILERS] Thrones video game
4,gameofthrones,[removed],[NO SPOILERS]


In [161]:
len(second_100)

100

In [163]:
master_got_list = []

In [164]:
for element in first_100:
    master_got_list.append(element)
len(master_got_list)

100

In [165]:
for element in second_100:
    master_got_list.append(element)
len(master_got_list)

200

In [None]:
#Now I need to figure out how to automate this step
#I will write a function that iterates through each pull of 100 posts, then adds those to a master list.
#Once I have a master list of posts, I will turn that into a dataframe. 
#Note: I will then have to do the same for comments (?)

In [40]:
#If I'm going to loop by using the before time, I need to be able to locate that inside each pull
#I need to use the 'created_utc' entry for the last element in the list
df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'removed_by_category', 'retrieved_on', 'score',
       'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit',
       'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail',
       'thumbnail_height', 

In [172]:
def get_posts(url, master_post_list):
    """ Docstring, will add last"""
    
    #set up the parameters of each independent pull
    try:
        utc_time = master_post_list[-1].get('created_utc')
    except:
        utc_time = master_post_list[-1][0].get('created_utc')
    
    params = {
    'subreddit' : 'gameofthrones',
    'size' : 100,
    'before' : utc_time}
    
    #now go get the posts
    res_gen = requests.get(url, params)
    got_data_gen = res_gen.json()
    temp_list = got_data_gen['data']
    
    #now add the new data to the master list
    for element in temp_list:
        master_got_list.append(element)
   
    return 'Your pull is complete.'

In [167]:
len(master_got_list)

200

In [168]:
get_posts(api_url, master_got_list)

'The length of [{\'all_awardings\': [], \'allow_live_comments\': False, \'author\': \'Jacqui_heggen\', \'author_flair_css_class\': None, \'author_flair_richtext\': [], \'author_flair_text\': None, \'author_flair_type\': \'text\', \'author_fullname\': \'t2_9ohz4ut3\', \'author_patreon_flair\': False, \'author_premium\': False, \'awarders\': [], \'can_mod_post\': False, \'contest_mode\': False, \'created_utc\': 1610499124, \'domain\': \'thebrag.com\', \'full_link\': \'https://www.reddit.com/r/gameofthrones/comments/kw55r9/some_genius_explains_how_the_trump_riots_is_game/\', \'gildings\': {}, \'id\': \'kw55r9\', \'is_crosspostable\': False, \'is_meta\': False, \'is_original_content\': False, \'is_reddit_media_domain\': False, \'is_robot_indexable\': False, \'is_self\': False, \'is_video\': False, \'link_flair_background_color\': \'\', \'link_flair_richtext\': [], \'link_flair_text_color\': \'dark\', \'link_flair_type\': \'text\', \'locked\': False, \'media_only\': False, \'no_follow\': Tr

In [169]:
len(master_got_list)

300

In [170]:
master_got_list[-1].get('created_utc')

1610299380

In [173]:
get_posts(api_url, master_got_list)

'Your pull is complete.'

In [174]:
len(master_got_list)

400

In [175]:
master_got_list[-1].get('created_utc')

1610229125

In [176]:
while len(master_got_list) < 2001:
    get_posts(api_url, master_got_list)

In [177]:
len(master_got_list)

2100

In [178]:
while len(master_got_list) < 5001:
    get_posts(api_url, master_got_list)

In [179]:
len(master_got_list)

5100

In [181]:
got_df = pd.DataFrame(master_got_list)
got_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,author_flair_text_color,link_flair_css_class,link_flair_text,banned_by,poll_data,media_metadata,author_cakeday,is_gallery,gallery_data,edited
0,[],False,Jacqui_heggen,,[],,text,t2_9ohz4ut3,False,False,...,,,,,,,,,,
1,[],False,lnhax_com,,[],,text,t2_1uzw3jfi,False,False,...,,,,,,,,,,
2,[],False,lnhax_com,,[],,text,t2_1uzw3jfi,False,False,...,,,,,,,,,,
3,[],False,lnhaxcom,,[],,text,t2_4gka8z1a,False,False,...,,,,,,,,,,
4,[],False,lnhaxcom,,[],,text,t2_4gka8z1a,False,False,...,,,,,,,,,,


In [182]:
got_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_patreon_flair',
       'author_premium', 'awarders', 'can_mod_post', 'contest_mode',
       'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'removed_by_category', 'retrieved_on', 'score',
       'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit',
       'subreddit_id', 'subreddit_subscribers', 'subreddit_type', 'thumbnail',
       'thumbnail_height', 

In [184]:
got_df[['subreddit', 'selftext', 'title']].tail()

Unnamed: 0,subreddit,selftext,title
5095,gameofthrones,,BIGG BOSS : KARISHMA TANNA Karishma Tanna is a...
5096,gameofthrones,,"NISHA DHAUNDIYAL, The yogini woman: Roadies Re..."
5097,gameofthrones,,[NO SPOILERS] my sisters made me a Daenerys th...
5098,gameofthrones,,My sister made me this Daenerys Christmas orna...
5099,gameofthrones,,My sister made me a Daenerys themed Christmas ...


In [186]:
#Now I will save the dataframe as a csv to use in my model
got_df_final = got_df[['subreddit', 'selftext', 'title']]
got_df_final.head()

Unnamed: 0,subreddit,selftext,title
0,gameofthrones,,Some genius explains how the Trump riots is ‘G...
1,gameofthrones,,"Jagjeet Sandhu Age, Career, Personal Life- Bio..."
2,gameofthrones,,"Kaagaz: plot, cast, review. Kaagaz is a 2021 I..."
3,gameofthrones,,"Urvi Singh Age, Career, Personal Life- Biograp..."
4,gameofthrones,,"Abhishek Banerjee: Wiki, age, birthday &amp; f..."


In [187]:
pwd

'/Users/melissamcmillan/Documents/Python_Stuff/GA_DSI_Course/Submissions/projects/project-3-delivery'

In [188]:
got_df_final.to_csv('./data/got_5100_posts.csv')

Now I want to do the same workflow for Lord of the Rings

In [189]:
#the api_url is the same as before, but I need to change the params
lotr_params = {
    'subreddit' : 'lotr',
    'size' : 500
}

In [190]:
res = requests.get(api_url, lotr_params)

In [191]:
res.status_code

200

In [192]:
lotr_data = res.json()

In [193]:
first_lotr_100 = lotr_data['data']

In [194]:
len(first_lotr_100)

100

In [195]:
master_lotr_list = []

In [196]:
for element in first_lotr_100:
    master_lotr_list.append(element)
len(master_lotr_list)

100

In [197]:
def get_lotr_posts(url, master_post_list):
    """ Docstring, will add last"""
    
    #set up the parameters of each independent pull
    try:
        utc_time = master_post_list[-1].get('created_utc')
    except:
        utc_time = master_post_list[-1][0].get('created_utc')
    
    params = {
    'subreddit' : 'lotr',
    'size' : 100,
    'before' : utc_time}
    
    #now go get the posts
    res_gen = requests.get(url, params)
    lotr_data_gen = res_gen.json()
    temp_list = lotr_data_gen['data']
    
    #now add the new data to the master list
    for element in temp_list:
        master_post_list.append(element)
   
    return 'Your pull is complete.'

In [199]:
get_lotr_posts(api_url, master_lotr_list)

'Your pull is complete.'

In [203]:
len(master_lotr_list)

4200

In [204]:
while len(master_lotr_list) < 5001:
    get_lotr_posts(api_url, master_lotr_list)

In [205]:
len(master_lotr_list)

5100

In [206]:
lotr_df = pd.DataFrame(master_lotr_list)
lotr_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,media,media_embed,secure_media,secure_media_embed,author_cakeday,link_flair_css_class,link_flair_text,edited,gilded,banned_by
0,[],False,crasherg15,,[],,text,t2_2jsih7su,False,False,...,,,,,,,,,,
1,[],False,doubavitch,,[],,text,t2_726a4ats,False,False,...,,,,,,,,,,
2,[],False,Durendal_et_Joyeuse,user,[],Gandalf the White,text,t2_caq68,False,False,...,,,,,,,,,,
3,[],False,TA-acount,,[],,text,t2_6fxgs308,False,False,...,,,,,,,,,,
4,[],False,QualFoiBinha,,[],,text,t2_6z9ugs5c,False,False,...,,,,,,,,,,


In [207]:
#Now I will save the dataframe as a csv to use in my model
lotr_df_final = lotr_df[['subreddit', 'selftext', 'title']]
lotr_df_final.head()

Unnamed: 0,subreddit,selftext,title
0,lotr,,Start my Tolkien journey off right
1,lotr,Hi guys!\n\nThis might seem like an unusual re...,Looking for posters
2,lotr,,The OneRing.net says they have confirmed the o...
3,lotr,"As the title says, there was supposed to be a ...",Do you think the Lord of the Rings Musical wil...
4,lotr,,I made this drawing of my babies


In [208]:
lotr_df_final.to_csv('./data/lotr_5100_posts.csv')

Data are exported and ready for cleaning and EDA. Please proceed to Notebook 02.