# GA DSI 26: Project 3 - Reddit Subreddits Classification
***

## Notebook organisation
- **Notebook 1: Introduction, Web scraping and Data Acquisition (current notebook)**
- Notebook 2: EDA and Pre-Processing
- Notebook 3: Model Preparation, Tuning, Insights and Conclusion


## Introduction
***
Reddit is a social news aggregation, web content rating, and discussion website. It was founded in 2005 by Steve Huffman, Alexis Ohanian and Aaron Swartz. Reddit was ranked as the 9th most popular social media app in the US and it has over 430 million monthly active users. Subreddits are groups on Reddit dedicated to a specific topic where people can discuss and share their opinions with others on that subreddits. There are currently over 100,000 active communities within Reddit that covers various topics and subjects [(source)](https://backlinko.com/reddit-users).

Users can upload posts or comments in the subreddits facilitating a discussion. In addition, users can rate the posts by upvoting or downvoting the posts based on their own feelings on the posts.

## Problem Statement
***

As an employee at a computer building company in Singapore, the company receive many requests from customers regarding new computer build and after sales support for their PC. The volume of requests has increased and is requiring more staff to sort and classify these requests between PC building queries or after-sales support for their computer. Hence, the company wants to build a classification model to sort and classify future incoming requests into the two categories of PC building queries or after-sales support. 

As the company does not archive and store past queries sent to the company, to build this classification model, the Reddit posts in two relevant subreddits shall be used in place of real customers enquiries to train and test the model. 

The subreddits identified are [r/buildapc ](https://www.reddit.com/r/buildapc/) and [r/techsupport ](https://www.reddit.com/r/techsupport/). Both subreddits are extremely popular with 4.8 million members on [r/buildapc ](https://www.reddit.com/r/buildapc/) and 1.6 million members on [r/techsupport ](https://www.reddit.com/r/techsupport/). Both subreddits offer answers, support and advice for technology and computers from questions posed by users.
- [r/buildapc ](https://www.reddit.com/r/buildapc/) is focused on questions and providing advise regarding the hardware choices for building a custom desktop computer.

- [r/techsupport ](https://www.reddit.com/r/techsupport/) is focused on providing answers to questions on technology and most of the questions are computer related with some questions about consoles and mobile phones but these are few as there are other relevant subreddits that have a specific focus on support for those products.


## Executive Summary
***
From the 6 models tested, the best performing model to classify incoming queries into the two categories of queries relating to building a PC or after sales support for the computer purchased from the company would be the TfidfVectorizer with Multinomial Naive Bayes. The model is able to correctly classify 86.9% of the posts. There would still be room for improvement to cater for the remaining 13.1% of posts that would be wrongly classified.

With this model, the company can implement this model as automated first step of classification for incoming queries without need for human oversight as it is able to correctly classify the incoming queries at a high success rate. Subsequently, as the staff reviews the queries and finds out it is wrong, they can transfer the queries across to the correct department while flagging the wrong query for further analysis to improve the model.




## Scrapping Data from Subreddits
***

After identifying the subreddits, the next step would be to scrap the subreddits and obtain the posts to build a dataset for training and testing of the model. 

In [1]:
# import libraries

import requests
import pandas as pd

### Define a custom function for scrapping data from a subreddit

In [2]:
def scrap_data(subreddit):
    """Function to scrap 1000 posts from the subreddit specified"""
    url = 'https://api.pushshift.io/reddit/search/submission?subreddit=' + subreddit 
    
    initial_params = {
        'subreddit' : subreddit,
        'size' : 100
    }
    
    res = requests.get(url, initial_params)
    
    data = res.json()
    posts = data['data']
    
    df = pd.DataFrame(posts)
    
    last_epoch = df['created_utc'][99]
    
    params_1 = {
        'subreddit' : subreddit,
        'size' : 100,
        'before' : last_epoch
    }
    
    count = 1
    
    while count != 10:
        res = requests.get(url, params_1)
        data = res.json()
        posts = data['data']
        df_1 = pd.DataFrame(posts)
        df = pd.concat([df, df_1])
        
        last_epoch = df_1['created_utc'][99]
        
        params_1 = {
        'subreddit' : subreddit,
        'size' : 100,
        'before' : last_epoch
        }
                
        count += 1
    
    df.reset_index(inplace = True, drop = True)
    
    return df

### Scrapping data from both subreddits

I will be using the function above to scrap data from the subreddits and to export a copy to a csv file. The data would then be worked on for some data cleaning before re-exporting to an updated csv file. 

The data in this notebook was scrapped from the subreddits on 13th January 2022. 

In [17]:
# run the function to get data from r/buildapc subreddit
buildapc = scrap_data('buildapc')

In [18]:
# run the function to get data from r/techsupport subreddit
techsupport = scrap_data('techsupport')

## Data Cleaning 

### For r/buildapc subreddit

In [19]:
# check the shape of the final dataframe to ensure there are 1000 entries
buildapc.shape

(1000, 67)

In [20]:
buildapc.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,upvote_ratio,url,whitelist_status,wls,link_flair_template_id,link_flair_text,post_hint,preview,removed_by_category,author_cakeday
0,[],False,Loud_Ad_5985,,[],,text,t2_6hfuzcim,False,False,...,1.0,https://www.reddit.com/r/buildapc/comments/s2w...,all_ads,6,,,,,,
1,[],False,Zenivoo,,[],,text,t2_gg5faco,False,False,...,1.0,https://www.reddit.com/r/buildapc/comments/s2w...,all_ads,6,7338e9ba-5cc3-11e3-9815-12313b0ae6f4,Build Help,self,"{'enabled': False, 'images': [{'id': 'dbtQ9A34...",,
2,[],False,craigmorris78,,[],,text,t2_b2w30,False,False,...,1.0,https://www.reddit.com/r/buildapc/comments/s2w...,all_ads,6,7338e9ba-5cc3-11e3-9815-12313b0ae6f4,Build Help,,,,
3,[],False,tempacc777,,[],,text,t2_5vhzga6j,False,False,...,1.0,https://www.reddit.com/r/buildapc/comments/s2w...,all_ads,6,,,,,,
4,[],False,Y0da_on_crack,,[],,text,t2_4giszggq,False,False,...,0.99,https://www.reddit.com/r/buildapc/comments/s2w...,all_ads,6,,,,,,


In [21]:
# get the columns names of the dataframe
buildapc.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id',
       'subreddit_subscribers', 'subreddit_type', 'suggested_sort'

The columns that we are interested in to build the classification model would be `'subreddit'`, `'title'` and `'selftext'`. We will be checking for null values and duplicates in these columns before exporting the data to a csv file.  

In [22]:
# checking for null values if they exist in the columns of interest

buildapc.isnull().sum().sort_values().tail(10)

is_original_content          0
link_flair_css_class         4
link_flair_text            365
link_flair_template_id     369
preview                    916
post_hint                  916
removed_by_category        984
author_cakeday             997
author_flair_css_class    1000
author_flair_text         1000
dtype: int64

In [23]:
# check for duplicates in the title column

buildapc['title'].value_counts(ascending = False).head(10)

My screen became whitish all of a sudden.                                  3
MY PC BUILD JOURNEY                                                        2
Help                                                                       2
Cpu                                                                        2
custom pc                                                                  2
Can you review my build?                                                   2
CPU Upgrade                                                                2
quick format                                                               2
How to use gas sensors with Arduino - Arduino tutorial - MQ2 gas sensor    2
2nd Build, gaming PC 1500€ budget. am I choosing right?                    1
Name: title, dtype: int64

These rows are likely to be double post or repost hence we will be dropping these rows. 

In [24]:
# check for duplicates in the selftext column

buildapc['selftext'].value_counts(ascending = False).head(5)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

For the above rows, there are several duplicates that are observed in the `'selftext'` column. We will be checking the rows with the selftext being empty and having [removed] before deciding to remove these rows or not. 

In [25]:
# check the rows with empty value in the selftext column

buildapc.loc[(buildapc['selftext'].isin([''])), ['title', 'selftext']]

Unnamed: 0,title,selftext
55,How am i supposed to know what pc parts are co...,
95,I have a 256gb M.2 SSD and a 128gb SATA SSD. W...,
99,"Hi everyone, I’ve built a pc over a year ago b...",
105,what 3060 ti to buy zotac GAMING twin edge oc ...,
151,should I go with the Asus z690-p or the gigaby...,
157,how do you install a mother board with a IO sh...,
170,Would a Ryzen 3 5300g bottle neck a 1660 super?,
185,need help or answer will be ok to swap a 3600 ...,
233,"Hi, Im building my first PC atm but i'm stuck ...",
319,Is a rx 6600 for 520$ a good price,


In [26]:
# slice out the rows with empty field in selftext column so that these rows would not be dropped during drop duplicates
empty = buildapc.loc[(buildapc['selftext'].isin(['']))]

As it can be seen, these rows have a valid title text just that the original poster did not include any text within the 'selftext' field. Hence, we shall remain with these rows and not drop them.

In [27]:
# check the rows with [removed] value in the selftext column

buildapc.loc[(buildapc['selftext'].isin(['[removed]'])), ['title', 'selftext']]

Unnamed: 0,title,selftext
61,How to use gas sensors with Arduino - Arduino ...,[removed]
73,How to use gas sensors with Arduino - Arduino ...,[removed]
271,Intel Xeon E3-1270 can work with a H61MLB?,[removed]
358,verizon📞 customer service.📞1866.517.1058📞. pho...,[removed]
360,💻verizon. customer service💻.1866.517.1058.💻 ph...,[removed]
361,verizon customer+1866.517.1058 service phone n...,[removed]
364,"""verizon wireless (866]-5[171058]customer serv...",[removed]
367,🗞🗞verizon wireless (866]-5[17🗞1058]🗞customer s...,[removed]
369,verizon wireless (866]-5[17-1058]customer serv...,[removed]
571,Need help building a ₹50000 pc.,[removed]


As it can be seen, these rows with [removed] in the selftext have a irrelevant title which are mainly spam or topics that are irrelevant to the r/buildapc subreddit that have been removed by the moderator. Hence, we shall we dropping these rows accordingly. 

In [28]:
# dropping duplicated rows in 'title' column
buildapc.drop_duplicates(subset = ['title'], inplace =  True)

In [29]:
# dropping dupliated rows in 'selftext' column
buildapc.drop_duplicates(subset = ['selftext'], inplace =  True)

In [30]:
buildapc.shape

(947, 67)

In [31]:
# concatenate dataframe with the rows that have an empty string in 'selftext column'
buildapc = pd.concat([buildapc, empty])
buildapc.shape

(976, 67)

In [35]:
# creating a final dataframe with only 'subreddit'

buildapc = buildapc[['subreddit', 'title', 'selftext']]
buildapc.head()

Unnamed: 0,subreddit,title,selftext
0,buildapc,3080 or 6900xt?,So I'm gonna be building a gaming PC for 4k ga...
1,buildapc,"Upgrading to a 1440p144Hz setup, would like so...",**What is your intended use for this build? Th...
2,buildapc,Help fine tune my build before it's final.,My 2500k/980 build is on it's knees and about ...
3,buildapc,Need help picking out a KVM or other solution ...,"As the tittle says, im looking for a KVM switc..."
4,buildapc,Ryzen 5 3600 + GT 710 or Ryzen 3 3200G,Hello! So unfortunately my old GPU died last w...


In [36]:
buildapc.shape

(976, 3)

### For r/techsupport subreddit

In [37]:
# check the shape of the final dataframe to ensure there are 1000 entries
techsupport.shape

(1000, 66)

In [38]:
techsupport.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,removed_by_category,post_hint,preview,author_cakeday
0,[],False,Ok_Professional_9434,,[],,text,t2_h1bb1pjk,False,False,...,0,[],1.0,https://www.reddit.com/r/techsupport/comments/...,all_ads,6,,,,
1,[],False,ravenderealistic,,[],,text,t2_5p686rrb,False,False,...,0,[],1.0,https://www.reddit.com/r/techsupport/comments/...,all_ads,6,,,,
2,[],False,1eeveefan,,[],,text,t2_b5f3zxlu,False,False,...,0,[],1.0,https://www.reddit.com/r/techsupport/comments/...,all_ads,6,,,,
3,[],False,Po_gU,,[],,text,t2_6cczit51,False,False,...,0,[],1.0,https://www.reddit.com/r/techsupport/comments/...,all_ads,6,,,,
4,[],False,user-0100,,[],,text,t2_f99ol851,False,False,...,0,[],1.0,https://www.reddit.com/r/techsupport/comments/...,all_ads,6,,,,


In [39]:
# get the columns names of the dataframe
techsupport.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_sub

The columns that we are interested in to build the classification model would be `'subreddit'`, `'title'` and `'selftext'`. We will be checking for null values and duplicates in these columns before exporting the data to a csv file.  

In [40]:
# checking for null values if they exist in the columns of interest

techsupport.isnull().sum().sort_values().tail(10)

link_flair_css_class         0
link_flair_richtext          0
gildings                     0
author                       0
post_hint                  911
preview                    911
removed_by_category        936
author_cakeday             997
author_flair_text         1000
author_flair_css_class    1000
dtype: int64

In [41]:
# check for duplicates in the title column

techsupport['title'].value_counts(ascending = False).head(10)

GPU not recognized                                                                                                                                                            3
Pc on but not displaying                                                                                                                                                      2
Help                                                                                                                                                                          2
Tried erasing/reinstalling a clean OS for my iMac 2012 27", might have bricked it?                                                                                            2
I was playing sims 4 and tabbed out to look for cc and when I clicked a link it took me to some weird Asian nsfw site I clicked out pretty fast so I should be fine right?    2
How do I install windows from the bios                                                                                  

These rows are likely to be double post or repost hence we will be dropping these rows. 

In [42]:
# check for duplicates in the selftext column

techsupport['selftext'].value_counts(ascending = False).head(5)

[removed]                                                                                                                                                                                                                                                                                                                                                                                                     64
i can hear my pc that my fans are on i can olse see the rgb inside of the pc keyboard rgb also on but no display and when i press caps on my keyboard it does not show that i turned it on so i think thats also not working if you can help i would really appreciate it\n\n(i am not good with pc stuff so try and make it as easy as possible)                                                              2
My computer is making a shutdown sound while it is in use. This happens periodically, but nothing happens; the system continues to operate normally; temperatures remain stable and within acceptable 

For the above rows, there are several duplicates that are observed in the `'selftext'` column. We will be checking the rows with the selftext being empty and having [removed] before deciding to remove these rows or not. 

In [44]:
# check the rows with [removed] value in the selftext column

techsupport.loc[(techsupport['selftext'].isin(['[removed]'])), ['title', 'selftext']]

Unnamed: 0,title,selftext
21,Lost access to a gmail account from when I was...,[removed]
30,Random but frequent BSOD,[removed]
48,3080 Ti - very very very low FPS on older game...,[removed]
52,Windows 10 Enterprise 21H1 - PAGE_FAULT_IN_NON...,[removed]
67,Task manager shows GPU at 80° c but afterburne...,[removed]
...,...,...
901,"[Help] My phone's (S9+, android 9) battery jus...",[removed]
903,Cannot run game installer due to 'compatibilit...,[removed]
941,"Found an old laptop but forgot the password, a...",[removed]
975,Plz help.. Windows update looks like it’s goin...,[removed]


As it can be seen, these rows with [removed] in the selftext have a irrelevant title which are mainly topics that are irrelevant to the r/techsupport subreddit that have been removed by the moderator. Hence, we shall we dropping these rows accordingly. 

In [45]:
# dropping duplicated rows in 'title' column
techsupport.drop_duplicates(subset = ['title'], inplace =  True)

In [46]:
# dropping dupliated rows in 'selftext' column
techsupport.drop_duplicates(subset = ['selftext'], inplace =  True)

In [47]:
techsupport.shape

(933, 66)

In [48]:
# creating a final dataframe with only 'subreddit'

techsupport = techsupport[['subreddit', 'title', 'selftext']]
techsupport.head()

Unnamed: 0,subreddit,title,selftext
0,techsupport,Win 11 dwm.exe is using way too much vram,"I have been on Win 11 since the beta, in Octob..."
1,techsupport,Toshiba Qosmio X70B10T graphics card not detec...,I recently had this laptop formated and I just...
2,techsupport,How to find original MD5 Hash,"Hello, I am trying to determine if a file I ha..."
3,techsupport,Is it worth upgrading to Windows 11 for mostly...,I recently got the notif that I can upgrade to...
4,techsupport,"Deleted google tv remote, can't control tv",I was having trouble with my chromecast google...


In [49]:
# check the shape of the final dataframe to ensure there are 1000 entries
techsupport.shape

(933, 3)

## Exporting the data
***
The dataframe would be exported to a csv file where it will be used in the next notebook. 

In [50]:
buildapc.to_csv('../datasets/buildapc.csv', index = False)

In [51]:
techsupport.to_csv('../datasets/techsupport.csv', index = False)