# Project 3: Subreddit Classification

Project notebook organisation:<br>
**1 - Webscraping and Data Acquisition** (current notebook)<br>
[2 - Preprocessing of data](./2_preprocessing.ipynb)<br>
[3 - Exploratory data analysis](./3_eda.ipynb)<br>
[4 - Model tuning and insights](./4_modelling_and_tuning.ipynb)<br>
<br>
<br>

In [7]:
import time, warnings
import pandas as pd
import numpy as np

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

%matplotlib inline

## Introduction and problem statement

Reddit is a popular social news, content, and discussions website where posts are organised according to subject into user-created 'subreddits'. Members submit content (such as images, texts, and links) to subreddits, which can then be voted on and commented by other members, creating an internet community of sorts around specific themes. 

In this project, I examined posts from two subreddits - [**r/Androidquestions**](https://www.reddit.com/r/AndroidQuestions/) (Fig 1) and [**r/iphonehelp**](https://www.reddit.com/r/iphonehelp/) (Fig 2). 
The great handset wars of the 1990s and 2000s have seen many a brand rise and fall. Out of the chaos of the period, 2 incumbant flavours of smartphone now exist; the iPhone and their related Apple-proprietary handsets and Androids which is better described as an operating system spanning many brands.
Often, when users have issues with their respective headsets, an avenue to turn to to ask about and reslve problems is on the world wide web, and Reddit is one of these touch points whereby users with problems may ask their questions.

The goal of this project is therefore to try and build a model that can easily discern if the problem, as described online by users, is iphone-related or android-related by looking at what they talk about on their subreddits. 
To answer this question, a word-frequency based classification model will be developed to predict which subreddit a random post belongs to. To identify a production model, a variety of preliminary models would be tested and evaluated based on their accuracy scores (i.e. how many correct predictions they are able to make).

<img src='./images/randroid.png' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 1. The frontpage of r/androidquestions as of 11pm, 24 June 2020.)</font></center>
<img src='./images/riphone.png' width = 700 align = center>
<center><font size=2 color='grey'>(Fig 2. The frontpage of r/iphonehelp as of 11pm, 24 June 2020.)</font></center>

While the goal of this project is to classify posts into subreddits, such classifer models have much wider applicabilities, for example the automatic sorting of customer problems into different categories (to be forwarded to different departments) in government ministries or even handphone service shops.

Due to the scale of this project, it is split into four sequential Jupyter notebooks: webscraping and data acquisition, preprocessing, EDA and model tuning and insights. This is the webscraping and data acquisition notebook.

## Executive Summary

Reddit is a popular social news, content, and discussions website where posts are organised according to subject into user-created 'subreddits'. Members submit content (such as images, texts, and links) to subreddits, which can then be voted on and commented by other members, creating an internet community of sorts around specific themes. 
In this project, I examined posts from two subreddits - [**r/Androidquestions**](https://www.reddit.com/r/AndroidQuestions/) and [**r/iphonehelp**](https://www.reddit.com/r/iphonehelp/).  

The goal of this project is therefore to try and build a model that can easily discern if the problem, as described online by users, is iphone-related or android-related by looking at what they talk about on their subreddits. 

To answer this question, a word-frequency based classification model will be developed to predict which subreddit a random post belongs to. To identify a production model, a variety of preliminary models would be tested and evaluated based on their accuracy scores (i.e. how many correct predictions they are able to make).The final production model was a multinomial naive Bayes classifier that makes predictions based on title content, text and comments with an accuracy of 83%. This shows that the posts in r/Androidquestions and r/iphonehelp are fairly different, but still have a good amount of similarities in their issues. 

The nature of queries in r/Androidquestions appear mostly related to software issues and tweaking issues related to the Android 10 operating system as opposed to the hardware. In contrast, most issues on r/iphonehelp are related to hardware problems such as accidentily dropping phones in water or replacing cracked screens. It is perhaps unsurprising then that it is easy to distinguish between posts meant for either subreddits. Keywords such as 'Android' and 'ios' which are native to the different operating systems further help the discrimation of the posts.

Despite the differences in the different phone ecosystems, they still share some similar issues, which is a likely explanation for the model misclassifications. Looking at overlapping words ('sim card', 'factory reset', 'recovery mode', 'lock screen', 'power button', 'old phone') between the top 50 meaningful phrases gives us an indication of the common issues which most likely stratify both subreddits.

To further improve model accuracy, a bigger corpus that incorporates a bigger vocabulary on the different systems is needed. As proven through the data on modelling, models trained using only title information tend to be more inaccurate as compared to the text and comment data which tended to be longer in nature, hence containing more words. The best model on the validation set incorporated *4,162* data points which were a combination of title, comment and text data. In contrast, hyperparameter optimisation, though time-consuming, only achieved very modest accuracy gains.
It can hence be said that the hypothesis of 'throw more data at the model' to improve accuracy scores holds true. The model does not discriminate between title, text and comments but merely the vocabulary of words within an entire subreddit post. I hence posit, that if this model were deployed for real-life use, the substantial increase in queries and discriptions of problems over time would improve accuracy scores. 

To move the project forward (i.e. to improve accuracy scores) I recommend the following:
1. Feed all 'text'-related information as a single feature into the model
1. Deploy the model and put it in use so as to 'crowd-source' a larger corpus of words as queries come in.
1. Use other sources of data such as other subreddits and other forums to increase th ecorpus of words.

As mentioned previously, although the goal of this project is to classify subreddits, such a classification model can also be applied elsewhere, such as to automate front end systems for topic matching and routing of queries to the right troubleshooting teams, recommending possible solutions as part of a larger software system, and the ever-useful spam filtering.


### Contents

1. [Scraping training data](#Scraping-training-data)
 1. [Method 1: Scraping using Reddit API](#Method-1:-Scraping-using-Reddit-API)
 1. [Method 2: Scraping using PRAW](#Method-2:-Scraping-using-PRAW)
6. [Scraping test data](#Scraping-test-data)

## Scraping training data

### Method 1: Scraping using Reddit API

The Reddit API as prescribed allows remote interaction with Reddit, including downloading posts from subreddits (with a cap of 1000 posts due to the way posts are stored. Each request yields 25 posts and hence a for loop has to be written to run the requisite number of times to scrape the required number of posts. The API can be interacted with directly by adding a .json tag at the end of the html string. This method requires a custom User-agent and a time.sleep() function after scraping each page of data to disguise the API call so that it does not be mistaken as a DDOS attack and be disconected. After attempting this method, I also realised that getting more information (e.g. number of upvotes, number of comments) than the basic post title, selftext and authors also requires querying further than the basic .json tag. I hence explore more options for scraping the data using PRAW as shown in method 2. 

In [10]:
import requests, time

posts = []
after = None

for i in range(40):
    url = 'https://www.reddit.com/r/Androidquestions/.json' # download posts sorted by new
    if after == None:
        current_url = url
    else:
        current_url = url + '?after=' + after

    # send request to url
    res = requests.get(current_url, headers={'User-agent': 'Leonard 1.0'})
    
    # check for errors
    if res.status_code != 200:
        print('Sum Ting Wong', res.status_code)
        break
    
    # get posts and add to [posts]
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    
    # get tag of last post on the page
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = 1
    time.sleep(sleep_duration)

df = pd.DataFrame(posts)
# df.to_csv('../data/android.csv', index = False)

# check whether all posts are added to df
df.shape[0] == len(posts)

# print number of posts saved
print(f'a total of {len(posts)} posts were downloaded.')

df.head()

a total of 993 posts were downloaded.


Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,link_flair_css_class,downs,thumbnail_height,top_awarded_type,hide_score,name,quarantine,link_flair_text_color,upvote_ratio,author_flair_background_color,subreddit_type,ups,total_awards_received,media_embed,thumbnail_width,author_flair_template_id,is_original_content,user_reports,secure_media,is_reddit_media_domain,is_meta,category,secure_media_embed,link_flair_text,can_mod_post,score,approved_by,author_premium,thumbnail,edited,author_flair_css_class,author_flair_richtext,gildings,content_categories,is_self,mod_note,created,link_flair_type,wls,removed_by_category,banned_by,author_flair_type,domain,allow_live_comments,selftext_html,likes,suggested_sort,banned_at_utc,view_count,archived,no_follow,is_crosspostable,pinned,over_18,all_awardings,awarders,media_only,can_gild,spoiler,locked,author_flair_text,treatment_tags,visited,removed_by,num_reports,distinguished,subreddit_id,mod_reason_by,removal_reason,link_flair_background_color,id,is_robot_indexable,report_reasons,author,discussion_type,num_comments,send_replies,whitelist_status,contest_mode,mod_reports,author_patreon_flair,author_flair_text_color,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,preview,link_flair_template_id,crosspost_parent_list,crosspost_parent,author_cakeday
0,,AndroidQuestions,The following FAQ is condensed from the [/r/an...,t2_6e2q6,False,,0,False,Frequently Asked Questions (FAQ),[],r/AndroidQuestions,False,6,,0,,,False,t3_2xz6x2,False,dark,0.88,,public,64,0,{},,,False,[],,False,False,,{},,False,64,,False,self,1.43243e+09,,[],{},,True,,1425551000.0,text,6,,,text,self.AndroidQuestions,True,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,qa,,,True,False,False,False,False,[],[],False,False,False,False,,[],False,,,moderator,t5_2rtri,,,,2xz6x2,True,,IAmAN00bie,,30,True,all_ads,False,[],False,,/r/AndroidQuestions/comments/2xz6x2/frequently...,all_ads,True,https://www.reddit.com/r/AndroidQuestions/comm...,73913,1425522000.0,1,,False,,,,,,
1,,AndroidQuestions,Hello everyone!\n\nWe now have a Discord serve...,t2_16yo8q,False,,0,False,We Have a New Discord Server!,[],r/AndroidQuestions,False,6,,0,,,False,t3_7iqnh2,False,dark,0.84,,public,23,0,{},,,False,[],,False,False,,{},Other,False,23,,False,self,False,,[],{},,True,,1512895000.0,text,6,,,text,self.AndroidQuestions,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,qa,,,True,False,False,False,False,[],[],False,False,False,False,,[],False,,,moderator,t5_2rtri,,,,7iqnh2,True,,RootCheckM8,,2,True,all_ads,False,[],False,,/r/AndroidQuestions/comments/7iqnh2/we_have_a_...,all_ads,True,https://www.reddit.com/r/AndroidQuestions/comm...,73913,1512866000.0,0,,False,self,{'images': [{'source': {'url': 'https://extern...,770c7e4c-6009-11e7-87b4-0e0924aa4c7a,,,
2,,AndroidQuestions,,t2_6iph3glg,False,,0,False,is it ok to use battery saver on all the time?,[],r/AndroidQuestions,False,6,,0,,,False,t3_hc0k6r,False,dark,0.91,,public,36,0,{},,,False,[],,False,False,,{},Other,False,36,,False,self,False,,[],{},,True,,1592602000.0,text,6,,,text,self.AndroidQuestions,False,,,qa,,,False,False,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rtri,,,,hc0k6r,True,,widwidjajajw,,8,True,all_ads,False,[],False,,/r/AndroidQuestions/comments/hc0k6r/is_it_ok_t...,all_ads,False,https://www.reddit.com/r/AndroidQuestions/comm...,73913,1592573000.0,0,,False,,,770c7e4c-6009-11e7-87b4-0e0924aa4c7a,,,
3,,AndroidQuestions,I don't know if this is the place to ask this ...,t2_405ovj7e,False,,0,False,S20 ultra (exynos) or 1plus 8 pro?,[],r/AndroidQuestions,False,6,,0,,,False,t3_hcaiv7,False,dark,1.0,,public,3,0,{},,,False,[],,False,False,,{},,False,3,,False,self,False,,[],{},,True,,1592635000.0,text,6,,,text,self.AndroidQuestions,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,qa,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rtri,,,,hcaiv7,True,,Arl_121,,4,True,all_ads,False,[],False,,/r/AndroidQuestions/comments/hcaiv7/s20_ultra_...,all_ads,False,https://www.reddit.com/r/AndroidQuestions/comm...,73913,1592607000.0,0,,False,,,,,,
4,,AndroidQuestions,"Full disclosure, I am a current iPhone 7 Plus ...",t2_6y43w7aq,False,,0,False,what is the 2020 experience with samsung andro...,[],r/AndroidQuestions,False,6,,0,,,True,t3_hcc15t,False,dark,1.0,,public,2,0,{},,,False,[],,False,False,,{},,False,2,,False,self,False,,[],{},,True,,1592641000.0,text,6,,,text,self.AndroidQuestions,False,"&lt;!-- SC_OFF --&gt;&lt;div class=""md""&gt;&lt...",,qa,,,False,True,False,False,False,[],[],False,False,False,False,,[],False,,,,t5_2rtri,,,,hcc15t,True,,SamPatchsPetBear,,3,True,all_ads,False,[],False,,/r/AndroidQuestions/comments/hcc15t/what_is_th...,all_ads,False,https://www.reddit.com/r/AndroidQuestions/comm...,73913,1592612000.0,0,,False,,,,,,


In [11]:
df.title

0                       Frequently Asked Questions (FAQ)
1                          We Have a New Discord Server!
2         is it ok to use battery saver on all the time?
3                     S20 ultra (exynos) or 1plus 8 pro?
4      what is the 2020 experience with samsung andro...
                             ...                        
988                      Android Phone Stuck in Fastboot
989                           Sideloading June OTA issue
990    My secondary phone keeps turning on its WiFi, ...
991                       Memoji on Whatsapp for Android
992       Playback pausing issue with stock music player
Name: title, Length: 993, dtype: object

In [12]:
df.selftext

0      The following FAQ is condensed from the [/r/an...
1      Hello everyone!\n\nWe now have a Discord serve...
2                                                       
3      I don't know if this is the place to ask this ...
4      Full disclosure, I am a current iPhone 7 Plus ...
                             ...                        
988    **I have an LG Stylo 4. This is cross-posted, ...
989    Hello all!\n\nSo I have a Pixel 2 XL on the Ma...
990    On my secondary phone (which I use only rarely...
991    As an answer on an archived [post](https://www...
992    So on my old phone a few months ago I noticed ...
Name: selftext, Length: 993, dtype: object

### Method 2: Scraping using PRAW

In order to get more usable data out of the webscrape, I used [Python Reddit API Wrapper (PRAW)](https://praw.readthedocs.io/en/v3.6.0/), which has the APIs built into a Python library. 

Using this approach, I was able to easily pull information on upvotes and comments. The same limits of 1000 posts per request apply and this method is generally slower (Average 12 min). Using PRAW, I collected the following from each subreddit:

- post title
- post text (body)
- post ID
- distinguished posts (i.e. whether or not it is a moderator post)
- post score (i.e. number of upvotes)
- post upvote ratio (i.e. number of upvotes divided by the total number of votes)
- post date
- all top level comments on each post and their respective:
    - comment text
    - distinguished comments
    - comment scores
    - parent post ID

In [5]:
import praw

# instantiate an instance PRAW using OAuth credentials established on the old Reddit website. 
reddit = praw.Reddit(client_id='A70hPuBlzHL59g',
                     client_secret='Py3I7hNj2-S-JWRZAfsfCxOz2Uk',
                     user_agent='The rain in Spain')

# custom scraping function
def scrape_subreddit(subreddit, postlimit= None):
    
    subreddit = reddit.subreddit(subreddit)

    post_title = []
    post_text = []
    post_id = []
    post_dist = []
    post_score = []
    post_upvoteratio = []
    post_date = []
    comment_text = []
    comment_dist = []
    comment_score = []
    comment_parentpost_id = []

    # collect from posts sorted by new
    for submission in subreddit.new(limit = postlimit):
        # collect information on post
        post_title.append(submission.title)
        post_text.append(submission.selftext)
        post_id.append(submission.id)
        post_dist.append(submission.distinguished)
        post_score.append(submission.score)
        post_upvoteratio.append(submission.upvote_ratio)
        post_date.append(submission.created_utc)

        # collect all comments on each post
        submission.comments.replace_more(limit = None)
        for comment in submission.comments.list():     
            comment_text.append(comment.body)
            comment_dist.append(comment.distinguished)
            comment_score.append(comment.score)
            comment_parentpost_id.append(submission.id)
 
    # create a df out of the posts
    df_post = pd.DataFrame({'title': post_title,
                              'id': post_id,
                            'date_created':post_date,
                              'text': post_text,
                              'distinguished': post_dist,
                              'score': post_score,
                              'upvote_ratio': post_upvoteratio})
    df_post['date_created'] = pd.to_datetime(df_post['date_created'], unit = 's')
    
    # put comments into a df
    df_comments = pd.DataFrame({'post_id': comment_parentpost_id,
                              'comment_text': comment_text,
                              'comment_distinguished': comment_dist,
                              'comment_score': comment_score})
    
    return df_post, df_comments

In [20]:
%%time
# scrape from subreddits
android_posts, android_comments = scrape_subreddit('AndroidQuestions')
iphone_posts, iphone_comments = scrape_subreddit('iphonehelp')

CPU times: user 9.01 s, sys: 538 ms, total: 9.55 s
Wall time: 12min 58s


In [23]:
android_posts.to_csv('./data/android_posts.csv')
android_comments.to_csv('./data/android_comments.csv')
iphone_posts.to_csv('./data/iphone_posts.csv')
iphone_comments.to_csv('./data/iphone_comments.csv')

## Scraping test data

Webscraping the subreddits again for test data as 5 days have elapsed since the previous webscrape (20 Jun - 25 Jun 2020). 

In [8]:
%%time
# scrape from subreddits
android_posts_test, android_comments_test = scrape_subreddit('AndroidQuestions')
iphone_posts_test, iphone_comments_test = scrape_subreddit('iphonehelp')

CPU times: user 6.27 s, sys: 460 ms, total: 6.73 s
Wall time: 13min 49s


### Export test data

In [10]:
android_posts_test.to_csv('./data/android_posts_test.csv')
android_comments_test.to_csv('./data/android_comments_test.csv')
iphone_posts_test.to_csv('./data/iphone_posts_test.csv')
iphone_comments_test.to_csv('./data/iphone_comments_test.csv')