# Introduction 

*r/pcgamingtechsupport* and *r/techsupport* are two well-frequented tech subreddits on reddit.com, where users often head to to seek advice for their tech-related problems. 

*r/techsupport* is a highly popular subreddit, intended for any general tech-related software or hardware issues. There is a plethora of problems that are routinely posted here, involving routers, Wi-Fi, monitors, motherboards, audio issues etc.

*r/pcgamingtechsupport* is a more specialized subreddit, intended for PC users to post any game-related issues that they may have. These could include low in-game FPS, lag spikes, games not being able to boot, etc.

As of 04 Dec 2020, *r/techsupport* boasts 1,248,988 subscribers while *r/pcgamingtechsupport* is far smaller, comprising 34,050 subscribers.

# Problem Statement

Our aim is to build a classification model that is able to classify posts into the two subreddits, and to uncover which are some of the top words/themes around each subreddit. 

Given that *r/pcgamingtechsupport* is essentially a more specialized version of *r/techsupport*, we would want to direct PC gaming users there instead, so they can obtain more personalized advice or faster responses from other PC gamers. At the same time, this allows for decluttering of *r/techsupport*, which would aid its moderators in better managing the rest of the posts.

The classification model will be scored on its accuracy, i.e. the proportion of posts that it was able to classify correctly into the two subreddits.

$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$

# Imports and Options

In [None]:
# import relevant libraries and modules

import requests
import pandas as pd
import time
import random

# Data Scraping

The code below is used to scrape the posts from our selected subreddits, and outputs them into .csv files for further processing.

In [None]:
# specify urls of the apis we are going to use

url1 = 'https://www.reddit.com/r/techsupport.json'
url2 = 'https://www.reddit.com/r/pcgamingtechsupport.json'

urls = [url1, url2]

In [None]:
# below code builds the pandas DataFrames from the jsons of the respective subreddits

for index, url in enumerate(urls):
    
    posts = []
    after = None

    # run the loop 40 times to obtain 1000 posts    
    for a in range(40):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        # builds dictionary of posts sequentially           
        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']

        # concatenates the existing and new dataframes, and output to the same .csv each time the loop runs 
        # in case the code breaks halfway         
        if a > 0:
            prev_posts = pd.read_csv(str(index)+'.csv')
            current_df = pd.DataFrame(current_posts)
            new_df = pd.concat([prev_posts, current_df])
            new_df.to_csv(str(index)+'.csv', index = False)
        
        # first time the loop is run, output to a .csv file 
        else:
            pd.DataFrame(posts).to_csv(str(index)+'.csv', index = False)

        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,45)
        print(sleep_duration)
        time.sleep(sleep_duration)