<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Classifying Xbox and Playstation Subreddits

## Content

Project will be split in 3 seperate workbooks for organisation

* **Webscraping**
* **Data Cleaning EDA Feature Selection**
* **Model Selection**

# Background

Reddit has been a centralized source of gaming news, opinions and debates for consoles and their games.  In particular, the [r/xbox](https://www.reddit.com/r/xbox/) and [r/playstation](https://www.reddit.com/r/playstation/) consoles are the 2 most popular and are always in comparisons with each other. 

As a consumer, buying a console on which one is the most popular, may seem like a poor choice to make.

But, thanks to the psychological concept of informational social influence, it is one many gamers make. What this means is that many consumers will look towards their peers online to choose the ‘right’ console because of a desire to be correct when there is no obvious choice.

# Problem Statement

The moderators of the xbox and playstation subreddits understand that for the consumer to make a proper informed decision, ideally their subreddits should only contain posts related to their console.

As misclassification of consoles by the individual subreddit authors (intentionally/mistakenly), are confusing the readers on the subject topic. 

As such, the moderators would like to identify posts in their subreddit that are about their rival console. Hence we have been tasked to construct and find the best classification model to correctly predict the subreddit a given post belongs to, through identifing the keywords associated with the subreddit.

# Data Collection

* [api.pushshift.io](https://api.pushshift.io/reddit/submission/search):  We will be using this program to perform webscraping on the 2 subreddit posts
* [`xbox.csv`](./datasets/xbox.csv): This contains the most recent 2200 xbox subreddit posts
* [`playstation.csv`](./datasets/playstation.csv): This contains the most recent 2200 playstation subreddit posts

# Outside Research

* According to a [comment](https://www.reddit.com/r/xboxone/comments/iwour0/unpopular_opinion_we_need_to_unify_all_xbox/) made on the xbox subreddit, there are many subreddits made for each generation of console (r/XboxSeriesX, r/SeriesXbox, r/XboxOne, r/PS4, r/PS5 etc.) instead of one subreddit for each console type where the community can engage in. For the consumer looking to buy a console, they probably will explore beyond r/xbox and r/playstation and go to more specific subreddits on next-gen consoles. Ideally for our model, scraping from these subreddits would improve our model.

* The subreddits contains posts by the Moderator and bots that inform users on the ethics and rules of the subreddit. These are repetitive posts and do not contain any relevant information for building our model.

* If there are NSFW (Not Safe For Work) posts, its text is hidden by fault. This could lead to us scraping an empty post

* Reddit includes an API that allows us to send a request and receive a JSON containing information about a post, but there are issues with this. One being that reddit throttles requests to once every few seconds, which we will deal with by puting a sleep statement in our code

# Data Dictionary

Data contains the most recent 2200 xbox and playstation subreddit posts each, with the below table showing the variables of interest


|Variables|Dataset|Type|Description|  
|:--|:-:|:-:|:--|  
|author|unstructured|object|Subredditor whom posted submission|  
|title|unstructured|object|Title of the post|  
|self_text|unstructured|object|Includes all text inside the post|  
|subreddit|structured|object|which subreddit the post was classified under|


# Import Libraries

In [9]:
import requests
import pandas as pd
import time
import re

# Webscraping from Reddit

A function was created to scrape the posts in reddit as the api allowed for us to collect 100 at a time. Ideally, we want to have enough posts for modelling (ideally a 1000 per subreddit), but we will be conservative and collect 2200 for each subreddit to account for posts that have missing values or removed during EDA (bot or moderator generated which won't be useful to our model etc.)

Essentially, this is the workflow of the loop function to get 100 posts each time
* request data from url
* convert data to a JSON string 
* store JSON string into a new list
* convert list to a dataframe
* concatenate dataframes
* write dataframe to csv file

In [8]:
# Create a function to get posts from subreddit
def get_posts(subreddit, loops):

    url = 'https://api.pushshift.io/reddit/submission/search'
    dfs = []
    start_time = time.time()
    params = {
        'subreddit': subreddit,
        'size': 100,
        'before': round(start_time)
        }
    
    for i in range(loops):
        current_time = time.time()
        
        # Request data
        res = requests.get(url, params)
        print(f'res {i+1} code: ', res.status_code)
        data = res.json()
        posts = data['data']
        post_df = pd.DataFrame(posts)
        dfs.append(post_df)
        
        # Get oldest post time and use as before parameter in next request
        oldest = post_df['created_utc'].min()
        params['before'] = oldest
        
        # Suspend execution for 1 second
        time.sleep(1)
    reddit_posts = pd.concat(dfs)

    reddit_posts.to_csv('./datasets/' + subreddit + 'test.csv', index = False)

In [6]:
# Call function to get xbox subreddit posts
get_posts('xbox', 22)

res 1 code:  200
res 2 code:  200
res 3 code:  200
res 4 code:  200
res 5 code:  200
res 6 code:  200
res 7 code:  200
res 8 code:  200
res 9 code:  200
res 10 code:  200
res 11 code:  200
res 12 code:  200
res 13 code:  200
res 14 code:  200
res 15 code:  200
res 16 code:  200
res 17 code:  200
res 18 code:  200
res 19 code:  200
res 20 code:  200
res 21 code:  200
res 22 code:  200


In [7]:
# Call function to get playstation subreddit posts
get_posts('playstation', 22)

res 1 code:  200
res 2 code:  200
res 3 code:  200
res 4 code:  200
res 5 code:  200
res 6 code:  200
res 7 code:  200
res 8 code:  200
res 9 code:  200
res 10 code:  200
res 11 code:  200
res 12 code:  200
res 13 code:  200
res 14 code:  200
res 15 code:  200
res 16 code:  200
res 17 code:  200
res 18 code:  200
res 19 code:  200
res 20 code:  200
res 21 code:  200
res 22 code:  200
