# Introduction and Data scraping
### Notebook 1 of 4

# Introduction

## Executive Summary

In this project, the aim is to assist a new gaming store in the setting up of the online forum for the local community by implementing a classification model for them to tag/allocate their post to either 'PS5' or 'Xbox'. To achieve this, we utilise the latest posts from reddit for the respective consoles. In addition, they need insights from the console gaming community to aid in their marketing campaign to determine the focus. From our research, PS5 seems to be have more popular and active community than Xbox. For both consoles, the trending topics are similar. Those topics are their subscription services, PS Plus and Game Pass, the top games as well as the controllers. After performing sentiments and emotions analysis on the reddit posts, games, in general, seems to garner the most positive sentiments. Thus, games could be a focus for a marketing campaign. To classify the post into PS5 and Xbox, we will use a classification model as the variable here is a binary variable. After evaluating the various models, our top three models are Logistic Regression, Light GBM and Random Forest. The Logistic Regression Model was chosen as the best model as it gave the best accuracy score.

## Problem Statement

A gaming store has been newly set up. The owners wish to build its business activities, and increase its physical and online presences.
To start off, they would like to find out which topics of the major consoles (PS5 and Xbox) are trending. At the same time, they are exploring the idea of having a forum on their website that allow discussion among gaming community, and e-commerce section to include product reviews. 
We are engaged to develop a classification model that predicts which category a post belongs to. This will be helpful for their forum moderation/upkeep while users experience better navigation of the store’s online space with the help of accurate tagging/sections allocation. 
We are also tasked to identify the recent topics of interest and the community’s sentiments towards them to aid them in their marketing campaign. This would provide them an indicative area of focus.

To address this problem, our goal is to: 
- Identify the hottest topics from the subreddits of PS5 and Xbox Series X
- What are general sentiments and emotions of the community in general and towards the topics/products
- Develop a Classification Model to distinguish PS5 and Xbox posts
    - Our objective is to classify post without the words 'PS5' and 'Xbox' (or similar words) in them as those posts can be automatically tagged/allocated

### Key Questions

- Which community is more active?
- What are the trending topics for each community?
- Which products should we focus our marketing on?
- Regarding top topics, what are the community’s sentiments and emotions towards them?
- What is the best model to classify post


### Process

- Data Collection
- Data Cleaning and Exploration
- Pre-processing
- Modelling
- Model Evaluation
- Sentiments and Emotions Analysis


# Data Collection

To achieve our objective, we will extract the last 15,000 posts from the PS5 and Xbox Series X subreddit respectively to analyse.

- Webscraped using Pushshift Reddit API
- Subreddits: PS5 (PlayStation 5) and XBox Series X
- Time frame: Before 24 Jun 2022 0000hr

In [1]:
# Import Libraries
import requests
import pandas as pd
import time
import random

In [2]:
url_subs = "https://api.pushshift.io/reddit/search/submission"

## Webscraping

### PS5

In [3]:
# Define function for extraction of data
def get_bfore_posts(url, subreddit, date, runs=150):
    params = {'subreddit': subreddit, 'size' : 100, 'before': date}
    reddit_subs = []
    for i in range(runs):
        res = requests.get(url, params)
        if res.status_code != 200:
            print("error")
        else:
            reddit_extract = res.json()
            reddit_subs += reddit_extract['data']
            params['before'] = reddit_subs[-1]['created_utc']
            time.sleep((random.randint(10, 20)))
            print(f"batch {i} completed")
    return reddit_subs

In [4]:
ps5_df = get_bfore_posts(url_subs, "PS5", date='1656028800')

batch 0 completed
batch 1 completed
batch 2 completed
batch 3 completed
batch 4 completed
batch 5 completed
batch 6 completed
batch 7 completed
batch 8 completed
batch 9 completed
batch 10 completed
batch 11 completed
batch 12 completed
batch 13 completed
batch 14 completed
batch 15 completed
batch 16 completed
batch 17 completed
batch 18 completed
batch 19 completed
batch 20 completed
batch 21 completed
batch 22 completed
batch 23 completed
batch 24 completed
batch 25 completed
batch 26 completed
batch 27 completed
batch 28 completed
batch 29 completed
batch 30 completed
batch 31 completed
batch 32 completed
batch 33 completed
batch 34 completed
batch 35 completed
batch 36 completed
batch 37 completed
batch 38 completed
batch 39 completed
batch 40 completed
batch 41 completed
batch 42 completed
batch 43 completed
batch 44 completed
batch 45 completed
batch 46 completed
batch 47 completed
batch 48 completed
batch 49 completed
batch 50 completed
batch 51 completed
batch 52 completed
bat

In [5]:
len(ps5_df)

14986

14986 out of 15000 PS5 post were extracted. Most of the posts were extracted, we will use this set of PS5 dataset for the analysis

In [6]:
ps5_df = pd.DataFrame(ps5_df)

In [7]:
ps5_df['created_utc']

0        1656026890
1        1656026393
2        1656024401
3        1656023983
4        1656023471
            ...    
14981    1646727952
14982    1646726903
14983    1646726605
14984    1646724175
14985    1646723536
Name: created_utc, Length: 14986, dtype: int64

Our ps5 dataset covers posts from the period of 8 March 2022 to 24 Jun 2022

In [8]:
ps5_df[['subreddit', 'selftext', 'title']]

Unnamed: 0,subreddit,selftext,title
0,PS5,[removed],How did yall get your ps5s?
1,PS5,,PS5 Horizon Bundle still in stock on PS Direct
2,PS5,[removed],Ps5 doesn’t work
3,PS5,I know that 1.4 only supports 1080p and up to ...,VRR support for HDMI 1.4...
4,PS5,[removed],Slow download speeds?
...,...,...,...
14981,PS5,,PS5 Now Has A Console Exclusive Reviewing Wors...
14982,PS5,I checked my internet speed and it’s around 2....,Updates and downloads just stop. I
14983,PS5,,Evo 2022 Lineup Reveal will be tomorrow
14984,PS5,"So after I turned it off last night, it didn't...",My ps5 doesn't work


### Xbox Series X

In [9]:
xbox_df = get_bfore_posts(url_subs, "XboxSeriesX", date='1656028800')

batch 0 completed
batch 1 completed
batch 2 completed
batch 3 completed
batch 4 completed
batch 5 completed
batch 6 completed
batch 7 completed
batch 8 completed
batch 9 completed
batch 10 completed
batch 11 completed
batch 12 completed
batch 13 completed
batch 14 completed
batch 15 completed
batch 16 completed
batch 17 completed
batch 18 completed
batch 19 completed
batch 20 completed
batch 21 completed
batch 22 completed
batch 23 completed
batch 24 completed
batch 25 completed
batch 26 completed
batch 27 completed
batch 28 completed
batch 29 completed
batch 30 completed
batch 31 completed
batch 32 completed
batch 33 completed
batch 34 completed
batch 35 completed
batch 36 completed
batch 37 completed
batch 38 completed
batch 39 completed
batch 40 completed
batch 41 completed
batch 42 completed
batch 43 completed
batch 44 completed
batch 45 completed
batch 46 completed
batch 47 completed
batch 48 completed
batch 49 completed
batch 50 completed
batch 51 completed
batch 52 completed
bat

In [10]:
len(xbox_df)

14996

14996 out of 15000 Xbox post were extracted. Most of the posts were extracted, we will use this set of XboxOur ps5 dataset covers posts from the period of 8 March 2022 to 24 Jun 2022 dataset for the analysis

In [11]:
xbox_df = pd.DataFrame(xbox_df)

In [12]:
xbox_df['created_utc']

0        1656028695
1        1656028592
2        1656028539
3        1656027959
4        1656027813
            ...    
14991    1643670976
14992    1643670696
14993    1643670577
14994    1643670316
14995    1643668667
Name: created_utc, Length: 14996, dtype: int64

Our Xbox dataset covers posts from the period of 31 January 2022 to 24 Jun 2022

In [13]:
xbox_df[['subreddit', 'selftext', 'title']]

Unnamed: 0,subreddit,selftext,title
0,XboxSeriesX,,anyone know how to fix this?
1,XboxSeriesX,I just purchased the physical Xbox One version...,Resident Evil Disc
2,XboxSeriesX,,Xbox has released a teaser for their fictional...
3,XboxSeriesX,,!!!
4,XboxSeriesX,Tomorrow I'll be receiving my brand new 4k TV...,I'm finally getting a 4k TV what do I install?
...,...,...,...
14991,XboxSeriesX,this is for my halo infinite disc which I boug...,disc not working.
14992,XboxSeriesX,,"I’m looking into buying the Series X, but I do..."
14993,XboxSeriesX,the hdmi cord got ripped out and took the port...,My dogs tipped my Series X over and im probabl...
14994,XboxSeriesX,Does the monitor support VRR on the XSX?,For anyone with a XSX and an Alienware AW2521h


## Export Data

In [14]:
ps5_df.to_csv("data/ps5.csv")

In [15]:
xbox_df.to_csv("data/xbox.csv")

## Data Collection Summary

To achieve our objective, we will extract the last 15,000 posts from the PS5 and Xbox Series X subreddit respectively to analyse.
We will Webscraped using Pushshift Reddit API for the Subreddits, PS5 (PlayStation 5) and XBox Series X. 
The time frame we will use: Before 24 Jun 2022 0000hr

14986 out of 15000 PS5 post were extracted. 
Our ps5 dataset covers posts from the period of 8 March 2022 to 24 Jun 2022
14996 out of 15000 Xbox post were extracted.
Our Xbox dataset covers posts from the period of 31 January 2022 to 24 Jun 2022
