<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web APIs & NLP

---

**Notebook 1: Data Collection and Export**<br>
Notebook 2: Cleaning, Preprocessing and EDA<br>
Notebook 3: Model Selection
---

## 1. Problem Statement

A newly established game store is looking to set up an online space on its website for users to hold discussions, as an effort to increase traffic on its site.

They have hired us to develop a classification model that is able to accurately predict which category a discussion thread belongs to. This would reduce the man hours needed to manually classify threads, and would also streamline the thread creation procedure for users.

At the same time, they would also like to find out more about the popularity of major consoles, products and games, so that they would be able to enhance their marketing strategy.

---

## 2. Data Collection

In [1]:
# Imports
import requests
import pandas as pd
import time
from random import randint

In [2]:
# Defining a function to extract data
def extract(subreddit, runs=1):
    url='https://api.pushshift.io/reddit/search/submission'
    params = {'subreddit':subreddit, 'size':100, 'before':1656028800}
    posts = []
    for i in range(runs):
        res = requests.get(url, params)
        if res.status_code != 200:
            print("Error!")
        else:
            data = res.json()
            posts += data['data']
            params['before'] = posts[-1]['created_utc']
            time.sleep(randint(1,5))
        print(len(posts)) # To track rate of extraction
    return pd.DataFrame(posts)

In [3]:
# Retrieving posts from PS% subreddit
ps5 = extract(subreddit='PS5', runs=150)

100
199
299
399
499
599
699
799
899
999
1099
1199
1299
1399
1499
1598
1698
1798
1898
1998
2098
2198
2298
2398
2498
2598
2698
2798
2898
2998
3098
3198
3298
3398
3498
3598
3698
3798
3898
3998
4098
4198
4297
4397
4496
4596
4696
4796
4896
4996
5096
5196
5296
5396
5496
5596
5696
5795
5894
5993
6092
6192
6292
6392
6492
6592
6692
6792
6892
6992
7092
7192
7292
7392
7492
7592
7692
7792
7892
7992
8092
8192
8292
8392
8492
8592
8692
8792
8892
8992
9092
9192
9292
9392
9492
9592
9692
9792
9892
9991
10091
10190
10290
10390
10490
10590
10690
10790
10889
10988
11088
11188
11288
11388
11488
11588
11688
11788
11887
11987
12087
12187
12287
12387
12487
12587
12687
12787
12886
12986
13086
13186
13286
13386
13486
13586
13686
13786
13886
13986
14086
14186
14286
14386
14486
14586
14686
14786
14886
14986


In [4]:
# First glimpse at PS5 dataset
print(ps5.shape)
ps5.head()

(14986, 86)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,crosspost_parent_list,distinguished,media_metadata,banned_by,suggested_sort,discussion_type,call_to_action,category,edited,collections
0,[],False,lowlifectc,,[],,text,t2_8ts292c3,False,False,...,,,,,,,,,,
1,[],False,willdearborn-,,[],,text,t2_8b1hlted,False,False,...,,,,,,,,,,
2,[],False,stvxv,,[],,text,t2_mfebh2g1,False,False,...,,,,,,,,,,
3,[],False,darkexistor,,[],,text,t2_5z5018j7,False,False,...,,,,,,,,,,
4,[],False,ItsNaws,,[],,text,t2_mwboj532,False,False,...,,,,,,,,,,


In [5]:
ps5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14986 entries, 0 to 14985
Data columns (total 86 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  14986 non-null  object 
 1   allow_live_comments            14986 non-null  bool   
 2   author                         14986 non-null  object 
 3   author_flair_css_class         9 non-null      object 
 4   author_flair_richtext          14915 non-null  object 
 5   author_flair_text              321 non-null    object 
 6   author_flair_type              14915 non-null  object 
 7   author_fullname                14915 non-null  object 
 8   author_is_blocked              14986 non-null  bool   
 9   author_patreon_flair           14915 non-null  object 
 10  author_premium                 14915 non-null  object 
 11  awarders                       14986 non-null  object 
 12  can_mod_post                   14986 non-null 

In [6]:
# Checking date and time of earliest post extracted from 'PS5' subreddit
ps5['created_utc'].tail(1)

14985    1646723536
Name: created_utc, dtype: int64

`1646723536` converts to: Tuesday, March 8, 2022 3:12:16 PM GMT+8

In [7]:
# Extracting posts from XboxSeriesX subreddit
xbox = extract(subreddit='XboxSeriesX', runs=150)

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2099
2199
2299
2399
2499
2599
2699
2799
2899
2999
3099
3199
3299
3399
3499
3599
3699
3799
3899
3999
4099
4199
4299
4399
4499
4599
4699
4799
4899
4999
5099
5199
5299
5399
5499
5599
5699
5799
5898
5998
6098
6198
6298
6398
6498
6598
6698
6798
6898
6998
7098
7198
7298
7398
7498
7598
7698
7798
7898
7998
8098
8198
8298
8398
8498
8597
8697
8797
8897
8997
9097
9197
9297
9397
9497
9597
9697
9797
9897
9997
10097
10197
10297
10397
10497
10597
10697
10797
10897
10997
11097
11197
11297
11397
11497
11597
11697
11797
11897
11997
12097
12197
12297
12397
12497
12597
12697
12797
12897
12997
13097
13197
13297
13397
13497
13597
13697
13797
13897
13997
14097
14197
14297
14397
14497
14597
14697
14797
14896
14996


In [8]:
# First glimpse at Xbox Dataset
print(xbox.shape)
xbox.head()

(14996, 92)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,collections,poll_data,distinguished,discussion_type,tournament_data,call_to_action,category,edited,crosspost_parent,crosspost_parent_list
0,[],False,OchoaJuan2004,,[],,text,t2_55m20ag7,False,False,...,,,,,,,,,,
1,[],False,Aggravating-Credit95,,[],,text,t2_cflg5v8x,False,False,...,,,,,,,,,,
2,[],False,timmy-failure,,[],,text,t2_2n88e3am,False,False,...,,,,,,,,,,
3,[],False,lilgingabredd,,[],,text,t2_6wkbgmtx,False,False,...,,,,,,,,,,
4,[],False,Mocti_54,,[],,text,t2_58wvgb04,False,False,...,,,,,,,,,,


In [9]:
xbox.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14996 entries, 0 to 14995
Data columns (total 92 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   all_awardings                  14996 non-null  object 
 1   allow_live_comments            14996 non-null  bool   
 2   author                         14996 non-null  object 
 3   author_flair_css_class         0 non-null      object 
 4   author_flair_richtext          14941 non-null  object 
 5   author_flair_text              1565 non-null   object 
 6   author_flair_type              14941 non-null  object 
 7   author_fullname                14941 non-null  object 
 8   author_is_blocked              14996 non-null  bool   
 9   author_patreon_flair           14941 non-null  object 
 10  author_premium                 14941 non-null  object 
 11  awarders                       14996 non-null  object 
 12  can_mod_post                   14996 non-null 

In [10]:
# Checking date and time of earliest post extracted from 'XboxSeriesX' subreddit
xbox['created_utc'].tail(1)

14995    1643668667
Name: created_utc, dtype: int64

`1643668667` converts to: Tuesday, February 1, 2022 6:37:47 AM GMT+8

In [13]:
# Exporting entire PS5 dataset
ps5.to_csv('./datasets/ps5_full.csv', index=False)

In [14]:
# Exporting entire Xbox dataset
xbox.to_csv('./datasets/xbox_full.csv', index=False)

---