# Project 3: Web Application Programming Interface (API) & Natural Language Processing (NLP)

### Part 1

----

## Background

We are a team of data analysts working in a leading tech magazine that strives to be a trusted source of latest technology trends and electronic brand reviews.  

To help the publication team come up with a big year-end story on “The Brutal War for Brand Dominance between 2 consumer electronic giants - Samsung and Apple”, we have been approached by the Editor-in-chief to develop a text classifier where the technology journalists can leverage to have a quick sense of the online chatters of the 2 top tech-brands, instead of looking through the forums manually. 

The text classifier will be built using supervised learning techniques. It will be capable of distinguishing whether an online post is about Samsung or Apple, and be able to surface some meaningful insights on the most talked-about topics which hopefully could provide the technology journalists some inspiration for their article.

### Problem Statement

To create a text classifier to determine whether a Reddit post submitted by a user would be classified under the Subreddit groups "Samsung" or "Apple" (i.e. binary classification problem) using supervised learning techniques and derive meaningful insights on the most talked-about topics relating to Samsung and Apple.


### Data Used

* [`df_apple.csv`](../data/df_apple.csv): Subreddits scraped from [`r/apple`](https://www.reddit.com/r/apple)
* [`df_samsung.csv`](../data/df_samsung.csv): Subreddits scraped from [`r/samsung`](https://www.reddit.com/r/samsung)

### Data Dictionary 

The key features used in the project are listed below:

|Feature|Type|Dataset|Description|
|:---|:---:|:---:|:---|
|author|*object*|ap_clean/ss_clean|creator of the post| 
|num_com|*int*|ap_clean/ss_clean|number of comments in the post|
|selftext|*object*|ap_clean/ss_clean|body text of the post| 
|title|*object*|ap_clean/ss_clean|title of the post| 
|date|*object*|ap_clean/ss_clean|date in string format| 
|datetime|*datetime*|ap_clean/ss_clean|datetime of post| 
|title_len|*int*|ap_clean/ss_clean|length of title| 
|selftext_len|*int*|ap_clean/ss_clean|length of selftext| 
|is_samsung|*int*|ap_clean/ss_clean|mapped as 1 for r/Samsung and 0 for r/Apple| 
|text|*object*|ap_clean/ss_clean|combination of processed title and selftext| 

### Contents

**Part 1** (This workbook):

1. [Data Scraping](#1.Data-Scraping)

[**Part 2**](../code/2_data_cleaning_eda.ipynb):

2. Data Cleaning
3. Preprocessing
4. Exploratory Data Analysis

[**Part 3**](../code/3_modelling.ipynb):

5. Model Selection
6. Conclusion and Recommendations

## 1.Data Scraping

Data was scraped from the two subreddits (r/Apple and r/Samsung) using Pushshift's API  and selecting the respective endpoints:

https://api.pushshift.io/reddit/search/submission?subreddit=samsung

https://api.pushshift.io/reddit/search/submission?subreddit=iphone


In [1]:
# Import libraries 
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 100)

# Data scraping 
import requests

### 1.1.Samsung Subreddits

In [2]:
# Import requests (as above)

In [3]:
# Create url for API call
url = 'https://api.pushshift.io/reddit/search/submission'

In [4]:
# Set parameters of pull
params_samsung = {
    'subreddit' : 'samsung',
    'size' : 100
}

In [5]:
# Submit request
res_samsung = requests.get(url, params_samsung)

In [6]:
# Request response code
res_samsung.status_code

200

In [7]:
# Convert to JSON and check type
samsung_json = res_samsung.json()
type(samsung_json)

dict

In [8]:
# Take a look at the 'samsung_json' dict
samsung_json

{'data': [{'all_awardings': [],
   'allow_live_comments': False,
   'author': 'goldaffe58',
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_text': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_4yjy8lol',
   'author_is_blocked': False,
   'author_patreon_flair': False,
   'author_premium': False,
   'awarders': [],
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1635525338,
   'domain': 'self.samsung',
   'full_link': 'https://www.reddit.com/r/samsung/comments/qigt3u/smart_watch_3_and_health_monitor_not_working/',
   'gildings': {},
   'id': 'qigt3u',
   'is_created_from_ads_ui': False,
   'is_crosspostable': True,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': False,
   'is_robot_indexable': True,
   'is_self': True,
   'is_video': False,
   'link_flair_background_color': '#646d73',
   'link_flair_richtext': [],
   'link_flair_template_id': '1a9f93c4-ddaf-11eb-8b21-0e91e42b02

In [9]:
# The data that we want is found in the list of 'data' key in the 'samsung_json' dict
df_samsung = pd.DataFrame(samsung_json['data'])

In [10]:
# Check first 2 rows of 'df_samsung'
df_samsung.head(2)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,author_flair_background_color,author_flair_template_id,author_flair_text_color,removed_by_category,link_flair_css_class,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,post_hint,preview,suggested_sort,thumbnail_height,thumbnail_width
0,[],False,goldaffe58,,[],,text,t2_4yjy8lol,False,False,False,[],False,False,1635525338,self.samsung,https://www.reddit.com/r/samsung/comments/qigt3u/smart_watch_3_and_health_monitor_not_working/,{},qigt3u,False,True,False,False,False,True,True,False,#646d73,[],1a9f93c4-ddaf-11eb-8b21-0e91e42b024f,Help,light,text,False,False,False,0,0,False,all_ads,/r/samsung/comments/qigt3u/smart_watch_3_and_health_monitor_not_working/,False,6,1635525349,1,Hey guys. I bought my mom a new phone. A Samsung because she has galaxy watch 3. But it seems li...,True,False,False,samsung,t5_2rkar,222267,public,self,Smart watch 3 and health monitor not working.,0,[],1.0,https://www.reddit.com/r/samsung/comments/qigt3u/smart_watch_3_and_health_monitor_not_working/,all_ads,6,,,,,,,,,,,,,
1,[],False,MileHigh96,s21 series,[],Galaxy s21,text,t2_ixma9y3,False,False,False,[],False,False,1635524629,self.samsung,https://www.reddit.com/r/samsung/comments/qigk1e/devices_compatible_with_samsung_health/,{},qigk1e,False,False,False,False,False,False,True,False,#646d73,[],1a9f93c4-ddaf-11eb-8b21-0e91e42b024f,Help,light,text,False,False,True,1,0,False,all_ads,/r/samsung/comments/qigk1e/devices_compatible_with_samsung_health/,False,6,1635524640,1,[removed],True,False,False,samsung,t5_2rkar,222267,public,self,Devices compatible with Samsung Health,0,[],1.0,https://www.reddit.com/r/samsung/comments/qigk1e/devices_compatible_with_samsung_health/,all_ads,6,#646d73,0ddbc86c-d6ff-11eb-8ef6-0eeb7674c55b,light,moderator,,,,,,,,,


In [11]:
# Check list of column names for 'df_samsung'
df_samsung_cols = list(df_samsung.columns.values)
df_samsung_cols

['all_awardings',
 'allow_live_comments',
 'author',
 'author_flair_css_class',
 'author_flair_richtext',
 'author_flair_text',
 'author_flair_type',
 'author_fullname',
 'author_is_blocked',
 'author_patreon_flair',
 'author_premium',
 'awarders',
 'can_mod_post',
 'contest_mode',
 'created_utc',
 'domain',
 'full_link',
 'gildings',
 'id',
 'is_created_from_ads_ui',
 'is_crosspostable',
 'is_meta',
 'is_original_content',
 'is_reddit_media_domain',
 'is_robot_indexable',
 'is_self',
 'is_video',
 'link_flair_background_color',
 'link_flair_richtext',
 'link_flair_template_id',
 'link_flair_text',
 'link_flair_text_color',
 'link_flair_type',
 'locked',
 'media_only',
 'no_follow',
 'num_comments',
 'num_crossposts',
 'over_18',
 'parent_whitelist_status',
 'permalink',
 'pinned',
 'pwls',
 'retrieved_on',
 'score',
 'selftext',
 'send_replies',
 'spoiler',
 'stickied',
 'subreddit',
 'subreddit_id',
 'subreddit_subscribers',
 'subreddit_type',
 'thumbnail',
 'title',
 'total_awards_r

In [12]:
# Find the first utc integer value
samsung_start_utc = df_samsung.head(1).iloc[0]['created_utc']
samsung_start_utc 

1635525338

In [13]:
# Multi-scraping loop for subreddit
params_samsung = {
    'subreddit' : 'samsung',
    'size': 100,
    'selftext:not' : '[removed]'
}
frames_to_concat = []
frame_count = 0
while frame_count < 100:
    res_samsung = requests.get(url,params_samsung)
    samsung_json = res_samsung.json()
    frame = pd.DataFrame(samsung_json['data'])
    frames_to_concat.append(frame)
    frame_count += 1
    try:
        params_samsung['before'] = frame.tail(1).iloc[0]['created_utc']
    except:
        IndexError

In [14]:
# Collect frames in dataframe
df_samsung = pd.concat(frames_to_concat, ignore_index=True)

In [15]:
# Check shape of dataframe
df_samsung.shape

(9994, 89)

In [16]:
# Save dataframe
df_samsung.to_csv('../data/df_samsung.csv', index = False)

### 1.2.Apple Subreddits

In [17]:
# Set parameters of pull
params_apple = {
    'subreddit' : 'apple',
    'size' : 100
}

In [18]:
# Submit request
res_apple = requests.get(url, params_apple)

In [19]:
# Request response code
res_apple.status_code

200

In [20]:
# Convert to JSON and check type
apple_json = res_apple.json()
type(apple_json)

dict

In [21]:
# Take a look at the 'apple_json' dict
apple_json

{'data': [{'all_awardings': [],
   'allow_live_comments': False,
   'author': 'Inner_Finding_3659',
   'author_flair_css_class': None,
   'author_flair_richtext': [],
   'author_flair_text': None,
   'author_flair_type': 'text',
   'author_fullname': 't2_7wv0synz',
   'author_is_blocked': False,
   'author_patreon_flair': False,
   'author_premium': False,
   'awarders': [],
   'can_mod_post': False,
   'contest_mode': False,
   'created_utc': 1635525774,
   'domain': 'self.apple',
   'full_link': 'https://www.reddit.com/r/apple/comments/qigymt/2nd_year_college_student_in_need_of_help/',
   'gildings': {},
   'id': 'qigymt',
   'is_created_from_ads_ui': False,
   'is_crosspostable': False,
   'is_meta': False,
   'is_original_content': False,
   'is_reddit_media_domain': False,
   'is_robot_indexable': False,
   'is_self': True,
   'is_video': False,
   'link_flair_background_color': '#eac2ba',
   'link_flair_richtext': [{'e': 'text', 't': 'iPad'}],
   'link_flair_template_id': 'd1c5f9

In [22]:
# The data that we want is found in the list of 'data' key in the 'apple_json' dict
df_apple = pd.DataFrame(apple_json['data'])

In [23]:
# Check first 2 rows of 'df_apple'
df_apple.head(2)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_created_from_ads_ui,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,link_flair_css_class,post_hint,preview,thumbnail_height,thumbnail_width,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,suggested_sort,author_cakeday
0,[],False,Inner_Finding_3659,,[],,text,t2_7wv0synz,False,False,False,[],False,False,1635525774,self.apple,https://www.reddit.com/r/apple/comments/qigymt/2nd_year_college_student_in_need_of_help/,{},qigymt,False,False,False,False,False,False,True,False,#eac2ba,"[{'e': 'text', 't': 'iPad'}]",d1c5f976-5701-11e9-a2bd-0e424fabf6d2,iPad,dark,richtext,False,False,True,0,0,False,all_ads,/r/apple/comments/qigymt/2nd_year_college_student_in_need_of_help/,False,6,automod_filtered,1635525786,1,[removed],True,False,False,apple,t5_2qh1f,2824284,public,self,2nd year college student in need of help,0,[],1.0,https://www.reddit.com/r/apple/comments/qigymt/2nd_year_college_student_in_need_of_help/,all_ads,6,,,,,,,,,,,,
1,[],False,tr0picana,,[],,text,t2_3n1vn,False,False,False,[],False,False,1635525164,self.apple,https://www.reddit.com/r/apple/comments/qigqw3/those_with_14_macbooks_how_many_lines_of_code_do/,{},qigqw3,False,False,False,False,False,False,True,False,#fbd58c,"[{'e': 'text', 't': 'Mac'}]",cb64da0c-5701-11e9-afcb-0ef51e89245a,Mac,dark,richtext,False,False,True,0,0,False,all_ads,/r/apple/comments/qigqw3/those_with_14_macbooks_how_many_lines_of_code_do/,False,6,automod_filtered,1635525175,1,[removed],True,False,False,apple,t5_2qh1f,2824269,public,self,"Those with 14"" MacBooks, how many lines of code do you see in full screen VS Code using a font s...",0,[],1.0,https://www.reddit.com/r/apple/comments/qigqw3/those_with_14_macbooks_how_many_lines_of_code_do/,all_ads,6,,,,,,,,,,,,


In [24]:
# Check list of column names for 'df_apple'
df_apple_cols = list(df_apple.columns.values)
df_apple_cols

['all_awardings',
 'allow_live_comments',
 'author',
 'author_flair_css_class',
 'author_flair_richtext',
 'author_flair_text',
 'author_flair_type',
 'author_fullname',
 'author_is_blocked',
 'author_patreon_flair',
 'author_premium',
 'awarders',
 'can_mod_post',
 'contest_mode',
 'created_utc',
 'domain',
 'full_link',
 'gildings',
 'id',
 'is_created_from_ads_ui',
 'is_crosspostable',
 'is_meta',
 'is_original_content',
 'is_reddit_media_domain',
 'is_robot_indexable',
 'is_self',
 'is_video',
 'link_flair_background_color',
 'link_flair_richtext',
 'link_flair_template_id',
 'link_flair_text',
 'link_flair_text_color',
 'link_flair_type',
 'locked',
 'media_only',
 'no_follow',
 'num_comments',
 'num_crossposts',
 'over_18',
 'parent_whitelist_status',
 'permalink',
 'pinned',
 'pwls',
 'removed_by_category',
 'retrieved_on',
 'score',
 'selftext',
 'send_replies',
 'spoiler',
 'stickied',
 'subreddit',
 'subreddit_id',
 'subreddit_subscribers',
 'subreddit_type',
 'thumbnail',
 '

In [25]:
# Find the first utc integer value
apple_start_utc = df_apple.head(1).iloc[0]['created_utc']
apple_start_utc 

1635525774

In [26]:
# Multi-scraping loop for subreddit
params_apple = {
    'subreddit' : 'apple',
    'size': 100,
    'selftext:not' : '[removed]'
}
frames_to_concat = []
frame_count = 0
while frame_count < 400:
    res_apple = requests.get(url,params_apple)
    apple_json = res_apple.json()
    frame = pd.DataFrame(apple_json['data'])
    frames_to_concat.append(frame)
    frame_count += 1
    try:
        params_apple['before'] = frame.tail(1).iloc[0]['created_utc']
    except:
        IndexError

In [27]:
# Collect frames in dataframe
df_apple = pd.concat(frames_to_concat, ignore_index=True)

In [28]:
# Check shape of dataframe
df_apple.shape

(39772, 95)

In [29]:
# Save dataframe
df_apple.to_csv('../data/df_apple.csv', index = False)