<a id='top'></a>
# Reddit API and Classification

## Executive Summary
---

Reddit is a website comprising user-generated content—including photos, videos, links, and text-based posts—and discussions of this content in what is essentially a bulletin board system. As of 2018, there are approximately 330 million Reddit users, called "redditors". The site's content is divided into categories or communities known on-site as "subreddits", of which there are more than 138,000 active communities.

These subreddits are governed by moderators who set and enforce community-specific rules, remove posts and comments that violate these rules, and generally work to keep discussions in their subreddit on topic.



## Problem Statement

As the popularity of Reddit increases, the Apple subreddit had garned approx. 1.8 million members in the community. This number also represents an increased number of people who utilise reddit for advertisements unrelated to the community, or "competitor supporters" posting unrelated content onto the subreddit; especially in the season where competitors are promoting their new product launch, there is an increase in competitor related posts on the Apple community. Despite the strict rules prohibiting such activities, the moderators spend considerable time and efforts on filtering out and removing unrelated content. 

Community members also developed a sense of unhappiness as they are unable to enjoy content without disruption by at least 1 or 2 unrelated posts such as advertisements or unrelated topics which have not been promptly removed by the moderators.

As a newly promoted moderator in the subreddit: "r/Apple - the unofficial Apple community", I was tasked by the existing moderators to create a classifier model that can accurately identify if a post belongs to the Apple subreddit, or unrelated to the community as a whole. 

Since a high proportion of unrelated content belonging to the "competitor subreddit": Android have been identified by existing moderators, I have decided to create a binary classification model based on title of posts of the apple and Android subreddits.

## Data Collection & Wrangling
---

We will be using Reddit's JSON API to collect posts (i.e. threads) from the two subreddits:<br>
- r/Android
- r/apple

At the end of this section, we will have dataframes each containing posts from the two subreddits.

### Library Imports

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

from bs4 import BeautifulSoup
import requests
import time
import random

### Get requests
In order to access the subreddits, we will begin with using requests library to send HTTP requests, which will return a response object with all response data. 

In [4]:
#defining url to access
url_android = 'https://www.reddit.com/r/Android/.json'
url_apple = 'https://www.reddit.com/r/apple/.json'

In [5]:
#sending get requests to the defined urls
res_android = requests.get(url_android, headers={'User-agent': 'Pony Inc 1.0'})
res_apple = requests.get(url_apple, headers={'User-agent': 'Tony Inc 2.0'})

In [6]:
#checking status_code for android subreddit
#the HTTP 200 OK success status response code indicates that the request has succeeded
res_android.status_code

200

In [7]:
#checking status_code for apple subreddit
#the HTTP 200 OK success status response code indicates that the request has succeeded
res_apple.status_code

200

#### Lets begin with the JSON object from android subreddit.
The JSON object is written like a dictionary object with key value pairs.

In [8]:
dict_apple = res_apple.json()
dict_apple

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'apple',
     'selftext': '\n\nWelcome to the daily Tech Support thread for /r/Apple. \n\nHave a question you need answered? Ask away! Please remember to adhere to our rules, which can be found in the sidebar. On mobile? [Here is a screenshot with our rules](https://i.imgur.com/yekEMCO).\n\nJoin our Discord and IRC chat rooms for support:\n\n[Discord](https://discord.gg/apple)\n\n[IRC](https://kiwiirc.com/client/irc.snoonet.org/apple?nick=CHANGE_ME)\n\n**Note: Comments are sorted by /new for your convenience**\n\nHere is an [archive](https://www.reddit.com/r/apple/search?q=title%3A%22Daily+Tech+Support+Thread%22+author%3A%22AutoModerator%22&amp;restrict_sr=on&amp;sort=new&amp;t=all) of all previous "Tech Support" threads. This is best viewed on a browser. If on mobile, type on the searchbar [title:"Daily Tech Support Thread" author:"AutoModera

#### Exploring the JSON object

In [9]:
dict_apple.keys()

dict_keys(['kind', 'data'])

In [10]:
dict_apple['kind']

'Listing'

In [11]:
dict_apple['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

#### Subreddit posts "location"

In [12]:
#the subreddit posts are nested in the key: children
dict_apple['data']['children'][0]

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'apple',
  'selftext': '\n\nWelcome to the daily Tech Support thread for /r/Apple. \n\nHave a question you need answered? Ask away! Please remember to adhere to our rules, which can be found in the sidebar. On mobile? [Here is a screenshot with our rules](https://i.imgur.com/yekEMCO).\n\nJoin our Discord and IRC chat rooms for support:\n\n[Discord](https://discord.gg/apple)\n\n[IRC](https://kiwiirc.com/client/irc.snoonet.org/apple?nick=CHANGE_ME)\n\n**Note: Comments are sorted by /new for your convenience**\n\nHere is an [archive](https://www.reddit.com/r/apple/search?q=title%3A%22Daily+Tech+Support+Thread%22+author%3A%22AutoModerator%22&amp;restrict_sr=on&amp;sort=new&amp;t=all) of all previous "Tech Support" threads. This is best viewed on a browser. If on mobile, type on the searchbar [title:"Daily Tech Support Thread" author:"AutoModerator"] (without the brackets, and including the quotation marks around the title and

In [13]:
#key to access the next posts
dict_apple['data']['after']

't3_j2aiow'

In [14]:
dict_apple['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [15]:
dict_apple['data']['children'][0]['kind']

't3'

In [16]:
dict_apple['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'post_hint', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'sug

In [17]:
#Subreddit of post
dict_apple['data']['children'][0]['data']['subreddit']

'apple'

In [18]:
#Title of first post in the subreddit
dict_apple['data']['children'][0]['data']['title']

'Daily Tech Support Thread - [September 30]'

In [19]:
##Title of second post in the subreddit
dict_apple['data']['children'][1]['data']['title']

'[Meta] Changes coming to the subreddit re: Self Promo Saturday'

In [20]:
#First post content
dict_apple['data']['children'][0]['data']['selftext']

'\n\nWelcome to the daily Tech Support thread for /r/Apple. \n\nHave a question you need answered? Ask away! Please remember to adhere to our rules, which can be found in the sidebar. On mobile? [Here is a screenshot with our rules](https://i.imgur.com/yekEMCO).\n\nJoin our Discord and IRC chat rooms for support:\n\n[Discord](https://discord.gg/apple)\n\n[IRC](https://kiwiirc.com/client/irc.snoonet.org/apple?nick=CHANGE_ME)\n\n**Note: Comments are sorted by /new for your convenience**\n\nHere is an [archive](https://www.reddit.com/r/apple/search?q=title%3A%22Daily+Tech+Support+Thread%22+author%3A%22AutoModerator%22&amp;restrict_sr=on&amp;sort=new&amp;t=all) of all previous "Tech Support" threads. This is best viewed on a browser. If on mobile, type on the searchbar [title:"Daily Tech Support Thread" author:"AutoModerator"] (without the brackets, and including the quotation marks around the title and author.)'

In [21]:
posts_apple = [p['data'] for p in dict_apple['data']['children']]

In [22]:
pd.DataFrame(posts_apple)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,url_overridden_by_dest
0,,apple,\n\nWelcome to the daily Tech Support thread f...,t2_6l4z3,False,,0,False,Daily Tech Support Thread - [September 30],"[{'e': 'text', 't': 'Official Megathread'}]",...,all_ads,True,https://www.reddit.com/r/apple/comments/j2nxhx...,1807195,1601479000.0,0,,False,,
1,,apple,I guess this will be rappleOS 20.9.1?\n\n_____...,t2_ofg2i,False,,0,False,[Meta] Changes coming to the subreddit re: Sel...,"[{'e': 'text', 't': 'Mod Post'}]",...,all_ads,True,https://www.reddit.com/r/apple/comments/j2sivn...,1807195,1601493000.0,0,,False,a413d8bc-9c29-11e6-8369-0e5f34746a7c,
2,,apple,,t2_oa1us,False,,0,False,Apple TV is coming to Xbox consoles,[],...,all_ads,False,https://www.windowscentral.com/apple-tv-coming...,1807195,1601483000.0,1,,False,,https://www.windowscentral.com/apple-tv-coming...
3,,apple,,t2_oa1us,False,,0,False,Apple TV app in the works for PlayStation too,[],...,all_ads,False,https://twitter.com/9to5mac/status/13113600669...,1807195,1601488000.0,0,"{'type': 'twitter.com', 'oembed': {'provider_u...",False,,https://twitter.com/9to5mac/status/13113600669...
4,,apple,,t2_6090pcx7,False,,0,False,"Mark Gurman: ""Apple marketing materials for th...","[{'e': 'text', 't': 'iPad'}]",...,all_ads,False,https://twitter.com/i/web/status/1311414895380...,1807195,1601518000.0,0,,False,d1c5f976-5701-11e9-a2bd-0e424fabf6d2,https://twitter.com/i/web/status/1311414895380...
5,,apple,,t2_84m5b46k,False,,0,False,Apple Officially Retires Beats Updater Utility...,"[{'e': 'text', 't': 'Beats'}]",...,all_ads,False,https://www.macrumors.com/2020/09/30/apple-ret...,1807195,1601461000.0,0,,False,5b946706-f58f-11e9-9e03-0e0c17497b6a,https://www.macrumors.com/2020/09/30/apple-ret...
6,,apple,,t2_11fdtw,False,,0,False,Apple Suggests Restoring iPhone and Apple Watc...,"[{'e': 'text', 't': 'Apple Watch'}]",...,all_ads,False,https://www.macrumors.com/2020/09/30/apple-wat...,1807195,1601520000.0,0,,False,d7ae9226-5701-11e9-9865-0ee1117c687e,https://www.macrumors.com/2020/09/30/apple-wat...
7,,apple,,t2_2uwit82z,False,,0,False,Apple Card Gains Yearly Spending Activity Opti...,"[{'e': 'text', 't': 'Apple Card'}]",...,all_ads,False,https://www.macrumors.com/2020/09/30/apple-car...,1807195,1601475000.0,0,,False,57518830-5702-11e9-9527-0e0cf4d0bed4,https://www.macrumors.com/2020/09/30/apple-car...
8,,apple,,t2_396tj,False,,0,False,Big Tech Faces Ban From Favoring Own Services ...,"[{'e': 'text', 't': 'Discussion'}]",...,all_ads,False,https://www.bloomberg.com/news/articles/2020-0...,1807195,1601488000.0,1,,False,86b258de-5702-11e9-98ce-0eebcac587ec,https://www.bloomberg.com/news/articles/2020-0...
9,,apple,,t2_gg5le,False,,0,False,Apple Formally Adopts Human Rights Policy in t...,"[{'e': 'text', 't': 'Discussion'}]",...,all_ads,False,https://www.cpomagazine.com/data-privacy/apple...,1807195,1601476000.0,0,,False,86b258de-5702-11e9-98ce-0eebcac587ec,https://www.cpomagazine.com/data-privacy/apple...


In [23]:
url_apple + '?after=' + dict_apple['data']['after']

'https://www.reddit.com/r/apple/.json?after=t3_j2aiow'

#### Obtaining posts

Each request gets us 25 posts, and we are able to obtain the maximum amount of posts available in approximately 30 requests. We have tried to increase our number of requests beyond 30 but found out that we will be requesting the same posts (meaning duplicate information) given that the same url will be repeated after the approximately the 30th request when we print url to track request (requesting info from the same url). 

<br>


Below, we have defined a function which will enable us to run a "request loop" to obtain posts from the subreddit given by url. Based on our observation as noted above, we will set number of requests at 40 (increased limit by 10 to ensure any additional unique posts not collected within the said 30 requests) , any duplicate posts will be removed in the later section.

In [24]:
#Function defined to obtain posts from subreddits
def obtain_posts(url, file_path):
    '''
    Docstring:
    Get requests 40 times from subreddit url using reddit api and export subreddit posts 
    into a comma-separated values (csv) file.
    
    Parameters
    ----------
    url : str
        string containing the URL to which the get request is sent
    file_path: str, path object
        destination file path for saved csv output containing subreddit posts
    
    '''
    posts = []
    after = None
    #we will be obtaining requests 40 times
    for n in range(40):
        if after == None:
            current_url = url
        else:
            current_url = url + '?after=' + after
        #print url to track request
        print(current_url)
        res = requests.get(current_url, headers={'User-agent': 'Pony Inc 1.0'})

        if res.status_code != 200:
            print('Status error', res.status_code)
            break

        current_dict = res.json()
        current_posts = [p['data'] for p in current_dict['data']['children']]
        posts.extend(current_posts)
        after = current_dict['data']['after']
        
        # generate a random sleep duration to look more 'natural'
        sleep_duration = random.randint(2,6)
        print(sleep_duration)
        time.sleep(sleep_duration)
    pd.DataFrame(posts).to_csv(file_path, index = False)

In [25]:
%%time
# Lets obtain posts from android subreddit and save output into android_posts.csv
obtain_posts(url_android, '../datasets/android_posts.csv')

https://www.reddit.com/r/Android/.json
5
https://www.reddit.com/r/Android/.json?after=t3_j2pvck
5
https://www.reddit.com/r/Android/.json?after=t3_j1oahj
6
https://www.reddit.com/r/Android/.json?after=t3_j04ato
6
https://www.reddit.com/r/Android/.json?after=t3_iyysc1
6
https://www.reddit.com/r/Android/.json?after=t3_ixszud
3
https://www.reddit.com/r/Android/.json?after=t3_ixdvsd
4
https://www.reddit.com/r/Android/.json?after=t3_iwn89s
4
https://www.reddit.com/r/Android/.json?after=t3_ivm5uq
5
https://www.reddit.com/r/Android/.json?after=t3_iuffvg
3
https://www.reddit.com/r/Android/.json?after=t3_itfsx0
6
https://www.reddit.com/r/Android/.json?after=t3_isv51c
5
https://www.reddit.com/r/Android/.json?after=t3_irwf64
6
https://www.reddit.com/r/Android/.json?after=t3_iqoyfi
2
https://www.reddit.com/r/Android/.json?after=t3_iphepi
5
https://www.reddit.com/r/Android/.json?after=t3_ioxv11
5
https://www.reddit.com/r/Android/.json?after=t3_io6oae
6
https://www.reddit.com/r/Android/.json?after=t3

In [26]:
%%time
# Lets obtain posts from apple subreddit and save output into apple_posts.csv
obtain_posts(url_apple, '../datasets/apple_posts.csv')

https://www.reddit.com/r/apple/.json
6
https://www.reddit.com/r/apple/.json?after=t3_j2aiow
6
https://www.reddit.com/r/apple/.json?after=t3_j1imfm
3
https://www.reddit.com/r/apple/.json?after=t3_j1dtaf
3
https://www.reddit.com/r/apple/.json?after=t3_j0e3qx
6
https://www.reddit.com/r/apple/.json?after=t3_j0abvd
4
https://www.reddit.com/r/apple/.json?after=t3_j0dih5
5
https://www.reddit.com/r/apple/.json?after=t3_j07qgx
4
https://www.reddit.com/r/apple/.json?after=t3_izqbzz
5
https://www.reddit.com/r/apple/.json?after=t3_iybsyi
5
https://www.reddit.com/r/apple/.json?after=t3_izcdcw
3
https://www.reddit.com/r/apple/.json?after=t3_ixrhkm
5
https://www.reddit.com/r/apple/.json?after=t3_ixjqkc
4
https://www.reddit.com/r/apple/.json?after=t3_ivwmiu
3
https://www.reddit.com/r/apple/.json?after=t3_ivr1i1
3
https://www.reddit.com/r/apple/.json?after=t3_ivrshn
3
https://www.reddit.com/r/apple/.json?after=t3_iwcul9
4
https://www.reddit.com/r/apple/.json?after=t3_ivwirt
3
https://www.reddit.com/r/a

### Data collected
We have managed to collect approximately 900 posts from each Android subreddit and apple subreddit.

In [27]:
#check android posts
df_android = pd.read_csv('../datasets/android_posts.csv')
df_android.shape

(985, 112)

In [28]:
df_android.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,Android,Android,Android,Android,Android
selftext,"Note 1. Join us at /r/MoronicMondayAndroid, a ...","&gt;Separately, Brussels wants large platforms...",,,
author_fullname,t2_6l4z3,t2_533dzk3z,t2_75e6g,t2_167ibb,t2_48kr10
saved,False,False,False,False,False
...,...,...,...,...,...
event_start,,,,,
event_end,,,,,
event_is_live,,,,,
link_flair_template_id,,,,,


In [29]:
#check apple posts
df_apple = pd.read_csv('../datasets/apple_posts.csv')
df_apple.shape

(989, 109)

<div style="text-align: right">
    <div class="right"> >>> <b>Next: </b>
        <a href="./02_data_cleaning_and_eda.ipynb">Data Cleaning and Exploratory Data Analysis</a>
    </div>
    </div>

[Go to top](#top)

---