# Project 3: Web API & Classification

## Problem Statement

Given posts using Reddits API, i will collect posts from 2 subreddits:

- Computing
- Data Science

I will use NLP to train a classifier on which subreddit a post came from.

## Executive Summary

I have tried two different models based on the posts i have collected:

- Naive Bayes
- Logistic Regression

Naive Bayes classifier was found to be more accurate at classifying titles into subreddits than the comparison model.

### Contents:
- [Scraping Reddit for Data](#Scraping-Reddit-for-Data)
- [Data Cleaning](#Data-Cleaning)
- [Combining Data](#Combining-Data)
- [Combine our Cleaning into One Function](#Combine-our-Cleaning-into-One-Function)
- [EDA](#EDA)
- [Create Feature Matrix and Target](#Create-Feature-Matrix-and-Target)
- [Naive Bayes classifier](#Naive-Bayes-classifier)
- [Logistic Regression classifier](#Logistic-Regression-classifier)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

In [1]:
# Standard import

import requests
import pandas as pd
import time
import random

## Scraping reddit for data

### 1. Handling of Computing subreddit

In [2]:
url1 = 'https://www.reddit.com/r/computing.json'

In [3]:
res1 = requests.get(url1)

In [4]:
res1.status_code

429

To change the status_code.

In [5]:
res1 = requests.get(url1, headers={'User-agent': 'Computing Inc 1.0'})

In [6]:
res1.status_code

200

In [7]:
reddit_dict1 = res1.json()

In [8]:
reddit_dict1

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 25,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'computing',
     'selftext': 'My PC started acting weird about a week ago. It would switch to a black screen and would pretty much shut off, meaning I couldn’t do anything, even though it had the indicators that it was still on. I thought I had fixed the problem by turning the max overheat level higher, to 87 degrees. Keep in mind that before this, my computer had never shut off due to overheating. Today I turned it on then left to make food. When I came back to do stuff it had apparently done the same thing that it did before, where it would go to a black screen. I shut it off and turn it back on again, black screen immediately. Did it again same thing, did it again, computer wouldn’t turn on. No fans or light on power switch, only light on keyboard. I went through and cleaned the dust, made sure everything was in their sockets properly, did

In [9]:
reddit_dict1['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [10]:
# The cell here gives me the class label, aka my target.

reddit_dict1['data']['children'][0]['data']['subreddit']

'computing'

In [11]:
# That's mapping of the first post.

reddit_dict1['data']['children'][0]['data']['title']

'PC Not Turning On'

In [13]:
posts1 = [p['data'] for p in reddit_dict1['data']['children']]

In [14]:
pd.DataFrame(posts1)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,post_hint,url_overridden_by_dest,preview,media_metadata
0,,computing,My PC started acting weird about a week ago. I...,t2_1jgvl4g2,False,,0,False,PC Not Turning On,[],...,https://www.reddit.com/r/computing/comments/k6...,24098,1607022000.0,0,,False,,,,
1,,computing,,t2_8bysr4r1,False,,0,False,Join us every Saturdays for a free awesome liv...,[],...,https://i.redd.it/y7d4hnf61s261.png,24098,1606916000.0,0,,False,image,https://i.redd.it/y7d4hnf61s261.png,{'images': [{'source': {'url': 'https://previe...,
2,,computing,If AX contain 1000101000110001 then what is th...,t2_616lursp,False,,0,False,"Hey, can you help me answer this ?",[],...,https://www.reddit.com/r/computing/comments/k5...,24098,1606944000.0,0,,False,,,,
3,,computing,Who’s not getting flooded with ‘personalized’ ...,t2_92a3odo1,False,,0,False,Secure Data Collaboration: Exclusive Sneak-Peak!,[],...,https://www.reddit.com/r/computing/comments/k4...,24098,1606833000.0,0,,False,,,,
4,,computing,"I currently have two 21.5"" monitors side by si...",t2_albamq,False,,0,False,Choosing a Monitor,[],...,https://www.reddit.com/r/computing/comments/k4...,24098,1606810000.0,0,,False,,,,
5,,computing,"Hello, I need help on some CMD commands. I am ...",t2_81iuuimw,False,,0,False,Need help to commando to CMD,[],...,https://www.reddit.com/r/computing/comments/k4...,24098,1606821000.0,0,,False,,,,
6,,computing,I have a dedicated desktop setup in one room; ...,t2_7o33qg17,False,,0,False,Computer vs. Laptop ..,[],...,https://www.reddit.com/r/computing/comments/k3...,24098,1606689000.0,0,,False,,,,
7,,computing,I hope to get an rtx 3070 soon but a lot of th...,t2_4e991fn2,False,,0,False,Is G-sync important ?,[],...,https://www.reddit.com/r/computing/comments/k3...,24098,1606670000.0,0,,False,,,,
8,,computing,Cell phones have a lot of hype around them and...,t2_8jsbbu7c,False,,0,False,Are cellphones no fun?,[],...,https://www.reddit.com/r/computing/comments/k2...,24098,1606587000.0,0,,False,,,,
9,,computing,,t2_4lmwp,False,,0,False,How to create a gemlog (Gemini log) with gssg,[],...,https://portal.mozz.us/gemini/rocketnine.space...,24098,1606531000.0,0,,False,,https://portal.mozz.us/gemini/rocketnine.space...,,


In [15]:
pd.DataFrame(posts1).to_csv('../data/computing.csv')

In [16]:
pd.DataFrame(posts1)['name']

0     t3_k63m62
1     t3_k59bdr
2     t3_k5iqsf
3     t3_k4lxo7
4     t3_k4gqww
5     t3_k4izt3
6     t3_k3iult
7     t3_k3d145
8     t3_k2sl3y
9     t3_k2f7az
10    t3_k2c6yv
11    t3_k25e84
12    t3_k1b885
13    t3_k1600x
14    t3_k0nnpt
15    t3_jzo4qg
16    t3_jzv0hl
17    t3_jznc0j
18    t3_jzr4bb
19    t3_jzldl1
20    t3_jyg93u
21    t3_jyo0oy
22    t3_jxxng6
23    t3_jy0175
24    t3_jxu26z
Name: name, dtype: object

In [17]:
posts = []
after = None

for a in range(27):
    if after == None:
        current_url = url1
    else:
        current_url = url1 + '?after=' + after
    print(current_url)
    res1 = requests.get(current_url, headers={'User-agent': 'Computing Inc 1.0'})
    
    if res1.status_code != 200:
        print('Status error', res1.status_code)
        break
    
    current_dict = res1.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']


    # COMPLETE THE CODE!
    if a > 0:
        prev_posts = pd.read_csv('../data/computing.csv')
        current_df = pd.DataFrame(current_posts)
        new_df = pd.concat([prev_posts , current_df])
        new_df.to_csv('../data/computing.csv' , index = False)
        
    else:
        pd.DataFrame(posts).to_csv('../data/computing.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/computing.json
2
https://www.reddit.com/r/computing.json?after=t3_jxu26z
5
https://www.reddit.com/r/computing.json?after=t3_jtk341
5
https://www.reddit.com/r/computing.json?after=t3_jj8m1g
2
https://www.reddit.com/r/computing.json?after=t3_j9vbsk
3
https://www.reddit.com/r/computing.json?after=t3_j34zzg
2
https://www.reddit.com/r/computing.json?after=t3_iu62jr
2
https://www.reddit.com/r/computing.json?after=t3_ilusdl
4
https://www.reddit.com/r/computing.json?after=t3_ick0il
4
https://www.reddit.com/r/computing.json?after=t3_i4x264
3
https://www.reddit.com/r/computing.json?after=t3_hxen0c
2
https://www.reddit.com/r/computing.json?after=t3_ho54k8
4
https://www.reddit.com/r/computing.json?after=t3_hfqbld
3
https://www.reddit.com/r/computing.json?after=t3_h0wb1c
6
https://www.reddit.com/r/computing.json?after=t3_gt8yaw
5
https://www.reddit.com/r/computing.json?after=t3_gn88yy
5
https://www.reddit.com/r/computing.json?after=t3_gge8a3
5
https://www.reddit.com/r/compu

In [18]:
df1 = pd.read_csv('../data/computing.csv')

### 2. Handling of Data Science subreddit

In [19]:
url2 = 'https://www.reddit.com/r/datascience.json'

In [20]:
res2 = requests.get(url2)

In [21]:
res2.status_code

429

To change the status code.

In [22]:
res2 = requests.get(url2, headers={'User-agent': 'Data Science Inc 1.0'})

In [23]:
res2.status_code

200

In [24]:
reddit_dict2 = res2.json()

In [25]:
reddit_dict2

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'datascience',
     'selftext': "Welcome to this week's entering &amp; transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and [Resources](Resources) pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&amp;restrict_sr=1&amp;sort=new)

In [26]:
reddit_dict2['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [27]:
# The cell here gives me the class label, aka my target.

reddit_dict2['data']['children'][0]['data']['subreddit']

'datascience'

In [28]:
# That's mapping of the first post.

reddit_dict2['data']['children'][0]['data']['title']

'Weekly Entering &amp; Transitioning Thread | 29 Nov 2020 - 06 Dec 2020'

In [29]:
posts2 = [p['data'] for p in reddit_dict2['data']['children']]

In [30]:
pd.DataFrame(posts2)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,link_flair_template_id,poll_data,post_hint,preview
0,,datascience,Welcome to this week's entering &amp; transiti...,t2_4l4cxw07,False,,0,False,Weekly Entering &amp; Transitioning Thread | 2...,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1606651000.0,0,,False,,,,
1,,datascience,,t2_886htd2x,False,,0,False,Are there any people who started off with data...,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607006000.0,1,,False,a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e,,,
2,,datascience,My company hired a three man team of data scie...,t2_140c97,False,,0,False,Data Science is not an easy and quick job that...,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607010000.0,0,,False,481ee318-d77d-11e7-a4a3-0e8624d7129a,,,
3,,datascience,I'm a sophomore in college considering a caree...,t2_3aed59pj,False,,0,False,Summer programs for undergrads?,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607062000.0,0,,False,a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e,,,
4,,datascience,Hi everyone! I'm an undergraduate junior Appli...,t2_opauw,False,,0,False,Undergraduate Data Science internships,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607069000.0,0,,False,71803d7a-469d-11e9-890b-0e5d959976c8,,,
5,,datascience,\nI’m having major doubts about my career and ...,t2_6dhdycki,False,,0,False,Is working as an analyst or DS in sustainabili...,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607041000.0,0,,False,a6ee6fa0-d780-11e7-b6d0-0e0bd8823a7e,,,
6,,datascience,Anyone aware of any studies around how people ...,t2_3x6ik,False,,0,False,"At your company, do the business teams who col...",[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607052000.0,0,,False,4fad7108-d77d-11e7-b0c6-0ee69f155af2,,,
7,,datascience,I currently work as a statistical modeling ana...,t2_39fgc1dn,False,,0,False,Online Masters in Statistics vs Data Science/C...,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607062000.0,0,,False,99f9652a-d780-11e7-b558-0e52cdd59ace,,,
8,,datascience,I'm planning to go deeper into big data - sinc...,t2_3z6gqvrh,False,,0,False,Performance measure for big data (on laptop),[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607047000.0,0,,False,4fad7108-d77d-11e7-b0c6-0ee69f155af2,,,
9,,datascience,Hey all! I'd love to get recommendations on so...,t2_48ou76y2,False,,0,False,Best Project Based Data Science Tutorial,[],...,https://www.reddit.com/r/datascience/comments/...,340541,1607033000.0,0,,False,937a6f50-d780-11e7-826d-0ed1beddcc82,,,


In [31]:
pd.DataFrame(posts2).to_csv('../data/datascience.csv')

In [32]:
pd.DataFrame(posts2)['name']

0     t3_k38a6g
1     t3_k5y56t
2     t3_k5zc3o
3     t3_k6f99q
4     t3_k6gpfn
5     t3_k69yqq
6     t3_k6co45
7     t3_k6f42w
8     t3_k6b9zx
9     t3_k67epw
10    t3_k6g8nm
11    t3_k69dh0
12    t3_k6csxc
13    t3_k6c7du
14    t3_k5z5p7
15    t3_k6a9rq
16    t3_k695x2
17    t3_k6952l
18    t3_k68iii
19    t3_k59xar
20    t3_k5m5mb
21    t3_k63l45
22    t3_k61xnx
23    t3_k5yu6y
24    t3_k5e815
25    t3_k5fhga
Name: name, dtype: object

In [33]:
posts = []
after = None

for a in range(23):
    if after == None:
        current_url = url2
    else:
        current_url = url2 + '?after=' + after
    print(current_url)
    res2 = requests.get(current_url, headers={'User-agent': 'Data Science Inc 1.0'})
    
    if res2.status_code != 200:
        print('Status error', res2.status_code)
        break
    
    current_dict = res2.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    posts.extend(current_posts)
    after = current_dict['data']['after']


    # COMPLETE THE CODE!
    if a > 0:
        prev_posts = pd.read_csv('../data/datascience.csv')
        current_df = pd.DataFrame(current_posts)
        new_df = pd.concat([prev_posts , current_df])
        new_df.to_csv('../data/datascience.csv' , index = False)
        
    else:
        pd.DataFrame(posts).to_csv('../data/datascience.csv', index = False)

    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,6)
    print(sleep_duration)
    time.sleep(sleep_duration)

https://www.reddit.com/r/datascience.json
6
https://www.reddit.com/r/datascience.json?after=t3_k5fhga
4
https://www.reddit.com/r/datascience.json?after=t3_k2dw8v
5
https://www.reddit.com/r/datascience.json?after=t3_k1fuvf
5
https://www.reddit.com/r/datascience.json?after=t3_jyidr5
6
https://www.reddit.com/r/datascience.json?after=t3_jwsdhi
2
https://www.reddit.com/r/datascience.json?after=t3_jujktx
5
https://www.reddit.com/r/datascience.json?after=t3_js3zli
5
https://www.reddit.com/r/datascience.json?after=t3_jqu8xg
3
https://www.reddit.com/r/datascience.json?after=t3_jokvfz
2
https://www.reddit.com/r/datascience.json?after=t3_jo5u2t
3
https://www.reddit.com/r/datascience.json?after=t3_jnj6n9
4
https://www.reddit.com/r/datascience.json?after=t3_jn1rwg
2
https://www.reddit.com/r/datascience.json?after=t3_jl4ox4
6
https://www.reddit.com/r/datascience.json?after=t3_jjzlbd
5
https://www.reddit.com/r/datascience.json?after=t3_jj3ynx
4
https://www.reddit.com/r/datascience.json?after=t3_jhjvl

In [34]:
df2 = pd.read_csv('../data/datascience.csv')

## Data Cleaning

### Getting the 3 columns that we need

In [35]:
df1_cln = df1[['subreddit' , 'title' , 'selftext']]

In [36]:
df2_cln = df2[['subreddit' , 'title' , 'selftext']]

### 1. Handling of Computing Data

In [37]:
# Check the shape of the dataframe before dropping any data.
df1_cln.shape

(675, 3)

In [38]:
df1_cln.drop_duplicates(subset=['selftext'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1_cln.drop_duplicates(subset=['selftext'], inplace=True)


In [39]:
df1_cln.dropna(subset=['selftext'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1_cln.dropna(subset=['selftext'], inplace=True)


In [41]:
# Check the shape of the dataframe after dropping data.
df1_cln.shape

(515, 3)

### 2. Handling of Data Science Data

In [42]:
# Check the shape of the dataframe before dropping any data.
df2_cln.shape

(576, 3)

In [43]:
df2_cln.drop_duplicates(subset=['selftext'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_cln.drop_duplicates(subset=['selftext'], inplace=True)


In [44]:
df2_cln.dropna(subset=['selftext'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2_cln.dropna(subset=['selftext'], inplace=True)


In [46]:
# Check the shape of the dataframe after dropping data.
df2_cln.shape

(511, 3)

### Save dataset after cleaning

In [47]:
df1_cln.to_csv('../data/computing.csv')

In [48]:
df2_cln.to_csv('../data/datascience.csv')

## Continue to Notebook 2: Classification

I have created 2 DataFrame table containing `title` , `selftext` and our target `subreddit`, i have saved these in a csv files, and will proceed with the classification in Notebook 2.