#  Project 3: Web APIs & Classification (Subreddits)

## Problem Statement
<br> We are the good folks at Askscience subreddit forums, bringing to you up and coming and interesting Science <br>facts, news fresh off the oven for your reading pleasure.
<br>Nowadays with the many cross-overs that happen between fields, topics that sound related but are not actually <br>related or relevant to our cause have seemed to pop up into our subreddit feeds. More prevalantly , the posting <br>of troll posts, or jokes mocking Science or in no way relevant to Science in anyway.
<br>Our job now is to invent a ML algorithm to correctly dectect such posts that dont belong,and subsequently <br>filter,remove them from our feed. 

## Executive Summary
Reddit is an American social news compiler,web content rating and discussion website. Members submit content to the site such as links,text posts and images which are upvoted or downvoted by the community.
Posts are organized by subject into user-created boards called "subreddits", which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing.

As it is an open community, everyone can post whenever/wherever and whatever they want. Most often then not random posts unrelated to the subreddit will appear. While the immediate solution is for moderators to spot them and remove, how efficient can this process be?
There is only so much a modertor can detect, before the amount of posts posted becomes too overwhelming for it to be done manually.

Thus arrives the need for automation, filtering out the irrelevant posts and keeping the subreddit accurate.
Our job now is to invent a ML algorithm to correctly dectect such posts that dont belong,and subsequently
filter,remove them from our feed.

### Contents:
- [Importing the Relevant Libraries](#Importing-the-Relevant-Libraries)
- [Anything and everything about Subreddit 1](#Anything-and-everything-about-Subreddit-1)
- [Cleaning the Data](#Cleaning-the-Data)
- [Function to scrape content from Askscience](#Function-to-scrape-content-from-Askscience)
- [Anything and everything about Subreddit 2](#Anything-and-everything-about-Subreddit-2)
- [Function to scrape content from Jokes](#Function-to-scrape-content-from-Jokes)


## Importing the Relevant Libraries

In [33]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import scipy.stats as stats
import seaborn as sns
import pandas as pd
import random
import time

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Anything and everything about Subreddit 1

In [34]:
url1='https://www.reddit.com/r/askscience.json'
request1=requests.get(url1,headers={'User-agent': 'hahaha123'})

In [35]:
request1.status_code #status 200 so gd to go

200

In [36]:
#lets work with subreddit1 first
artscience_dict=request1.json()

In [37]:
print(artscience_dict)

{'kind': 'Listing', 'data': {'modhash': '', 'dist': 27, 'children': [{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'askscience', 'selftext': "**Please read this entire post carefully and format your application appropriately.**\n\nThis post is for new panelist recruitment! The previous one is [here](https://www.reddit.com/r/askscience/comments/amj68a/askscience_panel_of_scientists_xx/).\n\nThe panel is an informal group of redditors who are **either professional scientists or those in training to become so**. All panelists have at least a graduate-level familiarity within their declared field of expertise and answer questions from related areas of study. A panelist's expertise is summarized in a color-coded AskScience flair.\n\nMembership in the panel comes with access to a panelist subreddit. It is a place for panelists to interact with each other, voice concerns to the moderators, and where the moderators make announcements to the whole panel. It's a good place to net

In [38]:
#lets see what we are dealing with, what are the keys in the dict
artscience_dict.keys()


dict_keys(['kind', 'data'])

In [39]:
artscience_dict['kind']

'Listing'

In [40]:
artscience_dict['data']  #can further expand this 


{'modhash': '',
 'dist': 27,
 'children': [{'kind': 't3',
   'data': {'approved_at_utc': None,
    'subreddit': 'askscience',
    'selftext': "**Please read this entire post carefully and format your application appropriately.**\n\nThis post is for new panelist recruitment! The previous one is [here](https://www.reddit.com/r/askscience/comments/amj68a/askscience_panel_of_scientists_xx/).\n\nThe panel is an informal group of redditors who are **either professional scientists or those in training to become so**. All panelists have at least a graduate-level familiarity within their declared field of expertise and answer questions from related areas of study. A panelist's expertise is summarized in a color-coded AskScience flair.\n\nMembership in the panel comes with access to a panelist subreddit. It is a place for panelists to interact with each other, voice concerns to the moderators, and where the moderators make announcements to the whole panel. It's a good place to network with peopl

In [41]:
artscience_dict['data'].keys()

dict_keys(['modhash', 'dist', 'children', 'after', 'before'])

In [42]:
artscience_dict['data']['children']  #children key can be further spilt into the things that we might want


[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'askscience',
   'selftext': "**Please read this entire post carefully and format your application appropriately.**\n\nThis post is for new panelist recruitment! The previous one is [here](https://www.reddit.com/r/askscience/comments/amj68a/askscience_panel_of_scientists_xx/).\n\nThe panel is an informal group of redditors who are **either professional scientists or those in training to become so**. All panelists have at least a graduate-level familiarity within their declared field of expertise and answer questions from related areas of study. A panelist's expertise is summarized in a color-coded AskScience flair.\n\nMembership in the panel comes with access to a panelist subreddit. It is a place for panelists to interact with each other, voice concerns to the moderators, and where the moderators make announcements to the whole panel. It's a good place to network with people who share your interests!\n\n---\n\n**You a

In [43]:
len(artscience_dict['data']['children']) #we have 27 indexes to go through

27

In [44]:
#lets see what we got in one index first
artscience_dict['data']['children'][0] #another dictionary !! 

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'askscience',
  'selftext': "**Please read this entire post carefully and format your application appropriately.**\n\nThis post is for new panelist recruitment! The previous one is [here](https://www.reddit.com/r/askscience/comments/amj68a/askscience_panel_of_scientists_xx/).\n\nThe panel is an informal group of redditors who are **either professional scientists or those in training to become so**. All panelists have at least a graduate-level familiarity within their declared field of expertise and answer questions from related areas of study. A panelist's expertise is summarized in a color-coded AskScience flair.\n\nMembership in the panel comes with access to a panelist subreddit. It is a place for panelists to interact with each other, voice concerns to the moderators, and where the moderators make announcements to the whole panel. It's a good place to network with people who share your interests!\n\n---\n\n**You are e

In [45]:
artscience_dict['data']['children'][0].keys()

dict_keys(['kind', 'data'])

In [46]:
artscience_dict['data']['children'][0]['kind']

't3'

In [47]:
artscience_dict['data']['children'][0]['data'] #one last dictionary !!

{'approved_at_utc': None,
 'subreddit': 'askscience',
 'selftext': "**Please read this entire post carefully and format your application appropriately.**\n\nThis post is for new panelist recruitment! The previous one is [here](https://www.reddit.com/r/askscience/comments/amj68a/askscience_panel_of_scientists_xx/).\n\nThe panel is an informal group of redditors who are **either professional scientists or those in training to become so**. All panelists have at least a graduate-level familiarity within their declared field of expertise and answer questions from related areas of study. A panelist's expertise is summarized in a color-coded AskScience flair.\n\nMembership in the panel comes with access to a panelist subreddit. It is a place for panelists to interact with each other, voice concerns to the moderators, and where the moderators make announcements to the whole panel. It's a good place to network with people who share your interests!\n\n---\n\n**You are eligible to join the panel 

In [48]:
artscience_dict['data']['children'][0]['data'].keys()

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'thumbnail', 'edited', 'author_flair_css_class', 'steward_reports', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc', 'view_count', 'archived', 'no_follow', 'is_crosspostable', 'pinned', 'over_18',

In [49]:
artscience_dict['data']['children'][0]['data']['subreddit'] #subreddit this post comes from ..DUH!

'askscience'

In [50]:
artscience_dict['data']['children'][0]['data']['title'] #title of the first post we see

'AskScience Panel of Scientists XXI'

In [51]:
artscience_dict['data']['children'][0]['data']['selftext']

"**Please read this entire post carefully and format your application appropriately.**\n\nThis post is for new panelist recruitment! The previous one is [here](https://www.reddit.com/r/askscience/comments/amj68a/askscience_panel_of_scientists_xx/).\n\nThe panel is an informal group of redditors who are **either professional scientists or those in training to become so**. All panelists have at least a graduate-level familiarity within their declared field of expertise and answer questions from related areas of study. A panelist's expertise is summarized in a color-coded AskScience flair.\n\nMembership in the panel comes with access to a panelist subreddit. It is a place for panelists to interact with each other, voice concerns to the moderators, and where the moderators make announcements to the whole panel. It's a good place to network with people who share your interests!\n\n---\n\n**You are eligible to join the panel if you:**\n\n* Are studying for at least an MSc. or equivalent degr

In [52]:
#now we know how to peel through each dic, we need to iterate into 27 times
#atrscience_dict['data']['children'][i]['data'] and everything related.. 

posts=[p['data']for p in artscience_dict['data']['children']]

In [53]:
pd.DataFrame(posts)

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,num_comments,is_video,link_flair_template_id
0,,askscience,**Please read this entire post carefully and f...,t2_ec1ey,False,,1,False,AskScience Panel of Scientists XXI,[],...,/r/askscience/comments/cflsy3/askscience_panel...,all_ads,True,https://www.reddit.com/r/askscience/comments/c...,18325684,1563629000.0,0,79,False,
1,,askscience,"Since 2015, using NASA hardware, scientists an...",t2_ec1ey,False,,0,False,AskScience AMA Series: We are experts on NASA'...,[],...,/r/askscience/comments/e1x03w/askscience_ama_s...,all_ads,True,https://www.reddit.com/r/askscience/comments/e...,18325684,1574770000.0,2,518,False,
2,,askscience,,t2_17c6o6,False,,0,False,Do ants that get lost(accidentally get on my b...,[],...,/r/askscience/comments/e4x3ya/do_ants_that_get...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575282000.0,0,59,False,431d9b62-8971-11e1-abc8-12313d18ad57
3,,askscience,,t2_s04sg,False,,0,False,What part of your brain gets activated when yo...,[],...,/r/askscience/comments/e4ljhw/what_part_of_you...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575226000.0,2,121,False,3f105c74-dfa7-11e3-99f2-12313b0b31f5
4,,askscience,Wouldn't a pointed bow cut through the water b...,t2_3nhpz9lr,False,,0,False,Why aren't the bows of submarines pointy??,[],...,/r/askscience/comments/e4pn8p/why_arent_the_bo...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575242000.0,0,115,False,5d6320a8-dfa7-11e3-a65e-12313d18e464
5,,askscience,I am always confused be centrifugal and centri...,t2_3catvkhj,False,,0,False,Do you weigh less at the equator because of ce...,[],...,/r/askscience/comments/e4ffc7/do_you_weigh_les...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575195000.0,3,562,False,e8738d5c-8970-11e1-9266-12313d2c1af1
6,,askscience,"If the universe is infused with dark matter, w...",t2_fwbr9,False,,0,False,Could there possibly be black holes that forme...,[],...,/r/askscience/comments/e4snp5/could_there_poss...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575256000.0,0,10,False,26929b46-8971-11e1-aa3a-12313d096aae
7,,askscience,"When I, for example, hold one arm straight to ...",t2_rowg1fb,False,,0,False,Does the brain send signals consistently to ke...,[],...,/r/askscience/comments/e4ls5e/does_the_brain_s...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575227000.0,1,30,False,3f105c74-dfa7-11e3-99f2-12313b0b31f5
8,,askscience,"&amp;#x200B;\n\nFor the Earth and the Moon, th...",t2_10vppx,False,,0,False,How do axes of orbit for planetary bodies and ...,[],...,/r/askscience/comments/e4ytiy/how_do_axes_of_o...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575292000.0,0,2,False,26929b46-8971-11e1-aa3a-12313d096aae
9,,askscience,"For example, why is my memory so bad when it c...",t2_lccopuu,False,,0,False,Is remembering a dream the same mechanism as r...,[],...,/r/askscience/comments/e4m05u/is_remembering_a...,all_ads,False,https://www.reddit.com/r/askscience/comments/e...,18325684,1575228000.0,0,8,False,3f105c74-dfa7-11e3-99f2-12313b0b31f5


In [54]:
pd.DataFrame(posts).to_csv('posts.csv')

In [55]:
artscience_dict['data']['after']  #this is the name of the last post, at least before we go next page or it loads when we scroll

't3_e4kzrg'

In [56]:
#This is the new URL that gives you the next 25 posts.
url1 + '?after=' + artscience_dict['data']['after']

'https://www.reddit.com/r/askscience.json?after=t3_e4kzrg'

In [57]:
pd.DataFrame(posts)['name']

0     t3_cflsy3
1     t3_e1x03w
2     t3_e4x3ya
3     t3_e4ljhw
4     t3_e4pn8p
5     t3_e4ffc7
6     t3_e4snp5
7     t3_e4ls5e
8     t3_e4ytiy
9     t3_e4m05u
10    t3_e4olgh
11    t3_e4wmr9
12    t3_e4psvd
13    t3_e43cbk
14    t3_e4il8a
15    t3_e3wjvc
16    t3_e4poq4
17    t3_e4kogm
18    t3_e4ojr4
19    t3_e4m3fo
20    t3_e4qwjq
21    t3_e4qrqr
22    t3_e4qc0m
23    t3_e4oxjv
24    t3_e4ko50
25    t3_e4fm6p
26    t3_e4kzrg
Name: name, dtype: object

## Function to scrape content from Askscience 

In [72]:
artscience_posts = []
after = None

for a in range(25):
    if after == None:
        current_url = url1
    else:
        current_url = url1 + '?after=' + after
    print(current_url)
    res = requests.get(current_url, headers={'User-agent': 'sandshoes 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    artscience_posts.extend(current_posts)
    after = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)


https://www.reddit.com/r/askscience.json
32
https://www.reddit.com/r/askscience.json?after=t3_e4kzrg
38
https://www.reddit.com/r/askscience.json?after=t3_e3js1b
50
https://www.reddit.com/r/askscience.json?after=t3_e2y7to
22
https://www.reddit.com/r/askscience.json?after=t3_e27t0u
41
https://www.reddit.com/r/askscience.json?after=t3_e1w8kr
31
https://www.reddit.com/r/askscience.json?after=t3_e1gesn
46
https://www.reddit.com/r/askscience.json?after=t3_e0lgg4
20
https://www.reddit.com/r/askscience.json?after=t3_dzxl9t
32
https://www.reddit.com/r/askscience.json?after=t3_dytpwl
23
https://www.reddit.com/r/askscience.json?after=t3_dz0f8f
59
https://www.reddit.com/r/askscience.json?after=t3_dydt5x
52
https://www.reddit.com/r/askscience.json?after=t3_dxdnnp
59
https://www.reddit.com/r/askscience.json?after=t3_dxjgyg
7
https://www.reddit.com/r/askscience.json?after=t3_dwonq0
40
https://www.reddit.com/r/askscience.json?after=t3_dwfx0r
59
https://www.reddit.com/r/askscience.json?after=t3_dw7tkl


In [73]:
len(artscience_posts)

627

In [74]:
pd.DataFrame(artscience_posts).to_csv('artscience.csv', index = False)

In [75]:
df1=pd.DataFrame(artscience_posts)

In [76]:
len(set([x['name'] for x in artscience_posts]))

# uniques

627

## Anything and everything about Subreddit 2

In [59]:
url2='https://www.reddit.com/r/Jokes.json'
request2=requests.get(url2,headers={'User-agent': 'proper12 1.0'})

In [60]:
request2.status_code #status 200 so gd to go

200

In [61]:
jokes_dict=request2.json()

## Function to scrape content from Jokes

In [67]:
jokes_posts = []
after1 = None
count=0
for a in range(25):
    if after1 == None:
        current_url1 = url2
    else:
        current_url1 = url2 + '?after=' + after1
    print(current_url1)
    res = requests.get(current_url1, headers={'User-agent': 'proper12 1.0'})
    
    if res.status_code != 200:
        print('Status error', res.status_code)
        break
    
    current_dict = res.json()
    current_posts = [p['data'] for p in current_dict['data']['children']]
    jokes_posts.extend(current_posts)
    after1 = current_dict['data']['after']
    
    # generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(2,60)
    print(sleep_duration)
    time.sleep(sleep_duration)
    count+=sleep_duration

https://www.reddit.com/r/Jokes.json
54
https://www.reddit.com/r/Jokes.json?after=t3_e4xcyj
42
https://www.reddit.com/r/Jokes.json?after=t3_e4mzn9
3
https://www.reddit.com/r/Jokes.json?after=t3_e4pqlc
33
https://www.reddit.com/r/Jokes.json?after=t3_e4xzh5
27
https://www.reddit.com/r/Jokes.json?after=t3_e502p8
30
https://www.reddit.com/r/Jokes.json?after=t3_e4yt2r
12
https://www.reddit.com/r/Jokes.json?after=t3_e4h954
19
https://www.reddit.com/r/Jokes.json?after=t3_e4vjwv
40
https://www.reddit.com/r/Jokes.json?after=t3_e4izr8
27
https://www.reddit.com/r/Jokes.json?after=t3_e4th4u
28
https://www.reddit.com/r/Jokes.json?after=t3_e4e89k
42
https://www.reddit.com/r/Jokes.json?after=t3_e4tkw4
53
https://www.reddit.com/r/Jokes.json?after=t3_e4joa1
20
https://www.reddit.com/r/Jokes.json?after=t3_e3khv0
24
https://www.reddit.com/r/Jokes.json?after=t3_e4e61l
38
https://www.reddit.com/r/Jokes.json?after=t3_e3xeuh
37
https://www.reddit.com/r/Jokes.json?after=t3_e4963m
41
https://www.reddit.com/r/Jo

In [82]:
pd.DataFrame(jokes_posts).to_csv('jokes.csv', index = False)

In [83]:
df2=pd.DataFrame(jokes_posts)

In [84]:
df2.shape

(626, 99)

In [85]:
len(jokes_posts)

626

In [86]:
len(set([x['name'] for x in jokes_posts]))

# uniques

626

In [1]:
#okay while waiting for thing to load stuff i need to do
#Only add subreddit, self-title(text),title from both subreddits
#map subreddit to binary numbers, 1 for askscience, 0 for jokes 

In [87]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 627 entries, 0 to 626
Data columns (total 100 columns):
approved_at_utc                  0 non-null object
subreddit                        627 non-null object
selftext                         627 non-null object
author_fullname                  623 non-null object
saved                            627 non-null bool
mod_reason_title                 0 non-null object
gilded                           627 non-null int64
clicked                          627 non-null bool
title                            627 non-null object
link_flair_richtext              627 non-null object
subreddit_name_prefixed          627 non-null object
hidden                           627 non-null bool
pwls                             627 non-null int64
link_flair_css_class             622 non-null object
downs                            627 non-null int64
hide_score                       627 non-null bool
name                             627 non-null object
quaranti

In [88]:
df1=df1[['subreddit','selftext','title']]
df1['target']=1

In [89]:
df2=df2[['subreddit','selftext','title']]
df2['target']=0

In [90]:
combined=pd.concat([df1,df2])

In [91]:
combined.head()

Unnamed: 0,subreddit,selftext,title,target
0,askscience,**Please read this entire post carefully and f...,AskScience Panel of Scientists XXI,1
1,askscience,"Since 2015, using NASA hardware, scientists an...",AskScience AMA Series: We are experts on NASA'...,1
2,askscience,,Do ants that get lost(accidentally get on my b...,1
3,askscience,,What part of your brain gets activated when yo...,1
4,askscience,Wouldn't a pointed bow cut through the water b...,Why aren't the bows of submarines pointy??,1


In [92]:
combined.shape

(1253, 4)

In [93]:
 combined=combined.rename(columns={"selftext": "text"})

In [94]:
combined.to_csv('combined.csv', index = False)