# Project 3: Web APIs & Classification

## Problem Statement

Given posts from two Using Reddit's API, you'll collect posts from two subreddits, r/worldnews and r/todayilearned, we will use NLP to train a classifier on which subreddit a given post came from.

## Executive Summary

### Contents:
- [Scraping reddit for data](#Scraping-reddit-for-data)
- [2018 Data Import and Cleaning](#2018-Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-data)
- [Descriptive and Inferential Statistics](#Descriptive-and-Inferential-Statistics)
- [Outside Research](#Outside-Research)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

In [1]:
import pandas as pd

%matplotlib inline

## Import data

In [2]:
import_path = r'..\datasets\worldnews.csv'
data = pd.read_csv(import_path)

## Explore and clean data

In [3]:
data.head()

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,url,subreddit_subscribers,created_utc,num_crossposts,media,is_video,crosspost_parent_list,link_flair_template_id,crosspost_parent,author_cakeday
0,,worldnews,,t2_4p0y4poi,False,,0,False,Greta Thunberg arrives in Lisbon today for UNc...,[],...,https://www.cbsnews.com/news/greta-thunberg-ar...,22598319,1575409000.0,1,,False,,,,
1,,worldnews,,t2_2yqt,False,,0,False,"'So If You're Poor, You're Dead'? Watch These ...",[],...,https://www.commondreams.org/news/2019/12/03/s...,22598319,1575393000.0,4,,False,,,,
2,,worldnews,,t2_4g3lx,False,,0,False,France's president just fact-checked Trump in ...,[],...,https://www.cnn.com/politics/live-news/nato-su...,22598319,1575388000.0,1,,False,,,,
3,,worldnews,,t2_2nmahwux,False,,1,False,US Navy ‘invited’ to go to Taiwan and ‘have fu...,[],...,https://www.scmp.com/news/china/military/artic...,22598319,1575393000.0,1,,False,,,,
4,,worldnews,,t2_4wcibogo,False,,2,False,"Trump says ""I don't know Prince Andrew"" but ph...",[],...,https://www.businessinsider.com/trump-claims-d...,22598319,1575384000.0,5,,False,,,,


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 795 entries, 0 to 794
Columns: 104 entries, approved_at_utc to author_cakeday
dtypes: bool(29), float64(29), int64(10), object(36)
memory usage: 488.5+ KB


In [5]:
import diagnostics as dia

sebas = dia.Sebastian()

In [6]:
sebas.get_nulls(data)

{'approved_at_utc': 795,
 'selftext': 795,
 'mod_reason_title': 795,
 'link_flair_css_class': 556,
 'author_flair_background_color': 795,
 'author_flair_template_id': 795,
 'secure_media': 795,
 'category': 795,
 'link_flair_text': 556,
 'approved_by': 795,
 'thumbnail': 795,
 'author_flair_css_class': 794,
 'content_categories': 795,
 'mod_note': 795,
 'removed_by_category': 795,
 'banned_by': 795,
 'selftext_html': 795,
 'likes': 795,
 'suggested_sort': 795,
 'banned_at_utc': 795,
 'view_count': 795,
 'author_flair_text': 794,
 'removed_by': 795,
 'num_reports': 795,
 'distinguished': 795,
 'mod_reason_by': 795,
 'removal_reason': 795,
 'link_flair_background_color': 795,
 'report_reasons': 795,
 'discussion_type': 795,
 'author_flair_text_color': 794,
 'media': 795,
 'crosspost_parent_list': 774,
 'link_flair_template_id': 791,
 'crosspost_parent': 774,
 'author_cakeday': 794}

In [7]:
sebas.drop_null_cols(data, null_size=500, inplace=True)

Unnamed: 0,subreddit,author_fullname,saved,gilded,clicked,title,link_flair_richtext,subreddit_name_prefixed,hidden,pwls,...,mod_reports,author_patreon_flair,permalink,parent_whitelist_status,stickied,url,subreddit_subscribers,created_utc,num_crossposts,is_video
0,worldnews,t2_4p0y4poi,False,0,False,Greta Thunberg arrives in Lisbon today for UNc...,[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e5o2la/greta_thunberg_ar...,all_ads,False,https://www.cbsnews.com/news/greta-thunberg-ar...,22598319,1.575409e+09,1,False
1,worldnews,t2_2yqt,False,0,False,"'So If You're Poor, You're Dead'? Watch These ...",[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e5k8w6/so_if_youre_poor_...,all_ads,False,https://www.commondreams.org/news/2019/12/03/s...,22598319,1.575393e+09,4,False
2,worldnews,t2_4g3lx,False,0,False,France's president just fact-checked Trump in ...,[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e5itnr/frances_president...,all_ads,False,https://www.cnn.com/politics/live-news/nato-su...,22598319,1.575388e+09,1,False
3,worldnews,t2_2nmahwux,False,1,False,US Navy ‘invited’ to go to Taiwan and ‘have fu...,[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e5k65a/us_navy_invited_t...,all_ads,False,https://www.scmp.com/news/china/military/artic...,22598319,1.575393e+09,1,False
4,worldnews,t2_4wcibogo,False,2,False,"Trump says ""I don't know Prince Andrew"" but ph...",[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e5i160/trump_says_i_dont...,all_ads,False,https://www.businessinsider.com/trump-claims-d...,22598319,1.575384e+09,5,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
790,worldnews,t2_nyc3b,False,0,False,"As Troubles Grow, Mexicans Keep the Faith With...",[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e4vgkv/as_troubles_grow_...,all_ads,False,https://www.nytimes.com/2019/12/01/world/ameri...,22598319,1.575271e+09,0,False
791,worldnews,t2_4x03o0iu,False,0,False,At least 21 killed in shootout between Mexican...,[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e4npl2/at_least_21_kille...,all_ads,False,https://www.upi.com/Top_News/World-News/2019/1...,22598319,1.575234e+09,0,False
792,worldnews,t2_1s32j2qz,False,0,False,‘All eyes on the Kingdom’ as Saudi Arabia take...,[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e4wbj6/all_eyes_on_the_k...,all_ads,False,https://www.arabnews.com/node/1592561/saudi-ar...,22598319,1.575276e+09,0,False
793,worldnews,t2_4q4rbmz7,False,0,False,Nasa astronaut snaps stunning view of London a...,[],r/worldnews,False,6,...,[],False,/r/worldnews/comments/e4zz66/nasa_astronaut_sn...,all_ads,False,https://www.thesun.co.uk/tech/10449524/nasa-as...,22598319,1.575298e+09,0,False


In [11]:
sebas.get_nulls(data)

{'link_flair_css_class': 556, 'link_flair_text': 556}

In [9]:
#for k, v in data.isnull().sum().iteritems():
 #   if v != 0:
   #     print(k, v)

In [12]:
data.columns

Index(['subreddit', 'author_fullname', 'saved', 'gilded', 'clicked', 'title',
       'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls',
       'link_flair_css_class', 'downs', 'hide_score', 'name', 'quarantine',
       'link_flair_text_color', 'subreddit_type', 'ups',
       'total_awards_received', 'media_embed', 'is_original_content',
       'user_reports', 'is_reddit_media_domain', 'is_meta',
       'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score',
       'author_premium', 'edited', 'steward_reports', 'author_flair_richtext',
       'gildings', 'is_self', 'created', 'link_flair_type', 'wls',
       'author_flair_type', 'domain', 'allow_live_comments', 'archived',
       'no_follow', 'is_crosspostable', 'pinned', 'over_18', 'all_awardings',
       'awarders', 'media_only', 'can_gild', 'spoiler', 'locked', 'visited',
       'subreddit_id', 'id', 'is_robot_indexable', 'author', 'num_comments',
       'send_replies', 'whitelist_status', 'contest_mode', '