# Reddit API and Classification

**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Previous:** [Data Collection](./01_data_collection.ipynb)

## Data Cleaning and Exploratory Data Analysis
In our previous section, we have collected into dataframes subreddit posts from two subreddits:
- r/Android
- r/apple

In this section, we will be cleaning the data collected and then carry out some analysis based on the cleaned data.

#### Library Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

from bs4 import BeautifulSoup
import regex as re
import requests
import time
import random

#### Data imports

In [2]:
df_android = pd.read_csv('../datasets/android_posts.csv')
df_apple = pd.read_csv('../datasets/apple_posts.csv')

### Exploring the dataframes

In [3]:
#preview first 5 rows data
pd.set_option('display.max_rows', None)
df_android.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,Android,Android,Android,Android,Android
selftext,Note 1. [Check out our apps wiki](https://www....,,,,
author_fullname,t2_6l4z3,t2_1xqjsw6h,t2_3qp7grch,t2_2fxjvwp2,t2_17k4v5
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,0,0,0
clicked,False,False,False,False,False
title,Saturday APPreciation (Sep 26 2020) - Your wee...,"Android 11 got rid of the 4GB limit on videos,...",Google Pixel 4a Smartphone Audio Review,NewPipe tests new Unified Player UI with seaml...,[Erica Griffin] Surface Duo: Above the Fold (D...
link_flair_richtext,[],[],[],[],[]


In [4]:
#preview first 5 rows data
df_apple.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,apple,apple,apple,apple,apple
selftext,\n\nWelcome to the daily Tech Support thread f...,"## Hello, /r/Apple, and welcome to Wallpaper W...",,,
author_fullname,t2_6l4z3,t2_6l4z3,t2_jp69e,t2_hqdhy,t2_418s4esu
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,1,0,0
clicked,False,False,False,False,False
title,Daily Tech Support Thread - [September 26],Wallpaper Wednesday - [September 23],"Hey, I made a 3D minesweeper game that's free ...",We just released a new content update for 1sla...,"I helped make a hand-painted, full length, act..."
link_flair_richtext,"[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'Promo Saturday'}]","[{'e': 'text', 't': 'Promo Saturday'}]","[{'e': 'text', 't': 'Promo Saturday'}]"


SyntaxError: invalid syntax (<ipython-input-17-ca4c9ce7349b>, line 1)

In [5]:
#preview shape of dataframes
print(f"Android dataframe has {df_android.shape[0]} rows, {df_android.shape[1]} columns")
print(f"Apple dataframe has {df_apple.shape[0]} rows, {df_apple.shape[1]} columns")

Android dataframe has 730 rows, 109 columns
Apple dataframe has 752 rows, 108 columns


In [6]:
#comparing columns in the two dataframes, Android dataframe has one extra column
[col for col in df_android.columns if not col in df_apple.columns]

['author_cakeday']

Reddit "Cake Day" is the yearly anniversary of the reddit user (Redditor). <br>
If author_cakeday is true, it meant that it was the cakeday of the author of the post<br>
It is a feature made available by Reddit that shows a cake icon beside the author name if they make a post on their cakeday<br>
A statement was made that other Redditors tend to be more generous with their upvotes if it is the author's cakeday

In [7]:
df_android['author_cakeday'].value_counts()

True    2
Name: author_cakeday, dtype: int64

In [8]:
#drop 'author_cakeday' column
df_android.drop(columns='author_cakeday', inplace=True)

In [9]:
#check column is dropped
print(f"Android dataframe now has {df_android.shape[0]} rows, {df_android.shape[1]} columns")

Android dataframe now has 730 rows, 108 columns


#### FIlter out not useful columns
**Note 1** <br>
We realised that there are many columns which will not be helpful in our modelling later on since these data in the columns are not unique. 

We have defined a function below which will print the number of unique values of each columns in the input dataframe, and display value counts of the columns with more than 1 but less than 10 unique values. 

This will give us an overview of the type of value in the columns and whether it provides us with useful information. We will not display the values for columns more than 10 unique values as it will be overcrowding, and will be looking into each of such individual columns below. 

**Note 2**<br>
We will retain the subreddit column since it is our target variable.

In [10]:
cols_not_unique = []
cols_nottoo_unique = []
cols_too_unique = []

def view_item_values(df, cols_not_unique , cols_nottoo_unique, cols_too_unique):
#     cols_not_unique = []
#     cols_nottoo_unique = []
#     cols_too_unique = []
    for col in df.columns:
        print("---")
        if col == 'subreddit':
            print(f"Column: '{col}' only has {df[col].nunique()} unique values")
        elif df[col].nunique() <= 1:
            cols_not_unique.append(col)
            print(f"Column: '{col}' only has {df[col].nunique()} unique values")
        elif df[col].nunique() >= 10:
            print(f"Column: '{col}' has {df[col].nunique()} unique values")
            cols_nottoo_unique.append(col)
        else:
            print(f"Column: '{col}' has {df[col].nunique()} unique values as follows: ")
            print(df[col].value_counts())
            cols_too_unique.append(col)
    print("\n")        
    print(f"Columns not unique to be dropped: {cols_not_unique}.")
    print("\n")
    print(f"Columns not too unique: {cols_nottoo_unique}.")
    print("\n")
    print(f"Columns too unique: {cols_too_unique}.")

In [11]:
view_item_values(df_apple, cols_not_unique , cols_nottoo_unique, cols_too_unique)

---
Column: 'approved_at_utc' only has 0 unique values
---
Column: 'subreddit' only has 1 unique values
---
Column: 'selftext' has 152 unique values
---
Column: 'author_fullname' has 461 unique values
---
Column: 'saved' only has 1 unique values
---
Column: 'mod_reason_title' only has 0 unique values
---
Column: 'gilded' has 3 unique values as follows: 
0    749
1      2
3      1
Name: gilded, dtype: int64
---
Column: 'clicked' only has 1 unique values
---
Column: 'title' has 752 unique values
---
Column: 'link_flair_richtext' has 41 unique values
---
Column: 'subreddit_name_prefixed' only has 1 unique values
---
Column: 'hidden' only has 1 unique values
---
Column: 'pwls' only has 1 unique values
---
Column: 'link_flair_css_class' has 27 unique values
---
Column: 'downs' only has 1 unique values
---
Column: 'thumbnail_height' has 48 unique values
---
Column: 'top_awarded_type' only has 0 unique values
---
Column: 'hide_score' has 2 unique values as follows: 
False    750
True       2


In [12]:
#Drop columns not unique from dataframes:
df_android.drop(columns=cols_not_unique, inplace=True)
df_apple.drop(columns=cols_not_unique, inplace=True)

In [13]:
#new shape of dataframes
print(f"Android dataframe now has {df_android.shape[0]} rows, {df_android.shape[1]} columns")
print(f"Apple dataframe now has {df_apple.shape[0]} rows, {df_apple.shape[1]} columns")

Android dataframe now has 730 rows, 52 columns
Apple dataframe now has 752 rows, 52 columns


In [16]:
df_android.head().T

Unnamed: 0,0,1,2,3,4
subreddit,Android,Android,Android,Android,Android
selftext,Note 1. [Check out our apps wiki](https://www....,,,,
author_fullname,t2_6l4z3,t2_1xqjsw6h,t2_3qp7grch,t2_2fxjvwp2,t2_17k4v5
gilded,0,0,0,0,0
title,Saturday APPreciation (Sep 26 2020) - Your wee...,"Android 11 got rid of the 4GB limit on videos,...",Google Pixel 4a Smartphone Audio Review,NewPipe tests new Unified Player UI with seaml...,[Erica Griffin] Surface Duo: Above the Fold (D...
link_flair_richtext,[],[],[],[],[]
link_flair_css_class,,removed,,,
thumbnail_height,,78,140,93,105
hide_score,False,False,False,False,False
name,t3_j0489i,t3_j07q6c,t3_j0fsoc,t3_izwqvf,t3_j09aot


In [15]:
df_apple.head()

Unnamed: 0,subreddit,selftext,author_fullname,gilded,title,link_flair_richtext,link_flair_css_class,thumbnail_height,hide_score,name,...,send_replies,permalink,stickied,url,subreddit_subscribers,created_utc,num_crossposts,media,url_overridden_by_dest,link_flair_template_id
0,apple,\n\nWelcome to the daily Tech Support thread f...,t2_6l4z3,0,Daily Tech Support Thread - [September 26],"[{'e': 'text', 't': 'Official Megathread'}]",megathread,,False,t3_j07nhs,...,False,/r/apple/comments/j07nhs/daily_tech_support_th...,True,https://www.reddit.com/r/apple/comments/j07nhs...,1798906,1601133000.0,0,,,
1,apple,"## Hello, /r/Apple, and welcome to Wallpaper W...",t2_6l4z3,0,Wallpaper Wednesday - [September 23],"[{'e': 'text', 't': 'Official Megathread'}]",megathread,,False,t3_iy9tar,...,False,/r/apple/comments/iy9tar/wallpaper_wednesday_s...,True,https://www.reddit.com/r/apple/comments/iy9tar...,1798906,1600867000.0,0,,,
2,apple,,t2_jp69e,1,"Hey, I made a 3D minesweeper game that's free ...","[{'e': 'text', 't': 'Promo Saturday'}]",promo,73.0,False,t3_j04k3k,...,True,/r/apple/comments/j04k3k/hey_i_made_a_3d_mines...,False,https://apps.apple.com/us/app/id1529127991,1798906,1601120000.0,0,,https://apps.apple.com/us/app/id1529127991,854c34e2-5702-11e9-bf73-0e73ef6cdf98
3,apple,,t2_hqdhy,0,We just released a new content update for 1sla...,"[{'e': 'text', 't': 'Promo Saturday'}]",promo,73.0,False,t3_j0fq3m,...,True,/r/apple/comments/j0fq3m/we_just_released_a_ne...,False,https://apps.apple.com/us/app/1sland/id1492928510,1798906,1601161000.0,0,,https://apps.apple.com/us/app/1sland/id1492928510,854c34e2-5702-11e9-bf73-0e73ef6cdf98
4,apple,,t2_418s4esu,0,"I helped make a hand-painted, full length, act...","[{'e': 'text', 't': 'Promo Saturday'}]",promo,73.0,False,t3_j0amsh,...,True,/r/apple/comments/j0amsh/i_helped_make_a_handp...,False,https://apps.apple.com/us/app/echoes-of-aeons/...,1798906,1601143000.0,0,,https://apps.apple.com/us/app/echoes-of-aeons/...,854c34e2-5702-11e9-bf73-0e73ef6cdf98


In [10]:
pd.set_option('display.max_info_columns', 110)
df_android.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 109 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   approved_at_utc                0 non-null      float64
 1   subreddit                      730 non-null    object 
 2   selftext                       188 non-null    object 
 3   author_fullname                723 non-null    object 
 4   saved                          730 non-null    bool   
 5   mod_reason_title               0 non-null      float64
 6   gilded                         730 non-null    int64  
 7   clicked                        730 non-null    bool   
 8   title                          730 non-null    object 
 9   link_flair_richtext            730 non-null    object 
 10  subreddit_name_prefixed        730 non-null    object 
 11  hidden                         730 non-null    bool   
 12  pwls                           730 non-null    in

In [11]:
df_apple.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 752 entries, 0 to 751
Data columns (total 108 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   approved_at_utc                0 non-null      float64
 1   subreddit                      752 non-null    object 
 2   selftext                       181 non-null    object 
 3   author_fullname                750 non-null    object 
 4   saved                          752 non-null    bool   
 5   mod_reason_title               0 non-null      float64
 6   gilded                         752 non-null    int64  
 7   clicked                        752 non-null    bool   
 8   title                          752 non-null    object 
 9   link_flair_richtext            752 non-null    object 
 10  subreddit_name_prefixed        752 non-null    object 
 11  hidden                         752 non-null    bool   
 12  pwls                           752 non-null    in

In [35]:
df_android['is_reddit_media_domain'].value_counts()

<bound method Series.count of False    730
Name: is_reddit_media_domain, dtype: int64>

In [None]:
df_android['edited'].value_counts()

In [None]:
'ups',
       'total_awards_received', 'media_embed', 'thumbnail_width',
       'author_flair_template_id', 'is_original_content', 'user_reports',
       'secure_media', 'is_reddit_media_domain', 'is_meta', 'category',
       'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score',
       'approved_by', 'author_premium', 'thumbnail', 'edited',
       'author_flair_css_class', 'author_flair_richtext', 'gildings',
       'post_hint', 'content_categories', 'is_self', 'mod_note',
       'created', 'link_flair_type', 'wls', 'removed_by_category',
       'banned_by', 'author_flair_type', 'domain', 'allow_live_comments',
       'selftext_html', 'likes', 'suggested_sort', 'banned_at_utc',
       'view_count', 'archived', 'no_follow', 'is_crosspostable',
       'pinned', 'over_18', 'preview', 'all_awardings', 'awarders',
       'media_only', 'can_gild', 'spoiler', 'locked', 'author_flair_text',
       'treatment_tags', 'visited', 'removed_by', 'num_reports',
       'distinguished', 'subreddit_id', 'mod_reason_by', 'removal_reason',
       'link_flair_background_color', 'id', 'is_robot_indexable',
       'report_reasons', 'author', 'discussion_type', 'num_comments',
       'send_replies', 'whitelist_status', 'contest_mode', 'mod_reports',
       'author_patreon_flair', 'author_flair_text_color', 'permalink',
       'parent_whitelist_status', 'stickied', 'url',
       'subreddit_subscribers', 'created_utc', 'num_crossposts', 'media',
       'is_video', 'url_overridden_by_dest', 'link_flair_template_id',
       'author_cakeday'], dtype=object)

**Next:** [Preprocessing and Modeling](./03_preprocessing_and_modeling.ipynb)