# Reddit API and Classification

**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Previous:** [Data Collection](./01_data_collection.ipynb)

## Data Cleaning and Exploratory Data Analysis
In our previous section, we have collected into dataframes subreddit posts from two subreddits:
- r/Android
- r/apple

In this section, we will be cleaning the data collected and then carry out some analysis based on the cleaned data.

#### Library Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

from bs4 import BeautifulSoup
import regex as re
import requests
import time
import random

#### Data imports

In [2]:
df_android = pd.read_csv('../datasets/android_posts.csv')
df_apple = pd.read_csv('../datasets/apple_posts.csv')

### Exploring the dataframes

In [3]:
#preview first 5 rows data
pd.set_option('display.max_rows', None)
df_android.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,Android,Android,Android,Android,Android
selftext,Note 1. [Check out our apps wiki](https://www....,,,,
author_fullname,t2_6l4z3,t2_1xqjsw6h,t2_3qp7grch,t2_2fxjvwp2,t2_17k4v5
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,0,0,0
clicked,False,False,False,False,False
title,Saturday APPreciation (Sep 26 2020) - Your wee...,"Android 11 got rid of the 4GB limit on videos,...",Google Pixel 4a Smartphone Audio Review,NewPipe tests new Unified Player UI with seaml...,[Erica Griffin] Surface Duo: Above the Fold (D...
link_flair_richtext,[],[],[],[],[]


In [4]:
#preview first 5 rows data
df_apple.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,apple,apple,apple,apple,apple
selftext,\n\nWelcome to the daily Tech Support thread f...,"## Hello, /r/Apple, and welcome to Wallpaper W...",,,
author_fullname,t2_6l4z3,t2_6l4z3,t2_jp69e,t2_hqdhy,t2_418s4esu
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,1,0,0
clicked,False,False,False,False,False
title,Daily Tech Support Thread - [September 26],Wallpaper Wednesday - [September 23],"Hey, I made a 3D minesweeper game that's free ...",We just released a new content update for 1sla...,"I helped make a hand-painted, full length, act..."
link_flair_richtext,"[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'Promo Saturday'}]","[{'e': 'text', 't': 'Promo Saturday'}]","[{'e': 'text', 't': 'Promo Saturday'}]"


For this project, our goal is to use NLP (Natural Language Processing) to train a classifier on which subreddit a given post came from. Hence, we determined the following columns crucial for our model.
- Significant text data in 'selftext' and 'title'
- Target variable in 'subreddit'

We will filter the columns above and create a new dataframe from there.

In [18]:
cols = ['subreddit','title','selftext']

In [19]:
df = pd.concat([df_android[cols], df_apple[cols]], axis=0, join='outer')

In [27]:
#save df checkpoint
df.to_csv('../datasets/combined_df.csv')

In [26]:
df.iloc[729:731]

Unnamed: 0,subreddit,title,selftext
729,Android,U.S. Government Contractor Embedded Software i...,
0,apple,Daily Tech Support Thread - [September 26],\n\nWelcome to the daily Tech Support thread f...


In [29]:
df['title'].unique

<bound method Series.unique of 0      Saturday APPreciation (Sep 26 2020) - Your wee...
1      Android 11 got rid of the 4GB limit on videos,...
2                Google Pixel 4a Smartphone Audio Review
3      NewPipe tests new Unified Player UI with seaml...
4      [Erica Griffin] Surface Duo: Above the Fold (D...
5      OneNote on Android is an unfortunately broken ...
6      How to use Samloader to download updates for y...
7      Some Pixel 4 owners are experiencing rapid bat...
8      Pixel 2 camera curse continues — and it's spre...
9                                  LG Wing in for review
10                          TCL 10L Review - Too Budget?
11     Google Messages 6.7 prepares to let you automa...
12     [DEV] DirectChat replicates Android 11 bubbles...
13     Google to Increase Push for Apps to Give Cut o...
14          Brief impressions of the Galaxy Fold (gen 1)
15     CopperheadOS Android 11 booting on a Pixel 4a ...
16     Samsung Galaxy Buds Live Review: Tasty design ...


In [10]:
pd.set_option('display.max_info_columns', 110)
df_android.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 109 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   approved_at_utc                0 non-null      float64
 1   subreddit                      730 non-null    object 
 2   selftext                       188 non-null    object 
 3   author_fullname                723 non-null    object 
 4   saved                          730 non-null    bool   
 5   mod_reason_title               0 non-null      float64
 6   gilded                         730 non-null    int64  
 7   clicked                        730 non-null    bool   
 8   title                          730 non-null    object 
 9   link_flair_richtext            730 non-null    object 
 10  subreddit_name_prefixed        730 non-null    object 
 11  hidden                         730 non-null    bool   
 12  pwls                           730 non-null    in

**Next:** [Preprocessing and Modeling](./03_preprocessing_and_modeling.ipynb)