# Reddit API and Classification

**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Previous:** [Data Collection](./01_data_collection.ipynb)

## Data Cleaning and Exploratory Data Analysis
In our previous section, we have collected into dataframes subreddit posts from two subreddits:
- r/Android
- r/apple

In this section, we will be cleaning the data collected and then carry out some analysis based on the cleaned data.

#### Library Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report
from nltk.corpus import stopwords # Import the stop word list

from bs4 import BeautifulSoup
import regex as re
import requests
import time
import random

#### Data imports

In [2]:
df_android = pd.read_csv('../datasets/android_posts.csv')
df_apple = pd.read_csv('../datasets/apple_posts.csv')

### Exploring the dataframes

In [3]:
#preview first 5 rows data
pd.set_option('display.max_rows', None)
df_android.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,Android,Android,Android,Android,Android
selftext,"Note 1. Join our IRC, and Telegram chat-rooms!...",,,,
author_fullname,t2_6l4z3,t2_q4p0j,t2_31mkizvx,t2_gernm,t2_cc9vk
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,0,0,0
clicked,False,False,False,False,False
title,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...,Google Maps is getting dedicated car mode UI,Tasker lets you intercept Samsung S Pen gestur...,22% off nearly everything in European Google S...,The new Galaxy S20 FE: $100 off at Amazon and ...
link_flair_richtext,[],[],[],[],[]


In [4]:
#preview first 5 rows data
df_apple.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,apple,apple,apple,apple,apple
selftext,\n\nWelcome to the daily Tech Support thread f...,"\n\nHello /r/Apple, and welcome to ""Shortcuts ...",,,
author_fullname,t2_6l4z3,t2_6l4z3,t2_78w5nxue,t2_th4cg,t2_15c325
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,0,0,4
clicked,False,False,False,False,False
title,Daily Tech Support Thread - [September 27],Shortcuts Sunday - [September 27],iOS 14: 'Phoenix 2' Space Shooter Delivers Pla...,iOS 14: How to stop your AirPods automatically...,There was no NASA Astronomy Picture of the Day...
link_flair_richtext,"[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'iOS'}]","[{'e': 'text', 't': 'AirPods'}]","[{'e': 'text', 't': 'Promo Saturday'}]"


### Filtered dataframe
For this project, our goal is to use NLP (Natural Language Processing) to train a classifier on which subreddit a given post came from. Hence, we determined the following columns crucial for our model.
- Significant text data in 'title'
- Target variable in 'subreddit'

We will filter the columns above and create a new dataframe from there.

In [5]:
cols = ['subreddit','title']

In [6]:
df = pd.concat([df_android[cols], df_apple[cols]], axis=0, join='outer', ignore_index=True)

### Data Cleaning

In [7]:
#preview dataframe
df.head()

Unnamed: 0,subreddit,title
0,Android,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...
1,Android,Google Maps is getting dedicated car mode UI
2,Android,Tasker lets you intercept Samsung S Pen gestur...
3,Android,22% off nearly everything in European Google S...
4,Android,The new Galaxy S20 FE: $100 off at Amazon and ...


In [8]:
#check datatypes, null values
#we have noted no null values in our data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1968 entries, 0 to 1967
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  1968 non-null   object
 1   title      1968 non-null   object
dtypes: object(2)
memory usage: 30.9+ KB


In [9]:
#save df checkpoint
df.to_csv('../datasets/combined_df.csv')

In [10]:
df['subreddit'].value_counts()

apple      986
Android    982
Name: subreddit, dtype: int64

In [11]:
#check shape of dataframe
df.shape

(1968, 2)

In [15]:
#check unique titles in each dataframe
print(f"Android dataframe has {df_android['title'].nunique()} unique titles.")
print(f"Apple dataframe has {df_apple['title'].nunique()} unique titles.")


Android dataframe has 729 unique titles.
Apple dataframe has 809 unique titles.


In [16]:
#crosscheck with unique titles in combined dataframe
df['title'].nunique()

1538

In [17]:
#drop duplicated rows
df.drop_duplicates(keep='first',inplace=True)

In [18]:
# check that duplicate are dropped
df.shape

(1538, 2)

In [19]:
#check number of unique titles again
df['title'].nunique()

1538

### Convert target variable (Android/apple) to binary labels
- 0 for Android
- 1 for apple

In [20]:
df['subreddit'] = df['subreddit'].map({'Android': 0, 'apple': 1})

In [21]:
#check labels
df.head()

Unnamed: 0,subreddit,title
0,0,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...
1,0,Google Maps is getting dedicated car mode UI
2,0,Tasker lets you intercept Samsung S Pen gestur...
3,0,22% off nearly everything in European Google S...
4,0,The new Galaxy S20 FE: $100 off at Amazon and ...


Lets proceed to clean the text data using a function below. <br><br>
Since we will require text data or words in our classifier, we all remove all non-letters and convert to lowercase letters for the same words to be categorised together. <br>

"stopwords" are also removed in this function below.

In [22]:
import regex as re

def raw_to_words(raw_text):
    # Function to convert raw text to a string of words
    # The input is a single string (a raw title text), and 
    # the output is a single string (a preprocessed title text)
    
    # 1. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
    
    # 2. Convert to lower case, split into individual words.
    words = letters_only.lower().split()
    
    # 3. In Python, searching a set is much faster than searching
    # a list, so convert the stop words to a set.
    stops = set(stopwords.words('english'))
    
    # 4. Remove stop words.
    meaningful_words = [w for w in words if not w in stops]
    
    # 5. Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [23]:
#apply the function elementwise in the dataseries
#so the function is applied to every row in the dataframe
df['title'] = df['title'].map(lambda x: raw_to_words(x))
df['title']

0            sunday rant rage sep weekly complaint thread
1               google maps getting dedicated car mode ui
2       tasker lets intercept samsung pen gestures wha...
3                nearly everything european google stores
4                           new galaxy fe amazon best buy
5       u antitrust investigation google coming head n...
6             new budget lenovo p tablet leaks snapdragon
7       pro tip enable nearby share android devices qu...
8       android got rid gb limit videos google camera ...
9                                 galaxy fe gorilla glass
10                     suface duo vs lg v different think
11                     pixel pre orders arriving early uk
12                   google pixel smartphone audio review
13      erica griffin surface duo fold duo vs galaxy z...
14        onenote android unfortunately broken experience
15      newpipe tests new unified player ui seamless f...
16      use samloader download updates samsung galaxy ...
17      pixel 

In [24]:
#check for unbalanced classes

df['subreddit'].value_counts(normalize=True)

1    0.526008
0    0.473992
Name: subreddit, dtype: float64

### Export cleaned dataframe

In [26]:
#Export to csv.
df.to_csv('../datasets/cleaned_df.csv', index=False)

**Next:** [Preprocessing and Modeling](./03_preprocessing_and_modeling.ipynb)