# Reddit API and Classification

**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Previous:** [Data Collection](./01_data_collection.ipynb)

## Data Cleaning and Exploratory Data Analysis
In our previous section, we have collected into dataframes subreddit posts from two subreddits:
- r/Android
- r/apple

In this section, we will be cleaning the data collected and then carry out some analysis based on the cleaned data.

#### Library Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report

from bs4 import BeautifulSoup
import regex as re
import requests
import time
import random

#### Data imports

In [2]:
df_android = pd.read_csv('../datasets/android_posts.csv')
df_apple = pd.read_csv('../datasets/apple_posts.csv')

### Exploring the dataframes

In [3]:
#preview first 5 rows data
pd.set_option('display.max_rows', None)
df_android.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,Android,Android,Android,Android,Android
selftext,"Note 1. Join our IRC, and Telegram chat-rooms!...",,,,
author_fullname,t2_6l4z3,t2_q4p0j,t2_31mkizvx,t2_gernm,t2_cc9vk
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,0,0,0
clicked,False,False,False,False,False
title,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...,Google Maps is getting dedicated car mode UI,Tasker lets you intercept Samsung S Pen gestur...,22% off nearly everything in European Google S...,The new Galaxy S20 FE: $100 off at Amazon and ...
link_flair_richtext,[],[],[],[],[]


In [4]:
#preview first 5 rows data
df_apple.head().T

Unnamed: 0,0,1,2,3,4
approved_at_utc,,,,,
subreddit,apple,apple,apple,apple,apple
selftext,\n\nWelcome to the daily Tech Support thread f...,"\n\nHello /r/Apple, and welcome to ""Shortcuts ...",,,
author_fullname,t2_6l4z3,t2_6l4z3,t2_78w5nxue,t2_th4cg,t2_15c325
saved,False,False,False,False,False
mod_reason_title,,,,,
gilded,0,0,0,0,4
clicked,False,False,False,False,False
title,Daily Tech Support Thread - [September 27],Shortcuts Sunday - [September 27],iOS 14: 'Phoenix 2' Space Shooter Delivers Pla...,iOS 14: How to stop your AirPods automatically...,There was no NASA Astronomy Picture of the Day...
link_flair_richtext,"[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'Official Megathread'}]","[{'e': 'text', 't': 'iOS'}]","[{'e': 'text', 't': 'AirPods'}]","[{'e': 'text', 't': 'Promo Saturday'}]"


For this project, our goal is to use NLP (Natural Language Processing) to train a classifier on which subreddit a given post came from. Hence, we determined the following columns crucial for our model.
- Significant text data in 'selftext' and 'title'
- Target variable in 'subreddit'

We will filter the columns above and create a new dataframe from there.

In [5]:
cols = ['subreddit','title','selftext']

In [14]:
df_android[cols]

Unnamed: 0,subreddit,title,selftext
0,Android,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...,"Note 1. Join our IRC, and Telegram chat-rooms!..."
1,Android,Google Maps is getting dedicated car mode UI,
2,Android,Tasker lets you intercept Samsung S Pen gestur...,
3,Android,22% off nearly everything in European Google S...,
4,Android,The new Galaxy S20 FE: $100 off at Amazon and ...,
5,Android,U.S. antitrust investigation of Google is comi...,
6,Android,New budget Lenovo P11 tablet leaks with Snapdr...,
7,Android,[Pro Tip] Enable Nearby Share on all your andr...,If your device is running Android 6.0+ then yo...
8,Android,"Android 11 got rid of the 4GB limit on videos,...",
9,Android,Galaxy S20FE Gorilla Glass 3,I feel like almost no one is talking about the...


In [15]:
df = pd.concat([df_android[cols], df_apple[cols]], axis=0, join='outer')

In [16]:
#save df checkpoint
df.to_csv('../datasets/combined_df.csv')

In [17]:
df.iloc[729:731]

Unnamed: 0,subreddit,title,selftext
729,Android,Wireless Charging Is a Disaster Waiting to Happen,
730,Android,Samsung Galaxy M31s Review: Rehashing a succes...,


In [18]:
df.shape

(1968, 3)

In [19]:
df['title'].nunique()

1538

In [20]:
df['selftext'].nunique()

342

In [64]:
df.head()

Unnamed: 0,subreddit,title,selftext
0,Android,Sunday Rant/Rage (Sep 27 2020) - Your weekly c...,"Note 1. Join our IRC, and Telegram chat-rooms!..."
1,Android,Google Maps is getting dedicated car mode UI,
2,Android,Tasker lets you intercept Samsung S Pen gestur...,
3,Android,22% off nearly everything in European Google S...,
4,Android,The new Galaxy S20 FE: $100 off at Amazon and ...,


In [67]:
df['title'][0].values

array(['Sunday Rant/Rage (Sep 27 2020) - Your weekly complaint thread!',
       'Daily Tech Support Thread - [September 27]'], dtype=object)

In [68]:
df['title'][1].values

array(['Google Maps is getting dedicated car mode UI',
       'Shortcuts Sunday - [September 27]'], dtype=object)

In [79]:
df['title'][1].values

array(['Google Maps is getting dedicated car mode UI',
       'Shortcuts Sunday - [September 27]'], dtype=object)

In [66]:
df['selftext'][1]

1                                                  NaN
1    \n\nHello /r/Apple, and welcome to "Shortcuts ...
Name: selftext, dtype: object

In [60]:
(df['title'].values+df['selftext'])[1].values

array([nan,
       'Shortcuts Sunday - [September 27]\n\nHello /r/Apple, and welcome to "Shortcuts Sunday". \n\nThe "Shortcuts Sunday" thread is your place to share your [Shortcuts](https://support.apple.com/guide/shortcuts/welcome/ios) with the /r/Apple community. \n\nTo share your Shortcut:\n\n\n1. Open the Shortcuts app\n2. [Tap the "..." button next to the shortcut you will be sharing](https://i.imgur.com/vAcQddT.jpg)\n3. [Tap the "Share" icon](https://i.imgur.com/B8wcD2B.jpg)\n4. Finally, [tap the "Copy iCloud link" icon](https://i.imgur.com/x8KeXJZ.jpg) and paste it to your comment. \n\nWhen sharing your shortcuts, please add a brief description of what your shortcut does and how it is useful. Bonus points if you can include a screen record of the shortcut in action. \n\nDon\'t forget to visit the /r/shortcuts subreddit for more information, guides, and shortcuts!'],
      dtype=object)

In [54]:
" ".join(df['title'][1].values.tolist())

'Google Maps is getting dedicated car mode UI Shortcuts Sunday - [September 27]'

In [55]:

try: 
    " ".join((df['selftext'][1].values.tolist()))
    
except: 
    for _ in df['selftext'][1].values.tolist():
            if _ == np.nan:
                _.replace(np.nan," ")
    print(df['selftext'][1].values.tolist())

[nan, '\n\nHello /r/Apple, and welcome to "Shortcuts Sunday". \n\nThe "Shortcuts Sunday" thread is your place to share your [Shortcuts](https://support.apple.com/guide/shortcuts/welcome/ios) with the /r/Apple community. \n\nTo share your Shortcut:\n\n\n1. Open the Shortcuts app\n2. [Tap the "..." button next to the shortcut you will be sharing](https://i.imgur.com/vAcQddT.jpg)\n3. [Tap the "Share" icon](https://i.imgur.com/B8wcD2B.jpg)\n4. Finally, [tap the "Copy iCloud link" icon](https://i.imgur.com/x8KeXJZ.jpg) and paste it to your comment. \n\nWhen sharing your shortcuts, please add a brief description of what your shortcut does and how it is useful. Bonus points if you can include a screen record of the shortcut in action. \n\nDon\'t forget to visit the /r/shortcuts subreddit for more information, guides, and shortcuts!']


In [None]:
df['title']

In [38]:
df['title'][0].values

array(['Sunday Rant/Rage (Sep 27 2020) - Your weekly complaint thread!',
       'Daily Tech Support Thread - [September 27]'], dtype=object)

In [57]:
len(df['title'][0])

2

In [52]:
" ".join(df['title'][0].values.tolist())

'Sunday Rant/Rage (Sep 27 2020) - Your weekly complaint thread! Daily Tech Support Thread - [September 27]'

In [53]:
" ".join(df['selftext'][0].values.tolist())

'Note 1. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\nThis weekly Sunday thread is for you to let off some steam and speak out about whatever complaint you might have about:  \n\n* Your device.  \n\n* Your carrier.  \n\n* Your device\'s manufacturer.  \n\n* An app  \n\n* Any other company\n\n***  \n\n**Rules**  \n\n1) Please do not target any individuals or try to name/shame any individual. If you hate Google/Samsung/HTC etc. for one thing that is fine, but do not be rude to an individual app developer.\n\n2) If you have a suggestion to solve another user\'s issue, please leave a comment but be sure it\'s constructive! We do not want any flame-wars.  \n\n3) Be respectful of other\'s opinions. Even if you feel that somebody is "wrong" you don\'t have to go out of your way to prove them wrong. Disagree politely, and move on. \n\nWelcome to the daily Tech Support thread for /r/

In [65]:
(df['selftext'][1].values.tolist())

[nan,
 '\n\nHello /r/Apple, and welcome to "Shortcuts Sunday". \n\nThe "Shortcuts Sunday" thread is your place to share your [Shortcuts](https://support.apple.com/guide/shortcuts/welcome/ios) with the /r/Apple community. \n\nTo share your Shortcut:\n\n\n1. Open the Shortcuts app\n2. [Tap the "..." button next to the shortcut you will be sharing](https://i.imgur.com/vAcQddT.jpg)\n3. [Tap the "Share" icon](https://i.imgur.com/B8wcD2B.jpg)\n4. Finally, [tap the "Copy iCloud link" icon](https://i.imgur.com/x8KeXJZ.jpg) and paste it to your comment. \n\nWhen sharing your shortcuts, please add a brief description of what your shortcut does and how it is useful. Bonus points if you can include a screen record of the shortcut in action. \n\nDon\'t forget to visit the /r/shortcuts subreddit for more information, guides, and shortcuts!']

In [75]:
" ".join(df['title'][0].values.tolist()) + " ".join(df['selftext'][0].values.tolist())

'Saturday APPreciation (Sep 26 2020) - Your weekly app recommendation/request thread! Daily Tech Support Thread - [September 26]Note 1. [Check out our apps wiki](https://www.reddit.com/r/Android/wiki/index#wiki_apps) for previous threads and apps curated by the reddit Android community!  \n\n[Download the official /r/Android App Store based on our wiki!](https://github.com/d4rken/reddit-android-appstore/releases)\n\nNote 2. Join us at **/r/MoronicMondayAndroid**, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! \n\nNote 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\n\n***  \n\nThis weekly Saturday thread is for:  \n* App promotion,  \n* App praise/sharing  \n\n***  \n\n**Rules:**  \n\n1) If you are a developer, you may promote your own app ONLY under the bolded, distinguished moderator comment. Users: if you t

In [76]:
" ".join(df['title'][1].values.tolist()) + " ".join(df['selftext'][1].values.tolist())

TypeError: sequence item 0: expected str instance, float found

In [62]:
df['title'][0].values + df['selftext'][0].values

array(['Saturday APPreciation (Sep 26 2020) - Your weekly app recommendation/request thread!Note 1. [Check out our apps wiki](https://www.reddit.com/r/Android/wiki/index#wiki_apps) for previous threads and apps curated by the reddit Android community!  \n\n[Download the official /r/Android App Store based on our wiki!](https://github.com/d4rken/reddit-android-appstore/releases)\n\nNote 2. Join us at **/r/MoronicMondayAndroid**, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! \n\nNote 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\n\n***  \n\nThis weekly Saturday thread is for:  \n* App promotion,  \n* App praise/sharing  \n\n***  \n\n**Rules:**  \n\n1) If you are a developer, you may promote your own app ONLY under the bolded, distinguished moderator comment. Users: if you think someone is trying to bypass thi

In [61]:
df['title'][1].values + df['selftext'][1].values

TypeError: can only concatenate str (not "float") to str

In [51]:
df['selftext'][0].values

array(['Note 1. [Check out our apps wiki](https://www.reddit.com/r/Android/wiki/index#wiki_apps) for previous threads and apps curated by the reddit Android community!  \n\n[Download the official /r/Android App Store based on our wiki!](https://github.com/d4rken/reddit-android-appstore/releases)\n\nNote 2. Join us at **/r/MoronicMondayAndroid**, a sub serving as a repository for our retired weekly threads. Just pick any thread and Ctrl-F your way to wisdom! \n\nNote 2. Join our IRC, and Telegram chat-rooms! [Please see our wiki for instructions.](https://www.reddit.com/r/Android/wiki/index#wiki_.2Fr.2Fandroid_chat_rooms)\n\n\n***  \n\nThis weekly Saturday thread is for:  \n* App promotion,  \n* App praise/sharing  \n\n***  \n\n**Rules:**  \n\n1) If you are a developer, you may promote your own app ONLY under the bolded, distinguished moderator comment. Users: if you think someone is trying to bypass this rule by promoting their app in the general thread, click the report button so we c

In [52]:
df['title + selftext'] = pd.concat([df['title'] , df['selftext']], ignore_index=True)

In [53]:
df.shape

(1482, 4)

In [54]:
df['title + selftext'][0].values

array(['Saturday APPreciation (Sep 26 2020) - Your weekly app recommendation/request thread!',
       'Saturday APPreciation (Sep 26 2020) - Your weekly app recommendation/request thread!'],
      dtype=object)

In [38]:
df['title + selftext'][0].values

array(['Saturday APPreciation (Sep 26 2020) - Your weekly app recommendation/request thread!',
       'Daily Tech Support Thread - [September 26]'], dtype=object)

In [30]:
df['title + selftext'][0].values

array(['Saturday APPreciation (Sep 26 2020) - Your weekly app recommendation/request thread!',
       'Daily Tech Support Thread - [September 26]'], dtype=object)

In [19]:
df.head()

Unnamed: 0,subreddit,title,selftext,title + selftext
0,Android,Saturday APPreciation (Sep 26 2020) - Your wee...,Note 1. [Check out our apps wiki](https://www....,Saturday APPreciation (Sep 26 2020) - Your wee...
1,Android,"Android 11 got rid of the 4GB limit on videos,...",,
2,Android,Google Pixel 4a Smartphone Audio Review,,
3,Android,NewPipe tests new Unified Player UI with seaml...,,
4,Android,[Erica Griffin] Surface Duo: Above the Fold (D...,,


**Next:** [Preprocessing and Modeling](./03_preprocessing_and_modeling.ipynb)