



### Contents:
- [Importing the Relevant Libraries](#Importing-the-Relevant-Libraries)
- [Loading the Data](#Loading-the-Data)
- [EDA](#EDA)


## Importing the Relevant Libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.simplefilter(action='ignore')

%matplotlib inline

## Loading the Data

In [3]:
reddit_posts_df = pd.read_csv('./datasets/combined.csv').drop(columns='subreddit')

In [3]:
reddit_posts_df.shape

(1251, 3)

In [4]:
reddit_posts_df

Unnamed: 0,text,title,target
0,"Since 2015, using NASA hardware, scientists an...",AskScience AMA Series: We are experts on NASA'...,1
1,,Do ants that get lost(accidentally get on my b...,1
2,,What part of your brain gets activated when yo...,1
3,Wouldn't a pointed bow cut through the water b...,Why aren't the bows of submarines pointy??,1
4,I am always confused be centrifugal and centri...,Do you weigh less at the equator because of ce...,1
...,...,...,...
1246,\n\nMy grandfather was always playing pranks ...,My grandfather was always playing pranks on pe...,0
1247,"Nazi Officer: ""Sir, we are mining too many use...",I did Nazi that coming,0
1248,Because of Gallium and Yttrium.,[OC] Why was the gay scientist gay?,0
1249,"Children of Vegans, children of Anti-Vaxxers. ...",The three big that never get old:,0


 ## EDA

In [5]:
#check for duplicate posts/remove them
#check for null values
reddit_posts_df.drop_duplicates(keep='first', inplace=True)

In [6]:
reddit_posts_df.shape  #no duplicates spotted so thats good

(1251, 3)

In [7]:
reddit_posts_df.isnull().sum()  #202 blank items in text, cannot drop them as we might lose vital information in title
#a further check showed 202 of the blank texts come from the askscience
# what i can do is remove them, and then repopulate that same amout with random samples from askscience

text      202
title       0
target      0
dtype: int64

In [8]:
reddit_posts_df=reddit_posts_df.dropna()

In [9]:
reddit_posts_df.isnull().sum() 

text      0
title     0
target    0
dtype: int64

In [10]:
reddit_posts_df.target.value_counts() #to see if we have balanced amount of posts from both subreddits
#also confirming we dropped 202 rows from the askscience subreddit


0    625
1    424
Name: target, dtype: int64

In [11]:
testing=reddit_posts_df[reddit_posts_df['target']==1]
testing

Unnamed: 0,text,title,target
0,"Since 2015, using NASA hardware, scientists an...",AskScience AMA Series: We are experts on NASA'...,1
3,Wouldn't a pointed bow cut through the water b...,Why aren't the bows of submarines pointy??,1
4,I am always confused be centrifugal and centri...,Do you weigh less at the equator because of ce...,1
5,"If the universe is infused with dark matter, w...",Could there possibly be black holes that forme...,1
6,"When I, for example, hold one arm straight to ...",Does the brain send signals consistently to ke...,1
...,...,...,...
620,I shouldn’t have to wait a whole hour when my ...,Why hasn’t a cooling equivalent of the microwa...,1
621,Isn't temperature related to the average kinet...,How is the temperature of the interstellar gas...,1
622,I'm trying to visualize their trajectories in ...,"Where is Voyager 2 in relation to Voyager 1, a...",1
623,After managing to stabilize and sustain the pl...,How do you convert a more million C degree pla...,1


In [12]:
testing.shape  #we can get 202 samples from testing sample, use 201 to make same amount as jokes

(424, 3)

In [13]:
import random
random.seed(42)
bs_sample1=testing.sample(n=201,replace=True)

In [14]:
bs_sample1

Unnamed: 0,text,title,target
312,"Not sure if this got ask before, but with all ...",Why is no electric car producer considering a ...,1
428,Autoimmune diseases are caused by your immune ...,If Rheumatoid arthritis (or really any autoimm...,1
340,I’m currently traveling the Philippians with s...,What did I see in the sky last night?,1
588,I'm just not sure why it's down to me to unplu...,If experts can tell me I shouldn't leave my ph...,1
17,I'm making a social media loosely themed aroun...,Is there a classification system for biologica...,1
...,...,...,...
536,I tried to google an answer of a fairly popula...,How do services that delete your personal data...,1
376,I heard somewhere that at low energies superst...,Has there been any progress of pinning down th...,1
255,Mostly asking about mammals other than primate...,What can animals do when hair/fur gets in thei...,1
89,How does CRISPR technology target one specific...,How does CRISPR target specific genes?,1


In [15]:
df3=pd.concat([reddit_posts_df,bs_sample1],ignore_index=True)

In [16]:
df3

Unnamed: 0,text,title,target
0,"Since 2015, using NASA hardware, scientists an...",AskScience AMA Series: We are experts on NASA'...,1
1,Wouldn't a pointed bow cut through the water b...,Why aren't the bows of submarines pointy??,1
2,I am always confused be centrifugal and centri...,Do you weigh less at the equator because of ce...,1
3,"If the universe is infused with dark matter, w...",Could there possibly be black holes that forme...,1
4,"When I, for example, hold one arm straight to ...",Does the brain send signals consistently to ke...,1
...,...,...,...
1245,I tried to google an answer of a fairly popula...,How do services that delete your personal data...,1
1246,I heard somewhere that at low energies superst...,Has there been any progress of pinning down th...,1
1247,Mostly asking about mammals other than primate...,What can animals do when hair/fur gets in thei...,1
1248,How does CRISPR technology target one specific...,How does CRISPR target specific genes?,1


In [17]:
df3.target.value_counts()  #thats our inbalanced accounted for :)

1    625
0    625
Name: target, dtype: int64

In [20]:
#lowercase everything first
df3['text']=df3['text'].str.lower() 
df3['title']=df3['title'].str.lower()

In [23]:
df3.head()

Unnamed: 0,text,title,target
0,"since 2015, using nasa hardware, scientists an...",askscience ama series: we are experts on nasa'...,1
1,wouldn't a pointed bow cut through the water b...,why aren't the bows of submarines pointy??,1
2,i am always confused be centrifugal and centri...,do you weigh less at the equator because of ce...,1
3,"if the universe is infused with dark matter, w...",could there possibly be black holes that forme...,1
4,"when i, for example, hold one arm straight to ...",does the brain send signals consistently to ke...,1


In [27]:
df3['text'] = df3['text'].str.replace(r'[^\w\s]+', '')
df3['title'] = df3['title'].str.replace(r'[^\w\s]+', '')

#Match a single character not present in the list below [^\w\s]+
#+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed 

In [28]:
df3.head()

Unnamed: 0,text,title,target
0,since 2015 using nasa hardware scientists and ...,askscience ama series we are experts on nasas ...,1
1,wouldnt a pointed bow cut through the water be...,why arent the bows of submarines pointy,1
2,i am always confused be centrifugal and centri...,do you weigh less at the equator because of ce...,1
3,if the universe is infused with dark matter wo...,could there possibly be black holes that forme...,1
4,when i for example hold one arm straight to th...,does the brain send signals consistently to ke...,1


In [30]:
df3.isnull().sum()

text      0
title     0
target    0
dtype: int64

In [32]:
df3.to_csv('datatobemodelled.csv', index = False)

In [33]:
df3.shape

(1250, 3)