## Data Cleaning: Reddit Project

In [1]:
#imports for data cleaning and eda
import pandas as pd
import numpy as np
import re

## Read in & Clean Biology Submissions csv

In [2]:
#read in csv for biology submissions
bio_sub = pd.read_csv('datasets/bio-submissions.csv')

In [3]:
#view the info of the biology csv
#view the amount of null values
bio_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10072 entries, 0 to 10071
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  10072 non-null  object
 1   selftext   4790 non-null   object
 2   title      10072 non-null  object
dtypes: object(3)
memory usage: 236.2+ KB


In [4]:
#check for correct read in
bio_sub.head(2)

Unnamed: 0,subreddit,selftext,title
0,biology,[removed],Mendelian inheritance
1,biology,,Diary of a Biologist: (Sarcoramphus papa) Chap...


In [5]:
#the columns of the biology dataframe
bio_sub.columns

Index(['subreddit', 'selftext', 'title'], dtype='object')

In [6]:
#show the values of biology subreddit that are null
bio_sub.isna().sum()

subreddit       0
selftext     5282
title           0
dtype: int64

In [7]:
#original shape of the dataframe
bio_sub.shape

(10072, 3)

In [8]:
#remove these removed/deleted columns
bio_sub.replace('[removed]',np.nan,inplace=True)
bio_sub.replace('[deleted]',np.nan,inplace=True)

In [9]:
#drop any rows that don't have text under the title
bio_sub.dropna(axis=0,subset=['selftext'],inplace=True)

In [10]:
#number of observations after dropping null text rows
bio_sub.shape

(4789, 3)

In [11]:
#tiny glance at individual post selftexts
bio_sub['selftext']

8        Here's the images: https://i.redd.it/tp5d22frp...
14       Hello! I am looking for some advice/ideas to f...
15       radiating , lasts 1 second from side of lower ...
16       Hi guys, I just had my first class on fluoresc...
22       I need to match up solution tests with details...
                               ...                        
10060    Is there anybody that believes in the Endosymb...
10061    This made me think about it: https://www.pop.o...
10063    https://frankreport.com/2020/05/28/dna-damage-...
10066     Hello there, I'm looking for good books on th...
10069     \n\nBiology is the study of life. This is a s...
Name: selftext, Length: 4789, dtype: object

In [12]:
#remove links from the text -- keep the other text from that row
bio_sub['selftext'] = bio_sub['selftext'].str.replace('http\S+|www.\S+', '', case=False)
bio_sub['title'] = bio_sub['title'].str.replace('http\S+|www.\S+', '', case=False)

In [13]:
#observing how many selftext values contain the special character \n
bio_sub[bio_sub['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title
8,biology,Here's the images: \n\nThey're the midslices b...,Why does it seem like my venous sinuses are ve...
14,biology,Hello! I am looking for some advice/ideas to f...,BS in Biology to a Masters in what??
27,biology,I'm still a high school student and am beginni...,I have a question about biology and why am I'm...
41,biology,Do genes that encode proteins used into e.g. t...,Why didn't we evolve to lose unused genes from...
50,biology,Question as stated in the title. I‘m really in...,What in your opinion is the most prestigious/ ...
...,...,...,...
10033,biology,&amp;#x200B;\n\n[RNA as a CUBE](,DNA in a CUBE?
10035,biology,"Hey guys, \n\nI graduated last summer with a B...",Can I go back to Graduate School if I drop out...
10058,biology,"RSV has killed 800,000 to 1 million people in ...",I'm crying right now. Hundreds of thousands of...
10066,biology,"Hello there, I'm looking for good books on th...",Books on hymenoptera


In [14]:
#remove the \n special characters from the selftext column
bio_sub['selftext'].replace('\n','',regex=True,inplace=True)

In [15]:
#shows that the replace method worked correctly
bio_sub[bio_sub['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title


In [16]:
#reset the index count for dataframe
bio_sub.reset_index(drop=True,inplace=True)

In [17]:
#observing the first 5 columns of the biology dataframe
bio_sub.head()

Unnamed: 0,subreddit,selftext,title
0,biology,Here's the images: They're the midslices but t...,Why does it seem like my venous sinuses are ve...
1,biology,Hello! I am looking for some advice/ideas to f...,BS in Biology to a Masters in what??
2,biology,"radiating , lasts 1 second from side of lower ...",shooting pain from abdomen to neck?
3,biology,"Hi guys, I just had my first class on fluoresc...",Any website for microscopy videos?
4,biology,I need to match up solution tests with details...,Help me please


In [18]:
#cleaned bio dataframe shape
bio_sub.shape

(4789, 3)

In [19]:
#the number of rows in the biology subreddit that contains the word 'biology'
bio_sub[bio_sub['selftext'].str.contains('biology')].shape

(868, 3)

In [20]:
#the number of rows in the biology subreddit that contains the word 'bio'
bio_sub[bio_sub['selftext'].str.contains('bio')].shape

(1273, 3)

#### Containing the Names of the Subreddits

Above, there is about 2,000 instances of the name of the actual subreddit 'biology' within the selftext alone. The names of the subreddits and variations such as 'bio' and 'biochem' might have a high predictive power within any models created. As I only want to use terminology from these branches of science to predict the subreddits the names of the subreddits and it's variations will be dropped.

## Read in & Clean Biochemistry Submissions csv

In [21]:
#read in biochemistry submissions csv
biochem_sub = pd.read_csv('datasets/biochem-submissions.csv')

In [22]:
#check for correct read in
biochem_sub.head(2)

Unnamed: 0,subreddit,selftext,title
0,Biochemistry,Is there a database for protein 3D structures?...,Protein 3D structure database?
1,Biochemistry,Warning: This is entirely not my field so any...,Hacking plants for natural product production


In [23]:
#check the info of the biochem csv
#shows there are no null values present and all columns are objects (as expected)
biochem_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6229 entries, 0 to 6228
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  6229 non-null   object
 1   selftext   6229 non-null   object
 2   title      6229 non-null   object
dtypes: object(3)
memory usage: 146.1+ KB


In [24]:
#values of the biochemistry dataframe columns
biochem_sub.columns

Index(['subreddit', 'selftext', 'title'], dtype='object')

In [25]:
#initial data frame shape pre-cleaning
biochem_sub.shape

(6229, 3)

In [26]:
#in order to observe all of the rows for eda
pd.options.display.max_rows = 88
biochem_sub.isna().sum()

subreddit    0
selftext     0
title        0
dtype: int64

In [27]:
#drop null rows of selftext
biochem_sub.dropna(axis=0,subset=['selftext'],inplace=True)

In [28]:
#dataframe shape after dropping nulls from selftext
biochem_sub.shape

(6229, 3)

In [29]:
#seeing if rows contain different special characters
biochem_sub[biochem_sub['selftext'].str.contains('\n')].head(1)

Unnamed: 0,subreddit,selftext,title
1,Biochemistry,Warning: This is entirely not my field so any...,Hacking plants for natural product production


In [30]:
#get rid of \n special characters from selftext
biochem_sub['selftext'].replace('\n','',regex=True,inplace=True)

In [31]:
#check that replace occurred correctly
biochem_sub[biochem_sub['selftext'].str.contains('\n')]

Unnamed: 0,subreddit,selftext,title


In [32]:
#drop the rows that contain web links ('https')

#remove links from the text -- keep other text rows
biochem_sub['selftext'] = biochem_sub['selftext'].str.replace('http\S+|www.\S+', '', case=False)
biochem_sub['title'] = biochem_sub['title'].str.replace('http\S+|www.\S+', '', case=False)

In [33]:
#checking title for https webpage links
biochem_sub[biochem_sub['title'].str.contains('https')]

Unnamed: 0,subreddit,selftext,title


In [34]:
#reset the index
biochem_sub.reset_index(drop=True)

Unnamed: 0,subreddit,selftext,title
0,Biochemistry,Is there a database for protein 3D structures?...,Protein 3D structure database?
1,Biochemistry,Warning: This is entirely not my field so any...,Hacking plants for natural product production
2,Biochemistry,I have just completed my masters in Biochemist...,Just completed my masters I need urgent help f...
3,Biochemistry,Pretty much the title. I’m aware capping helps...,Can someone explain to me what helix capping i...
4,Biochemistry,I have my end of term assessments coming up in...,Desperate for Past-Exam Questions
...,...,...,...
6224,Biochemistry,"This will probably be flagged, but i'm really ...",PNAS article-stuck behind paywall--HELP
6225,Biochemistry,I have a question about g-coupled receptors. ...,Cell Phys Receptor Question
6226,Biochemistry,I've been reading about the [centriole]( I und...,What causes centrioles and microtubules to cop...
6227,Biochemistry,So far I've mostly looked for stuff in my home...,Wanting to take a year off between undergrad a...


In [35]:
#final shape of the cleaned biochem dataset
biochem_sub.shape

(6229, 3)

In [36]:
#the number of rows in the biochem subreddit that contain the word biochemistry
biochem_sub[biochem_sub['selftext'].str.contains('biochemistry')].shape

(1156, 3)

In [37]:
#the number of rows in the biochem subreddit that contain the word biochemistry
biochem_sub[biochem_sub['selftext'].str.contains('biochem')].shape

(1721, 3)

#### Containing the Names of the Subreddits

Above, there is about 3,000 instances of the name of the actual subreddit 'biochemistry' within the selftext alone. The names of the subreddits and variations such as 'bio' and 'biochem' might have a high predictive power within any models created. As I only want to use terminology from these branches of science to predict the subreddits the names of the subreddits and it's variations will be dropped.

## Merge Cleaned Submission DataFrames

In [38]:
#concat the dataframes on top of eachother
#columns are the same in each dataframe
final_sub = pd.concat([bio_sub,biochem_sub],axis=0)

In [39]:
#final dataframe for submissions
#correct num of rows should be 4519 
#correct num of cols should be 3
final_sub.shape

(11018, 3)

In [43]:
#info of the final dataframe
final_sub.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11018 entries, 0 to 6228
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  11018 non-null  object
 1   selftext   11018 non-null  object
 2   title      11018 non-null  object
dtypes: object(3)
memory usage: 344.3+ KB


In [40]:
#make all of the text in the columns lowercase
final_sub['selftext'] = final_sub['selftext'].str.lower()
final_sub['title'] = final_sub['title'].str.lower()

In [41]:
#remove puncutation from text columns with regex
final_sub['selftext'] = final_sub['selftext'].str.replace(r'[^\w\s]', '')
final_sub['title'] = final_sub['title'].str.replace(r'[^\w\s]','')

In [44]:
#brief check to make sure replace worked
final_sub[final_sub['selftext'].str.contains('biology')]

Unnamed: 0,subreddit,selftext,title
1,biology,hello i am looking for some adviceideas to fur...,bs in biology to a masters in what
8,biology,im still a high school student and am beginnin...,i have a question about biology and why am im ...
9,biology,i want to major in biology im perfectly okay w...,what would you tell a high school senior that ...
21,biology,what are the differences with campbell biology...,campbell biology canadian edition
24,biology,hi im an 18 year old firstyear college student...,people with experience in biology please help
...,...,...,...
6186,Biochemistry,i recently graduated from a state uni with a d...,recent biochemmol biology graduate resume crit...
6196,Biochemistry,hi everyonei am a graduate student proposing a...,techniques for studying peroxisomal enzyme int...
6205,Biochemistry,hello everyone im not sure if this is the righ...,resources for beginners on proteins and antiox...
6209,Biochemistry,hi rbiochemistryi am really sorry to bother yo...,acs biochem exam help


#### Containing the Names of the Subreddits

Above, there is 1,561 instances of the name of the actual subreddit 'biology' within the selftext alone. The names of the subreddits and variations such as 'bio' and 'biochem' might have a high predictive power within any models created. As I only want to use terminology from these branches of science to predict the subreddits the names of the subreddits and it's variations will be dropped.

In [45]:
#how many instances of a found spam post?
final_sub[final_sub['selftext'].str.contains('ai dungeon')]

Unnamed: 0,subreddit,selftext,title
2950,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
2957,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
2966,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
3284,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
3343,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
3744,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
3933,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
3975,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
4068,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...
4342,biology,ann vast improvement sexy i tried retraining ...,ais from ai dungeon 2 to sexy to funny and one...


#### Noise

In the cell above there are 11 spam posts from the biology subreddit which have gone unmodded. These will be dropped, as they are insignficant to the data science question and introduce noise to the machine learning algorithms.

In [46]:
#removing spam posts from final dataframe
final_sub = final_sub[final_sub['selftext'].str.contains('ai dungeon')==False]

In [48]:
#value counts of the final dataframe
final_sub['subreddit'].value_counts(normalize=True)

Biochemistry    0.565913
biology         0.434087
Name: subreddit, dtype: float64

#### Final Subreddit Counts

Within the final rows of data for the final_sub dataframe, in which both selftext and title have there own columns, there is are more biochemistry subreddit posts. In total, there are 56.6% r/biochemistry posts and 43.4% r/biology posts. This will frame the baseline model.

In [49]:
#read final_sub dataframe to csv
final_sub.to_csv('datasets/cleaned-submission.csv')

## Combine Title and Selftext into a Single DataFrame for Combination Modeling

In [50]:
#final data frame for submissions
final_sub.head(2)

Unnamed: 0,subreddit,selftext,title
0,biology,heres the images theyre the midslices but the ...,why does it seem like my venous sinuses are ve...
1,biology,hello i am looking for some adviceideas to fur...,bs in biology to a masters in what


In [52]:
#getting the info from the final_sub csv
final_sub.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11007 entries, 0 to 6228
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  11007 non-null  object
 1   selftext   11007 non-null  object
 2   title      11007 non-null  object
dtypes: object(3)
memory usage: 344.0+ KB


In [53]:
#create a column of only selftext
selftxt = final_sub[['subreddit','selftext']]
selftxt.columns = ['subreddit','text']

In [54]:
#create a column of only title text
titletxt = final_sub[['subreddit','title']]
titletxt.columns =['subreddit','text']

In [55]:
#creat dataframe of combined title text and selftext
self_title_combined = pd.concat([selftxt,titletxt],ignore_index=True)

In [56]:
#checking the final shape of the combined dataframe
self_title_combined.shape

(22014, 2)

In [58]:
#getting the value counts of the subreddits
self_title_combined['subreddit'].value_counts(normalize=True)

Biochemistry    0.565913
biology         0.434087
Name: subreddit, dtype: float64

#### Final Subreddit Counts

Within the final rows of data for the final_sub dataframe, in which both selftext and title have there own columns, there is are more biochemistry subreddit posts. In total, there are 56.6% r/biochemistry posts and 43.4% r/biology posts. This will frame the baseline model.

In [59]:
#save this dataframe into the datasets folder
self_title_combined.to_csv('datasets/combined_title_self.csv')

#### Discussion of the Data

Observing the data it looks like there might be a considerable amount of noise for the algorithms during modeling to try and maneuver around. There is a considerable number of posts within the subreddits which ask for advice on career paths of different univeristies to study at. A title taken from a post above contains the following "bs in biology to a masters in what" - this is not the scientific terminology I am looking for during modeling. While I think I will be able to get a model above the baseline with the number of desired posts, I think accuracy will be hindered by the noise of the generic lifestyle posts.