# Cleaning the raw dataframe

In this notebook, we'll import the raw dataframe we created and do some cleaning on it before moving to EDA.

## Import libraries

In [1]:
import os
import pandas as pd

## Loading the data

In [2]:
filepath = '../data/raw_1587672956.csv'

In [3]:
# Need to set low_memory=False because of inconsistent datatypes
df = pd.read_csv(filepath, low_memory=False)

In [4]:
df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,gilded,poll_data,crosspost_parent,crosspost_parent_list,distinguished,steward_reports,removed_by,updated_utc,og_description,og_title
0,[],False,friendlyhello123,,[],,text,t2_66w0klyz,False,False,...,,,,,,,,,,
1,[],False,jason24119949422,,[],,text,t2_5riob2jy,False,False,...,,,,,,,,,,
2,[],False,diwiwi,,[],,text,t2_rsn6o6d,False,False,...,,,,,,,,,,
3,[],False,kofrase94,,[],,text,t2_6a5knvep,False,False,...,,,,,,,,,,
4,[],False,Clear-Plastic,,[],,text,t2_6adu0qmq,False,False,...,,,,,,,,,,


In [5]:
df.shape

(50000, 86)

## Data cleaning

### Too many features

The Pushshift API returns far more information than is actually useful for our purposes. The first thing we'll do is trim down the number of features in our dataframe.

In [6]:
# Create new dataframe out of only potential columns of interest
features = ['subreddit', 
            'subreddit_subscribers', 
            'title', 
            'selftext', 
            'score', 
            'num_comments', 
            'author', 
            'created_utc']
df = df[features].copy()

In [7]:
df.head()

Unnamed: 0,subreddit,subreddit_subscribers,title,selftext,score,num_comments,author,created_utc
0,Christianity,233423,Christian and wondering if God provides what w...,"Please forgive me for sounding so flippant, bu...",1,0,friendlyhello123,1587672956
1,Christianity,233423,will video games lead to hell,,1,0,jason24119949422,1587672676
2,Christianity,233421,Should we stop telling people to read the Bibl...,,1,0,diwiwi,1587672362
3,Christianity,233420,New to this app but I feel called to try to sp...,,1,0,kofrase94,1587672281
4,Christianity,233423,What do you think is often ignored in the bibl...,Like genealogies for example... both Rahab and...,1,0,Clear-Plastic,1587671298


### Rows without a valid author

More than a few of the `author` entries in our dataframe are marked as `[deleted]`. This indicates the orignial poster has deleted their account. Since we have so many observances, we'll not hesitate to simply drop observances with `author` marked as `[deleted]`.

In [8]:
df[df['author'] == '[deleted]']

Unnamed: 0,subreddit,subreddit_subscribers,title,selftext,score,num_comments,author,created_utc
228,Christianity,233299,Church,[deleted],1,0,[deleted],1587594293
232,Christianity,233297,Survey question for christians who belong to a...,[deleted],1,2,[deleted],1587592663
243,Christianity,233293,How long should it taking me to read a chapter...,[deleted],1,0,[deleted],1587590441
246,Christianity,233292,/u/ManWhoCommentsOK,[deleted],1,0,[deleted],1587589606
268,Christianity,233287,"Rlly interesting, rapture dream!",[deleted],0,0,[deleted],1587585226
...,...,...,...,...,...,...,...,...
47698,Catholicism,74857,How To Pray The Traditional Rosary,[deleted],1,0,[deleted],1570137584
48457,Catholicism,74337,Did anyone see Bishop Barron's AMA today?,,1,2,[deleted],1569546741
48827,Catholicism,73966,What are y'alls thoughts on this?,[deleted],1,0,[deleted],1569280755
49053,Catholicism,73795,A Catholic Poem on the Topic Of Nostalgia,[deleted],1,0,[deleted],1569101273


In [9]:
# Drop rows with deleted authors
no_author_indices = df[df['author'] == '[deleted]'].index
df.drop(index=no_author_indices, inplace=True)

### Rows without valid selftext

Note that we also have a fair amount of posts with the `selftext` column marked as `[removed]`. This indicates a post which was removed by some moderator of the subreddit for a violation of community rules. We have substantially more observances which fall into this category. However, since they were ultimately deemed to not be appropriate for the given community, we'll also not hesitate to drop them.

In [10]:
df[df['selftext'] == '[removed]']

Unnamed: 0,subreddit,subreddit_subscribers,title,selftext,score,num_comments,author,created_utc
80,Christianity,233386,Good Morning,[removed],1,0,mobileshalom,1587649184
87,Christianity,233385,Interesting 'social experiment',[removed],1,0,jazzlike_waterdog,1587646291
136,Christianity,233344,2020,[removed],1,2,kvngsias,1587621633
192,Christianity,233317,Atheists: what are you doing to get yourselves...,[removed],1,0,FeistyResearcher7,1587604629
229,Christianity,233299,Did I accidentally pray to the Devil? (OCD),[removed],1,10,Christian_Boy42,1587593530
...,...,...,...,...,...,...,...,...
49891,Catholicism,73237,How specific do you have to be when confessing...,[removed],5,15,catholicwarrior57,1568351037
49903,Catholicism,73231,The Parish Narcissist,[removed],0,4,Shnimzel,1568340851
49907,Catholicism,73228,Any good resources on the history of the church?,[removed],0,0,Red_Baron_Cath,1568335550
49921,Catholicism,73216,Is it a good idea to quit masturbation for a f...,[removed],0,2,datpoosniffer445,1568323889


In [11]:
# Drop rows with removed selftext
removed = df[df['selftext'] == '[removed]'].index
df.drop(index=removed, inplace=True)

### Reposts

We've accidentally grabbed more than a few "reposts" in our dataframe. A repost is a post with the same exact content as one previously posted to the community. The presence of reposts will bias any model we create by giving it the impression that certain sentiments are more common than they actually are. We'll delete the reposts.

In [12]:
df['selftext'].value_counts()[:5]

\nPlease post your prayer requests in this weekly thread, giving enough detail to be helpful. If you have been remembering someone or something in your prayers, you may also note that here.\n\nWe also ask that you also take time to remember [our beloved departed](http://www.reddit.com/r/TelaIgne/comments/30fxz4/rtelaigne_roll_of_the_faithful_departed/). Their final purification and ultimate union with God is included in our weekly intentions here. Consider adding your departed loved ones to that roll.\n\n---\n\n### Tela Igne\n\n/r/Catholicism is home to **Tela Igne**, a group of redditors devoted to praying for the intentions left here every week. To learn more about the group and for information on how to join, please see *[Tela Igne—Rules and Regulations for Membership](http://www.reddit.com/r/TelaIgne/comments/2bpp19/tela_igne_rules_and_regs_for_membership/).*                                                                                                                             

In [13]:
df['title'].value_counts()[:5]

Question              48
Prayer request        37
Please pray for me    31
Help                  27
I need help           17
Name: title, dtype: int64

In [14]:
# Drop posts with the same selftext
df.drop_duplicates(subset=['subreddit', 'selftext'], inplace=True)

In [15]:
# Drop posts with the same title
df.drop_duplicates(subset=['subreddit', 'title'], inplace=True)

In [16]:
# How many observations are we left with?
df.shape

(28737, 8)

We've lost a significint number of observances by removing reposts, but we still have a dataframe with almost 30,000 rows.

### Null values

Finally, we have a few null values in the dataset. We can safely drop these considering the sheer volume of posts we have.

In [17]:
df.isnull().sum()

subreddit                0
subreddit_subscribers    0
title                    0
selftext                 2
score                    0
num_comments             0
author                   0
created_utc              0
dtype: int64

In [18]:
df.dropna(inplace=True)

In [19]:
df.shape

(28735, 8)

## Exporting clean data

In [20]:
try:
    os.mkdir('../data')
except FileExistsError:
    pass

In [21]:
filepath = f'../data/clean{filepath.replace("../data/raw", "")}'
df.to_csv(filepath, index=False)

Now that we have a clean dataframe to work with, we can move on to performing EDA in a separate notebook.