#### Step 2: Cleaning

In [1]:
import pandas as pd

To begin the cleaning process, I'll take a look at the data at a high level to see how many null values there might be and if there are any surprises about how the data types are being stored.

In [2]:
data = pd.read_csv('../data/posts.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    5000 non-null   object
 1   title        5000 non-null   object
 2   selftext     3026 non-null   object
 3   created_utc  5000 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 156.4+ KB


Only about 60% of the samples taken have selftext, but otherwise the data seems to be intact. The lack of selftext won't matter anyways as I've decided to just look at the titles of posts.

The selftext will be helpful, however, in determining if a post has been removed. Because these removed posts may be off-topic posts removed by moderators, duplicates, or otherwise non-relevant, I will exclude them from the analysis. 

In [4]:
data = data[data['selftext'] != '[removed]']

With the removed posts removed, I can drop columns to only leave the features that we'll be using in the model: the target ('subreddit') and the title of the posts. 

In [5]:
data = data.drop(columns=['selftext', 'created_utc'])

Once we're down to only subreddits and titles, I want to drop any duplicate post titles that still may be in the set, as I don't want something like a recurring thread to dominate the predictions. 

In [6]:
data = data.drop_duplicates()

With our basic cleaning out of the way, we can check to see how many samples are still left:

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4607 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  4607 non-null   object
 1   title      4607 non-null   object
dtypes: object(2)
memory usage: 108.0+ KB


In total it looks like we dropped 393 posts, or just under 8% of the sample. This should still leave us with enough posts to create a reasonable model, as long as our class split didn't become too skewed.

In [8]:
data['subreddit'].value_counts()

Coffee    2316
tea       2291
Name: subreddit, dtype: int64

Looks like r/tea lost a few more posts than r/Coffee did, but the two classes are still approximately equal, and shouldn't cause any imbalance issues.

With cleaning complete, I'll save the new dataframe for use in EDA and modeling:

In [9]:
data.to_csv('../data/cleaned_posts.csv', index=False)