## Cleaning the data

Here, we're going to clean the data. This can take a lot of forms, and often includes looking at the raw data and identifying issues.

We're going to look for missing data, since there shouldn't be any. We will also create a new measure for the number of words in the comment.

If we had a more complicated question, we would do more complicated stuff in this step, like creating networks from comment trees or doing sentiment analysis or classifying comments.

In [1]:
import pandas as pd
import pprint

raw_data = '../data/raw_data.csv'
cleaned_csv = '../data/processed_data.csv'

In [2]:
df = pd.read_csv(raw_data)
# Count missings in each column
df.apply(lambda x: sum(pd.isna(x)), axis=0)


author                       0
body                         1
created_utc                  0
id                           0
link_id                      0
parent_id                    0
depth                        0
score                        0
score_hidden                 0
upvotes                      0
downvotes                    0
subreddit                    0
submission_id                0
submission_title             0
submission_created_utc       0
submission_author            0
submission_num_comments      0
submission_score             0
submission_body            584
submission_url               0
dtype: int64

In [3]:
df = df[pd.isna(df.body) == False]

In [4]:
# Drop columns that are not needed
df = df.loc[:, ['body', 'depth', 'score', 'subreddit', 'upvotes', 'downvotes', 'created_utc']]
# Create a new column for the length of the comment
df['comment_length'] = df['body'].apply(lambda x: len(str(x)))



In [5]:
# Write out the cleaned data
df.to_csv(cleaned_csv, index=False)