# Cleaning the Data

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv('./data/clean.csv', index_col='Unnamed: 0')

Check if any missing values remain

In [3]:
df.isna().sum()

author          0
anxiety         0
title           0
selftext        0
created_utc     0
retrieved_on    0
url             0
pinned          0
media_only      0
adhd            0
dtype: int64

After checking those missing values, most are posts that have just a title to a picture, or a link, or something similar. For my analytical purposes, dropping them is fine. 

In [19]:
df2 = df[df.selftext.notna()]

Speaking of links, going to get rid of any URLs that are in the bodies of the posts as well.

In [20]:
df2.selftext = df2.selftext.map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))

Getting rid of all mentions of the words ADHD and anxiety, ignoring case-sensitivity

In [82]:
df2.selftext = df2.selftext.map(lambda x: re.sub('(ADHD|anxiety)[s]?', ' ', x, flags=re.I))

Getting rid of slashes, whitespaces, tabs, line breaks, etc. 

In [90]:
df2['selftext'] = df2.selftext.map(lambda x: re.sub('\s[\/]?r\/[^\s]+', ' ', x))

Getting rid of newline characters

In [85]:
df2['selftext'] = df2.selftext.map(lambda x: re.sub('\n', ' ', x))

Getting rid of posts that were removed (usually for violating subreddit posting rules). 

In [119]:
df2.drop(df2[df2.selftext.str.contains(r'\[removed\]')].index, inplace=True)

Getting rid of the weird "&amp;#x200B;" string, which I'm assuming is some sort of zero-width space character encoding error in combination with the ampersand character's encoding.

In [87]:
df2['selftext'] = df2.selftext.map(lambda x: re.sub('&amp;#x200B;', ' ', x))

Getting rid of apostrophes

In [131]:
df2['selftext'] = df2.selftext.map(lambda x: re.sub('\'', '', x))

Save to csv to then move onto EDA in another notebook

In [132]:
df2.to_csv('./data/clean.csv')