# Deep Cleaning Text Data

In [2]:
import pandas as pd
import numpy as np
import re

In [3]:
merged_df = pd.read_csv('../data/merged.csv', index_col='Unnamed: 0')

Checking if any missing values remain.

In [4]:
merged_df.isna().sum()

author            0
Anxiety           0
title             0
selftext        727
created_utc       0
retrieved_on      0
url               0
pinned            0
media_only        0
ADHD              0
dtype: int64

After checking those missing values, most are posts that have just a title to a picture, or a link, or something similar. For my analytical purposes, dropping them is fine. 

In [5]:
merged_df.dropna(inplace=True)

#### Now begins the cleaning of the actual text contained in the posts. For NLP we need only words, so all special characters, links, etc. need to be removed.

Here I get rid of any URLs that are in the bodies of the posts using Regular Expressions (Regex).

In [6]:
merged_df.selftext = merged_df.selftext.map(lambda x: re.sub('http[s]?:\/\/[^\s]*', ' ', x))

Because my end-goal is to create a predictive model for which subreddit the text comes from, I need to get rid of all mentions of the words ADHD and Anxiety, ignoring case-sensitivity, in order to make the model more useful. Otherwise, it would obviously be too easy and my model would do a much better job, but it would not generate as many insights.

In [7]:
merged_df.selftext = merged_df.selftext.map(lambda x: re.sub('(ADHD|anxiety)[s]?', ' ', x, flags=re.I))

Getting rid of slashes, whitespaces, tabs, line breaks, other weird characters, etc. 

In [8]:
merged_df.selftext = merged_df.selftext.map(lambda x: re.sub('\s[\/]?r\/[^\s]+', ' ', x))

Getting rid of newline characters

In [9]:
merged_df.selftext = merged_df.selftext.map(lambda x: re.sub('\n', ' ', x))

Getting rid of posts that were removed (usually for violating subreddit posting rules). 

In [10]:
merged_df.drop(merged_df[merged_df.selftext.str.contains(r'\[removed\]')].index, inplace=True)

Getting rid of the weird "&amp;#x200B;" string, which I'm assuming is some sort of zero-width space character encoding error in combination with the ampersand character's encoding.

In [11]:
merged_df.selftext = merged_df.selftext.map(lambda x: re.sub('&amp;#x200B;', ' ', x))

Getting rid of apostrophes.

In [12]:
merged_df.selftext = merged_df.selftext.map(lambda x: re.sub('\'', '', x))

Finally, I am ready to save this cleaned dataframe to a CSV to then conduct Exploratory Data Analysis (EDA) in new notebook.

In [13]:
merged_df.to_csv('../data/clean.csv')