## Introduction to feature engineering. ##

**In this notebook, I show some examples of feature engineering I did for the text.** I tried a bunch of different things: filtering out text that only had a mention of anxiety and/or depression, adding additional keywords (such associated with symptoms of either) or excluding different keywords (such as excluding depression from texts about anxiety). 

I ended up landing on the following method: Take texts that mention some keywords associated with anxiety and depression (e.g., the words anxiety or depression themselves, a symptom or two, and the mention of therapy). For the "other" category, I ended up using four unrelated subreddits, and took text examples that did not mention either depression or anxiety.

There are a few things I would like to do in the future.
* Dive in deeper into feature engineering. For example, not all texts are made the same. Some texts may detail an episode of depression, some might talk solely about a therapist, while another celebrates getting over depression. I would likely explore more methods of topic modeling to look for these features (e.g., Latent Semantic Analysis, Latent Dirchelet Allocation).

* I might restrict word count to be greater than some number. As you might realize, posts can be quite short, or long. Longer posts may have greater details, and thus, be better suited for classification than shorter posts (given that text is generally sparse, short text would be very sparse!)

In [1]:
import pandas as pd
import nltk
import numpy as np

**Import all the csvs, which originally was taken from the google bigquery database.**

In [2]:
anxiety = pd.read_csv('reddit_anxiety.csv')
depression = pd.read_csv('reddit_depression.csv')
news = pd.read_csv('reddit_news.csv')
cute = pd.read_csv('reddit_cute.csv')
funny = pd.read_csv('reddit_funny.csv')
med = pd.read_csv('reddit_medicine.csv')

**Filter out specific texts with keywords for 'anxiety' posts.** 

In [None]:
anxiety = anxiety[anxiety['selftext'].str.contains('anx|pani|ther')]

**Do the same for depression.**

In [None]:
depression = depression[depression['selftext'].str.contains('depres|lon|ther')]

**Set those dataframes into a list, concatenate them all together.**

In [None]:
frames = [news, cute, funny, med]

In [None]:
other = pd.concat(frames)
other = other.reset_index(drop=True)

**Set the 'other' dataframe to find text that has no mention of anxiety or depression.**

In [None]:
other = other[~other['selftext'].str.contains('anx|depres')]

**Write all the data to csvs for later importing and modeling.**

In [None]:
depression.to_csv('reddit_depress_2.csv')

In [None]:
anxiety.to_csv('reddit_anxiety_2.csv')

In [None]:
other.to_csv('other.csv')