**Content Disclaimer**: This dataset contains real title posts scraped from the r/depression subreddit, and some of the titles contain language that is not safe for work, crude, or offensive. The full dataset is available as `depression.csv`, `preprocessed_thoughts.csv`, `thoughts.csv`, and `token_df.csv`. Unfortunately, I did not provide a sanitized version of my dataset because the words contained were important for the analysis and understanding of the model. Please note that the model, the dataset, and the techniques used are not perfect. If you have any concerns about working with this dataset, looking at my analysis, or the topic in general, you can skip my content overall or click [here](http://iamsorry.com/).

# Data Wrangling and Preprocessing Notebook

# We will get data from `r/depression` and `r/Showerthoughts`

In [81]:
import p3_tools as p3t

import pandas as pd

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

import re

# Data Wrangling

This cell is used to wrangle 10,000 posts from my selected subreddits.

In [2]:
# %%time
# shower_thoughts = p3t.get_posts('showerthoughts',10_000)
# depression = p3t.get_posts('depression', 10_000)

Wall time: 13min 47s


This cell was used to save them into csv files. In the csv files, it should only contain the subreddit it belongs to, the contents of self text and title.

In [3]:
# Save the dataframes
# shower_thoughts.to_csv('saved_data/shower_thoughts.csv',index=False)
# depression.to_csv('saved_data/depression.csv',index=False)

This uses a concatenation function that combines the 2 dataframes and resets the index if need be. This will also save it to a csv file.

In [9]:
# df = p3t.concat_df([shower_thoughts, depression])

In [12]:
# save big dataframe into csv
# df.to_csv('saved_data/thoughts.csv',index=False)

Then after all that, we can easily import my data from the csv files using `pandas`.

In [2]:
# import the data
shower_thoughts = pd.read_csv('saved_data/shower_thoughts.csv')
depression = pd.read_csv('saved_data/depression.csv')
df = pd.read_csv('saved_data/thoughts.csv')

In [3]:
shower_thoughts.head()

Unnamed: 0,subreddit,selftext,title
0,Showerthoughts,[removed],Vampires and skeletons are just different type...
1,Showerthoughts,,Life is actually a telltale game because the c...
2,Showerthoughts,[removed],Wouldn’t a botched circumcision just be a peni...
3,Showerthoughts,,You might of made a decision that saved your l...
4,Showerthoughts,[removed],"If you have ever opened the toilet water tank,..."


In [4]:
depression.head()

Unnamed: 0,subreddit,selftext,title
0,depression,About a yr ago I was deep in a deep dark place...,Finally catching my breath.
1,depression,"Hey, i´ve posted a few times here about how my...","""Friends"""
2,depression,I fail college so I'm the failed one I guess n...,"Yes, im a failure of a child but at least if y..."
3,depression,#YoungPeople RoundTable’s second part is out! ...,Being Young With BiPolar
4,depression,I remember spending quality time with my littl...,Love is the only cure for my depression


In [5]:
df.head()

Unnamed: 0,subreddit,selftext,title
0,Showerthoughts,[removed],Vampires and skeletons are just different type...
1,Showerthoughts,,Life is actually a telltale game because the c...
2,Showerthoughts,[removed],Wouldn’t a botched circumcision just be a peni...
3,Showerthoughts,,You might of made a decision that saved your l...
4,Showerthoughts,[removed],"If you have ever opened the toilet water tank,..."


# "Token Processing"

The first thing to do after grabbing the posts and importing them into csv files is to convert the `subreddit` column into a binary category. 
```python
[df['subreddit']=='depression'] = 1
[df['subreddit']=='Showerthoughts'] = 0
```

In [6]:
df['subreddit'] = np.where(df['subreddit']=='depression',1,0)

In [7]:
df.sample(3)

Unnamed: 0,subreddit,selftext,title
6415,0,[removed],Do scary people get scared?
15297,1,I was in a relationship of 7 months. I gave ev...,Relationship ended I(22M) feel like nobody is ...
16482,1,"Guys, I am very desperate to know this: this s...",Very important!!!


Next, we will make a column that splits the sentence into separate words. During this process, we will also pick words that exist in `nltk.corpus words.words()`. This may not include names not inside the library but that should not be an issue since those words might not play a huge role in analyzing the difference between one post and another.

I will use these 2 regular expressions to help tokenize the sentences in each row under `df['title']`
```python
tokenizer = RegexpTokenizer(r'\w+')
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)
tokenizer_3 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
```

In [8]:
tokenizer = RegexpTokenizer(r'\w+')

In [63]:
tokenizer.tokenize('I\'m always the last choice.')

['I', 'm', 'always', 'the', 'last', 'choice']

In [9]:
df['word_count'] = [len(tokenizer.tokenize(sentence)) for sentence in df['title']]

In [10]:
# create tokens for each title
df['tokens'] = [tokenizer.tokenize(sentence) for sentence in df['title']]

In [11]:
df.sample(3)

Unnamed: 0,subreddit,selftext,title,word_count,tokens
19213,1,i’m 16m and want to cut myself. i want to feel...,i need help,3,"[i, need, help]"
4478,0,[removed],"If I could end racism, have world peace, or wi...",18,"[If, I, could, end, racism, have, world, peace..."
18738,1,Who knew that they could come back? Lol,I'm having suicidal thoughts again,6,"[I, m, having, suicidal, thoughts, again]"


I will also show a character count. This will include spaces and punctuations. We will treat each character like how Twitter treats each character. Everything counts.

In [12]:
df['char_count'] = [len(sentence) for sentence in df['title']]

In [13]:
df.sample(3)

Unnamed: 0,subreddit,selftext,title,word_count,tokens,char_count
9472,0,[removed],Thanks to atomic weapons humanity is closest y...,14,"[Thanks, to, atomic, weapons, humanity, is, cl...",87
16201,1,So this year I’ve finally decided to look into...,What is depression?,3,"[What, is, depression]",19
1058,0,,We’re the only animals that’ll hold our breath...,13,"[We, re, the, only, animals, that, ll, hold, o...",61


Next, I want to add a column that contains a lemmatized and porter stemmed version of a title's post. But first I will create a column that contain a title's tokens.

In [15]:
lemmatizer= WordNetLemmatizer()

In [18]:
df['lem_tokens'] = [[lemmatizer.lemmatize(token.lower())for token in row] for row in df['tokens']]

In [19]:
df.sample(3)

Unnamed: 0,subreddit,selftext,title,word_count,tokens,char_count,lem_tokens
9046,0,[removed],Stress,1,[Stress],6,[stress]
4197,0,[removed],When an olympic weightlifter goes to the olymp...,14,"[When, an, olympic, weightlifter, goes, to, th...",88,"[when, an, olympic, weightlifter, go, to, the,..."
14886,1,Earlier I felt like I was on top of the world....,My head feels fucked up,5,"[My, head, feels, fucked, up]",23,"[my, head, feel, fucked, up]"


I created a PorterStemmer column as well for completionist sake.

In [23]:
p_stemmer = PorterStemmer()

In [24]:
df['pstem_words'] = [[p_stemmer.stem(token.lower()) for token in row] for row in df['tokens']]

In [25]:
df.sample(3)

Unnamed: 0,subreddit,selftext,title,word_count,tokens,char_count,lem_tokens,pstem_words
8644,0,,"If you expect the unexpected, the outcome will...",21,"[If, you, expect, the, unexpected, the, outcom...",129,"[if, you, expect, the, unexpected, the, outcom...","[if, you, expect, the, unexpect, the, outcom, ..."
11841,1,"I have had depression for over a decade, and e...",For those with depression that is under contro...,13,"[For, those, with, depression, that, is, under...",78,"[for, those, with, depression, that, is, under...","[for, those, with, depress, that, is, under, c..."
5163,0,,"If flowering plants did porn, most of it will ...",15,"[If, flowering, plants, did, porn, most, of, i...",87,"[if, flowering, plant, did, porn, most, of, it...","[if, flower, plant, did, porn, most, of, it, w..."


I also wanted to show a list of stop words for each post title. This is also for completionist's sake.

In [48]:
df['stopwords'] = [set([token.lower() for token in row if token.lower() in stopwords.words('english')]) for row in df['tokens']]

In [50]:
df.sample(3)

Unnamed: 0,subreddit,selftext,title,word_count,tokens,char_count,lem_tokens,pstem_words,stopwords
17685,1,I did not intend to write this-\n\n\-\n\nI am ...,Unexpectedly Personal,2,"[Unexpectedly, Personal]",21,"[unexpectedly, personal]","[unexpectedli, person]",{}
14802,1,\nA story about Mary Ruth\n\nShe was losing he...,Death Eyes,2,"[Death, Eyes]",10,"[death, eye]","[death, eye]",{}
5614,0,,One unrealistic thing about the bible is peopl...,39,"[One, unrealistic, thing, about, the, bible, i...",223,"[one, unrealistic, thing, about, the, bible, i...","[one, unrealist, thing, about, the, bibl, is, ...","{to, being, for, from, in, is, are, s, about, ..."


In [51]:
# saved this processed dataframe
# df.to_csv('saved_data/token_df.csv')

# Proper Preprocessing

Let's start back with a clean dataframe of the thoughts.csv

In [84]:
# thoughts "is" the same as df before token processing
thoughts = pd.read_csv('saved_data/thoughts.csv')

In [85]:
thoughts.head()

Unnamed: 0,subreddit,selftext,title
0,Showerthoughts,[removed],Vampires and skeletons are just different type...
1,Showerthoughts,,Life is actually a telltale game because the c...
2,Showerthoughts,[removed],Wouldn’t a botched circumcision just be a peni...
3,Showerthoughts,,You might of made a decision that saved your l...
4,Showerthoughts,[removed],"If you have ever opened the toilet water tank,..."


In [86]:
thoughts.shape

(20000, 3)

The first thing to do after grabbing the posts and importing them into csv files is to convert the `subreddit` column into a binary category. 
```python
[df['subreddit']=='depression'] = 1
[df['subreddit']=='Showerthoughts'] = 0
```

In [87]:
thoughts['subreddit'] = np.where(thoughts['subreddit']=='depression',1,0)

Next find word and character count. We will create `word_count` and `char_count`.

In [88]:
thoughts['word_count'] = [len(sent.split()) for sent in thoughts['title']]
thoughts['char_count'] = [len(sent) for sent in thoughts['title']]

In [89]:
thoughts.sample(3)

Unnamed: 0,subreddit,selftext,title,word_count,char_count
6448,0,,breathing is agreeing to keep living on,7,39
8022,0,[removed],Proper use of the word “infamous”,6,33
9113,0,,Blood bending is so common they have laws agai...,24,138


I will drop the 'selftext' column since it does not help me and there are too many '[deleted]' posts. So for the sake of our analysis, it is not needed.

In the depression subreddit, there is a lot of self text while most people on the shower thoughts subreddit have barely anything at all. So I will remove it.

In [90]:
thoughts.drop(columns='selftext',inplace=True)

### Saved the preprocessed dataframe to do EDA in another notebook

In [91]:
# thoughts.to_csv('saved_data/preprocessed_thoughts.csv', index=False)