# Creating a Stress Detection Tool using Data From Subreddits: Data Wrangling

In this notebook I will be using the Reddit API to get comments from three subreddits that generally contain positive and uplifting posts and then combining that data with the Dreaddit dataset, which contains comments from predominately 'negative' subreddits and has already been scored for sentiment analysis.  I will then combine the data into a new dataset in order to be used to create a stress detection tool that will be able to score new comments for signs of stress.

#### Import necessary libraries

In [1]:
import pandas as pd
import numpy as np

#reddit crawler
import praw
from praw.models import MoreComments

#preprocessing
import string

#for saving
import pickle

#### Set up client

In [2]:
r = praw.Reddit(user_agent = '',
                client_id = '',
                client_secret = '',
                check_for_async=False)

#### Create the subreddit list that I want to get data from
* I have mostly chosen mental health  related subreddits
* I added in a few positivity-based subreddits to balance out as well

In [3]:
sr_list = ['affirmations', 'happy', 'goodnews']

#### Get URLs from top reddit posts from each subreddit
* I am creating the first for loop to get the top posts from each subreddit from the API
* The second for loop is to get the comments from each of those posts
* I am then appending it to the list, then converting to a dataframe

In [4]:
posts = []

for sr in sr_list:
    subreddit = r.subreddit(sr)
    
    for post in subreddit.hot(limit=100):
        
        post.comments.replace_more(limit=100)
        for comment in post.comments.list():
             posts.append([post.subreddit, comment.body])

In [5]:
posts = pd.DataFrame(posts)

#### Checking out the data
* .shape to make sure I have enough data
* .head and .tail to make sure everything looks right

In [6]:
posts.shape

(1588, 2)

In [7]:
posts.head(5)

Unnamed: 0,0,1
0,happy,Welcome to /r/happy where we support people in...
1,happy,"You are good people, Mike."
2,happy,"Wanna come to Knight Lake near Waupaca, Wiscon..."
3,happy,"“But, I thought the old lady dropped it into t..."
4,happy,Yay! Thank you for being kind and returning it!


In [8]:
posts.tail(5)

Unnamed: 0,0,1
1583,goodnews,This is nice. Sadly only hear of when youngste...
1584,goodnews,Happy birthday! I'm so proud of you for being ...
1585,goodnews,Happy Birthday 🎂!!
1586,goodnews,Glad you are still here. Always remember that ...
1587,goodnews,There’s so much awesome stuff waiting for you....


#### Fixing the column names

In [9]:
posts.columns = ['subreddit', 'text']

In [10]:
posts.head(5)

Unnamed: 0,subreddit,text
0,happy,Welcome to /r/happy where we support people in...
1,happy,"You are good people, Mike."
2,happy,"Wanna come to Knight Lake near Waupaca, Wiscon..."
3,happy,"“But, I thought the old lady dropped it into t..."
4,happy,Yay! Thank you for being kind and returning it!


#### Some preprocessing
* I know I want two versions of my text data available, one with capitilization and punctuation and one without so I am adding the preprocessed text to a new column

In [11]:
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

posts['text_preproc'] = posts['text'].apply(remove_punctuations)

In [12]:
posts['text_preproc'] = posts['text_preproc'].str.lower()

#### Adding a label
* Now I am adding a label column and filling it with 0 for 'no stress' (this will make more sense when joined with the Dreaddit dataset)

In [13]:
posts['label'] = 0

#### Checking things out one more time

In [14]:
posts.head(5)

Unnamed: 0,subreddit,text,text_preproc,label
0,happy,Welcome to /r/happy where we support people in...,welcome to rhappy where we support people in t...,0
1,happy,"You are good people, Mike.",you are good people mike,0
2,happy,"Wanna come to Knight Lake near Waupaca, Wiscon...",wanna come to knight lake near waupaca wiscons...,0
3,happy,"“But, I thought the old lady dropped it into t...",“but i thought the old lady dropped it into th...,0
4,happy,Yay! Thank you for being kind and returning it!,yay thank you for being kind and returning it,0


In [15]:
posts.tail(5)

Unnamed: 0,subreddit,text,text_preproc,label
1583,goodnews,This is nice. Sadly only hear of when youngste...,this is nice sadly only hear of when youngster...,0
1584,goodnews,Happy birthday! I'm so proud of you for being ...,happy birthday im so proud of you for being ab...,0
1585,goodnews,Happy Birthday 🎂!!,happy birthday 🎂,0
1586,goodnews,Glad you are still here. Always remember that ...,glad you are still here always remember that t...,0
1587,goodnews,There’s so much awesome stuff waiting for you....,there’s so much awesome stuff waiting for you ...,0


#### Preparing the dreaddit dataset
* First import the data

In [16]:
df1 = pd.read_csv('dreaddit-test.csv')
df2 = pd.read_csv('dreaddit-train.csv')

#### The dataset is split into test and train, we don't want that right now so I will rejoin them

In [17]:
df3 = pd.concat([df1, df2])
df3 = pd.DataFrame(df3)

#### Viewing what we have

In [18]:
df3.head(5)

Unnamed: 0,id,subreddit,post_id,sentence_range,text,label,confidence,social_timestamp,social_karma,syntax_ari,...,lex_dal_min_pleasantness,lex_dal_min_activation,lex_dal_min_imagery,lex_dal_avg_activation,lex_dal_avg_imagery,lex_dal_avg_pleasantness,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment
0,896,relationships,7nu7as,"[50, 55]","Its like that, if you want or not.“ ME: I have...",0,0.8,1514981000.0,22,-1.238793,...,1.0,1.2,1.0,1.65864,1.32245,1.80264,0.63,62,-0.148707,0.0
1,19059,anxiety,680i6d,"(5, 10)",I man the front desk and my title is HR Custom...,0,1.0,1493348000.0,5,7.684583,...,1.4,1.125,1.0,1.69133,1.6918,1.97249,1.0,2,7.398222,-0.065909
2,7977,ptsd,8eeu1t,"(5, 10)",We'd be saving so much money with this new hou...,1,1.0,1524517000.0,10,2.360408,...,1.1429,1.0,1.0,1.70974,1.52985,1.86108,1.0,8,3.149288,-0.036818
3,1214,ptsd,8d28vu,"[2, 7]","My ex used to shoot back with ""Do you want me ...",1,0.5,1524018000.0,5,5.997,...,1.0,1.3,1.0,1.72615,1.52,1.84909,1.0,7,6.606,-0.066667
4,1965,relationships,7r1e85,"[23, 28]",I haven’t said anything to him yet because I’m...,0,0.8,1516200000.0,138,4.649418,...,1.125,1.1429,1.0,1.75642,1.43582,1.91725,0.84,70,4.801869,0.141667


In [19]:
df3.tail(5)

Unnamed: 0,id,subreddit,post_id,sentence_range,text,label,confidence,social_timestamp,social_karma,syntax_ari,...,lex_dal_min_pleasantness,lex_dal_min_activation,lex_dal_min_imagery,lex_dal_avg_activation,lex_dal_avg_imagery,lex_dal_avg_pleasantness,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment
2833,1713,relationships,7oee1t,"[35, 40]","* Her, a week ago: Precious, how are you? (I i...",0,1.0,1515187000.0,13,-1.369333,...,1.4,1.0,1.0,1.71133,1.45301,2.00304,0.84,16,0.254444,0.552066
2834,1133,ptsd,9p4ung,"[20, 25]",I don't have the ability to cope with it anymo...,1,1.0,1539827000.0,33,9.425478,...,1.0,1.0,1.0,1.65003,1.56842,1.81527,0.96,6,8.640664,-0.22037
2835,10442,anxiety,9nam6l,"(5, 10)",In case this is the first time you're reading ...,0,1.0,1539269000.0,2,11.060675,...,1.125,1.125,1.0,1.79768,1.49074,1.92286,1.0,1,9.951524,0.045455
2836,1834,almosthomeless,5y53ya,"[5, 10]",Do you find this normal? They have a good rela...,0,0.571429,1488938000.0,4,2.421912,...,1.1111,1.1429,1.0,1.71642,1.57627,1.89972,0.75,7,4.036765,0.159722
2837,961,ptsd,5y25cl,"[0, 5]",I was talking to my mom this morning and she s...,1,0.571429,1488910000.0,2,0.835254,...,1.0,1.0,1.0,1.68891,1.44615,1.89707,0.76,2,2.412,0.016667


In [20]:
df3.columns

Index(['id', 'subreddit', 'post_id', 'sentence_range', 'text', 'label',
       'confidence', 'social_timestamp', 'social_karma', 'syntax_ari',
       ...
       'lex_dal_min_pleasantness', 'lex_dal_min_activation',
       'lex_dal_min_imagery', 'lex_dal_avg_activation', 'lex_dal_avg_imagery',
       'lex_dal_avg_pleasantness', 'social_upvote_ratio',
       'social_num_comments', 'syntax_fk_grade', 'sentiment'],
      dtype='object', length=116)

#### Similarly to the data pulled from the API, I want to have a column of unprocesssed text and a column of preprocessed text

In [21]:
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

df3['text_preproc'] = df3['text'].apply(remove_punctuations)

In [22]:
df3['text_preproc'] = df3['text_preproc'].str.lower()

In [23]:
df3.head(5)

Unnamed: 0,id,subreddit,post_id,sentence_range,text,label,confidence,social_timestamp,social_karma,syntax_ari,...,lex_dal_min_activation,lex_dal_min_imagery,lex_dal_avg_activation,lex_dal_avg_imagery,lex_dal_avg_pleasantness,social_upvote_ratio,social_num_comments,syntax_fk_grade,sentiment,text_preproc
0,896,relationships,7nu7as,"[50, 55]","Its like that, if you want or not.“ ME: I have...",0,0.8,1514981000.0,22,-1.238793,...,1.2,1.0,1.65864,1.32245,1.80264,0.63,62,-0.148707,0.0,its like that if you want or not“ me i have no...
1,19059,anxiety,680i6d,"(5, 10)",I man the front desk and my title is HR Custom...,0,1.0,1493348000.0,5,7.684583,...,1.125,1.0,1.69133,1.6918,1.97249,1.0,2,7.398222,-0.065909,i man the front desk and my title is hr custom...
2,7977,ptsd,8eeu1t,"(5, 10)",We'd be saving so much money with this new hou...,1,1.0,1524517000.0,10,2.360408,...,1.0,1.0,1.70974,1.52985,1.86108,1.0,8,3.149288,-0.036818,wed be saving so much money with this new hous...
3,1214,ptsd,8d28vu,"[2, 7]","My ex used to shoot back with ""Do you want me ...",1,0.5,1524018000.0,5,5.997,...,1.3,1.0,1.72615,1.52,1.84909,1.0,7,6.606,-0.066667,my ex used to shoot back with do you want me t...
4,1965,relationships,7r1e85,"[23, 28]",I haven’t said anything to him yet because I’m...,0,0.8,1516200000.0,138,4.649418,...,1.1429,1.0,1.75642,1.43582,1.91725,0.84,70,4.801869,0.141667,i haven’t said anything to him yet because i’m...


#### Here I am fixing the columns and getting rid of what I don't want

In [24]:
df3 = df3[['subreddit', 'text', 'text_preproc', 'label']]

In [25]:
df3.head(5)

Unnamed: 0,subreddit,text,text_preproc,label
0,relationships,"Its like that, if you want or not.“ ME: I have...",its like that if you want or not“ me i have no...,0
1,anxiety,I man the front desk and my title is HR Custom...,i man the front desk and my title is hr custom...,0
2,ptsd,We'd be saving so much money with this new hou...,wed be saving so much money with this new hous...,1
3,ptsd,"My ex used to shoot back with ""Do you want me ...",my ex used to shoot back with do you want me t...,1
4,relationships,I haven’t said anything to him yet because I’m...,i haven’t said anything to him yet because i’m...,0


In [26]:
df3.shape

(3553, 4)

#### Joining the datasets
* Now that the datasets look similar, it's time to concatenate them

In [27]:
df = pd.concat([posts, df3])
df = pd.DataFrame(df)
df = df.reset_index(drop=True)

#### The final step in the data wrangling process here is to add a stress label, where a score of 1 equals 'stress' and a score of 0 equals 'no stress'

In [28]:
df['stress_label'] = np.where(df['label'] == 1, 'stress', 'no stress')

In [29]:
df.head(5)

Unnamed: 0,subreddit,text,text_preproc,label,stress_label
0,happy,Welcome to /r/happy where we support people in...,welcome to rhappy where we support people in t...,0,no stress
1,happy,"You are good people, Mike.",you are good people mike,0,no stress
2,happy,"Wanna come to Knight Lake near Waupaca, Wiscon...",wanna come to knight lake near waupaca wiscons...,0,no stress
3,happy,"“But, I thought the old lady dropped it into t...",“but i thought the old lady dropped it into th...,0,no stress
4,happy,Yay! Thank you for being kind and returning it!,yay thank you for being kind and returning it,0,no stress


In [30]:
df.tail(5)

Unnamed: 0,subreddit,text,text_preproc,label,stress_label
5136,relationships,"* Her, a week ago: Precious, how are you? (I i...",her a week ago precious how are you i ignored...,0,no stress
5137,ptsd,I don't have the ability to cope with it anymo...,i dont have the ability to cope with it anymor...,1,stress
5138,anxiety,In case this is the first time you're reading ...,in case this is the first time youre reading t...,0,no stress
5139,almosthomeless,Do you find this normal? They have a good rela...,do you find this normal they have a good relat...,0,no stress
5140,ptsd,I was talking to my mom this morning and she s...,i was talking to my mom this morning and she s...,1,stress


#### Save to pickle

In [31]:
df.to_pickle('df.pickle')