# Overview - Data clean-up and Sentiment Analysis

This notebook loads dataframes representing data from a different subreddit from CSV files. It cleans them and assigns assigns intent labels ("VENT" or "ADVICE"). For general subreddits (e.g. r/Advice, r/Vent), all posts get assined the appropriate intent label. For mental health subreddits, a manually compiled list of relevant "flairs" (e.g. "help", "rant") is compiled and then intent labels are assigned accordingly. Finally, sentiment scores will be assigned to both the title and body of the posts. The dataframes will be merged and saved into two dataframes: general and mental health. 



## Vader Sentiment Scores for Title and Body of Posts
Performing initial sentiment analysis using NLTK's Vader which is trained for social media data. A drawback of this approach is attempting to label sentiment on an entire post, which may not be as effective on multi-sentence and multi-paragraph posts (even a single paragraph can express both positive and negative sentiment). An approach to test in the future is entity-level sentiment analysis. 

Here's a simple function to apply to the dataframes once clean.

### compound_ss(row, col) : 
Apply row-wise to dataframe to compute sentiment score of specified column. Body of post = "selftext", title of post = "title" 

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

In [86]:
def compound_ss(row,col):
    if isinstance(row[col], str):
        scores = sid.polarity_scores(row[col])
        return scores['compound']
    
    #row at col is empty, return itself
    else:
        return row[col]


In [249]:
import pandas as pd
import numpy as np

### Load Mental Health data

Processing the mental health data first then repeating the steps for the general data below.

In [91]:
#read BPD df
BPD = pd.read_csv('BPD_01_01_2016_04_19_2020.csv')

In [88]:
#LOAD
CPTSD = pd.read_csv('CPTSD_01_01_2016_04_19_2020.csv')
depression = pd.read_csv('depression_help_04_19_2020.csv')
anxiety = pd.read_csv ('anxiety_04_19_2020.csv')

Manually inspect post flairs to composet the advice and vent flair lists. 

In [87]:
BPD_flairs= BPD.link_flair_text.value_counts().index
print(BPD_flairs)

Index(['Seeking Support', 'Venting', 'DAE', 'Questions', 'Questions/Advice',
       'Input', 'Perspective Needed', 'Fuck My Life', 'CW: Suicide',
       'Positivity', 'Urgent: Coping Skills Needed', 'Other',
       'Loved Ones/Friends', 'CW: Multiple', 'Person w/o BPD',
       '#ThatBPDfeelWhen', 'CW: Self Harm', 'Medicine',
       'Affirmations/Victories', 'Articles/Information', 'Progress Post',
       'Affirmations', 'Small Triumph', 'Quiet Borderline', 'TW: Suicide',
       'Oops, I did it... again', 'Research', 'CW: Substance Abuse',
       'Letter/Note', 'DBT Question', 'TW: Self Harm', 'Therapy Rant',
       'Lesson Learned', 'CW: Abuse', 'Baby Borderline',
       'It's Not the End of the World', 'Success Story',
       'CW: Eating Disorders', 'CW: Sexual Assault',
       'New Coping Skill Achievement Unlocked!', 'Justified Anger? Vote 1-10',
       'TW: Multiple', 'Discord Server', 'Relationships', 'Radical Acceptance',
       'Insight', 'Veteran Borderline', 'TW: Abuse', 'TW: 

Load other datasets and manually add to flair lists.

In [89]:
CPTSD_flairs = CPTSD.link_flair_text.value_counts().index
depression_flairs = depression.link_flair_text.value_counts().index
anxiety_flairs = anxiety.link_flair_text.value_counts().index

print(CPTSD_flairs)
print(depression_flairs)
print(anxiety_flairs)

Index(['DAE (Does Anyone Else?)', 'CPTSD Vent / Rant',
       'Request: Emotional Support', 'CPTSD Breakthrough Moment',
       'Request Advice: CPTSD Survivors Same Background', 'Trauma Story',
       'Symptom: Dissociation ', 'Symptom: Anxiety', 'Symptom: Flashbacks',
       'Request Support: Theraputic Resources Specific to OP',
       'Symptom: Nightmares / Insomnia', 'Symptom: Emotional Dysregulation',
       'Resource: Theraputic', 'Symptom: Avoidance',
       'Symptom: Self Harm', 'CPTSD Academic / Theory',
       'Resource: Academic / Theory', 'Symptom: Self Deprecation',
       'Request Support: Academic / Theory Resources',
       'Request Advice: CPTSD Survivors Different Baground from OP',
      dtype='object')
Index(['REQUESTING ADVICE', 'REQUESTING SUPPORT', 'RANT', 'Add a Flair!',
       'STORY', 'PROVIDING SUPPORT', 'PROVIDING ADVICE', 'MOTIVATION',
       'INSPIRATION', 'OTHER', 'ANNOUNCEMENT', 'Add a flair!', 'Question',
       'Help?', 'OTHER RANT ', 'META', 'Thanks!

### Compiled list of Venting vs. Advice flairs

In [256]:
vent_flairs = ['Venting', '#ThatBPDfeelWhen', 'Therapy Rant', 'CPTSD Vent / Rant','RANT', 'OTHER RANT ', 'VENT']

advice_flairs = ['Seeking Support', 'Questions', 'Questions/Advice','Input', 'Perspective Needed',
                 'Urgent: Coping Skills Needed', 'Advice', 'Request: Emotional Support', 
                 'Request Advice: CPTSD Survivors Same Background', 'Request Support: Theraputic Resources Specific to OP',
                 'Request Support: Academic / Theory Resources','Request Advice: CPTSD Survivors Different Baground from OP',
                 'REQUESTING ADVICE', 'REQUESTING SUPPORT','Question','Help?', 'help', 'REQUESTING SUPPORT AND ADVICE',
                 'PLEASE HELP!', 'I Need Help', 'help me', 'QUESTION', 'HALP', 'REQUESTING ADVICE AND SUPPORT', 'URGENT',
                 'Requesting Support and Advice', 'Advice Needed', 'Needs A Hug/Support', 'Advice', 'Question', 
                 '(Trigger Warning) Advice Needed']

Applying functions to dataframes:
axis=1 performs row operations instead of default column operations.

In [92]:
BPD['title_ss'] = BPD.apply(compound_ss, args=['title'],axis=1)
BPD['body_ss'] = BPD.apply(compound_ss,args=['selftext'],axis=1)


In [93]:
BPD.head()

Unnamed: 0.1,Unnamed: 0,author,created_utc,id,link_flair_text,num_comments,score,selftext,subreddit,title,url,created,d_,title_ss,body_ss
0,0,[deleted],1454296789,43mkdn,Seeking Support,2,1,[deleted],BPD,Fight with a coworker was publicly disrespectf...,https://www.reddit.com/r/BPD/comments/43mkdn/f...,1454311189.0,"{'author': '[deleted]', 'created_utc': 1454296...",-0.7922,0.0
1,1,SharpAtTheEdge,1454296727,43mk8p,Seeking Support,5,4,I just seem to fuck up every relationship I've...,BPD,I hate BPD so much.,https://www.reddit.com/r/BPD/comments/43mk8p/i...,1454311127.0,"{'author': 'SharpAtTheEdge', 'created_utc': 14...",-0.5719,0.8097
2,2,skyandbuildings,1454296592,43mjwl,Seeking Support,6,2,I think about killing myself all the time. As ...,BPD,I can't get the thought of suicide out of my h...,https://www.reddit.com/r/BPD/comments/43mjwl/i...,1454310992.0,"{'author': 'skyandbuildings', 'created_utc': 1...",-0.6705,-0.8059
3,3,The_JollyGreenGiant,1454296570,43mjul,Questions,12,5,Now I've realized that I've entered into a rel...,BPD,I thought I was fine over the past few weeks. ...,https://www.reddit.com/r/BPD/comments/43mjul/i...,1454310970.0,"{'author': 'The_JollyGreenGiant', 'created_utc...",0.2023,0.3735
4,4,justanotherikealamp,1454290090,43m3q5,Venting,4,14,I also have complex ptsd. I've been trapped in...,BPD,Is there anyone who could talk (or type) with ...,https://www.reddit.com/r/BPD/comments/43m3q5/i...,1454304490.0,"{'author': 'justanotherikealamp', 'created_utc...",0.0,-0.8907


In [94]:
CPTSD['title_ss'] = CPTSD.apply(compound_ss, args=['title'],axis=1)
CPTSD['body_ss'] = CPTSD.apply(compound_ss,args=['selftext'],axis=1)
depression['title_ss'] = depression.apply(compound_ss, args=['title'],axis=1)
depression['body_ss'] = depression.apply(compound_ss,args=['selftext'],axis=1)
anxiety['title_ss'] = anxiety.apply(compound_ss, args=['title'],axis=1)
anxiety['body_ss'] = anxiety.apply(compound_ss,args=['selftext'],axis=1)


### Join Mental Health dataframes into single MH dataframe

In [260]:
#mental health dataframes merged
MH = pd.concat([BPD,CPTSD,depression, anxiety], ignore_index=True)

In [261]:
#drop Unnamed: 0 col
MH = MH.drop(['Unnamed: 0'], axis=1)
MH.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179455 entries, 0 to 179454
Data columns (total 14 columns):
author             179455 non-null object
created_utc        179455 non-null int64
id                 179455 non-null object
link_flair_text    179444 non-null object
num_comments       179455 non-null int64
score              179455 non-null int64
selftext           170903 non-null object
subreddit          179455 non-null object
title              179455 non-null object
url                179455 non-null object
created            179455 non-null object
d_                 179334 non-null object
title_ss           179455 non-null float64
body_ss            170903 non-null float64
dtypes: float64(2), int64(3), object(9)
memory usage: 19.2+ MB


### Applying Intent Labels : ADVICE or VENT

In [257]:
def label_intent(row):
    flair = row['link_flair_text']
    if flair in vent_flairs:
        return 'VENT'
    elif flair in advice_flairs:
        return 'ADVICE'
    else:
        return 'OTHER'
    
#intent column mapping link_flair_text to binary intent (advice vs vent)

#remove all others


In [262]:
MH['intent'] = MH.apply(label_intent, axis=1)

In [263]:
MH.head()

Unnamed: 0,author,created_utc,id,link_flair_text,num_comments,score,selftext,subreddit,title,url,created,d_,title_ss,body_ss,intent
0,[deleted],1454296789,43mkdn,Seeking Support,2,1,[deleted],BPD,Fight with a coworker was publicly disrespectf...,https://www.reddit.com/r/BPD/comments/43mkdn/f...,1454311189.0,"{'author': '[deleted]', 'created_utc': 1454296...",-0.7922,0.0,ADVICE
1,SharpAtTheEdge,1454296727,43mk8p,Seeking Support,5,4,I just seem to fuck up every relationship I've...,BPD,I hate BPD so much.,https://www.reddit.com/r/BPD/comments/43mk8p/i...,1454311127.0,"{'author': 'SharpAtTheEdge', 'created_utc': 14...",-0.5719,0.8097,ADVICE
2,skyandbuildings,1454296592,43mjwl,Seeking Support,6,2,I think about killing myself all the time. As ...,BPD,I can't get the thought of suicide out of my h...,https://www.reddit.com/r/BPD/comments/43mjwl/i...,1454310992.0,"{'author': 'skyandbuildings', 'created_utc': 1...",-0.6705,-0.8059,ADVICE
3,The_JollyGreenGiant,1454296570,43mjul,Questions,12,5,Now I've realized that I've entered into a rel...,BPD,I thought I was fine over the past few weeks. ...,https://www.reddit.com/r/BPD/comments/43mjul/i...,1454310970.0,"{'author': 'The_JollyGreenGiant', 'created_utc...",0.2023,0.3735,ADVICE
4,justanotherikealamp,1454290090,43m3q5,Venting,4,14,I also have complex ptsd. I've been trapped in...,BPD,Is there anyone who could talk (or type) with ...,https://www.reddit.com/r/BPD/comments/43m3q5/i...,1454304490.0,"{'author': 'justanotherikealamp', 'created_utc...",0.0,-0.8907,VENT


### Cleaning MH dataframe

During data retrieval, a handful of posts had missing body of text, which resulted in bad formatting for the entire row. These are easily identified by checking the subreddit is in the valid list of subreddits and doesn't include text bleeding over from another column.

We also want to remove posts with intent other than asking for advice or venting, as we won't be using in the next analyses.

**Note** : the resulting dataframe will always have a body of text, but in some instances it will contain "[removed]" or "[deleted]". These will need to be excluded form sentiment analysis of body. We do not remove them because titles are in tact. 

In [129]:
#drop all rows where subreddit label messed up 
valid_subreddits = ['Anxiety','BPD','CPTSD','depression_help']


In [308]:
#keep only venting or advice posts
delete_idxs = MH.loc[MH['intent']=='OTHER'].index
MH.drop(delete_idxs,inplace=True)

#delete rows where selftext == subreddit name
delete_more = MH.loc[MH['selftext'].isin(valid_subreddits)].index
MH.drop(delete_more, inplace=True)

MH.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 85721 entries, 0 to 179454
Data columns (total 15 columns):
author             85721 non-null object
created_utc        85721 non-null int64
id                 85721 non-null object
link_flair_text    85721 non-null object
num_comments       85721 non-null int64
score              85721 non-null int64
selftext           83466 non-null object
subreddit          85721 non-null object
title              85721 non-null object
url                85721 non-null object
created            85721 non-null object
d_                 85721 non-null object
title_ss           85721 non-null float64
body_ss            83466 non-null float64
intent             85721 non-null object
dtypes: float64(2), int64(3), object(10)
memory usage: 10.5+ MB


## Load & Prepare General Data

In [370]:
advice = pd.read_csv('advice_01_01_2016_04_19_2020.csv')
needadv = pd.read_csv('needadvice_01_01_2016_04_19_2020.csv')
venting = pd.read_csv('venting_01_01_2016_04_19_2020.csv')
vent = pd.read_csv('vent_01_01_2016_04_19_2020.csv')
rant = pd.read_csv('rant_01_01_2016_04_19_2020.csv')

In [379]:
advice_all = pd.concat([advice, needadv], ignore_index=True, sort=False)
vent_all = pd.concat([venting,vent,rant],ignore_index=True, sort=False)

### Assign intent

In [381]:
advice_all['intent']= 'ADVICE'
vent_all['intent'] = 'VENT'

### Add sentiment analysis scores for body and title

In [None]:
#sentiment analysis scores
advice_all['title_ss'] = advice_all.apply(compound_ss, args=['title'],axis=1)

In [387]:
advice_all['body_ss'] = advice_all.apply(compound_ss,args=['selftext'],axis=1)



In [389]:
vent_all['title_ss'] = advice_all.apply(compound_ss, args=['title'],axis=1)


In [391]:
vent_all['body_ss'] = advice_all.apply(compound_ss,args=['selftext'],axis=1)

### Clean up data

- Move all mental health related posts to MH df
- Remove posts marked not asking for advice from advice_all

In [406]:
#move all mental health related posts to MH, delete from advice_all
mh_rows = advice_all.loc[advice_all['link_flair_text']=='Mental Health']
MH= MH.append(mh_rows,ignore_index=True, sort=True)

In [414]:
mh_rows.index

Int64Index([342786, 358465, 410703, 758465, 758469, 758475, 758502, 758538,
            758545, 758556,
            ...
            784034, 784046, 784054, 784059, 784067, 784071, 784074, 784080,
            784081, 784093],
           dtype='int64', length=1917)

In [415]:
advice_all.drop(mh_rows.index, inplace=True)

In [408]:
MH.head()

Unnamed: 0.1,Unnamed: 0,author,body_ss,created,created_utc,d_,id,intent,link_flair_text,num_comments,score,selftext,subreddit,title,title_ss,url
0,,[deleted],0.0,1454311189.0,1454296789,"{'author': '[deleted]', 'created_utc': 1454296...",43mkdn,ADVICE,Seeking Support,2,1,[deleted],BPD,Fight with a coworker was publicly disrespectf...,-0.7922,https://www.reddit.com/r/BPD/comments/43mkdn/f...
1,,SharpAtTheEdge,0.8097,1454311127.0,1454296727,"{'author': 'SharpAtTheEdge', 'created_utc': 14...",43mk8p,ADVICE,Seeking Support,5,4,I just seem to fuck up every relationship I've...,BPD,I hate BPD so much.,-0.5719,https://www.reddit.com/r/BPD/comments/43mk8p/i...
2,,skyandbuildings,-0.8059,1454310992.0,1454296592,"{'author': 'skyandbuildings', 'created_utc': 1...",43mjwl,ADVICE,Seeking Support,6,2,I think about killing myself all the time. As ...,BPD,I can't get the thought of suicide out of my h...,-0.6705,https://www.reddit.com/r/BPD/comments/43mjwl/i...
3,,The_JollyGreenGiant,0.3735,1454310970.0,1454296570,"{'author': 'The_JollyGreenGiant', 'created_utc...",43mjul,ADVICE,Questions,12,5,Now I've realized that I've entered into a rel...,BPD,I thought I was fine over the past few weeks. ...,0.2023,https://www.reddit.com/r/BPD/comments/43mjul/i...
4,,justanotherikealamp,-0.8907,1454304490.0,1454290090,"{'author': 'justanotherikealamp', 'created_utc...",43m3q5,VENT,Venting,4,14,I also have complex ptsd. I've been trapped in...,BPD,Is there anyone who could talk (or type) with ...,0.0,https://www.reddit.com/r/BPD/comments/43m3q5/i...


In [413]:
list(advice_all.link_flair_text.value_counts().reset_index()['index'])

['Personal',
 'Relationships',
 'Serious',
 'Work',
 'Other',
 'School',
 'Family',
 'Technology',
 'Life Decisions',
 'Mental Health',
 'Advice Received',
 'Career',
 'Friendships',
 'COVID-19',
 'Fitness',
 'Interpersonal',
 'Education',
 'Medical',
 'Motivation',
 'Finance',
 'Housing',
 'Moving',
 'Family Loss',
 'Travel',
 'Not Asking for Advice',
 'Pet Loss',
 'Trolling/Mean/Violent',
 'Favors',
 'Relationship',
 'OTHER',
 'Not asking for advice',
 'General Advice',
 'Health',
 'JOB',
 'FRIENDSHIPS',
 'LIFE DECISIONS',
 'Friendship',
 'Social',
 'Life',
 'Friends',
 'MEDICAL',
 'Reddit',
 'NSFW',
 'Dating',
 'Family Issues',
 'Love',
 'advice',
 'help',
 'relationship',
 'Avoided topics',
 'Personal Information',
 'EDUCATION',
 'Pets',
 'No medical advice',
 'FINANCE',
 'Neighbors',
 'TRAVEL',
 'Advice',
 'Roommate',
 'Feelings',
 'Cars',
 'Sleep',
 'Driving',
 'Advertising',
 'Other ',
 'Roommates',
 'Fashion',
 'Help',
 'Clothing',
 'Music',
 'Parents',
 'family',
 'Bullying',


Some of the general advice posts were also tagged for content. Through manual inspection, a list of flairs that need to be removed were identified, such as "not asking for advice". I also identified posts tagged with mental health subject matter which will be moved to the overall mental health dataframe.

In [417]:
remove_flairs = ['Not Asking for Advice', 'Not asking for advice', 'General Advice Not Asking for Advice', 
                 'Advertising Not Asking for Advice', 'Spam / Irrelevant / Not asking for advice','Mental',
                'Rant', 'ADHD regulated meds', 'Not asking for advice Avoided topics', 'Personal Information Not Asking for Advice',
                'Family, Depression, Serious']


more_mental_flairs = ['emotional and mental weakness', 'Mental', 'Mental Health ', 'tw!! suicide', 'Anxiety',
                   'Health/Mental health/', 'Depression/Suicide', 'Suicide', 'friendships&amp;mental health',
                   'Mental Health + Friendships', 'Emotion health', 'Depression', 'Trauma, suicide, mental illness, drug use',
                   'Family and methal health ', 'mental health', 'Mental Disorders', 'Life decision, Mental Health and Moving',
                   'Emotional']

In [428]:
more_mh_rows = advice_all.loc[advice_all['link_flair_text'].isin(more_mental_flairs)]


In [425]:
remove_idxs = advice_all.loc[advice_all['link_flair_text'].isin(remove_flairs)].index

In [420]:
MH= MH.append(more_mh_rows,ignore_index=True, sort=True)

In [424]:
#drop mh_rows from advice_all df

advice_all.drop(more_mh_rows.index,inplace=True)


KeyError: '[301920 319634 325562 332287 357309 358672 379054 404321 411322 759818\n 764392 767618 768602 769610 773769 775308 775949 777214 779496] not found in axis'

In [427]:
advice_all.drop(remove_idxs, inplace=True)

### How many posts in MH dataset seek ADVICE vs. VENT?

It seems that post authors in these communities are more than twice as likely to seek advice or input over venting. Our dataset is imbalanced towards advice-seeking posts.

In [336]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
output_notebook()

In [441]:
counts = MH.intent.value_counts().reset_index()
p = figure(x_range=counts['index'], title='Post Intent Counts', plot_height=400, tools='hover', 
           tooltips=[("total posts", "@top")])
p.vbar(x=counts['index'],top=counts['intent'], width=0.6)
p.y_range.start=0
show(p)

### How many posts in general dataset seek ADVICE vs. VENT?


Similar to the activity in mental health subreddits, the general advice subreddits see much higher volume of posts than the general vent subreddits.

In [453]:
labels = ['ADVICE', 'VENT']
counts= [len(advice_all), len(vent_all)]
p = figure(x_range=labels, title='General Post Intent Counts', plot_height=400, tools='hover', 
           tooltips=[("total posts", "@top")])
p.vbar(x=labels,top=counts, width=0.6)
p.y_range.start=0
show(p)

### Save new dataframes

In [435]:
#save dataframes

MH.drop('Unnamed: 0', axis=1,inplace=True)

In [436]:
MH.to_csv('MH_ss.csv')

In [438]:
advice_all.drop('Unnamed: 0', axis=1, inplace=True)
advice_all.to_csv('advice_ss.csv')

In [440]:
vent_all.drop('Unnamed: 0', axis=1, inplace=True)
vent_all.to_csv('vent_ss.csv')