# Characteristic Terms in Reddit Posts

This notebook takes a random sample of posts from the collected Reddit intention corpus and produces scattertext plots of characteristic terms. Characteristic terms are those that are more frequent in corpus in comparison to the standard English language corpus (loaded from spaCy). In addition, characteristic terms are further identified according to category columns: intent (Seeking advice or Venting) and group (Mental Health or General). 

In [1]:
import pandas as pd
import numpy as np
import scattertext as st
import spacy
from pprint import pprint
import pickle

Load data and models

In [2]:
#load data
MH = pd.read_csv('../data/MH_ss.csv',index_col='id', low_memory=False)
advice = pd.read_csv('../data/advice_ss.csv',index_col='id', low_memory=False)
vent = pd.read_csv('../data/vent_ss.csv',index_col='id', low_memory=False)
gen = pd.concat([advice,vent],sort=True)

In [20]:
#spaCy model of general English corpus
nlp= spacy.load('en')

Due to computational complexity of generating scattertext corpi, random sample 5,000 vent and 5,000 advice posts.

In [7]:
MH_body = MH.loc[~MH['selftext'].isin(['[removed]','[deleted]'])]
vent_samp = MH_body.loc[MH_body['intent']=='VENT'].sample(5000)
adv_samp = MH_body.loc[MH_body['intent']=='ADVICE'].sample(5000)
MH_samp = pd.concat([vent_samp,adv_samp])


gen_body = gen.loc[~gen['selftext'].isin(['[removed]','[deleted]'])]
vent_samp = gen_body.loc[gen_body['intent']=='VENT'].sample(5000)
adv_samp = gen_body.loc[gen_body['intent']=='ADVICE'].sample(5000)
gen_samp = pd.concat([vent_samp,adv_samp])

### Characteristic Terms in General Posts

To view the interactive version, please visit : <a href='https://www.cs.columbia.edu/~nishi/General-Intent-Term-Visualization_10000.html'> Link </a>, where you can enter your own terms, click on the points to view their scores, and see excerpts of the posts in which they appear.

**Note** : Loading the static page takes a few minutes but runs smoothly from there. 

F-scores for words are computed through the harmonic mean of P(term | category) and P(category | term) for terms in corpus. 

The y-axis shows an Advice F-score, while the x-axis shows Vent F-score. Each point on the scattertext plot represents a word's Advice F-score vs. Vent F-score.

This means, upper left corner of plot shows terms highly associated with advice-seeking posts only, while lower right corner shows terms highly associated with vent-seeking posts only. Those on the upper right corner are terms highly in common for both.

It makes sense that advice-seeking posts tend to include terms explicitly requesting advice ("any advice", "please help") and gratitude for future responses ("appreicated", "in advance"). More interesting is the use of greetings in advice-seeking posts ("hello", "hi") which indicate advice-seeking posts tend to address an audience while venting posts do not. There seem to be more profanity and negative words ("trash", "scream") associated with venting. Note, many vents about President Trump.


<img src='img/gen_intent.png'>

### Generating the Scattertext Plot:

In [42]:
#turning dfs into Scattertext Corpus 
gen_corpus = st.CorpusFromPandas(gen_samp,
                                category_col='intent',
                                text_col='selftext',
                                nlp=nlp
                               ).build()



In [44]:
html = st.produce_scattertext_explorer(gen_corpus,
          category='ADVICE',
          category_name='Advice',
          not_category_name='Vent',
          minimum_term_frequency=30,                                                                    
          width_in_pixels=1000)

open("General-Intent-Term-Visualization_10000.html", 'wb').write(html.encode('utf-8'))

12784539

### Characteristic Terms in Mental Health Posts

To view the interactive version, please visit : <a href='https://www.cs.columbia.edu/~nishi/MH-Term-Visualization_10000.html'> Link </a>, where you can enter your own terms, click on the points to view their scores, and see excerpts of the posts in which they appear.

**Note** : Loading the static page takes a few minutes but runs smoothly from there. 

This plot again shows characteristic terms for advice-seeking vs. venitng posts but in posts relating to mental health. 

Again, the y-axis shows an Advice F-score, while the x-axis shows Vent F-score. Each point on the scattertext plot represents a word's Advice F-score vs. Vent F-score. This means, upper left corner of plot shows terms highly associated with advice-seeking posts only, while lower right corner shows terms highly associated with vent-seeking posts only. Those on the upper right corner are terms highly in common for both.

Some similar trends are repeated in this corpus, such as greeting others and expressing gratitude in advance for advice-seeking posts, as well as the profanity in the venting posts. However, there's a greater prevalence of negative terms throughout, as well as non-surprisingly, mental disorder terms and medications ("panic attack", "anxiety", "prozac"). You can see less spread as terms become more characteristic, suggesting a lot of overlap of terms here.

<img src='img/mh_intent.png'>

### Generating the Scattertext Plot:

In [39]:
#characteristic terms using scattertext
#for more than 2 categories, do n 1 vs n-1 plots

#turning dfs into Scattertext Corpus 
MH_corpus = st.CorpusFromPandas(MH_samp,
                                category_col='intent',
                                text_col='selftext',
                                nlp=nlp
                               ).build()



In [27]:
#characteristic terms that differentiate corpus from general English
print(list(MH_corpus.get_scaled_f_scores_vs_background().index[:10]))



['bpd', 'idk', 'anxiety', 'reddit', 'anxious', 'gon', 'texted', 'instagram', 'cptsd', 'texting']


In [28]:
MH_term_freq_df = MH_corpus.get_term_freq_df()
MH_term_freq_df['Advice Score']= MH_corpus.get_scaled_f_scores('ADVICE')
MH_term_freq_df['Vent Score']= MH_corpus.get_scaled_f_scores('VENT')

pprint(list(MH_term_freq_df.sort_values(by='Advice Score',
                                        ascending=False).index[:10]))

pprint(list(MH_term_freq_df.sort_values(by='Vent Score',
                                        ascending=False).index[:10]))

['advice',
 'any advice',
 'panic',
 'depression',
 'any',
 'help',
 'do i',
 'anxiety',
 'how to',
 'have been']
['fucking',
 'fuck',
 'wish',
 'tired',
 'hate',
 'i hate',
 'shit',
 'enough',
 'everyone',
 'i wish']


In [40]:
html = st.produce_scattertext_explorer(MH_corpus,
          category='ADVICE',
          category_name='advice',
          not_category_name='vent',
          minimum_term_frequency=20,                             
          width_in_pixels=1000)

open("MH-Term-Visualization_10000.html", 'wb').write(html.encode('utf-8'))

13989244

### Intent Terms vs. Mental Health Terms

To view the interactive version, please visit : <a href='https://www.cs.columbia.edu/~nishi/Intent_Group.html'> Link </a>, where you can enter your own terms, click on the points to view their scores, and see excerpts of the posts in which they appear.

**Note** : Loading the static page takes a few minutes but runs smoothly from there. 

This plot shows <a href="https://github.com/JasonKessler/scattertext/blob/master/README.md#understanding-scaled-f-score">Scaled F-scores</a> by intent category over both general and mental health posts to visualize how characteristic terms change based on intent and group (general, mental health). 

The y-axis represents increased characteristicness for advice-seeking posts, while the x-axis shows increased characteristicness for mental health-oriented posts. As a result, upper-left = Advice-seeking general posts, lower-left = venting general posts, upper-right = advice-seeking mental health posts, lower-right = venting mental health posts.

This plot clarifies how a classifier might leverage terms to classify intent and group. It seems general advice posts discuss daily issues, such as education and family members. Mental health advice posts discuss heavily discuss symptoms such as suicide and anxiety, but also treatment ("dbt", "sessions"). General vent posts contain a lot of profanity, while mental health posts show more on-topic words to mental health ("breakdown", "emotions", "brain").

<img src='img/intent_vs_group.png'>

### Generating the Scattertext plot with custom coordinates

Use custom coordinates to plot the intent scaled f-score vs. mental health scaled f-score:

In [9]:
gen_samp['group'] = 'GENERAL'
MH_samp['group'] = 'MENTAL HEALTH'
merged_samp = pd.concat([gen_samp,MH_samp],sort=True)

In [10]:
merged_samp.info()

<class 'pandas.core.frame.DataFrame'>
Index: 20000 entries, c2d6y1 to ahdd6w
Data columns (total 16 columns):
Unnamed: 0         20000 non-null int64
author             20000 non-null object
body_ss            19166 non-null float64
created            20000 non-null float64
created_utc        20000 non-null int64
d_                 9936 non-null object
group              20000 non-null object
intent             20000 non-null object
link_flair_text    10772 non-null object
num_comments       20000 non-null int64
score              20000 non-null int64
selftext           19221 non-null object
subreddit          20000 non-null object
title              20000 non-null object
title_ss           20000 non-null float64
url                20000 non-null object
dtypes: float64(3), int64(4), object(9)
memory usage: 2.6+ MB


In [46]:
merged_corpus_intent = st.CorpusFromPandas(merged_samp,
                                category_col='intent',
                                text_col='selftext',
                                nlp=nlp
                               ).build()

In [47]:
merged_corpus_group = st.CorpusFromPandas(merged_samp,
                                category_col='group',
                                text_col='selftext',
                                nlp=nlp
                               ).build()

In [48]:
#get f-scores
advice_scores = merged_corpus_intent.get_scaled_f_scores('ADVICE')
health_scores = merged_corpus_group.get_scaled_f_scores('MENTAL HEALTH')

In [64]:
html = st.produce_scattertext_explorer(merged_corpus_intent,
                                       category='ADVICE',
                                       category_name='Advice',
                                       not_category_name='Venting',
                                       minimum_term_frequency=30,
                                       pmi_filter_thresold=4,
                                       width_in_pixels=1000,
                                       scores=advice_scores,
                                       sort_by_dist=False,
                                       x_coords=health_scores,
                                       y_coords=advice_scores,
                                       show_characteristic=False,
                                       metadata=(merged_corpus_intent.get_df()['group']+' ('+merged_corpus_intent.get_df()['subreddit']  + ')'),
                                       x_label='More Mental-Health related',
                                       y_label='More Advice-Seeking')
file_name = 'Intent_Group.html'
open(file_name, 'wb').write(html.encode('utf-8'))

25624332

### Pickle corpus objects

In [66]:
with open('corpus_intent_10k.pickle', 'wb') as f:
    pickle.dump(merged_corpus_intent, f)
    
with open('corpus_group_10k.pickle', 'wb') as f:
    pickle.dump(merged_corpus_group, f)


In [67]:
with open('corpus_MH_5k.pickle', 'wb') as f:
    pickle.dump(MH_corpus, f)
    
with open('corpus_gen_5k.pickle', 'wb') as f:
    pickle.dump(gen_corpus, f)


## Up Next: Meaningful Phrase Associations, Topics, T-SNE 