# Do Models of Mental Health Based on Social Media Generalize?

Overview of code and some data. The best way to become familiar with the functionalities of this codebase is to look at source code for the `fit` and `predict` methods of the `MentalHealthClassifier`. Alternatively, examine the README.md in the root of this repository to understand how datasets were acquired, processed, explored, and modeled.

Questions? Contact Keith Harrigian at kharrigian@jhu.edu.

## Notebook Setup

We recommend running this code within a conda environment (python >= 3.6, no GPU required). Everything should mostly work out of the box by running `pip install -e .` in the root of this repository. You may need to additionally procure data resources (e.g. see the README for more information). Some packages (i.e. `nltk`, `demoji`) may require additional downloads that will be prompted upon the first import.

In [None]:
## Standard Library
import os
import sys
from glob import glob
from datetime import datetime
from pprint import PrettyPrinter

## External Libraries
import joblib
import numpy as np
import pandas as pd
from sklearn import metrics
import matplotlib.pyplot as plt

## Local Module
from mhlib.util.logging import initialize_logger
from mhlib.preprocess.tokenizer import Tokenizer
from mhlib.preprocess.preprocess import tokenizer
from mhlib.model import (data_loaders,
                         train,
                         model)

In [None]:
## Globals
MENTAL_HEALTH_DIR = "INSERT_ABSOLUTE_PATH_TO_CODE_REPOSITORY_HERE"
LOGGER = initialize_logger()
PRINTER = PrettyPrinter()

## Datasets

We provide support for interacting with the following datasets from Reddit and Twitter. Please note that appropriate data usage agreements are required to access the data. You will also need to be added to the "ouat" security group on the CLSP grid. Besides each official dataset name is the reference name used throughout the `mhlib` codebase. Any datasets not listed below, but housed in the mental health data repository, require additional credentials.

### Twitter

1. **CLPsych 2015 Shared Task** [*clpsych*, *clpysch_deduped*]: Annotated via self-disclosure + manual annotation. 844 users (control group is age and gender matched). Other disorders: PTSD.
2. **Multi-Task Learning** [*multitask*]: Annotated via self-disclosure + manual annotation. 2800 users (control group is age and gender matched). Other disorders: anxiety, depression, suicide attempt, suicidal ideation, eating disorder, panic disorder, schizophrenia, borderline personality disorder, bipolar disorder, PTSD.
3. **(1)** and **(2)** combined [*merged*] - Includes all users in the CLPsych 2015 shared task dataset and Multi-Task Learning dataset, accounting for users duplicated across datasets.

### Reddit

1. **Topic Restricted Text** [*wolohan*]: Individuals who submitted original post in r/Depression labeled as depressed. Individuals who submitted original post in r/AskReddit labeled as control (as long as they weren't already in the depression sample). 7016 control and 6853 depression.
2. **RSDD** [*rsdd*]: Annotated via self-disclosures + manual annotation. 107,274 control and 9,210 depression.
3. **SMHD** [*smhd*]: Annotated via self-disclosures + manual annotation. 279,561 control and 7,847 depression. Contains users from RSDD. Other disorders: ADHD, anxiety, autism, bipolar disorder, OCD, PTSD, schizophrenia, eating disorders

In [None]:
## View Raw Data Locations
raw_data_dirs = [f"{MENTAL_HEALTH_DIR}data/raw/twitter/",f"{MENTAL_HEALTH_DIR}data/raw/reddit/"]
for r in raw_data_dirs:
    LOGGER.info("\n".join([i for i in glob(f"{r}*/") if os.path.isdir(i)]))

In [None]:
## Load Labels for a Dataset
clpsych_labels = train.load_dataset_metadata("clpsych",
                                             "depression",
                                             random_state=42)

## Show Distribution
LOGGER.info("Class Distribution")
for lbl, count in clpsych_labels["depression"].value_counts().items():
    LOGGER.info(f"\t* {lbl}: {count}")

## Show Sample Labels
clpsych_labels.head()

In [None]:
## Optionally, Resample Dataset Balance
clpsych_labels_rebalanced = train._rebalance(clpsych_labels,
                                              "depression",
                                              target_class_ratio=[1, 5],
                                              random_seed=42)
LOGGER.info("Rebalanced Class Distribution")
for lbl, count in clpsych_labels_rebalanced["depression"].value_counts().items():
    LOGGER.info(f"\t* {lbl}: {count}")

## Optionally, Downsample Dataset
clpsych_labels_downsampled = train._downsample(clpsych_labels,
                                               downsample_size=100,
                                               random_seed=42)
LOGGER.info("Downsampled Class Distribution")
for lbl, count in clpsych_labels_downsampled["depression"].value_counts().items():
    LOGGER.info(f"\t* {lbl}: {count}")

## Pre-processed Data

Raw datasets have been pre-processed and stored in a consistent format (list of json dictionaries). When possible, pre-processed Reddit data structures include the subreddit the post comes from. If you wish to see how these files were generated, please examine `scripts/preprocess/` for thorough instructions.

We provide a helper class, `LoadProcessedData` to load the preprocessed data files and apply additional layers of preprocessing if desired (e.g. filtering out negation tokens or emojis, excluding posts containing certain terms).

In [None]:
## Initialize Data Loader for Preprocessed Data
loader = data_loaders.LoadProcessedData(filter_negate=True,
                                        filter_upper=True,
                                        filter_punctuation=False,
                                        filter_numeric=True,
                                        filter_user_mentions=True,
                                        filter_url=True,
                                        filter_retweet=True,
                                        filter_stopwords=False,
                                        keep_pronouns=True,
                                        preserve_case=False,
                                        filter_empty=True,
                                        emoji_handling=None,
                                        strip_hashtag=False,
                                        max_tokens_per_document=None,
                                        max_documents_per_user=10,
                                        filter_mh_subreddits=None,
                                        filter_mh_terms="smhd",
                                        keep_retweets=True,
                                        random_state=42)

In [None]:
## Load Pre-Processed Data
sample_file = os.path.abspath(clpsych_labels.iloc[0]["source"])
LOGGER.info(f"Loading File: {sample_file}\n\n")
user_data = loader.load_user_data(sample_file,
                                  min_date=datetime(2013, 1, 1),
                                  max_date=datetime(2013, 12, 1),
                                  n_samples=20,
                                  randomized=True)
LOGGER.info("Post 1 of {}:\n".format(len(user_data)))
LOGGER.info(PRINTER.pformat(user_data[0]))

In [None]:
## Re-tokenizing Using Standard Tokenizer
text = user_data[0].get("text")
tokens = tokenizer.tokenize(text)
LOGGER.info("{}:\n\n{}".format(text, tokens))

In [None]:
## Re-tokenizing With Non-default Tokenizer Parameters
new_tokenizer = Tokenizer(stopwords=["else","this","for"],
                          keep_case=False,
                          negate_handling=True,
                          negate_token=False,
                          upper_flag=False,
                          keep_punctuation=False,
                          keep_numbers=False,
                          expand_contractions=True,
                          keep_user_mentions=False,
                          keep_pronouns=True,
                          keep_url=True,
                          keep_hashtags=False,
                          keep_retweets=False,
                          emoji_handling=None,
                          strip_hashtag=True)
new_tokens = new_tokenizer.tokenize(text)
LOGGER.info(new_tokens)

## Modeling

Our model infrastructure currently supports a classical infrastructure of 1) construct hand-crafted features from a bag-of-words representation of a user's documents and then 2) feed this representation into an estimator (e.g. Logistic Regression). We have abstracted much of the process into a single class, `MentalHealthClassifier`. 

This instance should be initialized with parameters that control vocabulary construction, feature selection, preprocessing, and model fitting procedures. Optionally, models can be fit with a parallelized grid search procedure that cycles through chosen hyperparameters. Ongoing work looks to incorporate unsupervised domain-adaptation methods (e.g. importance weighting, feature subspace mapping), though their effectiveness is still yet to be proven.

To see a comprehensive set of examples of how this class is used, please examine code in `scripts/model/` and `scripts/experiment/`. Note that many of these scripts ingest configurations set in the `configurations/` directory.


In [None]:
## Construct Train/Test Splits
np.random.seed(42)
train_ind = np.random.choice(clpsych_labels.index,
                             int(clpsych_labels.shape[0]*.8),
                             replace=False)
train_labels = clpsych_labels.loc[train_ind].set_index("source")["depression"].to_dict()
target_labels = clpsych_labels.loc[~clpsych_labels.index.isin(train_ind)].set_index("source")["depression"].to_dict()

## Distributions
train_dist = pd.Series(train_labels).value_counts()
target_dist = pd.Series(target_labels).value_counts()
LOGGER.info("Training Distribution:\n{}".format(PRINTER.pformat(train_dist)))
LOGGER.info("\nTarget Distribution:\n{}".format(PRINTER.pformat(target_dist)))

In [None]:
## Initialize MentalHealthClassifier Class
mhmod = model.MentalHealthClassifier(target_disorder="depression",
                                     model="logistic",
                                     model_kwargs={"C":100,
                                                   "solver":"lbfgs",
                                                   "max_iter":1000
                                                  },
                                     vocab_kwargs={
                                                     'filter_negate': False,
                                                     'filter_upper': False,
                                                     'filter_punctuation': False,
                                                     'filter_numeric': False,
                                                     'filter_user_mentions': False,
                                                     'filter_url': False,
                                                     'filter_retweet': False,
                                                     'filter_stopwords': False,
                                                     'keep_pronouns': True,
                                                     'preserve_case': False,
                                                     'emoji_handling': None,
                                                     'filter_hashtag': False,
                                                     'strip_hashtag': False,
                                                     'max_vocab_size': 100000,
                                                     'min_token_freq': 10,
                                                     'max_token_freq': None,
                                                     'ngrams': (1, 1),
                                                     'max_tokens_per_document': None,
                                                     'max_documents_per_user': None,
                                                     'binarize_counter': True,
                                                     'filter_mh_subreddits': None,
                                                     'filter_mh_terms': None,
                                                     'keep_retweets': True,
                                                     'external_vocab': [],
                                                     'external_only': False,
                                                     'random_state': 42},
                                        preprocessing_kwargs={
                                                     'feature_flags':{
                                                         "tfidf":True,
                                                         "liwc":True,
                                                         "glove":False,
                                                         "lda":False
                                                     },
                                                     'feature_kwargs':{
                                                          "tfidf":{
                                                              "norm":"l2"
                                                          },
                                                         "liwc":{
                                                             "norm":"matched"
                                                         },
                                                         "glove":{
                                                             "dim":200
                                                         },
                                                         "lda":{
                                                             "n_components":30
                                                         }
                                                     }
                                            },
                                        feature_selector="pmi",
                                        feature_selection_kwargs={
                                                                "min_support":10,
                                                                "top_k":10000
                                            },
                                        min_date="2011-01-01",
                                        max_date="2013-12-01",
                                        randomized=False,
                                        vocab_chunksize=50,
                                        jobs=4,
                                        random_state=42,
                                    )

In [None]:
## Fit Model
mhmod, train_preds = mhmod.fit(train_files=sorted(train_labels.keys()),
                               label_dict=train_labels,
                               return_training_preds=True)

In [None]:
## Get Test Predictions
test_preds = mhmod.predict(sorted(target_labels.keys()),
                           min_date=None,
                           max_date=None,
                           n_samples=None,
                           randomized=False,
                           drop_null=True)

In [None]:
## Isolate Non-null Samples
train_nn = [t for t in train_labels if t in train_preds]
target_nn = [t for t in target_labels if t in test_preds]

## Format Predictions + Ground Truth
y_train_true = [int(train_labels[f]!="control") for f in sorted(train_nn)]
y_train_pred = [train_preds[f] for f in sorted(train_nn)]
y_test_true = [int(target_labels[f]!="control") for f in sorted(target_nn)]
y_test_pred = [test_preds[f] for f in sorted(target_nn)]


## Plot Class Separation
fig, ax = plt.subplots(1, 2, figsize=(10,5.8))
for g, (group, group_true, group_pred) in enumerate(zip(["Training","Test"],
                                              [y_train_true,y_test_true],
                                              [y_train_pred,y_test_pred])):
    yt = np.array(group_true)
    yp = np.array(group_pred)
    auc = metrics.roc_auc_score(yt, yp)
    for lbl in [0, 1]:
        ax[g].hist(yp[yt==lbl],
                   color=f"C{lbl}",
                   bins=np.linspace(0,1,21),
                   alpha=0.5,
                   label="Control" if lbl == 0 else "Depression")
    ax[g].set_xlabel("Probability")
    ax[g].legend(loc="best",title="True Label")
    ax[g].set_ylabel("Sample Frequency")
    ax[g].set_title("{} AUC={:.3f}".format(group, auc), loc="left", fontweight="bold")
fig.tight_layout()
plt.show()

In [None]:
## Extract Feature Coefficients
coefs_ = pd.Series(index=mhmod.get_feature_names(),
                   data=mhmod.model.coef_[0])
top_coefs_ = coefs_.nlargest(20).append(coefs_.nsmallest(20)).sort_values()

## Look at Top Coefficients
fig, ax = plt.subplots(figsize=(10,5.8))
_ = top_coefs_.plot.barh(ax=ax, color="C0", alpha=.8)
_ = ax.axvline(0, color="black", linestyle="--", alpha=.8)
_ = ax.set_xlabel("Coefficient")
fig.tight_layout()

In [None]:
## Look at Results of Feature Selection Procedure
selector = mhmod.selector._selector
pmi = selector.get_pmi().sort_values(1, ascending=False)
pmi.head(40)

In [None]:
## Save Model
LOGGER.info("Caching Model")
_ = mhmod.dump("clpsych.model.joblib",
               compress=5)

In [None]:
## Load a Cached Model
cached_model = joblib.load("clpsych.model.joblib")