# Project problem and hypothesis

**Problem and Impact**: Many individuals with clinical depression do not have the desire or drive to seek out treatment for their disorder, despite the fact that it significantly negatively impacts their quality of life, relationships, and career. Likewise, it is particularly difficult to recruit depressed patients to paticipate in clinical trials for new treatments, because they are seldom motivated to seek out opportunities to participate. A tool that can analyse writing from individuals on the internet and their probability of being depressed would be the first step toward a screening tool to identify people who would benefit from screening for clinical depression, particularly in areas with low resources and low access to healthcare, and to refine methods for identifying and advertising clinical trial opportunities to qualified patients. 

**Specific Aim**: To investigate whether an unsupervised bag-of-words machine learning approach can be used to cluster free text blog posts using either completely unsupervised clustering, or a small lexicon of words (n=12) associated with depression based on previously published work. Separately, investigate whether characteristic patterns of personal pronoun usage that are present in the speech of people with depression can be used to cluster blog posts into depressed and not depressed. 

**Data**: All previously published work in this domain area (see below) has made extensive use of expert raters to establish a "ground truth" set of depressed and non-depressed text samples. Since this project does not have the time or resources to create an expertly labeled set of text, I will approximate this by extracting approximiately 400 blog posts of self-identified depressed individuals from the website The Mighty from the Depression category (https://themighty.com/category/mental-illness/depression/). To create a control, putatively "non-depressed" text sample, I will use an equal number of blogs of runners and marathoners, collated from several collections of "best" running blogs, such as http://newfitnessgadgets.com/best-running-blogs. One could choose many sources of presumably non-depressed individuals; however, I chose runners based on the evidence from animal models that aerobic exercise, particularly running, upregulates neurogenesis in the hippocampus of the brain, which is critical for mood and anxiety, and because my research group is studying this phenomena in humans currently. The blog text sample is expected to include more female writers than male writers, and is biased in that these individuals have chosen to write about their lives and share these stories publicly, which may be uncharacteristic of individuals with depression in the general population.

**Variables**: Since age, gender, and other demographic information will not be available for many of these writers, the data will only include the blog text, a unique writer ID number I will assign, and a binary variable indicating if it was a depression blog writer (1) or a runner blog writer (0). My model will predict the binary variable "depressed".

**Outcome**: depressed (binary outcome: 1=depressed, 0=not depressed); recall, precision, and F-measure for benchmarking

**Hypothesis**: Previous attempts to identify depressed individuals have been based on extensive (up to 10,000 word) custom lexicons and have implied that depression is too complicated for unsupervised analysis. I hypothesize that an unsupervised approach, while faster and less resource intensive, will likely perform at a lower level than mixed models based on large, hand-crafted depression-specific lexicons. 

Furthermore, I hypothesize that word usage, either the unsupervised clustering or the lexicon from previously published work, will predict depression status more accurately than personal pronoun use. Personal pronoun use has been shown mainly in spoken language and not written, so the pattern in depressed patients may not extend to blog postings. 

**Future directions**: If this shows interesting results, instead of classifying extreme depression blogs vs. runner blogs, it would be interesting to construct a model based on text ranked by experts using a continuous scale, such as a 1-10 scale with 1 indicating no depression and 10 indicating severe depression. This would change this from a classification problem to a regression problem. 

# Datasets

### Outline of potential methods and models

0) Examine and explore data, get a feel for how many average posts and words per individual, and if I should exclude any outliers based on that

1) Use **spaCy for data preprocessing**, including identifying tokens, tagging, and parsing
    
- Also possibly use sklearn for counting and/or weighting/mormalizing token frequency

2) Use **sklearn feature extraction module** (http://scikit-learn.org/stable/modules/feature_extraction.html) for sparse vectorization, custom stopwords, and **unsupervised bag-of-words** approach to classify depression status of blog posts using k-means (see: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py)
    
- Figure out how to use both the built-in English stopwords and custom ones for this dataset. One blog source is focused on talking about depression, so medical jargon should be excluded, and one source is talking about running, so running jargon should be excluded. We are interested in identifying patterns in the remaining writing.
    
- Use the "vocabulary" parameter to designate a very brief **depression lexicon** based on Neuman <i>et al.</i> (see below) as well as using "max_features" to find the unbiased top frequency tokens

3) Separately use the "vocabulary" parameter to explore the use of **personal pronouns** based on Rude <i>et al.</i> 2004 in the blog posts (possibly on the entire corpus to measure agreement of depression word usage vs. depression personal pronoun usage, or on a top and bottom percentage of the posts based on depression vocabulary use)

4) **Outcome measure**: binary score for "depressed" or "not depressed"

5) **Model comparison and benchmarking** For comparison to previously published models, and depression words vs. depression personal pronouns: recall and precision, and the F measure
    
- The F-measure is the weighted harmonic mean of precision and recall, calculated as 
<center>F = (2 x precision x recall) / (precision + recall) (ref: Tung and Lu, 2016)

### Explanation of available data

The data I plan to use will be collected from publicly available blog posts on a medical site (themighty.com) and from blogs culled from blogrolls for runners and marathoners. 


| variable | information | data type |
|------|------|------|
|   unique ID#  | unique ID # assigned per author | integer |
|   post  | blog post text | string |
|   depressed  | source of blog (depression site=1; running site=0) | boolean |

The depression site has approximately 400 blog posts under depression, and an equal number of running blog posts will be collected. 100 posts of each type will be randomly selected to be held back as the testing set and not used until the model is built. 

# Domain knowledge

**Assumptions**: This project includes some major assumptions. First, that individuals who post about depression on a website devoted to discussing medical problems are representative of the depressed population. This is almost certainly not true, but as a first attempt, it must be used as a freely/publicly available source of writing samples from self-identified depressed individuals. Second, we assume that runners and marathoners are an adequate comparision group and are largely not depressed. Although animal models clearly show reversal of depression-like symptoms after free running with effect sizes bigger than any other treatment, this is yet to be replicated in humans despite many anecdotal reports of "runner's high" and the antidepressant effects of running. However, runners' blogs will be used in this project as a source of *mostly* non-depressed writing samples, with the understanding that some runners may be depressed, and some self-identified depressed individuals many not meet clinical thresholds for Major Depressive Disorder. 

**Experience**: I received independent funding this year to lead a small-scale clinical trial of the impact of exercise on the function of the hippocampus (the area of the brain sensitive to exercise-increased neurogenesis). Although I won't be finishing the project due to leaving my academic institution, enthusiasm for non-pharmacological treatments for depression is high and the trial will be carried on by others after I leave. Recruitment of qualified study volunteers is a huge challenge for this type of trial, and if this machine learning project could identify characteristics common to the writing of legitimately depressed individuals, better targeted ads (for example, on Facebook) could be deployed to recruit participants. 

### <i>Similar projects

**There are three main published studies that I found that attempt to make some classification or diagnosis of depression using free text samples written by individuals. A common feature of all three studies is the use of an extensive custom lexicon of words concerning depression, which was typically constructed via a mix of automatic and manual methods, and in each case likely required many hours of manual labor and review by domain experts (clinicians). I do not have the time or resources to use a hand-crafted lexicon approach in this project, but the projects below are still interesting in terms of how they label or establish the "truth" in their text samples, where their samples are drawn from, and how they score the performance of their models.**


* Analyzing depression tendency of web posts using an event-driven depression tendency warning model, Chiaming Tung and Wenhsiang Lu, Artificial Intelligence in Medicine, 2016; https://www.ncbi.nlm.nih.gov/pubmed/?term=26616420
    * This study created two methods for analyzing depression content in Chinese language web posts. The first menthod designed to extract negative event terms. The second created an "event driven depression tendency warning" model to predict the depression tendency of blog post authors based on their posted work. 
    
    Theis study extended the authors' earlier work, where they manually created a negative event lexicon for depression. (Unfortunately, the lexicon is only published in a master's thesis and is for the Chinese language, so it is not able to be used in the present project.) Here, the authors extracted terms related to the description of criteria for depression diagnosis given in the Diagnostic and Statistical Manual of Mental Disorders 4, but collapsed the criteria from nine categories to four categories: negative events, negative emotions, symptoms, and negative thoughts, each with their own lexicon. Events were automatically extracted from web posts using these manually created lexicons, and then a score was calculated based on these extracted events. 
    
    The data consisted of 18,000 Chinese web posts between 2004 and 2011 from the top 30 posters on the discussion forum PTT. Advertisements were excluded. 724 posts were used as training/test data. A Chinese language specific word segmentation and part of speech tagging tool was used to segment and tag the posts for parts of speech. Each post was also annotated as "true" or "false" for depression tendency by three masters students. 
    
    By splitting their analysis into the four lexicon categories, they find that negative emotions are "usually a result of" a negative event and is "associated with" symptoms and negative thoughts. Therefore they propose an event-driven depression tendency warning model to predict which bloggers will become depressed. The outcome was a single binary score reflecting depression tendency. The lexicons were significantly smaller than in the studies described below, with 221 negative event terms, 1316 negative emotion terms, 58 symptom terms, and 31 negative thinking terms, all confirmed by psychologist raters. The authors report that the automatic event extraction had a low accuracy (0.549) even when combining features to get the best performance. They also analyzed which part of speech patterns were most effective at predicting depression, and found that the pattern VC + Nh (example given: "Deny myself") had the highest accuracy of 0.558.
    
    The authors discuss applying precision, recall, and F measure as performance measures for their binary outcome. When compared to other possible methods for predicting depression in text, the author's current event-based prediction performed the best in recall, but not in precision. However, the authors observe that compared to the model that has the highest precision, their method increased recall rate by increasing the size of the lexicon at the expense of precision. 
    
    
* Screening internet forum participants for depression symptoms by assembling and enhancing multiple NLP methods, Christian Karmen <i>et al.</i>, Computer Methods and Programs in Biomedicine, 2015; https://www.ncbi.nlm.nih.gov/pubmed/?term=25891366
     * The goal of this study was to create a method to analyze free text from internet postings (including blogs, forums, and chat rooms) to inform people that they should be further screened for possible clinical depression. They based their method conceptually on the steps a clinician would take to diagnose a patient in the clinic. Step 1 was to "identify symptoms", in which they created a lexicon file with depression symptom terms and generated as many first and second degree synonyms as possible. Step 2 was to "identify symptom frequency" by creating a separate lexicon with synonyms for time markers - never, sometimes, often, always. Step 3 was to "sum the frequencies of symptoms" via NLP methods focusing on symptoms, frequencies, personal pronouns, and negation. Step 4 was to "use thresholds to determine severity" of depression by calculating a depression score. 
     
     The NLP processing consisted of 5 steps. 1) proprocessing - removal of formatting characters, html, and quotes of previous postings; 2) boundary detection - the Stanford Parser was used to detect sentence boundaries and a custom algorithm was used to identify phrases; 3) matching - the phrases were matched with the symptom and frequency lexicons; 4) pronouns and negation - the parts of speech tags identified by the Stanford Parser were used to detect negative words and phrases and indicate inversion of the phrase; 5) scoring - the four previous factors were used to calculate a score, where pronouns and negation are simple sums and frequencies and symptoms are weighted sums. 
     
     The free text was drawn from forum postings from the site PsychoBabble (moderated by one of the authors) and the subforum on grief. There were about 2,100 posts in the sample, reduced to 1,304 that were between 20-200 words. The grief subforum was chosen for its hypothesized high level of depression. Two mental health clinicians rated a subset of the posts and although scores were collapsed to simply indicate "depressed" or "not depressed", they still only achieved "fair" interrater reliability on classification of the posts. 
     
     For the summed score algorithm, the authors claim the precision is 1.0 and recall is 0.81. Other measures, a sentence ratio and word ratio, had lower scores. (Sentence ratio: precision 0.80, recall 0.73; word ratio: precision 0.72, recall 0.72.) To test the performance of their model, they used the expert raters as the gold standard, and took the most extreme rated samples at each end (not depressed or very depressed) and discarded any samples with middle ratings or ratings where experts disagreed. 
     
     One feature of this particular approach noted by the authors in the discussion is the relatively large lexicon that they produced via a combination of automatic and manual methods - over 10,000 words.   
     
     
* Proactive screening for depression through metaphorical and automatic text analysis, Yair Neuman <i>et al.</i>, Artificial Intelligence in Medicine, 2012; https://www.ncbi.nlm.nih.gov/pubmed/?term=22771201
    * This study used a large sample of text from the internet to create examples of "metaphorical relations in which depression is embedded" and "extracts the conceptual domains describing it". This information was then used to construct a lexicon for depression that could be used to evaluate either the "level of depression" in texts, or "whether the text is dealing with depression as a topic". When tested on a blog corpus, this study achieved an 84.2% correct classification rate (p<0.001) whether a blog post "includes signs of depression". Compared to human experts, the algorithm achieved an average of 78% precision and 76% recall. 
    
    Unfortunately, in the actual paper, due to "space limitations" the actual algorithm is not described in detail, and the authors focus on "proving empirical evidence for the system's ability to screen for depression in texts." As far as what they do discuss, their system consists of "thousands of lines of code" (C#) that perform the function of searching for metaphors that involve a particular concept, where it appears in websites as "X is like *", where X is the concept and * is a wildcard. For depression, they searched the first 20,000 results from Bing for "Depression is like *" and manually reviewed them to exclude cases like economic depression. The resulting phrases were parsed with the Stanford parser. They also manually identified related concept statements. For example, a page that said "depression is like a big cage" included the statement "You are locked inside a cage where you feel no happiness", so they additionally extracted the concepts "locked" and "no happiness". Their final lexicon included 1723 phrases that dealt with depression, with the top being dark, disease, pain, quicksand, black hole, cancer, box, emotional, life, death, black, and cloud. 
   
   In testing their algorithm on the 2004 Blogger blog corpus (proposed for use in my curent project), they identified 83 posts that were "exhibiting signs of depression" by searching for the root "depress-" and manually confirming the authors attested to being depressed. They also "manually identified 100 posts in which no sign of depression was evident" to use as a control group. On this sample of 183 **posts**, they used binary logistic regression with depression as the outcome and the score from their algorithm as the independent variable. They also claim to calculate the sensitivity and specificity by using the mean of their algorithm's score (apparently across the whole depressed and non-depressed sample, although this is not specified) and categorizing the individual posts as above average score or below average score, and testing this against their label (depressed or non-depressed). They "found that its sensitivity was 0.69 and its specificity 0.97." They tested their lexicon against that produced by an online Latent Semantic Analysis hand-reviewed to remove economic concepts and achieved similar results, with a higher sensitivity for their own method (.69 vs. .59). They also examine whether the use of the first-person pronoun "I" (based on Rude <i>et al.</i>, 2004) as an additional variable improves prediction, and find that when using this additional information, their method achieved a 90.7% correct classification rate. 

# Project concerns

My concerns for this project fall into two categories: data handling and usefulness of outcome measures. 

For data handling, since I only recently decided to scrape/collect my own data, I need to figure out the most efficient way to do this. I may restrict the data set to bloggers who only have one post, only have multiple posts, only have posts over 100 words, or some other criterion I discover during data exploration. 

For usefulness of outcome measures, all previously published studies attempting to categorize web posts as depressed or non-depressed relied on hand-crafted lexicons/vocabularies, sometimes as large as 10,000 words. These models still didn't perform all that impressively. Accurately inferring emotional state from text in this specific context, and generally, is still a relatively hard problem, so imperfect performance is expected. However, by using an unsupervised bag-of-words approach based on an extremely small lexicon (top 12 words from Neuman <i>et al.</i>), which is the only available published English depression lexicon, I expect my model will perform much worse than the published ones, because it cannot increase recall by increasing the size of the lexicon. Furthermore, the analysis of the use of personal pronouns in depressed patients is relatively well-replicated in the literature, but only in contexts other than blog posts (conversations, storytelling, therapy sessions). It is unknown whether it will apply to written blogs. 

I wish I had access to a large dataset of blogs labeled by a large number of psychiatrists/psychologists for depression content along a continuous scale. This would allow for exploration of language features of of non-depressed writers vs. depressed writers with varying level of depression. This would be prohibitively expensive in terms of time and money, though, even for a full scale academic study.  

If the model is performs well, it may be used for social good (such as finding people who could benefit from screening for clinical depression) or for unethical purposes (denying someone a job, or claiming to have diagnosed them without any medical training or contact with the "patient"). Diagnosis without contact with the subject, for example as in cases where contemporary psychiatrists "diagnose" historical figures or current celebrities, typically gains a lot of attention in the popular press, but is very frowned upon by psychiatrists in general for ethical reasons of diagnosing without examining the "patient". This ethical problem would be magnified if any person with the model or formula to "diagnose" depression believed they could completely accurately measure a living person's depression without even meeting with them.

# Outcomes

**Goals and criteria**: My goal is to collect and prepare the blog posts and their labels (depressed; not depressed) and essentially do three analyses: unbiased clustering to find keywords that group them (excluding medical terms and running terms), a short lexicon-based clustering using published words associated with depression, and personal pronoun use. I will set aside a testing set and build the models with the training and validation sets. Outcome will be performance in correctly labeling text from the testing set, recall and precision. I will compare my recall and precision with other published work identifying depression in web text. 

I expect my models will perform more poorly than published ones; however, they may not for many reasons, including technical ones (different data used as input) or actual ones (perhaps unsupervised clustering will work better than the supervised methods published so far). 

I expect the word usaage to be the most important feature, compared to pronoun usage (see Hypothesis in the first section above).

My model cannot be too complicated, because I have limited information to put in, unless I choose to engineer a large number of additional features based on language in the blog posts, such as part of speech use patterns and frequency, or information outside of the blog posts such as the author's posting frequency. If I had additional features such as age, gender, and writing over time, I could develop a more complicated model. 

I am hoping for recall/precision results similar to published work to call the model a success. Hwever, even if this model performs more poorly than supervised methods, it will be an interesting result because that implies that written indicators of depression may be too subtle to be detected through current automatic methods. 

This project is on a short timeline, but depending on results and interest from my colleagues in the future, I may explore if any of them would be willing to spend time developing an expert labeled data set for further analysis of unsupervised methods for identifying potential depression (or orther mental health disorder) clinical trial participants through social media/web writing samples. 