# Project: Mental health in Switzerland

In this notebook, we describe the multiple steps of our preprocessing pipeline and the work done on the twitter dataset to answer our research question.

### Overview:
0. **Data Retrival:** how we access the dataset
1. **Dataset Selection and Analysis:** a first look at the dataset, exploring potential issues
2. **Dataset Cleaning:** cleaning based on the results found in _Part 1_
3. **NLP methods:** applying NLP methods to our data to retrieve relevant data
4. **Machine Learning:** using Machine Learning to further clean our dataset
5. **Analysis:** trying to find meaning in the final dataset we retrieve
6. **Conclusion**

We quickly import the libraries to be used later:

In [None]:
import numpy as np
import pandas as pd
import pickle as pkl
import pyspark as ps
import matplotlib.pyplot as plt

#Part 1
import json
from pandas.io.json import json_normalize

#Part 2
import unicodedata

#Part 3
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

## 0. Datasets Retrieval

We did not retrieve both datasets using the same method first because they did not have the same weight, but also because they were not provided on the same platform.

Our first dataset (_**twitter-swisscom**_) was retrieved from a .zip file. Thus, we had access to the entire dataset quickly allowing us to have an overview of all the tweets when analyzing it (cf _Part 1_ below).

Our second dataset (_**Spinn3r**_), however, was retrieved from the cluster. We first tried to retrieve the whole dataset but quickly realized it would be impossible to do so (it was very heavy and took a long time to be downloaded). Thus, we only extracted the first JSON file to perform our analysis. To do this, we used the following methods:

```bash
cluster$ hadoop fs -getmerge /datasets/swiss-tweet/harvest3r_twitter_data_01-01_0.json /buffer/example.json
local$ scp -r gaspar@iccluster060.iccluster.epfl.ch:/buffer/example.json <local-path>
```

## 1. Dataset Selection & Analysis

We were provided two separate datasets containing Swiss tweets. They were formated differently and contained different fields. On top of this, while the first dataset contained data collectd over multiple years, the other only covers a span of 10 months.

The goal of our first analysis was thus to decide which dataset would be more useful but also to familiarize ourselves with the chosen dataset in order to know if we need to adapt our research questions or enrich the dataset with external information in order to perform our analysis.

After the analysis (presented below), we decided to use **dataset 2** for our project. While dataset 1 contains more precise location information (longitude and latitude), dataset 2 contains a sentiment analysis field as well as a language field. As trying to categorize the language of each tweet in the first dataset was quite expensive in terms of computation – we needed to deal with the network latency of our API requests – and a lot of preprocessing was necessary to get it to work, dataset 2 was clearly better suited for our project.

### Dataset 1 (twitter-swisscom)

The dataset comes with a *txt schema*, giving us an idea of the meaning of each column in the *tsv file* containing the tweets. We were also given a sample file to get an overview of the data, but we also optained the complete set of tweets (5GB) in a _.zip_ file.

The dataset contains the following useful columns:

- **userId** : id identifying the user.
- **createdAt** : time the tweet was posted.
- **text** : content of the tweet.
- **placeLatitude** : latitude of the tweet.
- **placeLongitude** : longitude of the tweet.
- **sourceName** : username.
- **sourceUrl** : URL of the tweet.
- **followersCount** : number of followers.
- **friendsCount** : number of mutual follows.
- **statusesCount** : number of statuses of user.

The sample dataset contains a lot of NaN values, and each column contains at least 1% or more NaN values.

The complete analysis and code can be found in the [Basic Exploration dataset 1 notebook](Basic%20Exploration%20Dataset%201.ipynb)

### Dataset 2 (from Spinn3r)

#### Metadata analysis

This dataset has an elaborate description of each field available on the [spinn3r website](http://docs.spinn3r.com/?Example#content-schema). Given the amount of data present in the cluster, we only look at one day to perform our first analysis (we will later show how we scale our operations using Spark).

Unlike the previous dataset, this dataset is given in JSON format (with nested elemtns). As we could not find how to extract all the data directly using the *read_json* function provided, we use a JSON normalizer (provided in the *Pandas* libary). 

_Note: We will later see that Spark deals better with nested JSON._

The fields found in this dataset are:

In [None]:
EXAMPLE_PATH = 'swiss-tweet/example.json'
TRUE_PATH = 'reduced_tweets.json'

with open(EXAMPLE_PATH) as data_file:    
    data = json.load(data_file)

twitter_df = json_normalize(data)
#rename columns for convenience
twitter_df.columns = [column.replace('_source.','') for column in twitter_df.columns]
twitter_df.columns

Out of all these columns, we find the ones below to be the most useful for our analysis:
- **main**: contains the content of the tweet.
- **published**: gives the time on which the content was posted.
- **source_spam_probability**: probability of tweet being spam.
- **source_location**: location of the tweet.
- **tags**: tags associated with the tweet (provided by Spinn3r).
- **lang**: language of the tweet.
- **sentiment**: sentiment score of the tweet -POSITIVE, NEGATIVE, NEUTRAL-.
- **author_gender**: gender of the author -MALE, FEMALE, UNKNOWN-.
- **source_followers**: followers of the user who tweeted.
- **source_following**: number of people the user follows.

We consider these tags to be the most useful as:

- **Main** stands at the center of our analysis. We plan to perform NLP methods in order to identify relevant tweets and use the content as well to identify related words.
- **Published** can be used to map the tweets over the duration of the year and look for seasonal changes.
- **Source_location** can be used to look at the geographical distribution of the tweets.
- **Lang** will be used to filter out unwanted languages, which we need to do in order to perform our NLP tasks.
- **Author_gender** will be used to identify the gender and look at the differences between both genders.

In [None]:
columns = ['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']
twitter_df = twitter_df[columns]

#### General Distribution

While this example isn't representative (especially given that it contains the tweets posted on January, 1st), it can still give us insights on other fields. We assume that roughly the same categories of users were active on that day, meaning we can draw conclusions on the distribution of language and gender.

In [None]:
plt.figure(figsize=[30, 20])
explode = (0, 0, 0, 0, 0, 0, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 
           0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2)
twitter_df['lang'].value_counts().plot(kind='pie', explode=explode)
plt.legend(labels=twitter_df['lang'])
plt.axis('equal')
plt.show()

We see that English, French and German are the most frequent languages. This is good as those are the languages we plan on using.

In [None]:
plt.figure(figsize=[5, 5])
twitter_df['author_gender'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

We see that most accounts do not contain information on the user's gender (meaning there is no way we can have an unbiased set). However, we have a significant number of profiles where the gender is documented meaning we could use these to look at the differences in mental distress between the genders.

In [None]:
plt.figure(figsize=[10, 5])
explode = (0, 0.2, 0.4)
twitter_df['sentiment'].value_counts().plot(kind='pie', explode=explode)
plt.axis('equal')
plt.show()

Once again, we see that the vast majority of tweets are labeled as neutral while only a very small number is labeled as negative. Given our subject will look at both the **'NEUTRAL'** and **'NEGATIVE'** tweets.

_Note: we make the assumption that there are no false positives, meaning that a tweet showing signs of mental distress will not be labeled as **'POSITIVE'**. This means we can safely exclude these tweets from further analysis._

In [None]:
plt.figure(figsize=[8, 5])
twitter_df['source_spam_probability'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

Looking at the spam probability of the tweet set, we see that not a single tweet was labeled as spam. This makes us question the accuracy of the labeling as the set of tweets on that day most certainly contains spam. However, we will still use it as we assume the chance of having false positives to be very low.

In [None]:
plt.figure(figsize=[20, 10])
twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'].value_counts().plot(kind='pie')
plt.legend(labels=twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'])
plt.axis('equal')
plt.show()

As the location seems to be language dependent, we only care about tweets written in the 3 languages we need. We see that:
- A lot of locations only differ in language (e.g. as Switzerland and Schweiz)
- The name of the locations are not always given languages we are interested in (e.g. สวิตเซอร์แลนด์)
- A vast majority of the dataset is just located in 'Switzerland'
- As opposed to dataset 1, all tweets are located in Switzerland

In [None]:
twitter_df.count()

Looking at the missing values in the dataset, we see that for most tweets, the tags are missing. This may indicate that the tags won't be usefull for our analysis, but this does not influence our research at this stage.

We also note that for some tweets, the sentiment is missing. As noted on Spinn3er, this may be due to the fact that some tweets do not contain enough linguistic information. As we will filter out such tweets, the remaining set should contain sentiment. Even if doesn't, this field is not central to our analysis, it is merely used to help us filter our dataset.

In [None]:
plt.figure(figsize=[6, 5])
twitter_df[~(twitter_df.sentiment.isin(['POSITIVE', 'NEGATIVE', 'NEUTRAL']))]['lang'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

To be sure this won't be an issue, we quickly look at all the uncategorized tweets and see that the language is unkown as well, meaning they will all be removed anyways.

#### Potential issues

While this set of tweets is not representative, we can still use it to find potential issues we might have with the tweets' content:

In [None]:
pd.set_option('display.max_colwidth', -1)
twitter_df.sample(n=10)['main']

Looking at the sample, we immediatly see that the tweets containing links are not relevant to our research question (they are mostly news or adds).

In [None]:
twitter_df.main[twitter_df.main.map(lambda x: 'http://' in x)].head(10)

Looking more intently at the tweets containing links, we can make the assumption that the relationship between URLs and spam is a general rule (at any time of the year).



Preform a vastly simplified version of the dictionary matching we will preform to get relevant tweets and analyze the results.

Here we look at the occurence of 'suicide' in the tweet:

In [None]:
pd.set_option('display.max_colwidth', 100)
twitter_df[twitter_df['main'].map(lambda x: 'suicide' in x) ]['main']

By performing a simplistic dictionary matching using the occurences of the word *'suicide'* in our tweet set, we see that a lot of these tweets contain news. This further comforts us in our choice to remove tweets with URLs in order to get better results.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'therapie' in x) ]['main']

Looking at the word *'therapy'* (in German), we can confirm once again the issue there is with URLs.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'RT ' in x) ]['main'].head()

Finally, we look at the retweets to get an idea on how useful they could be. From what we see, and what others have seen [8], they are not.

Using this analysis, we can apply this gained knowledge to start cleaning our data.

## 2. Dataset Cleaning

Now that we have done some data exploration, we have to clean our dataset to be able to use it correctly for our research.

First, we use Pandas locally on a small subset of the tweets to explore different cleaning methods and make sure the functions we chose work as expected. After proving that our concept works, we use Spark to scale up our operations and be able to perform them on the 320 files provided on the cluster.

_Note: The Spark version of this code can be found in the [run.py](run.py) file._

### 2.1 Unnesting the JSON

As mentioned in _Part 1_, the provided dataset uses a nested json format. Thus, to be able to work with it, we need to unnest it using the _normalize_ function.

_Note: we normalize the **data** DF as we use it in **Part 1** to import our datasets._

In [None]:
cleaned = json_normalize(data)
cleaned.columns = [column.replace('_source.','') for column in cleaned.columns]

### 2.2 Column Selection

Once we have a functional Dataframe, we choose the useful columns determined above.

In [None]:
cleaned = cleaned[['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']]

As mentionned in _Part 1_, we chose this columns for the possible insights they can give us. Some of the question we want to answer in our research are:
- _Are men or women more prone to depression?_ 
- _Is someone with more followers happy?_

### 2.3 Language Filtering

The dataset provided contains tweets in a lot of languages. However, as our project is focused on the Swiss Population, we only keep tweets written in one of the official Swiss languages (i.e. French, German and Italian).

However, we had 2 hurdles (pushing us to do slight changes):
- Nobody in the group is Italian-Speaking
- Most of the tweets are written in English _(this is due to the large English-Speaking community in Switzerland but also due to the fact English is a _de-facto_ choice when communicating through the Internet)_

Moreover, we should not forget the existence of Swiss-German dialects which also complicates the task at hand.

To work our way through these problems, we decided to keep 3 languages: English, French and German. Note that we mean "German" in an extended manner as we also include Swiss-German dialects in the lot (this will be taken into account when creating our dictionaries as some words slightly change between the two).

In [None]:
lang_mask = ~cleaned.lang.isin(['de', 'en', 'fr'])
cleaned.drop(cleaned[lang_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

### 2.4 Sentiment Analysis

As we use a dataset with sentiment labeling, we use this opportunity to drop all tweets marked as **POSITIVE** given the nature of our subject (we are looking for signs of mental distress). 

In [None]:
sent_mask = (cleaned['sentiment'] == 'POSITIVE')
cleaned.drop(cleaned[sent_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

As this labeling is largely based on emojis, we expect some false positives in the **POSITIVE** category. However, it is too difficult to consider the sarcastic use of the _:)_ emoji (text analysis is not refined enough yet to detect such subtleties). Thus, we knowingly choose this method as we are only interested in having an overview of tweets showcasing signs of mental illnesses and not listing all of them.

### 2.5 Spam Removal

Spam is not relevant to any of our objectives. Thus, we decide to drop all tweets having a probability of being a spam greater value than 0.5 (given in the **source_spam_probability** field). Once again, we decide to trust the system that treated the data.

In [None]:
spam_mask = (cleaned['source_spam_probability'] >= 0.5)
cleaned.drop(cleaned[spam_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

_Note: our choice to have a 50% threshold follows the maximum likelihood rule and was considered to be a sufficiently robust value._

### 2.6 Time Format Encoding

To be able to use the time in our analysis (especially to find seasonal patterns in depression), we need the dates to be properly formatted.

In [None]:
cleaned['published'] = pd.to_datetime(cleaned['published'])
cleaned['published'].head()

### 2.7 Text Treatment

Now that we only have relevant tweets (following a mild cleaning), we need to work on the main subject of our analysis: the content of the tweets itself. This is necessary to ease the _Part 3_ of our research allowing us to process the text and find patterns: Natural Language Processing.

The first step of our treatment is to transform all the caracters to lowercase to be able to compare tweets easily without being bothered with case-sensitive searches. Following the same idea, we get rid of all the special characters complicating our task.

In [None]:
data_cleaning = lambda data: unicodedata.normalize('NFD', data.lower()).encode('ascii', 'ignore').decode('utf-8')

In [None]:
cleaned['main'] = cleaned['main'].map(data_cleaning)
cleaned['main'].head()

As we have seen in _Part 1_, URLs are highly linked to spam (except for URLs linking to pictures). As we image processing is not in the scope of this project, we remove all "pic.twitter" URLS. After that, we take out all tweets containing a URL format. We also remove retweets, as they would 

In [None]:
url_mask = cleaned['main'].str.contains("www\S+") | cleaned['main'].str.contains("http\S+")
cleaned['main'] = cleaned['main'].str.replace("pic.twitter\S+", '')
cleaned.drop(cleaned[url_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)
cleaned['main'].head()

We also remove mentions to remove the usernames, which are meanling less and should not be considered:

In [None]:
cleaned['main'] = cleaned['main'].str.replace("@\S+", '')

Finally, we remove all non-alphanumeric characters, as they are will not give more information and might get in the way of the text analysis.

In [None]:
cleaned['main'] = cleaned['main'].str.replace(r'[^\w\s]', '')
cleaned.main.head()

After performing all these cleaning steps, the size of the set is significantly reduced (the new file only weights 2.4MB instead of the 55MB of the original file).

## 3. NLP methods

Using Natural Language Processing (NLP) is necessary given the nature of our dataset: tweets. Following the steps used in previous courses and in the papers we read, we came up with the following pipeline in order to process the tweets (and dictionary). Unlike the previous part, we only used local functions instead of a scaled up version using Spark.

_Note: we use nltk, a goto python NLP library which was very interesting for us as it offered operations in the various languages we are working on._

### 3.1. Tokenizing

The first important step in NLP is to divide the words of every tweet in a table to be able to easily treat them. It is easily done using the _'split'_ method provided in Python. 

In [None]:
cleaned['tweets'] = cleaned['main'] #keep tweets for later
cleaned['main'] = cleaned['main'].str.split()
cleaned.head()

### 3.2 Stop words Removal

To have relevant tweets, it is necessary to remove useless words which would interfere with our analysis. We assumed stop words were the only important words to remove (as we already dealt with special characters and urls).

In [None]:
def remove_stops(language, df, *col):
    lang = language[:2]
    if language == 'german':
        lang = 'de'
    
    lang_set = stopwords.words(language)
        
    if col:
        df.loc[df['lang'] == lang, col] = df.loc[df['lang'] == lang, col].\
            apply(lambda data: [word for word in data if word not in lang_set])
    else: 
        df = df.apply(lambda data: [word for word in data if word not in lang_set])

In [None]:
remove_stops('english', cleaned, 'main')
remove_stops('french', cleaned, 'main')
remove_stops('german', cleaned, 'main')
cleaned.head()

### 3.3 Stemming

The final step of our NLP treatement pipeline is stemming. The idea is that words appearing in multiple forms (such as have, having, had, …) should only be considered once using their radical (e.g. "hav").

In [None]:
def stem_words(language, df, *col):
    lang = language[:2]
    if language == 'german':
        lang = 'de'
    
    stemmer = SnowballStemmer(language)
        
    if col:
        df.loc[df['lang'] == lang, col] = df.loc[df['lang'] == lang, col].\
            apply(lambda data: [[stemmer.stem(word) for word in expression] for expression in data])
    else:
        df = df.apply(lambda data: [[stemmer.stem(word) for word in expression] for expression in data])

In [None]:
stem_words('english', cleaned, 'main')
stem_words('french', cleaned, 'main')
stem_words('german', cleaned, 'main')
cleaned.head()

### 3.4 Dictionary processing

As we applied all of these methods to our dataset, it is only natural that we do the same for the dictionary we will use. The first step of this part explains how we built the dictionary (and cleaned it as we did before) while the second part focuses on applying the previous NLP methods to the dictionary.

####  Building the dictionary

The first step of building our dictionary was doing research on previous dictionaries that were used for similar problems, such as can be seen in [2], [3]. Starting from this, we built our own dictionary by expanding the examples (as we expanded our subject to take into account multiple mental disorders instead of simply determining tweets of users diagnosed with clinical illnesses). As we treat tweets in multiple languages, we also translated all the terms and tried adding words that specifically target mood (or eating) disorders in french and german.

_Note: you can take a look at this dictionary by clicking on the 4th reference (at the end of the file)._

In [None]:
DICT_PATH = "dictionary.csv"
dictionaries = pd.read_csv(DICT_PATH)
dictionaries.head()

In [None]:
en_dict = dictionaries['english'].dropna()
fr_dict = dictionaries['french'].dropna()
de_dict = pd.concat([dictionaries['german'].dropna(), dictionaries['swiss_german'].dropna()])

After fetching our 3 datasets, we clean them as we did for the tweets in _Part 2_. However, as we created them ourselves, we only have to lowercase them and treat the special characters. 

In [None]:
en_dict = en_dict.apply(data_cleaning).str.split()
fr_dict = fr_dict.apply(data_cleaning).str.split()
de_dict = de_dict.apply(data_cleaning).str.split()

#### Processing the dictionary

Once we have our (clean) data, we simply run all the previous NLP methods on our dictionaries. 

_Note: As all methods were explained above, we will not dwell on each step._

In [None]:
remove_stops('english', en_dict)
remove_stops('french', fr_dict)
remove_stops('german', de_dict)

In [None]:
stem_words('english', en_dict)
stem_words('french', fr_dict)
stem_words('german', de_dict)

To show the soundness of our method, we display the head of each dictionary:

In [None]:
en_dict.head()

In [None]:
fr_dict.head()

In [None]:
de_dict.head()

### 3.5 Processing the Data

In this part, we cross the dataset with our dictionaries to retrieve the tweets exhibiting mental distress. This is the first (less naive) step of our analysis before running the Machine Learning algorithms on our data.

We first transfrom the lists back to string sentences, for efficacy and code readability:

In [None]:
cleaned.main = cleaned.main.map(lambda x:  ' '.join(x))
de_dict = de_dict.map(lambda x:  ' '.join(x))
en_dict = en_dict.map(lambda x:  ' '.join(x))
fr_dict = fr_dict.map(lambda x:  ' '.join(x))

We now check if we can find a dict entry for each tweet

In [None]:
def check_dict(tweet, dict_):
    """checks if dict entry matches tweet"""
    match = [w for w in dict_ if w in tweet] #find matching for each entry
    return len(match) > 0 #at least one match

This gives us the following results for each language.


For english:

In [None]:
english_filtered = cleaned[cleaned.lang == 'en']['tweets']\
[cleaned[cleaned.lang == 'en']['main'].map(lambda x: check_dict(x, en_dict))]
english_filtered.head()

In [None]:
english_filtered.count()

We can clearly see sadness an lonelyness or fear in this tweets, but some of them do not actualy express distress.
We also note that the number of tweets has greatly been reduces through this.

Now we look at the french tweets and again find similar results:

In [None]:
french_filtered = cleaned[cleaned.lang == 'fr']['tweets']\
[cleaned[cleaned.lang == 'fr']['main'].map(lambda x: check_dict(x, fr_dict))]
french_filtered.head()

In [None]:
french_filtered.count()

Finally, for german we get the following results:

In [None]:
german_filtered = cleaned[cleaned.lang == 'de']['tweets']\
[cleaned[cleaned.lang == 'de']['main'].map(lambda x: check_dict(x, de_dict))]
german_filtered.head()

In [None]:
german_filtered.count()

Not only do all of this tweets reference the person, but they also talk about sadness or dissapointment.
The remaining number is quite small, compared to the initial 4000 german tweets in the dataset, but it is a number we migth expect.

Still those results could be better, which is why we decide to introduce a step 4 into our analysis:

## 4. ML (to be done in milestone 3)

The final part of our data processing involves training a supervised Machine Learning classifier to get better results, as missclasifications are still somewhat frequent. Similar projects routinely use this [2] [3] [8].

All these methods are implemented in scikit-learn and have an equivalent in Spark in case our computers cannot handle the size of the data, allowing us to perform these steps on the cluster.

In order to do this, we go through the following steps:

### 4.1. Labeling the tweets

We need to label a subset of tweets as 'mentaly distressed' or not. Ideally around 6000 per language [7], but due to limited time and manpower, we will limit ourself to 1000 tweets each.

### 4.2 Construcing features TF-IDF

Once we have the labeled set, we transfrom it in order to have features to use for the ML algorithm.

### 4.3 Train SVM classifier

According to [7], SVM along with TF-IDF tends to preform very well, especially with limited training set size (which is ideal in our case). We perform a binary classification on the labeled dataset.

### 4.4 Relabel the training set

Using the previously attained classifier, we relabel the set and discard tweets that are labeled as 'undistressed'.

## 5. Final Data Analysis

In the final analysis, we try to analyze and visualize the results in order to answer our research questions.

### 5.1 General analysis

We compare the set found using Machine Learning to the general cleaned set and the overall set.

### 5.2 Finding related indicatior of mental health issues using LDA

As discussed in [8], LDA one of the best ways to find relevant related words in tweets. Again, we plan on using the implemented scikit-learn version to preform this.

### 5.3 Lookint at gender differences

### 5.4 Tweet frequency and type over time

### 5.5 Comparing our results to census data

This will be approximative since we could not find a dataset containing this information.

### 5.6 Comparing the languages

## 6. Conclusion

# References and Bibliography

[1] [Example dataprocessing pipeline](http://nbviewer.jupyter.org/gist/mizvol/eb24770ac3d5d598463f972e2a669f03)

[2] [Dissertation containing a first dict](https://www.rand.org/content/dam/rand/pubs/rgs_dissertations/RGSD300/RGSD391/RAND_RGSD391.pdf)

[3] [Second thesis containing dict](https://getd.libs.uga.edu/pdfs/kale_sayali_s_201512_ms.pdf)

[4] [Our own dictionary](https://docs.google.com/spreadsheets/d/1WwI9crZk36pcTOQ1g_5dumMd11OlkpFRNHsEvpkwLMk/edit?usp=sharing)

[5] [Stemming with Spark](https://github.com/master/spark-stemming)

[6] [ML methods in Spark](https://spark.apache.org/docs/2.1.0/ml-features.html)

[8] [Public health paper using LDA](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/viewFile/2880/3264)

[7] [Best ways to do Text Classification](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)