# Project: Mental health in Switzerland

In this notebook we describe our pipeline and work done on the twitter datset to answer our research question(s).

Overview:
0. **Dataretrival:** how we access the dataset
1. **Dataset selection and analysis:** a first look at the dataset, exploring potential issues
2. **Dataset cleaning:** cleaning based on the results found in 1
3. **NLP methods:** applying NLP methods to our data to retrieve relevant data
4. **Machine Learning:** using machine learning to further clean
5. **Analysis:** analysis preformed on our cleaned dataset
6. **Conclusion**


An implemention cleaning the data on spark can be found in run.py

We quickly import the libraries to be used later:

In [1]:
import numpy as np
import pandas as pd
import pickle as pkl
import pyspark as ps
import matplotlib.pyplot as plt

#Part 1
import json
from pandas.io.json import json_normalize

#Part 2
import unicodedata

#Part 3
from nltk.corpus import stopwords #Part 3
from nltk.stem.snowball import SnowballStemmer

## 0. Datasets retrieval

We did not retrieve both datasets using the same method first because they did not have the same weight, but also because they were not provided on the same platform.

Our first dataset (_**twitter-swisscom**_) was retrieved from a .zip file. Thus, we had access to the entire dataset quickly allowing us to have an overview of all the tweets when analyzing it (cf _Part 1_ below).

Our second dataset, however, was retrieved from the cluster. We first tried to retrieve the whole dataset but quickly realized it would be impossible to do so (it was very heavy and took a long time to be downloaded). Thus, we only extracted the first JSON file to perform our analysis. To do this, we used the following methods:

```bash
cluster$ hadoop fs -getmerge /datasets/swiss-tweet/harvest3r_twitter_data_01-01_0.json /buffer/example.json
local$ scp -r gaspar@iccluster060.iccluster.epfl.ch:/buffer/example.json <local-path>
```

## 1. Dataset selection & analysis

The goal of this first analysis is to familiarize ourselves with the dataset in order to know if we need to adapt our research questions or enrich the dataset with external information in order to perform our analysis.

We were provided with two separate datasets containing swiss tweets. They were formated differently and contained different fields. While contained data collectd over the duration of multiple years, the other only covers a span of 10 months.
We performed an analysis on both in order to be able to decide which one should be used in out project. After this analysis, we decided to use **dataset 2** for our project.

While dataset 1 containes more precise location information, in the form of longitude and latitude, dataset 2 contains a sentiment analysis field, as well as a language field.
As trying to categorize the language of each tweet in dataset 1 was quite expensive – having to deal with network latency of API requests – and a lot of preprocessing was necessary to get it to work, dataset 2 containing this field puts it at a clear advantage.

We now provide a quick overview of dataset 1.

### Dataset 1 (twitter-swisscom)

The dataset comes with a *txt schema*, giving us an idea of what each column in the *tsv file* containing the tweets represents. A sample file was given, but we optained the complete set of tweets (5gb) via a .zip.

The dataset contains the following usefull columns:

- **userId** : id identifying the user.
- **createdAt** : time the tweet was posted on.
- **text** : content of tweet.
- **placeLatitude** : latitude of tweet.
- **placeLongitude** : longitude of tweet.
- **sourceName** : username.
- **sourceUrl** : URL of tweet.
- **followersCount** : number of followers.
- **friendsCount** : number of mutual follows.
- **statusesCount** : number of statuses of user.

The sample dataset contains a lot of NaN values, and each column contains at least 1% or more NaN values.

The complete analysis and code can be found in the [Basic Exploration dataset 1 notebook](Basic%20Exploration%20Dataset%201.ipynb)

### Dataset 2 (from Spinn3r)

This dataset has an elaborate description of each field available at the [spinn3r website](http://docs.spinn3r.com/?Example#content-schema).
Unlike the previous dataset, this dataset is given in JSON format.

To deal with the amount of data present in the cluster we look at one day to perform our first analysis and then show how to scale up.

The format of this dataset is a nested json that we could not find how to extract directly using the read JSON funtion provided. We thus use a JSON normalizer contained in the Pandas libary to extract it. We will later see that spark deals better with nested JSON.

The fields found in this dataset are:

In [2]:
EXAMPLE_PATH = 'swiss-tweet/example.json'

with open(EXAMPLE_PATH) as data_file:    
    data = json.load(data_file)

twitter_df = json_normalize(data)
#rename columns for convenience
twitter_df.columns = [ column.replace('_source.','') for column in twitter_df.columns]
twitter_df.columns

Index(['_id', '_index', '_score', 'author_avatar_img', 'author_gender',
       'author_link', 'author_name', 'bucket', 'canonical', 'date_found',
       'domain', 'hashcode', 'index_method', 'lang', 'links', 'main',
       'main_checksum', 'main_format', 'main_length', 'mentions', 'permalink',
       'published', 'resource', 'sentiment', 'sequence', 'sequence_range',
       'site', 'source_content_checksum', 'source_content_length',
       'source_created', 'source_date_found', 'source_description',
       'source_favicon_height', 'source_favicon_width', 'source_favorites',
       'source_followers', 'source_following', 'source_handle',
       'source_hashcode', 'source_http_status', 'source_image_height',
       'source_image_src', 'source_image_width', 'source_last_posted',
       'source_last_published', 'source_last_updated', 'source_likes',
       'source_link', 'source_location', 'source_parsed_posts',
       'source_parsed_posts_max', 'source_profiles', 'source_publisher_type',


Out of these columns, the one we can use are:
- **main**: contains the content of the tweet.
- **published**: gives the time on which the content was posted.
- **source_spam_probability**: probability of tweet being spam.
- **source_location**: location of tweet.
- **tags**: tags associated with tweet, as provided by spinn3r.
- **lang**: language of tweet.
- **sentiment**: sentiment score of tweet -POSITIVE, NEGATIVE, NEUTRAL-.
- **author_gender**: gender of author -MALE, FEMALE, UNKNOWN-.
- **source_followers**: followers of user who tweeted.
- **source_following**: number of people the user follows.


We quickly discuss the usage of the most importan tags:

**Main** stands at the center of our analysis, we plan to preform NLP methods in order to identify relevant tweets and use the content as well to identify related words.
**Published** can be used to map the tweets over the duration of the year, and look at seasonal changes.
**Source_location** can be used look at the geographical distribution of the tweets.
**Lang** will be used to filter out unwanted languages, which we need to do in order to preform the nlp tasks.
**Author_gender** will be used to identify the gender and look at the differenc between genders.

In [3]:
#columns we care about
columns = ['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']
twitter_df = twitter_df[columns]

We now look at general distributions in this dataset. While this example isn't representative when it comes to the tweets – especially given it contains tweets of the 1th of january – it can still give us insights on the other fields.

We assume that roughly the same categories of users were active on that day, so we can draw conclusions on the distribution of language and gender.

The language distribution is the following:

In [4]:
twitter_df['lang'].value_counts()

en     7350
fr     4421
de     4174
pt     2139
es     1463
und     893
it      266
in      125
pl      124
tr      124
ar      116
ja      111
ht       84
nl       71
tl       60
et       43
da       39
sv       38
zh       30
no       25
ko       25
fi       15
lt       14
ru        9
hi        7
is        7
sl        5
hu        5
lv        4
ta        3
bg        2
th        1
vi        1
el        1
Name: lang, dtype: int64

We see that English, French and German are most frequent. This is good as those are the languages we plan on using.

We now look at the distribution of gender in the dataset:

In [5]:
twitter_df['author_gender'].value_counts()

UNKNOWN    14424
FEMALE      4044
MALE        3327
Name: author_gender, dtype: int64

We see that most accounts do not seem to contain this information.  But there are still a lot that do, so we could use the ones that do to look at differences between gender, although it would not give use an unbiased set, as the type of user declaring their gender on twitter may be different than those who chose not to.

We now look at the sentiment column, to see how the tweets were labeled.

In [6]:
twitter_df['sentiment'].value_counts()

NEUTRAL     20357
POSITIVE      482
NEGATIVE       63
Name: sentiment, dtype: int64

We see that the vast majority of tweets was labeled as neutral, and only a very small number are labeled as negative. We will look at both neutral and negatively labeled tweets.

Under the assumption that the positives are not false positives, a tweet showing signs of mental distress will not be labeled as POSITIVE, hence we can safely exclude these tweets from further analysis.

Looking at the spam probabality we see that not a single tweet was labeled as spam. This puts into question the accuracy of the labeling, as the set of tweets on that day most certainly contains spam. We will still use it, as we assume the chance of false positives is low, so we lose nothing by using it.

In [7]:
twitter_df['source_spam_probability'].value_counts()

0.0    21795
Name: source_spam_probability, dtype: int64

We now examine the locations provided by the dataset:

In [8]:
#we only look at the locations for the languages we care about, as location seems to be language dependent
twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'].value_counts()

Switzerland           6613
Schweiz               1977
Suisse                1762
Genève                1094
Zürich                 961
Geneva                 676
Zurich                 544
Basel                  433
Bern                   363
Lausanne               287
CH                     154
Swiss                  124
Lugano                  93
St. Gallen              88
Geneve                  67
Schaffhausen            64
Fribourg                59
Svizzera                50
Luzern                  44
Winterthur              42
Baden                   37
Chur                    26
Lenzburg                23
Waldenburg              21
Interlaken              21
Biel                    19
Marly                   19
Genf                    19
Sankt Gallen            18
Neuchâtel               14
                      ... 
Visp                     2
Frauenfeld               2
Bâle                     2
Horgen                   2
Glarus                   2
Hinwil                   2
H

We see that:
- there are a lot of locations that are the same but in a different language, such as Switzerland and Schweiz
- the names of the locations are not just in the languages we are interessted in (see สวิตเซอร์แลนด์)
- a vast majority of the dataset is just labeled as 'switzerland'
- but as opposed to dataset 1, they are all located in Switzerland

We now look at missing values in the dataset:

In [9]:
twitter_df.count() #give us number of NAN

main                       21795
published                  21795
source_spam_probability    21795
source_location            21795
tags                        3865
lang                       21795
sentiment                  20902
author_gender              21795
source_followers           21795
source_following           21795
dtype: int64

We see that for most tweets the tags are missing, this may indicatie that te taggs wont be usefull for analysis, but this does not influence our research at this stage.
We also note that for some tweets the sentiment is missing. As noted on spinn3er, this may be due to the tweets not containing enough linguistic information. As we ourself will filter out such tweets, the remaining set should contain sentiment. Even if not, this field is not central to our analysis.

To be sure this wont be an issue we quickly look at tweets and note that for all those uncategorized tweets the language is unkown as well, so they will all be removed.

In [18]:
twitter_df[~(twitter_df.sentiment.isin(['POSITIVE', 'NEGATIVE', 'NEUTRAL']))]['lang'].value_counts()

und    893
Name: lang, dtype: int64

#### Looking at the Tweets

While this set of tweets is not representative, we can still use it to find potential issues we might have with the tweet content:

In [10]:
pd.set_option('display.max_colwidth', -1)
twitter_df.sample(n=10)['main']

3582     Chez Mohawk's Cycles on a toujours une place pour la Velosophe Beer! Merci. http://fb.me/6Qi6lrkSZ                                                                                                                                                                                                                                   
7053     @StopHatinRussia ich habe dieses Jahr zu jeder Mitternacht in jeder Weltregion einen Tweet rausgelassen… morgen früh kommt Amerika dran…                                                                                                                                                                                             
4169     attention la kaïra des bacs à sable est parmi nous                                                                                                                                                                                                                                                                               

We immediatly see that the tweets containing links are not relevant to our research question, as they are mostly news or adds. We make the assumption that this would be the case anytime of the year.

We look at the tweets containing links and confirm:

In [11]:
twitter_df.main[twitter_df.main.map(lambda x: 'http://' in x)].head(10)

6     Alpinultras. Sello, circuito y estudio médico para potenciar las travesías de ultrafondo de al… http://wp.me/p9GIl-b3O  vía @CarrerasMontana 
7     5 people unfollowed me today tracked by http://unfollowspy.com                                                                               
9     Wir wünschen euch von Herzen ein gesundes 2016! http://fb.me/7SfcGRpCP                                                                       
17    Aktuellste Stellenangebote http://tinyurl.com/bugpgs8                                                                                        
23    Naa geierst Du wieder und willst wichsen..Na gut aber Du musst 2 mal spritzen.Du musst… http://dlvr.it/D8jjgL  pic.twitter.com/LBV2QcdDIF    
25    AZEALIA BANKS - 212 FT. LAZY JAY http://fb.me/3GqunbPRp                                                                                      
26    Alles Gute fürs Neue Jahr! Sehen Sie eine Videobotschaft des CEO #Bringhen Group: https://vimeo.com/148510

We now preform a vastly simplified version of the dictionary matching we will preform to get relevant tweets and analyze the results.

Here we look at the occurence of 'suicide' in the tweet:

In [12]:
pd.set_option('display.max_colwidth', 100)
twitter_df[twitter_df['main'].map(lambda x: 'suicide' in x) ]['main'] #news instead of personal reference
#removing news would be good
#we also see that we should not remove pic.twit

1731     Suicide bombing kills 1 at Kabul restaurant - http://news.bridgeward.com/suicide-bombing-kills-1...
3624                                              IS suicide attack planned in Munich http://bit.ly/1Owgpls 
5228     Afghanistan: attentat suicide dans un restaurant français de Kaboul - Europe1 http://dlvr.it/D8q...
6896     Ian Murdock's last night alive #debian http://sanfrancisco.cbslocal.com/2015/12/31/prominent-pro...
9479     Ipad Music Making Daily is out! http://paper.li/suicidesurfer77/1315555100?edition_id=bb78ee60-b...
10145     L'Etat islamique soupçonné d'avoir planifié un attentat suicide en Allemagne http://rss.ch/662786 
10446                           Afghanistan: attentat suicide dans le centre de Kaboul http://rss.ch/662843 
11096    Kaboul : les talibans revendiquent l'attentat suicide de ce 1er janvier contre un restaurant fra...
11377     L'Etat islamique soupçonné d'avoir planifié un attentat suicide en Allemagne http://rss.ch/662783 
14771              

We see that a lot of these tweets contain news, we should remove the links in order to get a better result.

Here we look at therapy (in german), again this confirms the issue with links.

In [13]:
twitter_df[twitter_df['main'].map(lambda x: 'therapie' in x) ]['main'] #adds instead of personal reference
#all contain links..reason to remove links

11567    #Arbeit #Jobs #CH Chefarzt Psychiatrie/Psychotherapie 100 (w/m): Herisau, Appenzell Ausserrhoden...
12508    Aktuelle Firmensuche "Physiotherapie Praxis Bücheli" Ostermundigen (BE) #Gesundheitsberatung #Su...
13610    Natürlich Gesund: Mistel und Myrrhe in der Phytotherapie http://pure-natur.blogspot.ch/2014/06/t...
13720    Aktuelle Suchabfrage "Kunsttherapie" auf @Help_ch #Kunsttherapie #Suchportal #Schweiz #Suche htt...
Name: main, dtype: object

Finally we look at the tweets that are retweets, to get an idea if they could be useful.
From what we see, and what others [8] have seen they are not.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'RT ' in x) ]['main'].head()

Using this analysis we apply the gained knowledge and start to clean the data.

## 2. Datset Cleaning

Now that we have done some data exploration, we have to clean the data to be able to use it correctly for the rest of the project.

First, we use Pandas on a small subset of the dataset and locally, so as to be able to make sure all functions work as expected and to test them out. Later on, we use spark to be able to do the same operations on a bigger scale.
The spark version of this code can be found in [run.py](run.py)

### 2.1 Unnesting the JSON

As mentioned in 1, the dataset is given in nested json format, we thus have to unnest it.

In [19]:
cleaned = json_normalize(data)
cleaned.columns = [column.replace('_source.','') for column in cleaned.columns]

### 2.2 Column Selection

Once we have a Dataframe, we choose the columns we think will be useful for the rest of the proejct, as described in 1. Thus, we select the columns we previously found to be useful.

In [None]:
#Only keeping the necessary columns
cleaned = cleaned[['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']]

Some of them will bring us interesting insights. For example: are men or women more prone to depression? Is someone with more followers happy? There are quite a few interesting questions that can thus be asked and later on answered with the information we choose to keep.

### 2.3 Language Filtering

Seeing as our project is mainly focused on Switzerland, we will make sure we only keep the tweets that are in languages spoken by the majority of Switzerland, which would mean French, German and Italian. Unfortunatley, as no one in the group speaks Italian, we decided to forfeit the language in favor of English. This has two reasons: first, a lot of people speak English on the Internet, as it is a global language; second, there is a large English-speaking community in Switzerland.

Additionaly, note that Swiss German is counted in the German part, even if some of the words are a little bit more peculiar. This will be taken into account in our dictionnaries later on.

In [None]:
lang_mask = ~cleaned.lang.isin('de', 'en', 'fr')
cleaned.drop(cleaned[lang_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

### 2.4 Sentiment Analysis

Seeing as we were provided with a dataset that has a sentiment analysis, we use it to be able to drop all tweets that have a **positive** result. Seeing as we want to find depression and other mental illnesses, we are mostly searching for negative or neutral tweets. 

It could be that the sentiment analysis does have a few false positives (for example, the use of a smiling face could be used saracastically and thus making the result of the analysis be positive), but we decide to use the data as it is, since we are interessted in an overview.

In [None]:
sent_mask = (cleaned['sentiment'] == 'POSITIVE')
cleaned.drop(cleaned[sent_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

### 2.5 Spam Removal

Spam is not relevant to any of our objectives, thus, we decide to drop all the tweets having a greater value than 0.5 in the **source_spam_probability** column. Once again, we decide to trust the algorithm giving us this data.

We chose a 50% threshold as it follows the maximum likelihood rule: it is likely to be spam if it has more than 50% chance to be spam.

In [None]:
spam_mask = (cleaned['source_spam_probability'] >= 0.5)
cleaned.drop(cleaned[spam_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

### 2.6 Time Format Encoding

To be able to use the time in some analysis (to find seasonal patterns in depression for example), we also need the dates to be properly formatted. Thus, we decided to reformat all the dates in the column **published** to be certain to be able to use them later on.

In [None]:
cleaned['published'] = pd.to_datetime(cleaned['published'])
cleaned['published'].head(10)

### 2.7 Text Treatment

Now that we have only the tweets that interest us remaining, we need to do a little bit of cleaning in the **main** column. Indeed, to be able to process the text and find patterns or any other information that we could seek.

First we put everything in lowercase to be able to make comapraisons between words without having problems with different representation between the same letter in uppercase and lowercase.

In [None]:
cleaned['main'] = cleaned['main'].astype(str).str.lower().\
                    apply(lambda tweet: unicodedata.normalize('NFD', tweet).\
                    encode('ascii', 'ignore').decode('utf-8'))
cleaned['main'].head()

Then, we remove all URLs and images. As we do not have any image processing, the pictures are not useful. As for the links, some information might be relevant (for example the title of an article if it is to be found in the URL) but most of it isn't.

We also remove all non-alphanumeric characters, as they are will not give more information and might get in the way of the text analysis.

In [None]:
cleaned['main'] = cleaned['main'].str.replace("www\S+", '').str.replace("http\S+", '').\
                    str.replace("pic.twitter\S+", '').str.replace('[^\w\s]', '')
cleaned['main'].head()

## 3. NLP methods

Using Natural Language Processing (NLP) is necessary given the nature of our dataset: tweets. Following the steps used in previous courses and in the papers we read, we came up with the following pipeline in order to process the tweets (and dictionary). Unlike the previous part, we only used local functions instead of a scaled up version using Spark.

_Note: we use nltk, a goto python NLP library which was very interesting for us as it offered operations in the various languages we are working on._

In [None]:
def process_words(language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'
    
    stemmer = SnowballStemmer(language)
    lang_set = stopwords.words(language)
        
    cleaned.loc[cleaned['lang'] == lang, 'main'] = cleaned.loc[cleaned['lang'] == lang, 'main'].str.split().\
        apply(lambda tweet: [word for word in tweet if word not in lang_set]).\
        apply(lambda tweet: [stemmer.stem(word) for word in tweet])

### 3.1. Tokenizing

The first important step in NLP is to divide the words of every tweet in a table to be able to easily treat them. It is easily done using the _'split'_ method provided in Python. 

In [None]:
cleaned['main'] = cleaned['main'].str.split()
cleaned.head()

### 3.2 Stop words Removal

To have relevant tweets, it is necessary to remove useless words which would interfere with our analysis. We assumed stop words were the only important words to remove (as we already dealt with special characters and urls).

In [None]:
def remove_stops(language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'
    
    lang_set = stopwords.words(language)
        
    cleaned.loc[cleaned['lang'] == lang, 'main'] = cleaned.loc[cleaned['lang'] == lang, 'main'].\
        apply(lambda tweet: [word for word in tweet if word not in lang_set])

In [None]:
remove_stops('english')
remove_stops('french')
remove_stops('german')
cleaned.head()

### 3.3 Stemming

The final step of our NLP treatement pipeline is stemming. The idea is that words appearing in multiple forms (such as have, having, had, …) should only be considered once using their radical (e.g. "hav").

In [None]:
from nltk.stem.snowball import SnowballStemmer

def stem_words(language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'
    
    stemmer = SnowballStemmer(language)
        
    cleaned.loc[cleaned['lang'] == lang, 'main'] = cleaned.loc[cleaned['lang'] == lang, 'main'].\
        apply(lambda tweet: [stemmer.stem(word) for word in tweet])

In [None]:
stem_words('english')
stem_words('french')
stem_words('german')
cleaned.head()

### 3.4 Dictionary processing

As we applied all of these methods to our dataset, it is only natural that we do the same for the dictionary we will use. The first step of this part explains how we built the dictionary (and cleaned it as we did before) while the second part focuses on applying the previous NLP methods to the dictionary.

####  Building the Dictionary

The first step of building our dictionary was doing research on previous dictionaries that were used for similar problems. Starting from this, we built our own dictionary by expanding the examples (as we expanded our subject to take into account multiple mental disorders instead of simply determining tweets of users diagnosed with clinical depression). As we treat tweets in multiple languages, we also translated all the terms and tried adding words that specifically target mood (or eating) disorders in french and german.

_Note: you can take a look at this dictionary by clicking on the 4th reference (at the end of the file)._

#### processing the dictionary

### 3.5 Processing the Data

## 4. ML 

### 4.1. labeling the tweets

### 4.2 construcing features TF-IDF

### 4.3 train SVM classifier

### 4.4 relabel training set

## 5. Final Data Analysis

LDA to find similarities

## 6. Conclusions

# References and bibliography

[1] [Example dataprocessing pipeline](http://nbviewer.jupyter.org/gist/mizvol/eb24770ac3d5d598463f972e2a669f03)

[2] [Dissertation containing a first dict](https://www.rand.org/content/dam/rand/pubs/rgs_dissertations/RGSD300/RGSD391/RAND_RGSD391.pdf)

[3] [Second thesis containing dict](https://getd.libs.uga.edu/pdfs/kale_sayali_s_201512_ms.pdf)

[4] [Our own dictionary](https://docs.google.com/spreadsheets/d/1WwI9crZk36pcTOQ1g_5dumMd11OlkpFRNHsEvpkwLMk/edit?usp=sharing)

[5] [Stemming with Spark](https://github.com/master/spark-stemming)

[6] [ML methods in Spark](https://spark.apache.org/docs/2.1.0/ml-features.html)

[7] [Best ways to do Text Classification](http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/)

[8] [Public health paper using LDA](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/viewFile/2880/3264)