# Project: Mental health in Switzerland

In this notebook, we describe the multiple steps of our preprocessing pipeline and the work done on the Twitter dataset to answer our research question.

### Overview:

<ol start="0">
  <li><b>Data Collection:</b> how we accessed the datasets</li>
  <li><b>Datasets Description:</b> a first look at the datasets</li>
  <li><b>Data Treatment:</b> cleaning based on the results found in <i>Part 1</i> and applying NLP methods to our data</li>
  <li><b>Dictionary Construction:</b> creation of the dictionary we are going to us on the datasets</li>
  <li><b>Topic Modeling and Analysis of First Dataset:</b> anaysis of the smaller Spinn3r set for a proof of concept</li>
  <li><b>Analysis of the Second Dataset:</b> applying what we learned on the smaller dataset on the swisscom-twitter dataset</li>
  <li><b>Final Findings:</b> trying to find meaning in the final dataset we retrieve</li>
  <li><b>Conclusion</b></li>
</ol>

We quickly import the libraries to be used later:

In [None]:
import numpy as np
import pandas as pd
import pickle as pkl
import pyspark as ps
import matplotlib.pyplot as plt

#Part 1
import json
from pandas.io.json import json_normalize

#Part 2
import re
import unicodedata

#Part 3
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, TweetTokenizer

## 0. Data Collection

We did not retrieve both datasets using the same method, as they did not have the same weight, but also because they were not provided on the same platforms.

Our first dataset (_**twitter-swisscom**_) was retrieved from a .zip file. Thus, we had access to the entire dataset quickly allowing us to have an overview of all the tweets when analyzing it (cf _Part 1_ below).

Our second dataset (_**Spinn3r**_), however, was retrieved from the cluster. We first tried to retrieve the whole dataset but quickly realized it would be impossible to do so (it was very heavy and took a long time to be downloaded). Thus, we only extracted the first JSON file to perform our analysis. To do this, we used the following methods:

```bash
cluster$ hadoop fs -getmerge /datasets/swiss-tweet/harvest3r_twitter_data_01-01_0.json /buffer/example.json
local$ scp -r gaspar@iccluster060.iccluster.epfl.ch:/buffer/example.json <local-path>
```

## 1. Datasets Description

We were provided two separate datasets containing Swiss tweets. They were formated differently and contained different fields. On top of this, while the first dataset contained data collected over multiple years (twitter-swisscom, refered to as second dataset), the other only covers a span of 10 months (Spinn3r, refered to as first dataset).

The goal of our first analysis was thus to decide which dataset would be more useful but also to familiarize ourselves with the chosen dataset in order to know if we need to adapt our research questions or enrich the dataset with external information in order to perform our analysis.

After the analysis (presented below), we decided to use **both datasets** for our project. While the twitter-swisscom contains more precise location information (longitude and latitude), the Spinn3r dataset contains a sentiment analysis field as well as a language field and has a lot less data. Thus, we decided to create our functions and methods using the smaller dataset (Spinn3r) and use it as a proof of concept before going on a larger scale and use the swisscom-twitter dataset to further our analysis.

### First Dataset (Spinn3r)

#### Metadata analysis

This dataset has an elaborate description of each field available on the [spinn3r website](http://docs.spinn3r.com/?Example#content-schema). Given the amount of data present in the cluster, we only look at one day to perform our first analysis (we will later show how we scale our operations using Spark).

Unlike the previous dataset, this dataset is given in JSON format (with nested elements). As we could not find how to extract all the data directly using the *read_json* function provided, we use a JSON normalizer (provided in the *Pandas* libary). 

_Note: We will later see that Spark deals better with nested JSON._

The fields found in this dataset are:

In [None]:
EXAMPLE_PATH = 'swiss-tweet/example.json'
TRUE_PATH = 'reduced_tweets.json'

with open(EXAMPLE_PATH) as data_file:    
    data = json.load(data_file)

twitter_df = json_normalize(data)
#rename columns for convenience
twitter_df.columns = [column.replace('_source.','') for column in twitter_df.columns]
twitter_df.columns

Out of all these columns, we find the ones below to be the most useful for our analysis:
- **main**: contains the content of the tweet.
- **published**: gives the time on which the content was posted.
- **source_spam_probability**: probability of tweet being spam.
- **source_location**: location of the tweet.
- **tags**: tags associated with the tweet (provided by Spinn3r).
- **lang**: language of the tweet.
- **sentiment**: sentiment score of the tweet -POSITIVE, NEGATIVE, NEUTRAL-.
- **author_gender**: gender of the author -MALE, FEMALE, UNKNOWN-.
- **source_followers**: followers of the user who tweeted.
- **source_following**: number of people the user follows.

We consider these tags to be the most useful as:

- **Main** stands at the center of our analysis. We plan to perform NLP methods in order to identify relevant tweets and use the content as well to identify related words.
- **Published** can be used to map the tweets over the duration of the year and look for seasonal changes.
- **Source_location** can be used to look at the geographical distribution of the tweets.
- **Lang** will be used to filter out unwanted languages, which we need to do in order to perform our NLP tasks.
- **Author_gender** will be used to identify the gender and look at the differences between both gend

In [None]:
columns = ['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']
twitter_df = twitter_df[columns]

#### General Distribution

While this example isn't representative (especially given that it contains the tweets posted on January, 1st), it can still give us insights on other fields. We assume that roughly the same categories of users were active on that day, meaning we can draw conclusions on the distribution of language and gender.

In [None]:
plt.figure(figsize=[30, 20])
explode = (0, 0, 0, 0, 0, 0, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 
           0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2)
twitter_df['lang'].value_counts().plot(kind='pie', explode=explode, cmap=plt.cm.terrain)
plt.axis('equal')
plt.show()

We see that English, French and German are the most frequent languages. This is good as those are the languages we plan on using.

In [None]:
plt.figure(figsize=[5, 5])
twitter_df['author_gender'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

We see that most accounts do not contain information on the user's gender (meaning there is no way we can have an unbiased set). However, we have a significant number of profiles where the gender is documented meaning we could use these to look at the differences in mental distress between the genders.

In [None]:
plt.figure(figsize=[10, 5])
explode = (0, 0.2, 0.4)
twitter_df['sentiment'].value_counts().plot(kind='pie', explode=explode)
plt.axis('equal')
plt.show()

Once again, we see that the vast majority of tweets are labeled as neutral while only a very small number is labeled as negative. Given our subject will look at both the **'NEUTRAL'** and **'NEGATIVE'** tweets.

_Note: we make the assumption that there are no false positives, meaning that a tweet showing signs of mental distress will not be labeled as **'POSITIVE'**. This means we can safely exclude these tweets from further analysis._

In [None]:
plt.figure(figsize=[8, 5])
twitter_df['source_spam_probability'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

Looking at the spam probability of the tweet set, we see that not a single tweet was labeled as spam. This makes us question the accuracy of the labeling as the set of tweets on that day most certainly contains spam. However, we will still use it as we assume the chance of having false positives to be very low.

In [None]:
plt.figure(figsize=[20, 10])
twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'].value_counts().plot(kind='pie')
plt.legend(labels=twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'])
plt.axis('equal')
plt.show()

As the location seems to be language dependent, we only care about tweets written in the 3 languages we need. We see that:
- A lot of locations only differ in language (e.g. as Switzerland and Schweiz)
- The name of the locations are not always given languages we are interested in (e.g. สวิตเซอร์แลนด์)
- A vast majority of the dataset is just located in 'Switzerland'
- As opposed to dataset 1, all tweets are located in Switzerland

In [None]:
twitter_df.count()

Looking at the missing values in the dataset, we see that for most tweets, the tags are missing. This may indicate that the tags won't be usefull for our analysis, but this does not influence our research at this stage.

We also note that for some tweets, the sentiment is missing. As noted on Spinn3er, this may be due to the fact that some tweets do not contain enough linguistic information. As we will filter out such tweets, the remaining set should contain sentiment. Even if doesn't, this field is not central to our analysis, it is merely used to help us filter our dataset.

In [None]:
plt.figure(figsize=[6, 5])
twitter_df[~(twitter_df.sentiment.isin(['POSITIVE', 'NEGATIVE', 'NEUTRAL']))]['lang'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

To be sure this won't be an issue, we quickly look at all the uncategorized tweets and see that the language is unkown as well, meaning they will all be removed anyways.

#### Potential issues

While this set of tweets is not representative, we can still use it to find potential issues we might have with the tweets' content:

In [None]:
pd.set_option('display.max_colwidth', -1)
twitter_df.sample(n=10)['main']

Looking at the sample, we immediatly see that the tweets containing links are not relevant to our research question (they are mostly news or adds).

In [None]:
twitter_df.main[twitter_df.main.map(lambda x: 'http://' in x)].head(10)

Looking more intently at the tweets containing links, we can make the assumption that the relationship between URLs and spam is a general rule (at any time of the year).

Perform a vastly simplified version of the dictionary matching we will preform to get relevant tweets and analyze the results.

Here we look at the occurence of 'suicide' in the tweets:

In [None]:
pd.set_option('display.max_colwidth', 100)
twitter_df[twitter_df['main'].map(lambda x: 'suicide' in x) ]['main']

By performing a simplistic dictionary matching using the occurences of the word *'suicide'* in our tweet set, we see that a lot of these tweets contain news. This further comforts us in our choice to remove tweets with URLs in order to get better results.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'therapie' in x) ]['main']

Looking at the word *'therapy'* (in German), we can confirm once again the issue there is with URLs.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'RT ' in x) ]['main'].head()

Finally, we look at the retweets to get an idea on how useful they could be. From what we see, and what others have seen [8], they are not.

### Second Dataset (twitter-swisscom)

The dataset comes with a *txt schema*, giving us an idea of the meaning of each column in the *tsv file* containing the tweets. We were also given a sample file to get an overview of the data, but we also optained the complete set of tweets (5GB) in a _.zip_ file.

The dataset contains the following useful columns:

- **userId** : id identifying the user.
- **createdAt** : time the tweet was posted.
- **text** : content of the tweet.
- **placeLatitude** : latitude of the tweet.
- **placeLongitude** : longitude of the tweet.
- **sourceName** : username.
- **sourceUrl** : URL of the tweet.
- **followersCount** : number of followers.
- **friendsCount** : number of mutual follows.
- **statusesCount** : number of statuses of user.

The sample dataset contains a lot of NaN values, and each column contains at least 1% or more NaN values.

The complete analysis and code can be found in the [Basic Exploration dataset 1 notebook](Basic%20Exploration%20Dataset%201.ipynb)

## 2. Data Treatment

Now that we have done some data exploration, we have to clean our dataset to be able to use it correctly for our research.

First, we use Pandas locally on a small subset of the tweets to explore different cleaning methods and make sure the functions we chose work as expected. After proving that our concept works, we use Spark to scale up our operations and be able to perform them on all the files provided on the cluster.

### First Dataset 

#### Importing the data

This part follows what exactly has already been done during Data Exploration, meaning normalizing the provided _JSON_ file and only keeping the columns we are interested in. The final step in importing the data is formatting the date of publication in the tweets for convenience. 

In [None]:
cleaned = json_normalize(data)
cleaned.columns = [column.replace('_source.','') for column in cleaned.columns]

In [None]:
cleaned = cleaned[['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']]

In [None]:
cleaned['published'] = pd.to_datetime(cleaned['published'])
cleaned['published'].head()

#### Tweet filtering

Given the context of our research, there are many useless tweets. 
- First, we would like to get rid of any tweets with a spam probability greater than 0.5 (an arbitrary value we found sound). 
- Moreover, we only want to keep tweets marked as **NEUTRAL** or **NEGATIVE** as we are conducting a study on mental distress in tweets. 
- Finally, as we are focused on the Swiss population, we only keep tweets in the official languages (except for Italian as no one in our group speaks the language, meaning we won't be able to extract useful information) and in English (as most tweets in the dataset, and more generally on Twitter, are written in English)

In [None]:
spam_mask = (cleaned['source_spam_probability'] >= 0.5)
cleaned.drop(cleaned[spam_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

In [None]:
sent_mask = (cleaned['sentiment'] == 'POSITIVE')
cleaned.drop(cleaned[sent_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

In [None]:
lang_mask = ~cleaned.lang.isin(['de', 'en', 'fr'])
cleaned.drop(cleaned[lang_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)

#### Text treatment

Now that we only have relevant tweets (following a mild cleaning), we need to work on the main subject of our analysis: the content of the tweets itself. This is necessary to ease the _3rd Part_ of our research (processing the text to find patterns using NLP, _Natural Language Processing_).

The first step of our treatment is to lower all the caracters (allows comparing the tweets without case-sensitive searches).

In [None]:
data_cleaning = lambda data: unicodedata.normalize('NFD', data.lower()).encode('ascii', 'ignore').decode('utf-8')

In [None]:
cleaned['main'] = cleaned['main'].map(data_cleaning)
cleaned['main'].head()

As we have seen in _Part 1_, URLs are highly linked to spam (except for URLs linking pictures to the tweet). As image processing is not in the scope of this project, we remove all "pic.twitter" URLS and mentions of the format "@user". After that, we remove all tweets containing a URL and retweets as they do not give us any information.

In [None]:
url_mask = cleaned['main'].str.contains("www\S+") | cleaned['main'].str.contains("http\S+")
cleaned['main'] = cleaned['main'].str.replace("pic.twitter\S+", '')
cleaned.drop(cleaned[url_mask].index, inplace=True)
cleaned.reset_index(drop=True, inplace=True)
cleaned['main'].head()

In [None]:
alphanum_filter = lambda data: re.sub(r'[^\w\s]', '', data)

In [None]:
cleaned['main'] = cleaned['main'].map(alphanum_filter)
cleaned.main.head()

#### Spark version

After performing all these cleaning steps, the size of the set is **significantly reduced** (the new file only weights **2.4MB** instead of the **55MB** of the original file, allowing us to reduce the set by **30x**). This is particularly helpful as the whole dataset represents around **30GB**.

_Note:_ the code was defined as _**Raw NBConvert**_ as it is not meant to be run on the notebook but on the cluster. The code is only here as information for the reader.

We start by importing the necessary elements and defining Spark environment variables.

In [None]:
from pyspark.sql.functions import regexp_replace, col, lower, explode

sc = SparkContext()
sqlContext = SQLContext(sc)
df = sqlContext.read.json('/datasets/swiss-tweet/')

We then import our data.

In [None]:
#remove nesting
df = df.select('_source.*')

#list and keep relevant columns
columns = ['main', 'author_user_id','published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following', 'geo_point']
df = df[columns]

We filter tweets using the same method.

In [None]:
df = df.filter(df.source_spam_probability < 0.5) #remove spam
df = df.filter(df.sentiment != 'POSITIVE') #remove positive sentient tweets
df = df.filter(df.lang.isin('en', 'de', 'fr'))  #gets tweets with right language

Text treatment is a little longer as it is more thorough in Spark.

In [None]:
#lower the characters
df = df.withColumn('main', lower(df.main))

#replace picture urls and mentions
df = df.withColumn('main', regexp_replace(col('main'), 'pic.twitter\S+', ' '))
df = df.withColumn('main', regexp_replace(col('main'), '@\S+', ' '))

#remove tweets mentioning websites and retweets
df = df.where(~df.main.like("%http%"))
df = df.where(~df.main.like("%.com%"))
df = df.where(~df.main.like("%.ch%"))
df = df.where(~df.main.like("%www%"))
df = df.where(~df.main.like("%rt%"))

Finally, we store the data in the _JSON_ we will be using.

In [None]:
df.write.json('reduced_tweets.json')

### Second Dataset 

## 3. Dictionary Construction

###  Building the dictionary

The first step of building our dictionary was doing research on previous dictionaries that were used for similar problems, such as can be seen in [2], [3]. Starting from this, we built our own English dictionary _($4^{th}$ reference in the Bibliography at the end of the file)_ and carefully translated it in French and German.

_Note_: a more complete explanation of our dictionary construction is included in our report.

In [None]:
DICT_PATH = "dictionary.csv"
dictionaries = pd.read_csv(DICT_PATH)
dictionaries.head()

In [None]:
en_dict = dictionaries['english'].dropna()
fr_dict = dictionaries['french'].dropna()
de_dict = pd.concat([dictionaries['german'].dropna(), dictionaries['swiss_german'].dropna()])

After fetching our 3 datasets, we clean them as we did for the tweets in _Part 2_. However, as we created them ourselves, we only have to lowercase them and treat the special characters. 

In [None]:
en_dict = en_dict.apply(data_cleaning).str.split()
fr_dict = fr_dict.apply(data_cleaning).str.split()
de_dict = de_dict.apply(data_cleaning).str.split()

### Processing the dictionary

Once we have our (clean) data, we simply run all the previous NLP methods on our dictionaries. 

_Note_: Special methods were written for the dictionaries as they do not follow the same format as tweets, but they are the same in essence.

In [None]:
import dict_filtering

In [None]:
dict_remove_stops('english', en_dict)
dict_remove_stops('french', fr_dict)
dict_remove_stops('german', de_dict)

In [None]:
dict_stem_words('english', en_dict)
dict_stem_words('french', fr_dict)
dict_stem_words('german', de_dict)

To show the soundness of our method, we display the head of each dictionary:

In [None]:
en_dict.head()

In [None]:
fr_dict.head()

In [None]:
de_dict.head()

## 4. Topic Modeling and Analysis of First Dataset

### Natural Language Processing

Using NLP _(Natural Language Processing)_ is necessary given the nature of our dataset: tweets. Following the steps used in previous courses and in the papers we read, we came up with the following pipeline in order to process the tweets (and dictionary). Unlike the previous part, we only used local functions.

_Note: we use nltk, a goto python NLP library which was very interesting for us as it offered operations in the various languages we are working on._

#### Helper functions
(cf. _Next part_ below for information)

In [None]:
def remove_stops(language, cleaned, frame):
    lang = language[:2]
    if language == 'german':
        lang = 'de'
    lang_set = stopwords.words(language)
    cleaned.loc[cleaned['lang'] == lang, frame] = cleaned.loc[cleaned['lang'] == lang, frame].\
        apply(lambda tweet: [word for word in tweet if word not in lang_set])

In [None]:
def stem_words(language, cleaned, frame):
    lang = language[:2]
    if language == 'german':
        lang = 'de'

    stemmer = SnowballStemmer(language)

    cleaned.loc[cleaned['lang'] == lang, frame] = cleaned.loc[cleaned['lang'] == lang, frame].\
        apply(lambda tweet: [stemmer.stem(word) for word in tweet])

#### Tweet formatting

NLP is mainly based on treating the data to obtain an adequate format to work with following 3 steps:
- Tokenizing: separating words 
- Stop-words removal
- Stemming: keeping only the radical of a word (allows better comparison)

In [None]:
tknzr = TweetTokenizer()

cleaned['tweets'] = cleaned['main'] #keep tweets for qualitative study
cleaned['main'] = cleaned['main'].map(lambda x: tknzr.tokenize(x))
cleaned.head()

In [None]:
remove_stops('english', df, 'main')
remove_stops('french', df, 'main')
remove_stops('german', df, 'main')

In [None]:
stem_words('english', df, 'main')
stem_words('french', df, 'main')
stem_words('german', df, 'main')

#### Data processing

In this part, we cross the dataset with our dictionaries to retrieve the tweets exhibiting mental distress. This is the first (less naive) step of our analysis before using Topic Modeling methods on our data.

In [None]:
cleaned.main = cleaned.main.map(lambda x:  ' '.join(x))
de_dict = de_dict.map(lambda x:  ' '.join(x))
en_dict = en_dict.map(lambda x:  ' '.join(x))
fr_dict = fr_dict.map(lambda x:  ' '.join(x))

We now check if we can find a dict entry for each tweet

In [None]:
def check_dict(tweet, dict_):
    """checks if dict entry matches tweet"""
    match = [w for w in dict_ if w in tweet] #find matching for each entry
    return len(match) > 0 #at least one match

This gives us the following results for each language.


For english:

In [None]:
english_filtered = cleaned[cleaned.lang == 'en']['tweets']\
[cleaned[cleaned.lang == 'en']['main'].map(lambda x: check_dict(x, en_dict))]
english_filtered.head()

In [None]:
english_filtered.count()

We can clearly see sadness an loneliness or fear in these tweets, but some of them do not actualy express distress.
We also note that the number of tweets has greatly been reduces through this.

Now we look at the French tweets and again find similar results:

In [None]:
french_filtered = cleaned[cleaned.lang == 'fr']['tweets']\
[cleaned[cleaned.lang == 'fr']['main'].map(lambda x: check_dict(x, fr_dict))]
french_filtered.head()

In [None]:
french_filtered.count()

Finally, for German we get the following results:

In [None]:
german_filtered = cleaned[cleaned.lang == 'de']['tweets']\
[cleaned[cleaned.lang == 'de']['main'].map(lambda x: check_dict(x, de_dict))]
german_filtered.head()

In [None]:
german_filtered.count()

Not only do all of this tweets reference the person, but they also talk about sadness or dissapointment.
The remaining number is quite small, compared to the initial 4000 german tweets in the dataset, but it is a number we might expect.

Still those results could be better.

#### Mapping from Dataset 1 to Dataset 2

Formatting is as follows: **Column in Dataset 2** -> Column in Dataset 1

- **author_user_id**           -> userId
- **geo_point**                -> longitude + latitude (or: placeLongitude + placeLatitude)
- **main**                     -> text
- **published**                -> createdAt
- **source_followers**         -> followersCount
- **source_following**         -> friendsCount (or is it folowwers that are following ?)
- **source_location**          -> userLocation


#### Dataset 1 having no equivalent : 
- id                                -> tweet ID
- truncated                         -> no idea, only NaNs
- placeId                           -> ID of where they are
- inReplyTo                         -> If it is a tweet reply
- source & sourceName, sourceUrl    -> what user used to send tweet (Android, website, and so on)
- userName                          -> username
- screenName                        -> name shown on tweets from username
- statusesCount                     -> number of status of user

#### Dataset 2 having no equivalent : 
- **author_gender**                 -> author's gender
- **lang**                          -> language of tweet (given)
- **sentiment**                     -> tweet's sentiment analysis (given)
- **source_spam_probability**       -> spam probability of tweet (given)
- **tags**                          -> hashtags contained in the tweet (in Dataset 1 stays in text)
- **tokenized**                     -> tokenized text of tweet (done by us)
- **keywords**                      -> keywords fund in tweets (done by us)

## 5. Analysis of the Second Dataset

## 6. Final Findings

## 7. Conclusion