# Project: Mental health in Switzerland


@oth: Describe getting the dataset from the cluster...

We quickly import the libraries to be used later:

In [None]:
import numpy as np
import pandas as pd
import pyspark as ps
import matplotlib.pyplot as plt

from pandas.io.json import json_normalize
import json

## 1. dataset selection & analysis

The goal of this first analysis is to familiarize ourselves with the dataset inorder the know if we need to adapt our research question or enrich the dataset with external information in order to preform our analysis.

We were provided two separate datasets containing swiss tweets. They were formated differently and contained different fields and while one was over the duration of multiple years the other only covers a span of 10 months.
We performed an analysis of both in order to be able to decide which one should be used in out project.
After the analysis both sets we decided to use **dataset 2** for our project.

While dataset 1 containes more precise location information in the form of longitude and latitude, dataset 2 contains a sentiment analysis field as well as a language field.

As trying to categorize the language of each tweet in dataset 1 was quite computationaly expensive –having to deal with network latency of api requests– and a lot of preprocessing was necessary to get it to work, dataset 2 containing this field puts it at a clear advantage.

We now provide a quick overview of dataset 1.

### dataset 1 (twitter-swisscom)

The dataset comes with a *txt schema*, giving us an idea of what each column in the *tsv file* containing the tweets represents. A sample file was given, but we optained the complete set of tweets (5gb) via a .zip.

The dataset contains the following usefull columns:

- userId : id identifying user
- createdAt : time of posting tweet
- text : content of tweet
- placeLatitude : latitude of tweet
- placeLongitude : longitude of tweet
- sourceName : username
- sourceUrl : url of tweet
- followersCount : number of followers
- friendsCount : number of mutuals
- statusesCount : number of statuses of user

the sample dataset contains a lot of nan values, and each column contains at least 1% or more nan values.

The complete analysis and code can be found in the [Basic Exploration dataset 1 notebook](Basic%20Exploration%20Dataset%201.ipynb)

### data set 2 (from Spinner)

This dataset has a elaborate description of each field available at the [spinn3r website](http://docs.spinn3r.com/?Example#content-schema).
Unlike the previous dataset, this dataset is given in json format.

To deal with the amount of data present in the cluster we look at one day to perform our first analysis and then show how to scale up.

The format of this dataset is a nested json that we could not find how to extract dirrectly using the read json funtion provided. We thus use a json normalizer contained in the pandas libary to extract it. We will later see that spark deals better with nested json.

The fields found in this dataset are:

In [None]:
EXAMPLE_PATH = 'swiss-tweet/example.json'

with open(EXAMPLE_PATH) as data_file:    
    data = json.load(data_file)

twitter_df = json_normalize(data)
#rename columns for convenience
twitter_df.columns = [ column.replace('_source.','') for column in twitter_df.columns]
twitter_df.columns

Out of these columns, the one we can use are:
- main: contains the content of the tweet
- published: gives the time on which the content was posted
- source_spam_probability: probability of tweet being spam
- source_location: location of tweet
- tags: tags associated with tweet, as provided by spinn3r
- lang: language of tweet
- sentiment: sentiment score of tweet -POSITIVE, NEGATIVE, NEUTRAL-
- author_gender: gender of author -MALE, FEMALE, UNKNOWN-
- source_followers: followers of user who tweeted
- source_following: number of mutual followers

In [None]:
columns = ['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']
twitter_df = twitter_df[columns]

We now look at general distributions in this dataset. While this example isn't representative when it comes to the tweets –especially given it contains tweets of the 1th of january– it can still give us insights on the other fields.

We assume that roughtly the same categories of users were active on that day, so we can draw conclusions on the distribution of language, gender.

The language distribution is the following:

In [None]:
twitter_df['lang'].value_counts()

We see that english, french and german are most frequent. This is good as those are the languages we plan on using.

We now look at the distribution of gender in the dataset:

In [None]:
twitter_df['author_gender'].value_counts()

We see that most accounts do not seem to contain this information.  But there are still a lot that do, so we could use the ones that do to look at differences between gender, although it would not give use an unbiased set, as the type of user declaring their gender on twitter may be different than those who chose not to.

We now look at the sentiment column, to see how the tweets were labeled.

In [None]:
twitter_df['sentiment'].value_counts()

We see that the vast majority of tweets was labeled as neutral, and only a very small number are labeled ad negative. We will this look at both neutral and negatively labeled tweets.
Under the assumption that the positives are not false positives, a tweet showing signs of mental distress will not be labeled as POSITIVE, hence we can safely exclude these tweets from further analysis.

Looking at the spam probabality we see that not a single tweet was labeled as spam. This puts into question the accuracy of the labeling, as the set of tweets on that day most certainly contains spam. We will still use it, as we assume the chanse of false positives is is low, so we lose nothing by using it.

In [None]:
twitter_df['source_spam_probability'].value_counts()

We now examine the locations provided by the dataset:

In [None]:
#we only look at the locations for the languages we care about, as location seems to be language dependent
twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'].value_counts()

We see that:
- there are a lot of locations that are the same but in a different language, such as Switzerland and Schweiz
- the names of the locations are not just in the languages we are interessted in (see สวิตเซอร์แลนด์)
- a vast majority of the dataset is just labeled as 'switzerland'
- but as opposed to dataset 1, they are all located in switzerland

In [None]:
twitter_df.count() #give us number of NAN

#### looking at the tweets

while this set of tweets is not representative we can still use it to find potential issues we might have with the tweet content:

In [None]:
pd.set_option('display.max_colwidth', -1)
twitter_df.sample(n=10)['main']

We immediatly see that the tweets containing links are not relevant to our research question, as they are mostly news or adds. We make the assumption that this would be the case anytime of the year.

We look at the tweets containing links and confirm:

In [None]:
twitter_df.main[twitter_df.main.map(lambda x: 'http://' in x)].head(10)

We now preform a vastly simplified version of the dictionary matching we will preform to get relevant tweets and analyze the results:

In [None]:
pd.set_option('display.max_colwidth', 100)
twitter_df[twitter_df['main'].map(lambda x: 'suic' in x) ]['main'] #news instead of personal reference
#removing nres would be good
#we also see that we should remove pic.twit

In [None]:
pd.set_option('display.max_colwidth', 100)
twitter_df[twitter_df['main'].map(lambda x: 'therapie' in x) ]['main'] #adds instead of personal reference
#all contain links..reason to remove links

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'therapie' in x) ]['main']

## 2. datset cleaning

explain that we began with pandas (local proof of concept) and scaled up using spark

### 2.1 unnesting the json

### 2.2 column selection

### 2.3 language filtering

### 2.4 sentiment analysis

### 2.5 spam removal

### 2.6 time format encoding

### 2.7 text treatment

lowercase, normalize (unicode)

url removal, RT removal

## 3. NLP methods

### 3.1. tokenizing

### 3.2 stop word removal

### 3.3 stemming

### 3.4 dictionary processing

#### building the dictionary

#### processing the dictionary

### 3.5 processing the data

## 4. ML 

### 4.1. labeling the tweets

### 4.2 construcing features TF-IDF

### 4.3 train SVM classifier

### 4.4 relabel training set

## 5. final data analysis

LDA to find similarities

## 6. Conclusions

# References and bibliography

[1] https://github.com/master/spark-stemming preforming stemming with spark

[2] http://nbviewer.jupyter.org/gist/mizvol/eb24770ac3d5d598463f972e2a669f03 example dataprocessing pipeline

[3] https://spark.apache.org/docs/2.1.0/ml-features.html ml methods we can use with spark

[4] http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/ best ways to do text classification

[5] https://www.rand.org/content/dam/rand/pubs/rgs_dissertations/RGSD300/RGSD391/RAND_RGSD391.pdf dissertation containing dict 1

[6] https://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/viewFile/2880/3264 public health paper, LDA usage

[7] https://docs.google.com/spreadsheets/d/1WwI9crZk36pcTOQ1g_5dumMd11OlkpFRNHsEvpkwLMk/edit?usp=sharing our dictionary

[8] https://getd.libs.uga.edu/pdfs/kale_sayali_s_201512_ms.pdf second thesis containing dict