# Project: Mental health in Switzerland

In this notebook, we describe the multiple steps of our preprocessing pipeline and the work done on the Twitter dataset to answer our research question.

### Overview:

<ol start="0">
  <li><b>Data Collection & Preliminary Analysis:</b> how we accessed the datasets and why we use them</li>
  <li><b>Data Treatment:</b> cleaning and NLP on the first dataset</li>
  <li><b>Datasets Description for Spinn3r:</b> a first look at the Spinn3r Dataset</li>
  <li><b>Dictionary Construction:</b> creation of the dictionary we are going to us on the datasets</li>
  <li><b>Dictionary Update:</b> updating the dictionnary with what we learned thanks to the dataset</li>
  <li><b>Analysis of First Dataset:</b> anaysis of the smaller Spinn3r set for a proof of concept</li>
  <li><b>Processing Pipeline on the Second Dataset:</b> apply what we learned on the previous dataset on the second one</li>
  <li><b>Analysis of the Second Dataset:</b> applying what we learned on the smaller dataset on the swisscom-twitter dataset</li>
  <li><b>Conclusion</b></li>
</ol>

We quickly import the libraries to be used later:

In [None]:
import numpy as np
import pandas as pd
import pickle as pkl
import pyspark as ps
import matplotlib.pyplot as plt
import seaborn as sns
import pprint

#Part 1
import json
from pandas.io.json import json_normalize

#Part 2
import re
import unicodedata

#Part 3
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, TweetTokenizer
from collections import Counter

from gensim import corpora, models
import pyLDAvis.gensim

## 0. Data Collection & Preliminary Analysis

We were provided two separate datasets containing Swiss tweets. They were formated differently and contained different fields. On top of this, while the first dataset contained data collected over multiple years (twitter-swisscom, refered to as second dataset), the other only covers a span of 10 months (Spinn3r, refered to as first dataset).


The complete twitter data for both datsets was collected from the cluster provided in the class.

An example dataset of (_**twitter-swisscom**_) was retrieved from a .zip file provided.

Our second dataset (_**Spinn3r**_), however, did not provide a specific example datset. As a standing we extracted the first JSON file to perform our analysis. To do this, we used the following comands after connecting to the cluster:

```bash
cluster$ hadoop fs -getmerge /datasets/swiss-tweet/harvest3r_twitter_data_01-01_0.json /buffer/example.json
local$ scp -r gaspar@iccluster060.iccluster.epfl.ch:/buffer/example.json <local-path>
```

### Preliminary analysis

The goal of our preliminary analysis was to decide which dataset would be more useful but also to familiarize ourselves with the chosen dataset in order to know if we need to adapt our research questions or enrich the dataset with external information in order to perform our analysis.

After the analysis (presented below), we decided to use **both datasets** for our project. While the twitter-swisscom contains more precise location information (longitude and latitude), the Spinn3r dataset contains a sentiment analysis field as well as a language field and has a lot less data. Thus, we decided to create our functions and methods using the smaller dataset (Spinn3r) and use it as a proof of concept before going on a larger scale and use the swisscom-twitter dataset to further our analysis.

This dataset has an elaborate description of each field available on the [spinn3r website](http://docs.spinn3r.com/?Example#content-schema). Given the amount of data present in the cluster, we only look at one day to perform our first analysis (we will later show how we scale our operations using Spark).

This dataset is given in JSON format and has nested elements. As the *read_json* function does not work well with nested JSON data, we use a JSON normalizer, which is provided in the *Pandas.io* libary. 

_Note: We will later see that Spark deals better with nested JSON._

The fields found in this dataset are:

In [None]:
EXAMPLE_PATH = 'data/example_data/spinn3r_example.json'

with open(EXAMPLE_PATH) as data_file:    
    data = json.load(data_file)

twitter_df = json_normalize(data)
#rename columns for convenience
twitter_df.columns = [column.replace('_source.','') for column in twitter_df.columns]
twitter_df.columns

Out of all these columns, we find the ones below to be the most useful for our analysis:
- **main**: contains the content of the tweet.
- **published**: gives the time on which the content was posted.
- **source_spam_probability**: probability of tweet being spam.
- **source_location**: location of the tweet.
- **tags**: tags associated with the tweet (provided by Spinn3r).
- **lang**: language of the tweet.
- **sentiment**: sentiment score of the tweet -POSITIVE, NEGATIVE, NEUTRAL-.
- **author_gender**: gender of the author -MALE, FEMALE, UNKNOWN-.
- **source_followers**: followers of the user who tweeted.
- **source_following**: number of people the user follows.

We consider these tags to be the most useful as:

- **Main** stands at the center of our analysis. We plan to perform NLP methods in order to identify relevant tweets and use the content as well to identify related words.
- **Published** can be used to map the tweets over the duration of the year and look for seasonal changes.
- **Source_location** can be used to look at the geographical distribution of the tweets.
- **Lang** will be used to filter out unwanted languages, which we need to do in order to perform our NLP tasks.
- **Author_gender** will be used to identify the gender and look at the differences between both gend

_Note:_ that on the cluster we find additional fields, out of which we decide to use **'author_user_id'**, as means to uniquely identify an user and **'geo_point'**, containing latitude and longitude, as a way to identify user location.

In [None]:
columns = ['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']
twitter_df = twitter_df[columns]

#### Initial analysis of categorical data

We now give a quick look into data contained in the example dataset, to find how we need to clean it.

In [None]:
plt.figure(figsize=[8, 5])
twitter_df['source_spam_probability'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

Looking at the spam probability of the tweet set, we see that not a single tweet was labeled as spam. This makes us question the accuracy of the labeling as the set of tweets on that day most certainly contains spam. However, we will still use it as we assume the chance of having false positives to be very low.

In [None]:
twitter_df[twitter_df.lang.isin(['de', 'fr', 'en'])]['source_location'].value_counts().tail(20)

As the location seems to be language dependent, we only care about tweets written in the 3 languages we need. We see that:
- A lot of locations only differ in language (e.g. as Switzerland and Schweiz)
- The name of the locations are not always given languages we are interested in (e.g. สวิตเซอร์แลนด์)
- A vast majority of the dataset is just located in 'Switzerland'
- As opposed to dataset 1, all tweets are located in Switzerland

In [None]:
twitter_df.count()

Looking at the missing values in the dataset, we see that for most tweets, the tags are missing. This may indicate that the tags won't be usefull for our analysis, but this does not influence our research at this stage.

We also note that for some tweets, the sentiment is missing. As noted on Spinn3er, this may be due to the fact that some tweets do not contain enough linguistic information. As we will filter out such tweets, the remaining set should contain sentiment. Even if doesn't, this field is not central to our analysis, it is merely used to help us filter our dataset.

### Identifying potential issues

While this set of tweets is not representative, we can still use it to find potential issues we might have with the tweets' content:

In [None]:
pd.set_option('display.max_colwidth', 200)
twitter_df.sample(n=10)['main']

Looking at the sample, we immediatly see that the tweets containing links are not relevant to our research question (they are mostly news or adds).

In [None]:
twitter_df.main[twitter_df.main.map(lambda x: 'http://' in x)].head(10)

Looking more intently at the tweets containing links, we can make the assumption that the relationship between URLs and spam is a general rule (at any time of the year).

Perform a vastly simplified version of the dictionary matching we will preform to get relevant tweets and analyze the results.

Here we look at the occurence of 'suicide' in the tweets:

In [None]:
pd.set_option('display.max_colwidth', 100)
twitter_df[twitter_df['main'].map(lambda x: 'suicide' in x) ]['main']

By performing a simplistic dictionary matching using the occurences of the word *'suicide'* in our tweet set, we see that a lot of these tweets contain news. This further comforts us in our choice to remove tweets with URLs in order to get better results.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'therapie' in x) ]['main']

Looking at the word *'therapy'* (in German), we can confirm once again the issue there is with URLs.

In [None]:
twitter_df[twitter_df['main'].map(lambda x: 'RT ' in x) ]['main'].head()

Finally, we look at the retweets to get an idea on how useful they could be. From what we see, and what others have seen [8], they are not.

## 1. Data Treatment

Now that we have done some data exploration, we have to clean our dataset to be able to use it correctly for our research.

First, we use Pandas locally on a small subset of the tweets to explore different cleaning methods and make sure the functions we chose work as expected. After proving that our concept works, we use Spark to scale up our operations and be able to perform them on all the files provided on the cluster.

### Tweet filtering

Given the context of our research, there are many useless tweets. 
- First, we would like to get rid of any tweets with a spam probability greater than 0.5 (an arbitrary value we found sound). 
- Finally, as we are focused on the Swiss population, we only keep tweets in the official languages (except for Italian as no one in our group speaks the language, meaning we won't be able to extract useful information) and in English (as most tweets in the dataset, and more generally on Twitter, are written in English)

In [None]:
spam_mask = (twitter_df['source_spam_probability'] >= 0.5)
twitter_df.drop(twitter_df[spam_mask].index, inplace=True)
twitter_df.reset_index(drop=True, inplace=True)

In [None]:
lang_mask = ~twitter_df.lang.isin(['de', 'en', 'fr'])
twitter_df.drop(twitter_df[lang_mask].index, inplace=True)
twitter_df.reset_index(drop=True, inplace=True)

### Text treatment

Now that we only have relevant tweets (following a mild cleaning), we need to work on the main subject of our analysis: the content of the tweets itself. This is necessary to ease the _3rd Part_ of our research (processing the text to find patterns using NLP, _Natural Language Processing_).

To make later temporal analysis easier we transform the column containing the dates into datetime format.

In [None]:
twitter_df['published'] = pd.to_datetime(twitter_df['published'])
twitter_df['published'].head()

The first step of our treatment is to lower all the caracters (allows comparing the tweets without case-sensitive searches).

In [None]:
twitter_df['main'] = twitter_df['main'].str.lower()
twitter_df['main'].head()

As we have seen in _Part 1_, URLs are highly linked to spam (except for URLs linking pictures to the tweet). As image processing is not in the scope of this project, we remove all "pic.twitter" URLS and mentions of the format "@user". After that, we remove all tweets containing a URL and retweets as they do not give us any information.

In [None]:
url_mask = twitter_df['main'].str.contains("www\S+") | twitter_df['main'].str.contains("http\S+")
twitter_df['main'] = twitter_df['main'].str.replace("pic.twitter\S+", '')
twitter_df.drop(twitter_df[url_mask].index, inplace=True)
twitter_df.reset_index(drop=True, inplace=True)
twitter_df['main'].head()

In [None]:
alphanum_filter = lambda data: re.sub(r'[^\w\s]', '', data)

In [None]:
twitter_df['main'] = twitter_df['main'].map(alphanum_filter)
twitter_df['main'] = twitter_df['main'].str.replace('_', '') #not removed by previous regex
twitter_df.main.head()

### Spark version

After performing all these cleaning steps, the size of the set is **significantly reduced** (the new file only weights **2.4MB** instead of the **55MB** of the original file, allowing us to reduce the set by **30x**). This is particularly helpful as the whole dataset represents around **30GB**.

_Note:_ the code was defined as _**Raw NBConvert**_ as it is not meant to be run on the notebook but on the cluster. The code is only here as information for the reader, it's executable can be found in the run.py file.

We start by importing the necessary elements and defining Spark environment variables.

We then import our data.

We filter tweets using the same method.

Text treatment is a little longer as it is more thorough in Spark.

Finally, we store the data in the _JSON_ we will be using.

## 2. Dataset Description for Spinn3r

We now preform a general analysis of the dataset on which we preform all the NLP methods and unsupervised clustering on.

We use a pkl file that was generated using *processing_pipeline.py* for speed an convenience.

In [None]:
twitter_df = pkl.load( open( "data/spinn3r_tweets/processed_tweets.pkl", "rb" ))

### General statistics:

We look at how many values we have for each field. The total number of tweets is $3936084$. We see that less than half of all tweets have a $geo\_point$ value associated to them

In [None]:
twitter_df.count()

### Distribution of categorical data

We now look at the distribution of categorical features in our dataset.

In [None]:
plt.figure(figsize=[5, 5])
twitter_df['lang'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

We see that almost half of all tweets are in english, followed by french and german. The latter is interessting as given the swiss population, one would expect there to be more german than french tweets.

In [None]:
plt.figure(figsize=[5, 5])
twitter_df['author_gender'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

We see that most accounts do not contain information on the user's gender (meaning there is no way we can have an unbiased set). However, the dataset still contains over a third of profiles where the gender is documented meaning we could use these to look at the differences in mental distress between the genders. Also note that male and female are not equaly represented in the dataset.

In [None]:
plt.figure(figsize=[10, 5])
twitter_df['sentiment'].value_counts().plot(kind='pie')
plt.axis('equal')
plt.show()

Looking at the sentiment, we see that about three-quaters of all tweets are labeled as being neutral.

### Distribution of temporal data

we now look at the temporal distribution of the data. We look at daily and yearly patterns.

In [None]:
#Declaring functions
def get_time_distribution(times):
    publishing_time = times.map(lambda x: x.hour).value_counts()
    publishing_time.sort_index(inplace=True)
    return publishing_time

def plot_daily(df):
    times = get_time_distribution(df['published'])
    return plt.plot(times)

In [None]:
plot_daily(twitter_df)
plt.xlabel('hour')
plt.ylabel('number of tweets')
plt.title('distribution of tweets over 24 hours')
plt.show()

We see that the number of tweets is the largest around 8pm and rapidly decreases through the night, reaching its lowest point at around 3 in the morning. We can also observe a dip in the number of tweets during lunch time.

In [None]:
yearly_distribution = twitter_df.published.map(lambda x: x.month).value_counts()

In [None]:
sns.tsplot(data=yearly_distribution, time=yearly_distribution.index)
plt.ylabel('number of tweets')
plt.title('distribution of tweets over 10 month period')
plt.xticks(range(1, 11),['jan', 'feb', 'mar', 'apr', 'mai', 'jun', 'jul', 'aug', 'sep', 'oct'])
plt.xlabel('month')
plt.show()

looking at the yearly distribution we see that the dataset is not balanced. It is possible that some of this inbalance is due to the number of twitter users and thus tweets rappidly increasing over a 10 month period. It may also be linked with the way Spinn3r retrived data from the set.

### Distribution of location data

We previously noted that only about 40% of all tweets have geo-location attached. We still look at how they are distributed over all. We note that we can find some points that are not in Switzerland. We do not filter out these tweets as we take the dataset as correct.

We use the tutoral by http://andrewgaidus.com/leaflet_webmaps_python/ to create this map. We use the geodata provided in the 3rd Homework and transform it into a shapefile using an online converter.


In [None]:
#putting this here for convenience

import geopandas as gpd
from geopandas.tools import sjoin
import folium
from folium.plugins import MarkerCluster
#from folium.element import IFrame
import shapely
from shapely.geometry import Point
import pysal as ps

In [None]:
#we should probably move this into processing
def process_location_data(df):
    """preprocessed data in format as found on cluster:
    df: dataframe, locations should be in 'geo_point'
    """
    non_null = df[~df.geo_point.isnull()]
    virgule = non_null[non_null.geo_point.str.contains(',')].geo_point.str.split(pat=',', expand=True).applymap(float)
    no_virgule = non_null[~non_null.geo_point.str.contains(',')].geo_point.str.split(expand=True).applymap(float)
    return pd.concat([virgule, no_virgule])

In [None]:
locations = process_location_data(twitter_df)

In [None]:
#transfrom to geoseries, use the correct format
location_geo = gpd.GeoSeries(locations.apply(lambda z: Point(z[1], z[0]), 1),crs={'init': 'epsg:4326'})

In [None]:
#merge with dataframe
location = gpd.GeoDataFrame(twitter_df[~twitter_df.geo_point.isnull()], geometry=location_geo)

In [None]:
tracts = gpd.read_file('SHAPEFILE_LV95_LN02/swissBOUNDARIES3D_1_3_TLM_LANDESGEBIET.shp')

In [None]:
#join based on location data, we associate a region with a location
tract_counts = gpd.tools.sjoin(location, tracts).groupby('NAME').size()

In [None]:
#add the counts to the tracts data
tracts.set_index('NAME', inplace=True)
tracts['counts'] = tract_counts
tracts.reset_index(inplace=True)

In [None]:
tract_counts.head() #this is how it looks like

using this data we can simply create a map

In [None]:
SWISS_COORD = [46.85, 8.23] #location of switzerland
swiss_map = folium.Map(location = SWISS_COORD, zoom_start = 8, tiles = 'cartodbpositron')

swiss_map.choropleth(tracts.to_json(), data = tracts, key_on = 'feature.properties.{}'.format('NAME'),
                columns = ['NAME', 'counts'], fill_color = 'YlOrRd', threshold_scale=[1000, 10000, 20000, 30000, 40000])
swiss_map

In [None]:
swiss_map.save('lol.html')

We see that the tweets are concentrated in urban centers.
#it'd be nice to have a more finely grained map

### Natural Language Processing

Using NLP _(Natural Language Processing)_ is necessary given the nature of our dataset: tweets. Following the steps used in previous courses and in the papers we read, we came up with the following pipeline in order to process the tweets (and dictionary). The actualy pipeline is realized in the *processing_pipeline.py* script, as the pipeline takes about 5 mintes to run on our set of tweets. We here explain it's functionality.

_Note: we use nltk, a goto python NLP library which was very interesting for us as it offered operations in the various languages we are working on._

#### Helper functions
(cf. _Next part_ below for information)

#### Tweet formatting

NLP is mainly based on treating the data to obtain an adequate format to work with following 3 steps:
- Tokenizing: separating words, we use a special tweettokenizer to work onthe tweets
- Stop-words removal
- Stemming: keeping only the radical of a word (allows better comparison)

## 3. Dictionary Construction

###  Building the dictionary

The first step of building our dictionary was doing research on previous dictionaries that were used for similar problems. We find two dictionaries for this purpose [2], [3]. 

We translate them carefuly into french and german in order to be able to preform the analysis in those two other languages as well.

We show some of the words present in the dictionaries for example purposes:

In [None]:
#dictionary 1
DICT_PATH_1 = "data/dictionaries/dict_1.csv"
dictionary_1 = pd.read_csv(DICT_PATH_1)
dictionary_1

In [None]:
#dictonary 2
DICT_PATH_2 = "data/dictionaries/dict_2.csv"
dictionary_2 = pd.read_csv(DICT_PATH_2)
dictionary_2.head(10)

We now show how we preformed nlp techniques by using one dictionary as example. We prefrom the actualy cleaning and matching of keywords in a script (dict_filtering.py) ans save it to a pkl file, which can be found in the data/keyworded_tweets folder.

We need to preform these nlp techniques in order for the dictionary to match the processed tweets.

In [None]:
en_dict = dictionary_1['english'].dropna()
fr_dict = dictionary_1['french'].dropna()
de_dict = dictionary_1['german'].dropna()

After fetching our 3 datasets, we clean them as we did for the tweets in _Part 2_.

In [None]:
tknzr = TweetTokenizer()
en_dict = en_dict.map(lambda x: tknzr.tokenize(x))
fr_dict = fr_dict.map(lambda x: tknzr.tokenize(x))
de_dict = de_dict.map(lambda x: tknzr.tokenize(x))

In [None]:
print('how the dictionary looks like after tokenizing: ')
print(en_dict.head())
print(fr_dict.head())
print(de_dict.head())

### Processing the dictionary

Once we have our (clean) data, we simply run all the previous NLP methods on our dictionaries. 

_Note_: Special methods were written for the dictionaries as they do not follow the same format as tweets, but they are the same in essence.

In [None]:
def dict_cleaning(dicts):
    return dicts.astype(str).str.lower()

def dict_remove_stops(dictionary, language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'

    lang_set = stopwords.words(language)
    return dictionary.apply(lambda expression: [word for word in expression if word not in lang_set])

def dict_stem_words(dictionary, language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'

    stemmer = SnowballStemmer(language)

    return dictionary.apply(lambda expression: [stemmer.stem(word) for word in expression])

In [None]:
fr_dict = dict_remove_stops(fr_dict, 'french')
en_dict = dict_remove_stops(en_dict, 'english')
de_dict = dict_remove_stops(de_dict, 'german')

In [None]:
en_dict = dict_stem_words(en_dict, 'english')
fr_dict =  dict_stem_words(fr_dict, 'french')
de_dict =  dict_stem_words(de_dict, 'german')

After stemming and stop word removal, the tweets look the following way:

In [None]:
en_dict.head()

In [None]:
fr_dict.head()

In [None]:
de_dict.head()

### Finding keyword matches

We now check if we can find a dict entry for each tweet, by using the following algorithm, that checks if a keyword occurs at least once in the tweets.

In [None]:
def compare(s, t):
    """compares to lists, and returns 
    true for all possible permutations of same list
    s, t: lists to compare
    """
    return Counter(s) == Counter(t)

def match_dict(tweet, dict_):
    """returns keywords that match in string
    tweet: tweet to find keywords in
    dict_: list of keywords
    """
    #should we explain this more?
    doc = [sublist for sublist in dict_ if compare(list(filter(lambda x: x in tweet, sublist)), sublist) ]
    return doc #at least one match

This gives us the following results for each language. We preform the actual algorithm in a script and save the results into a pickel file.

#### Dictonary 1

The way to get the results would be to write the following line:

We inspect the results given:

In [None]:
df = pkl.load( open( "data/keyword_tweets/small_english_keyworded_tweets.pkl", "rb" ))

In [None]:
small_english_filtered = df[(df.lang == 'en') & (df.keywords.map(lambda x: x != [] ))]
small_english_filtered[['main', 'keywords']].head(10)

In [None]:
small_english_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts()

In [None]:
print('number of matching tweets: ', small_english_filtered.main.count())

In [None]:
df = pkl.load( open( "data/keyword_tweets/small_french_keyworded_tweets.pkl", "rb" ))
small_french_filtered = df[(df.lang == 'fr') & (df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', small_french_filtered.main.count())

In [None]:
small_french_filtered[['main', 'keywords']].head(10)

In [None]:
small_french_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts()

In [None]:
df = pkl.load( open( "data/keyword_tweets/small_german_keyworded_tweets.pkl", "rb" ))
small_german_filtered = df[(df.lang == 'de') & (df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', small_german_filtered.main.count())

In [None]:
small_german_filtered[['main', 'keywords']].head(10)

In [None]:
small_german_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts()

#### Dictionary 2

In [None]:
df = pkl.load( open( "data/keyword_tweets/english_keyworded_tweets.pkl", "rb" ))

In [None]:
english_filtered = df[(df.lang == 'en') & (df.keywords.map(lambda x: x != [] ))]
english_filtered[['main', 'keywords']].head(10)

In [None]:
print('number of matching tweets: ', english_filtered.main.count())

In [None]:
english_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts()

We can clearly see sadness an loneliness or fear in these tweets, but some of them do not actualy express distress.
We also note that the number of tweets has greatly been reduces through this.

Now we look at the French tweets and again find similar results:

In [None]:
df = pkl.load( open( "data/keyword_tweets/french_keyworded_tweets.pkl", "rb" ))

In [None]:
french_filtered = df[(df.lang == 'fr') & (df.keywords.map(lambda x: x != [] ))]
french_filtered[['main', 'keywords']].head(20)

In [None]:
print('number of matching tweets: ', french_filtered.main.count())

In [None]:
french_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts()

Finally, for German we get the following results:

In [None]:
df = pkl.load( open( "data/keyword_tweets/german_keyworded_tweets.pkl", "rb" ))

In [None]:
german_filtered = df[(df.lang == 'de') & (df.keywords.map(lambda x: x != [] ))]
german_filtered[['main', 'keywords']].head(20)

In [None]:
print('number of matching tweets: ', german_filtered.main.count())

In [None]:
german_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts()

Not only do all of this tweets reference the person, but they also talk about sadness or dissapointment.
The remaining number is quite small, compared to the initial 4000 german tweets in the dataset, but it is a number we might expect.

Still those results could be better.

### Dictionary evaluation

In this part, we cross the dataset with our dictionaries to retrieve the tweets exhibiting mental distress. This is the first (less naive) step of our analysis before using Topic Modeling methods on our data.

#### Building the LDA

Using a script we build an LDA model and save it to the models folder.
We quickly present how we create this model, and display the results using the pyLDAvis library.

To build the LDA we create a dictionary and define the 'texts', in our case, the tweets.
We pass it to gensim which implements the building of an LDA model. 

In [None]:
dictionary = corpora.Dictionary(english_filtered.tokenized)
dictionary.filter_extremes(no_below=2, no_above=.8) #remove too frequent and too infrequent words
texts = english_filtered.tokenized.tolist() #list of all tweets
corpus = [dictionary.doc2bow(text) for text in texts]

In [None]:
ldamodel = models.LdaMulticore(corpus, id2word=dictionary, num_topics=6, workers=3, iterations=100, passes=10)
ldamodel.show_topics()

In [None]:
#creating the visualisation
vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis_data)

For french and german:

In [None]:
german_model = models.LdaModel.load('models/german.model')

In [None]:
german_model.show_topics()

In [None]:
french_model = models.LdaModel.load('models/french.model')

In [None]:
french_model.show_topics()

#### First sample evaluation

As you can see, we label a sample of a processed dataset. Note that this task is complex as it is very subjective (choosing tweets displaying mental health issues was done at the discretion of the person labeling). Our main conclusion from this step was the necessity to "expand" our research to tweets showing signs of mental distress in general and not only mentioning mental illnesses.

In [None]:
first_iter = pd.read_csv("data/labeled_tweets/english_labeled.csv")
first_iter.head()

**Key _(qualitative)_ takeaways :**

- Tweets were labeled "losely" using multiple signs of mental distress, mostly sadness. Thus, tweets with the following specificities were labeled as showing mental distress: nostalgia (either for the past or simply for the end of a nice day), mention of sad activities (people who watched sad movies, listened to sad songs as this was most probably triggered by a previous feeling of sadness) or mention of being mad (either over other people like haters or, more often, mad with their families)
- Some tweets mentioned the mental distress of other people (either how they helped them or raising awareness over these issues)
- Some tweets showed desperation (a form of mental distress) over the "way the world is", these were not included.
- An unexpected insight (which we decided not to pursue further as it was too tenuous) was the link between summer and mental distress. Indeed, most of tweets mentioning summer showed some sadness due to the absence of friends (from university for example).
- Some tweets only included _"motivational"_ quotes. We can suppose that these reflect bad times. Nonetheless, these were not included as our previous assumption is a little far-fetched. Moreover, this could have messed with our model (they did not include specific words of our dictionary nor specific mentions to mental distress).
- A lot of mental distress tweets are related to unilateral feelings. They are often messages specifically targeted at someone who most certainly doesn't know their existence (for example work collegues or college classmates).
- Most tweets talking about depression or terrible low self-esteem show _« covert-humour »_, a coping technique used to hide ones self-deprecation. These tweets are very difficult to detect as they use sarcasm (which can not be perceived using current NLP techniques) and seldom use words from or dictionary.

**Key _(quantitative)_ takeaways :**

After the labeling, we found the necessity of both changing our current dictionary (through additions but also removals) and creating a new dictionary (which we decided to call a _negative_ dictionary, but is simply a blacklist of words we did not want to appear anymore).

In [None]:
#HELPER FUNCTION:
def tweet_search(df, word):
    return df[df['main'].map(lambda x: word in x)]['main']

In [None]:
tweet_search(first_iter, 'unfortunately')

In [None]:
tweet_search(first_iter, 'therapy')

In [None]:
tweet_search(first_iter, 'overthinking')

Even though these expressions do not appear very often, they are unique to tweets showing mental distress. Thus, we decide to include them in our new dictionary.

In [None]:
#HELPER FUNCTION:
def md_search(df, word, col):
    md_count = df[df[col].map(lambda x: word in x) & df['mental'] == 1][col].count()
    print("Tweets showing mental distress and containing the word '" + word + "': " , md_count)
    count = df[df[col].map(lambda x: word in x)][col].count()
    print("Total appearance of the word '" + word + "': " , count)
    print("Ratio between these 2 numbers : ", md_count/count)

In [None]:
md_search(first_iter, 'hate', 'split')

In [None]:
md_search(first_iter, 'pain', 'split')

In [None]:
md_search(first_iter, 'addict', 'split')

Even though we decided to exclude many words (_hate_, _hurt_, _pain_, _addict_ and _overdose_), we decided to demonstrate our method only on a few. When looking at the number, we see that more than half the tweets containing the (previously) specified words are false positives, which drives us to striked them from our dictionary.

In [None]:
#HELPER FUNCTION:
def blacklisted_search(df, word):
    count = df[df['split'].map(lambda x: word in x)]['split'].count()
    print("Total appearance of '" + word + "': ", count)

In [None]:
blacklisted_search(first_iter, 'overw')
blacklisted_search(first_iter, 'overwatch')

In [None]:
blacklisted_search(first_iter, 'syria')
blacklisted_search(first_iter, 'assad')
blacklisted_search(first_iter, 'asylum')

In [None]:
blacklisted_search(first_iter, 'trump')
blacklisted_search(first_iter, 'clinton')
blacklisted_search(first_iter, 'america')
blacklisted_search(first_iter, 'usa')

As we can see, the words we want to strike represent major recurring themes (mostly having to do with the news) which do not have anything to do with mental distress (none of these tweets were labeled as showing signs of mental distress). Once again, we only showed our idea on a small subset of topics, the list of words we want to blacklist is : _gov_, _government_, _syria_, _assad_, _refugees_, _asylum_, _overwatch_, _vine_, _trump_, _hillary_, _america_, _usa_ and _india_. 

_Note:_ two other expressions are blacklisted, but that is due to our first research algorithm (which did not match perfectly the strings) : _suicidesquad_ (contains _suicide_) and _spain_ (contains _pain_)

## 4. Dictionary Update

#### Second sample evaluation

This second analysis was been done on a different dataset. Instead of labeling only tweets previously labeled as "NEUTRAL" and "NEGATIVE", we apply our dictionary to the whole dataset to have more relevant results (and show that Spinn3r's labeling algorithm does not work as expected: 1.4% of the tweets we labeled as showing mental distress were labeled as "POSITIVE").

In [None]:
second_iter = pd.read_csv("data/labeled_tweets/new_english_labeled.csv")
second_iter.head()

**Key _(qualitative)_ takeaways :**

Unlike the first pass on the data, this labeling does not include a thorough qualitative analysis. It only allowed us to improve our methodology and find plausible answers to some of the phenomena we observed. One of such phenomena is the difference in emotional expression between the genders. This may be due to the (very high) number of tweets by _"fangirls"_, female tweet authors mentioning their love for their music favorite bands (mostly Asian) or one of their favorite actors.

**Key _(quantitative)_ takeaways :**

In this case, we only determine words to add both to our dictionary and to our _negative_ dictionary (no word in our dictionary was deemed superfluous or inducing too much false positives).

In [None]:
tweet_search(second_iter, 'breakdown')

In [None]:
exp_count = second_iter[second_iter['main'].map(lambda x: ('confused' in x or 'lonely' in x) and 'sad' in x)]['main']
print("Total appearance of the expressions involving 'sad' : ", exp_count.count())
exp_count

We did not use our usual research function for the last demonstration as it involves 2 expressions we found the need to group. Once again, this is only a preview of our quantitative analysis. The whole list of expressions we want to add to our dictionary is : _panic attack_, _sleepless_, _problems falling asleep_, _lonely and sad_, _confused and sad_, _gambling_, _breakdown_, and _mental issue_.

In [None]:
md_search(second_iter, 'sadly', 'main')

In [None]:
sm_count = second_iter[second_iter['main'].map(lambda x: 'sonic' in x and 'mania' in x)]['main']
print("Total appearance of the expression 'Sonic Mania' : ", sm_count.count())
sm_count

Once again, we use a different format to show a special expression. We need to delete _'Sonic Mania'_ as it contains the word _'mania'_ (which couldn't be taken out even with our new search algorithm). However, we can see that affining our dictionary worked as we have a lot less words to blacklist : _asylum_, _sadly_, _not afraid_, _stress relief_ and words with the radical _"addict"_.

## 5. Analysis of First Dataset

#### Mapping from Dataset 1 to Dataset 2

Formatting is as follows: **Column in Dataset 2** -> Column in Dataset 1

- **author_user_id**           -> userId
- **geo_point**                -> longitude + latitude (or: placeLongitude + placeLatitude)
- **main**                     -> text
- **published**                -> createdAt
- **source_followers**         -> followersCount
- **source_following**         -> friendsCount (or is it folowwers that are following ?)
- **source_location**          -> userLocation


#### Dataset 1 having no equivalent : 
- id                                -> tweet ID
- truncated                         -> no idea, only NaNs
- placeId                           -> ID of where they are
- inReplyTo                         -> If it is a tweet reply
- source & sourceName, sourceUrl    -> what user used to send tweet (Android, website, and so on)
- userName                          -> username
- screenName                        -> name shown on tweets from username
- statusesCount                     -> number of status of user

#### Dataset 2 having no equivalent : 
- **author_gender**                 -> author's gender
- **lang**                          -> language of tweet (given)
- **sentiment**                     -> tweet's sentiment analysis (given)
- **source_spam_probability**       -> spam probability of tweet (given)
- **tags**                          -> hashtags contained in the tweet (in Dataset 1 stays in text)
- **tokenized**                     -> tokenized text of tweet (done by us)
- **keywords**                      -> keywords fund in tweets (done by us)

## 6. Processing Pipeline on the Second Dataset

The dataset comes with a *txt schema*, giving us an idea of the meaning of each column in the *tsv file* containing the tweets. We were also given a sample file to get an overview of the data, but we also optained the complete set of tweets (5GB) in a _.zip_ file.

The dataset contains the following useful columns:

- **userId** : id identifying the user.
- **createdAt** : time the tweet was posted.
- **text** : content of the tweet.
- **placeLatitude** : latitude of the tweet.
- **placeLongitude** : longitude of the tweet.
- **sourceName** : username.
- **sourceUrl** : URL of the tweet.
- **followersCount** : number of followers.
- **friendsCount** : number of mutual follows.
- **statusesCount** : number of statuses of user.

The sample dataset contains a lot of NaN values, and each column contains at least 1% or more NaN values.

The complete analysis and code can be found in the [Basic Exploration dataset 1 notebook](Basic%20Exploration%20Dataset%201.ipynb)

## 7. Analysis of the Second Dataset

## 8. Conclusion