# What we tweet about when we tweet about mental health

This notebook follows the methodology described in our report in order to answer our research questions.

### Overview:

<ol start="0">
  <li><b>Data Collection &amp; Preliminary Analysis:</b> description of our dataset retrieval and quick analysis of our datasets to familiarize ourselves with them</li>
  <li><b><i>Spinn3r</i> Data Treatment:</b> 
      application of our cleaning and NLP pipeline to the $1^{st}$ dataset</li>
  <li><b><i>Spinn3r</i> Dataset Description:</b> 
      a first (less naive) analysis of the dataset</li>
  <li><b>Dictionary Construction</b> (using the <b><i>Spinn3r</i></b> dataset)<b>:</b> 
      creation (and refining) of the dictionary we are going to use on both datasets</li>
  <li><b><i>Spinn3r</i> Dataset Analysis:</b> 
      first answer to our research questions using the <i>"smaller" <b>Spinn3r</b></i> set</li>
  <li><b><i>twitter-swisscom</i> Data Processing Pipeline:</b> 
      after using the <b><i>Spinn3r</i></b> as a proof of concept, we apply the the cleaning, NLP and keyword-based selection methods to the our $2^{nd}$ dataset</li>
  <li><b>Conclusion</b></li>
</ol>

We quickly import the libraries we use in this Notebook (and define our constant values):

In [None]:
import numpy as np
import pandas as pd
import pickle as pkl
import pyspark as ps
import seaborn as sns
import matplotlib.pyplot as plt

#Part 1
import json
from pandas.io.json import json_normalize

#Part 2
import re
import unicodedata
from collections import Counter
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize, TweetTokenizer

#Part 3
import pyLDAvis.gensim
from gensim import corpora, models

#Maps display
import folium
import shapely
import pysal as ps
import geopandas as gpd
from geopandas.tools import sjoin
from shapely.geometry import Point
from folium.plugins import MarkerCluster

In [None]:
#Charts colors
colors = plt.cm.RdYlBu(np.linspace(.2, .7, 4))

In [None]:
#EXAMPLE DATA
EXAMPLE_PATH = 'data/example_data/'
SP_EX_TWEETS = EXAMPLE_PATH + 'spinn3r_example.json'
TS_EX_TWEETS = EXAMPLE_PATH + 'twitterswisscom_example.tsv'

In [None]:
#SPINN3R DATASETS
SP_PATH = 'data/spinn3r_tweets/'
SP_PROC_TWEETS = SP_PATH + 'processed_tweets.pkl'

In [None]:
#SPINN3R KEYWORDED TWEETS SUBSETS
KW_SP_PATH = SP_PATH + 'keyword_tweets/'
SP_EN_KW = KW_SP_PATH + 'english_keyworded_tweets.pkl'
SP_FR_KW = KW_SP_PATH + 'french_keyworded_tweets.pkl'
SP_DE_KW = KW_SP_PATH + 'german_keyworded_tweets.pkl'
SP_SMALL_EN_KW = KW_SP_PATH + 'small_english_keyworded_tweets.pkl'
SP_SMALL_FR_KW = KW_SP_PATH + 'small_french_keyworded_tweets.pkl'
SP_SMALL_DE_KW = KW_SP_PATH + 'small_german_keyworded_tweets.pkl'

In [None]:
#SPINN3R LABELED TWEETS SUBSETS
LBL_SP_PATH = SP_PATH + 'labeled_tweets/'
SP_EN_LBL = LBL_SP_PATH + 'english_labeled.csv'
SP_NEW_EN_LBL = LBL_SP_PATH + 'new_english_labeled.csv'

In [None]:
#DICTIONARIES
DICT_1 = "data/dictionaries/dict_1.csv"
DICT_2 = "data/dictionaries/dict_2.0.csv"

In [None]:
#TWITTER-SWISSCOM DATASETS
SP_PATH = 'data/twitter-swisscom/'
TWITTERSWISSCOM_PROCESSED_TWEETS = 'data/twitter-swisscom/dataset2_english.csv'

## 0. Data Collection & Preliminary Analysis

### 0.1. Data collection

We were provided two separate datasets containing Swiss tweets formated differently and containing different fields.

Our first dataset (_**Spinn3r**_) did not have any example subset. As a stand-in, we extracted the first JSON file to perform our preliminary analysis. To do this, we used the following commands after connecting to the cluster:

```bash
cluster$ hadoop fs -getmerge /datasets/swiss-tweet/harvest3r_twitter_data_01-01_0.json /buffer/example.json
local$ scp -r gaspar@iccluster060.iccluster.epfl.ch:/buffer/example.json <local-path>
```

An example dataset as well as the complet dataset of (_**twitter-swisscom**_) was retrieved from a provided _'.zip'_ file. 

### 0.2. Preliminary analysis

The goal of this preliminary analysis is to familiarize ourselves with our datasets in order to know if we need to adapt our research questions or enrich the dataset with external information. We will see (below) that the main difference between the two datasets is that the $1^{st}$ dataset (_**Spinn3r**_) only covers a span of 10 months while the data of the $2^{nd}$ dataset (_**twitter-swisscom**_) was collected over multiple years and contains more precise location information. Another difference between the datasets was the fact that the $1^{st}$ contained a column expliciting the _language_, _gender of the user_, and the _predominent sentiment_ of the tweet.

After our analysis, we decided to use **both datasets**. For this reason, we use the $1^{st}$ (smaller) dataset to write our functions and methods. On top of having results, we use this as a proof of concept of our analysis pipeline to be applied to the bigger dataset.

#### 0.2.1. Dataset 1: _Spinn3r_

The [Spinn3r website](http://docs.spinn3r.com/?Example#content-schema) offers an elaborate description of each available field. Given the amount of data present in the cluster, we only look at a single file (corresponding to a day) to perform our preliminary analysis. This is a nested _JSON_ files. As the *read_json* function does not work well with nested _JSON_ data, we use a _JSON_ normalizer (provided in the _**Pandas.io**_ libary).

In [None]:
with open(SP_EX_TWEETS) as data_file:    
    data = json.load(data_file)

ex_sp_df = json_normalize(data)

#rename columns for convenience
ex_sp_df.columns = [column.replace('_source.','') for column in ex_sp_df.columns]
ex_sp_df.columns

Out of all these fields, we see that the most useful columns for our analysis are:
- **main**: contains the content of the tweet (the center of our analysis).
- **published**: gives the time on which the content was posted.
- **source_location**: location of the tweet.
- **lang**: language of the tweet.
- **author_gender**: gender of the author -MALE, FEMALE, UNKNOWN-.
- **sentiment**: sentiment score of the tweet -POSITIVE, NEGATIVE, NEUTRAL- (we will later see that this field is useless).

_Note :_ On the cluster we find additional fields, out of which we decide to use **'author_user_id'**, as means to uniquely identify an user and **'geo_point'**, containing latitude and longitude, as a way to identify user location.

In [None]:
#defining the columns to keep
columns = ['main', 'published', 'source_spam_probability', 'source_location', 'tags', 'lang', 'sentiment',
                   'author_gender', 'source_followers', 'source_following']
ex_sp_df = ex_sp_df[columns]

**Looking at categorical data :**

In [None]:
plt.figure(figsize=[8, 5])
ex_sp_df['source_spam_probability'].value_counts().plot(kind='pie', colors=colors)
plt.title('Spam probability in tweets')
plt.axis('equal')
plt.show()

The spam probability of the tweet set seemed very promising to clean our data, but we see that not a single tweet was labeled as spam which makes us question the accuracy of the labeling.

We now look at the values contained in the **source_location** column (which seem to be language dependent):

In [None]:
location_occurences = ex_sp_df[ex_sp_df.lang.isin(['de', 'fr', 'en'])]['source_location'].value_counts()
print('Most frequent:\n' , location_occurences.head(), '\n')
print('Least frequent:\n' , location_occurences.tail())

We notice that:
- A lot of locations only differ in language (e.g. Switzerland, Schweiz and Suisse)
- The locations' names are not always written in languages we are interested in (e.g. สวิตเซอร์แลนด์, a location name in our list)
- A vast majority of the dataset is just located in 'Switzerland'
- Some of the tweets are not located in Switzerland

To deal with these different locations (and display them on a map) we use the **geo_point** field instead.

_Note :_ we choose to not deal with the last point mentioned under the assumption that we were provided with a dataset only containing tweets written in Switzerland.

In [None]:
ex_sp_df.count()

When looking at the number of values of the dataset, we see that most tags associated to tweets are missing. This may indicate that tags won't be useful for our analysis, but this does not influence our research at this stage.

We also note that for some tweets, the sentiment is missing. As noted on Spinn3er, this may be due to the fact that some tweets do not contain enough linguistic information to identify the sentiment (we will filter out such tweet). However, this field is not central to our analysis, it is merely used to help us evaluate the treatment applyied to our dataset.

**Identifying potential issues :**

While this set of tweets is not representative, we can still use it to find potential issues we might have with the tweets' content. We will later use this analysis to decide which values to filter out of the dataset.

We start be inspecting a random sample:

In [None]:
pd.set_option('display.max_colwidth', 140)
ex_sp_df.sample(n=10)['main']

We immediatly see that tweets containing links are not relevant to our research question (they are mostly news or adds).

In [None]:
ex_sp_df.main[ex_sp_df.main.map(lambda x: 'http://' in x)].head(10)

Looking at the sample, we generalize this rule and make the assumption that URLs are sign that a tweet is a spam.

Using a very simplified matching research, we see that some words are problematic.

In [None]:
pd.set_option('display.max_colwidth', 100)
ex_sp_df[ex_sp_df['main'].map(lambda x: 'suicide' in x) ]['main']

We notice that the occurences of the word _**'suicide'**_ in our set are often linked to news. This also comforts us in our choice to remove tweets with URLs in order to get better results.

In [None]:
ex_sp_df[ex_sp_df['main'].map(lambda x: 'therapie' in x) ]['main']

Looking at the word _**'therapie'**_ (_therapy_ in German), we can confirm once again the issue with URLs.

In [None]:
ex_sp_df[ex_sp_df['main'].map(lambda x: 'RT ' in x) ]['main'].head()

Finally, we look at the retweets to see if they can be useful. Using these results (and the findings other researchers such as the ones presented in [8]), they are not useful.

## 1. _Spinn3r_ Data Treatment

Now that we have familiarized with our data, we have to treat our Spinn3r dataset to be able to use it correctly. To do this, we first apply our functions locally to a small subset to make sure they work as expected. After this, we use Spark to scale up our operations and apply them to all dataset on the cluster.

### 1.1. Tweet filtering

- First, we would like to get rid of any tweets with a spam probability greater than $0.5$ (tweets more likely to be spam than ham). 
- As we are only focused on the Swiss population, we only keep tweets written in the country's official languages (except for Italian as no one in our group speaks the language, meaning we won't be able to extract useful information) and in English (as most tweets in the dataset, and more generally on Twitter, are written in English)

In [None]:
#removing spam
spam_mask = (ex_sp_df['source_spam_probability'] >= 0.5)
ex_sp_df.drop(ex_sp_df[spam_mask].index, inplace=True)
ex_sp_df.reset_index(drop=True, inplace=True)

In [None]:
#filtering by language
lang_mask = ~ex_sp_df.lang.isin(['de', 'en', 'fr'])
ex_sp_df.drop(ex_sp_df[lang_mask].index, inplace=True)
ex_sp_df.reset_index(drop=True, inplace=True)

### 1.2. Column formatting

To start, we format the column containing the dates to make the temporal analysis easier.

In [None]:
#convert dates to datetime format
ex_sp_df['published'] = pd.to_datetime(ex_sp_df['published'])
ex_sp_df['published'].head()

As the geo_point data is not standardised, we need to treat both types of geolocatio encoding seperately.

As the example dataset does not contain the 'geo_point' field, we just present the function we will apply to do this here.

In [None]:
def process_location_data(df):
    """preprocesses geo data in format as found on cluster
    df: dataframe, locations should be in 'geo_point' column
    """
    #get lines containing geopoint
    non_null = df[~df.geo_point.isnull()]
    
    #location separated by comma format: [longitude, latitude ]
    comma = non_null[non_null.geo_point.str.contains(',')].geo_point.str.split(pat=',', expand=True).applymap(float)
    
    #locations just next to each other [longitude latitude]
    no_comma = non_null[~non_null.geo_point.str.contains(',')].geo_point.str.split(expand=True).applymap(float)
    return pd.concat([comma, no_comma])

### 1.3. Text treatment

Following this mild cleaning, we need to work on the main subject of our analysis: the content of the tweets.

We put all caracters in lowercase (which will allow us to compare the tweets more easily without any case-sensitive-related issues).

In [None]:
ex_sp_df['main'] = ex_sp_df['main'].str.lower()
ex_sp_df['main'].head()

As we have seen in _Part 1_, URLs are highly linked to spam (except for URLs linking pictures to the tweet). Thus, we remove all _pic.twitter.com_ URLS and delete all mentions of the format _"@user"_.

In [None]:
url_mask = ex_sp_df['main'].str.contains("www") | ex_sp_df['main'].str.contains("http")
ex_sp_df['main'] = ex_sp_df['main'].str.replace("pic.twitter\S+", '')
ex_sp_df.drop(ex_sp_df[url_mask].index, inplace=True)
ex_sp_df.reset_index(drop=True, inplace=True)
ex_sp_df['main'].head()

To finish the text treatment, we delete all non-alphanumeric characters.

In [None]:
alphanum_filter = lambda data: re.sub(r'[^\w\s]', '', data)

In [None]:
ex_sp_df['main'] = ex_sp_df['main'].map(alphanum_filter)
ex_sp_df['main'] = ex_sp_df['main'].str.replace('_', '') #not removed by previous regex
ex_sp_df.main.head()

### 1.4. Spark version

After performing all these filtering and cleaning steps, the size of the set is **significantly reduced** (the new file only weights **2.4MB** instead of the **55MB** of the original file, allowing us to reduce the set by **30 times**). This is particularly helpful as the whole dataset represents around **30GB**.

_Note:_ the code was written as _**Raw NBConvert**_ as it is not meant to be run on the notebook but on the cluster. The code is only here as information for the reader, it's executable can be found in the run.py script.

We start by importing the necessary elements and defining Spark environment variables.

We then import our data.

We filter tweets using the same method.

Text treatment is a little longer as it is more thorough in Spark.

Finally, we store the data in the _JSON_ we will be using.

### 1.5. Natural Language Processing

Using  Natural Language Processing _(NLP)_ is necessary given that we work with tweets. We only perform a simple processing pipeline (tokenization, stop words removal and stemming) using the goto NLP python library _**nltk**_. This is useful because we work with tweets from different languages.

_Note:_ once againt, the code was written as _**Raw NBConvert**_ as it takes a long time to run on the dataset ($5+$ minutes). The code is only here as information for the reader, it's executable can be found in the processing_pipeline.py script.

#### 1.5.1. Helper functions
(cf. _Next part_ below for information)

#### 1.5.2. Tweet formatting

NLP is mainly based on treating the data to obtain an adequate format to work with. It usually follows the 3 steps explicited below:
- Tokenizing: separating words, we use a special tweettokenizer to work onthe tweets
- Stop-words removal: removes words with little semantic meaning
- Stemming: keeping only the radical of a word (allows better comparison)

## 2. Dataset Description for Spinn3r

To be able to provide sound answers to our research questions, we need to analyze our entire dataset to compare it to our final results.

_Note :_ we used the processing_pipeline.py for speed and convenience. The results were save to the processed_tweets pkl file which can be found in the _'data/spinn3_tweets'_ folder.

In [None]:
sp_df = pkl.load(open(SP_PROC_TWEETS, "rb" ))

### 2.1. General statistics:

In [None]:
sp_df.count()

In [None]:
sp_df.tokenized

We have a total of $3936084$ tweets but we note that less than half of them have *geo\_point* values attached to them.

### 2.2. Distribution of categorical data

In [None]:
plt.figure(figsize=[7, 5])
sp_df['lang'].value_counts().plot(kind='pie', colors=colors, autopct='%.2f%%')
plt.axis('equal')
plt.title('Distribution of language on tweets')
plt.show()

Almost half of all tweets are in English, French and German being the next most popular languages. This is very interesting because it is quite unexpected given we are looking at tweets which were posted in Switzerland where german is the the most videly used official language with $66\%$ of the country using german, which french speakers making up $22.7\%$. [source](https://en.wikipedia.org/wiki/Languages_of_Switzerland)

In [None]:
plt.figure(figsize=[7, 5])
sp_df['author_gender'].value_counts().plot(kind='pie', colors=colors, autopct='%.2f%%')
plt.axis('equal')
plt.title('Distribution of gender on tweets')
plt.show()

Most accounts do not contain information on the user's gender (meaning there is no way we can have an unbiased set). However, the dataset still provides the gender of over a third of profiles (even though male and female users are not equally represented), possibly allowing to get some insights on the differences in mental distress expression between the genders.

In [None]:
plt.figure(figsize=[10, 5])
sp_df['sentiment'].value_counts().plot(kind='pie', colors=colors,autopct='%.2f%%')
plt.axis('equal')
plt.title('Distribution of sentiment analysis results on tweets')
plt.show()

Around $75\%$ of tweets are labeled as being _**NEUTRAL**_, and only $8\%$ are labeled as negative. Taking these labeles for face value, this could already indicate that people on twitter do not discuss negative topics –mental health being such a topic – a lot. 

### 2.3. Distribution of temporal data

In [None]:
#Helper functions
def get_time_distribution(times):
    publishing_time = times.map(lambda x: x.hour).value_counts() / times.count()
    publishing_time.sort_index(inplace=True)
    return publishing_time

def plot_daily(df):
    times = get_time_distribution(df['published'])
    return plt.plot(times)

In [None]:
plot_daily(sp_df)
plt.xlabel('hour')
plt.ylabel('number of tweets')
plt.title('Distribution of tweets over 24 hours')
plt.show()

We first look at the daily pattern of data. The highest number of posted tweets is around 8pm (a number decreasing through the night with the lowest pic attained at 3am). Another low point is around lunch time.

In [None]:
yearly_distribution = sp_df.published.map(lambda x: x.month).value_counts()

sns.tsplot(data=yearly_distribution, time=yearly_distribution.index)
plt.ylabel('number of tweets')
plt.title('Distribution of tweets over a 10 months period')
plt.xticks(range(1, 11),['jan', 'feb', 'mar', 'apr', 'mai', 'jun', 'jul', 'aug', 'sep', 'oct'])
plt.xlabel('month')
plt.show()

When looking at the number of tweets posted throughout the 10 months, we see that the dataset is not balanced (a plausible explanation is the rapidly-increasing number of Twitter users). This could also be linked to the Spinn3r's tweets retrieval.

### 2.4. Distribution of location data

As previously noted, only $\sim40\%$ of tweets are a geo-localised. However, we can still look at their overall distribution. We use [this tutorial](http://andrewgaidus.com/leaflet_webmaps_python/) (and [the geodata](https://github.com/interactivethings/swiss-maps) provided in the $3^{rd}$ homework) to create our map. We then convert our data into a shapefile using [an online converter](https://mygeodata.cloud/converter/shp-to-geojson).

_Note:_ some points are not located in Switzerland but we do not filter out the attached tweets (we suppose the dataset is correct).

We first match the shapefile formatting with our location data.

In [None]:
locations = process_location_data(sp_df)
location_geo = gpd.GeoSeries(locations.apply(lambda z: Point(z[1], z[0]), 1), crs={'init': 'epsg:4326'})
location = gpd.GeoDataFrame(sp_df[~sp_df.geo_point.isnull()], geometry=location_geo) #merged data –useful later on–

In [None]:
municipalities = gpd.read_file('data/geodata/ch-municipalities.shp') #load data
municipalities.geometry = municipalities.geometry.to_crs(location_geo.crs) #change format

Using _**gdp**_'s library, we find occurences per municipality and _'beautify'_ the data display.

In [None]:
#join between region and location
counts = gpd.tools.sjoin(location, municipalities).groupby('id').size()

In [None]:
#add counts to the regions
municipalities.set_index('id', inplace=True)
municipalities['counts'] = counts
municipalities.fillna(0, inplace=True) #if we can't find match we have 0 tweets
municipalities.reset_index(inplace=True)

We display using 2 layers to get more granularity (we plot the municipalities with less than 10 tweets on a different layer).

In [None]:
municipalities_0 = municipalities[municipalities.counts < 10]
municipalities_1 = municipalities[municipalities.counts >= 10]

Using this data, we create our map.

In [None]:
SWISS_COORD = [46.85, 8.23]
swiss_map = folium.Map(location = SWISS_COORD, zoom_start = 8, tiles = 'cartodbpositron')

#layer with municipalities containing more than 10 tweets
swiss_map.choropleth(municipalities_1.to_json(), data = municipalities_1,
                     key_on = 'feature.properties.{}'.format('id'),
                    columns = ['id', 'counts'], fill_color = 'PuRd',
                     threshold_scale=[11, 500, 1000, 2000, 5000, 10000],
                     legend_name='Number of tweets', fill_opacity=0.9)
#layer with municipalities containing less than 10 tweets in total
swiss_map.choropleth(municipalities_0.to_json(), data = municipalities_0, 
                     key_on = 'feature.properties.{}'.format('id'),
                     columns = ['id', 'counts'], fill_color = 'PuBu', 
                     threshold_scale=[0, 10], legend_name='Area of little to no data',
                     fill_opacity=0.1, line_opacity=0.3)

As we cannot display folium maps in an _'.ipynb'_ file on GitHub, in addition it being quite large, we save our map to an HTML file so it an be viewd independently.

In [None]:
swiss_map.save('municipalities_map.html')

In [None]:
%%HTML
<iframe width='900' height="500" src="municipalities_map.html"></iframe>

We see that tweets are strongly concentrated in urban centers. This will have to be taken into consideration when looking at the Röstigraben and urban/rural differences.

## 3. Dictionary Construction

### 3.1. Building the dictionary

The first step in building our dictionary was research (we use dictionnaries shown in [2] and [3] as theye were used for similar problems). As work with tweets in multiple languages, we translate them carefuly into French and German. We decided to add multiple appropriate synonyms in the translation for words that have many synonymes (as is the case with german). 

In [None]:
dict_1 = pd.read_csv(DICT_1)
dict_1.head(10)

In [None]:
dict_2 = pd.read_csv(DICT_2)
dict_2.head(10)

We apply our cleaning pipeline to the termes in the dictionary so that words in our dictionary can be matched to our processed tweets.

_Note :_ we only show the pipeline application to the $1^{st}$ dictionary as an example. The actual cleaning (and matching of keywords) is done in the dict_filtering.py script and saved to a pkl file (stored in the *'data/spinn3r_tweets/keyworded_tweets'* folder).

In [None]:
en_dict = dict_1['english'].dropna()
fr_dict = dict_1['french'].dropna()
de_dict = dict_1['german'].dropna()

In [None]:
tknzr = TweetTokenizer()
en_dict = en_dict.map(lambda x: tknzr.tokenize(x))
fr_dict = fr_dict.map(lambda x: tknzr.tokenize(x))
de_dict = de_dict.map(lambda x: tknzr.tokenize(x))

After cleaning and tokenizing, the dictionnaries look like this:

In [None]:
en_dict.head()

In [None]:
fr_dict.head()

In [None]:
de_dict.head()

### 3.2. Processing the dictionary

Once we have our (clean) data, we simply run all the previous NLP methods on our dictionaries.

In [None]:
def dict_remove_stops(dictionary, language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'

    lang_set = stopwords.words(language)
    return dictionary.apply(lambda expression: [word for word in expression if word not in lang_set])

def dict_stem_words(dictionary, language):
    lang = language[:2]
    if language == 'german':
        lang = 'de'

    stemmer = SnowballStemmer(language)

    return dictionary.apply(lambda expression: [stemmer.stem(word) for word in expression])

In [None]:
fr_dict = dict_remove_stops(fr_dict, 'french')
en_dict = dict_remove_stops(en_dict, 'english')
de_dict = dict_remove_stops(de_dict, 'german')

In [None]:
en_dict = dict_stem_words(en_dict, 'english')
fr_dict =  dict_stem_words(fr_dict, 'french')
de_dict =  dict_stem_words(de_dict, 'german')

After stemming and stop word removals, the dictionnaries look like this :

In [None]:
en_dict.head()

In [None]:
fr_dict.head()

In [None]:
de_dict.head()

### 3.3 Finding keyword matches

We define to following helper functions to find matches:

In [None]:
def compare(s, t):
    """compares to lists, and returns 
    true for all possible permutations of same list
    s, t: lists to compare
    """
    return Counter(s) == Counter(t)

def match_dict(tweet, dict_):
    """returns keywords that match in string
    tweet: tweet to find keywords in
    dict_: list of keywords
    """
    #should we explain this more?
    doc = [sublist for sublist in dict_ if compare(list(filter(lambda x: x in tweet, sublist)), sublist) ]
    return doc #at least one match

This gives us the following results for each language. We perform the actual algorithm in a script and save the results into a pickle file.

Now that we have our dictionaries, we filter the tweets to keep the ones where keyword occurs at least once in the tweets.

_Note:_ the actual algorithm was performed using the dict_filtering_small.py script and the results were saved to pickle files in the *'data/spinn3r_tweets/keyworded_tweets'* folder.

#### 3.3.1. Dictonary 1

**English results :**

In [None]:
df = pkl.load(open(SP_SMALL_EN_KW, "rb" ))

small_english_filtered = df[(df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', small_english_filtered.main.count())
print('percentage of matching tweets: ', small_english_filtered.main.count() / df.main.count())

In [None]:
small_english_filtered[['main', 'keywords']].head()

We see that we get very frew results using this dictionary. Less than $0.2\%$ of all tweets match at least one keyword. This is not surprising as these are quite specialized terms.
Even though some tweets indeed show signs of mental health issues, some are completely unrelated to mental health. 
This discussion will be explicited in the Dictionary Evaluation (Parts _3.4_ and _3.5_).

The fact that despite the few matches we get false positives indicates that the total number of true matching tweets is even smaller.

In [None]:
small_english_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.title('Keyword occurences')
plt.ylabel('number of tweets')
plt.show()

The 3 most frequent keywords are _depress_, _addict_ and _suicid_, which are often casually used compared to the 5 other words we see in the chart.

**French results :**

In [None]:
df = pkl.load(open(SP_SMALL_FR_KW, "rb" ))
small_french_filtered = df[(df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', small_french_filtered.main.count())
print('percentage of matching tweets: ', small_french_filtered.main.count() / df.main.count())

We notice that there are 3 times less matching tweets than in English, which can partially be explained by the fact that the ratio of matching tweets is not the same in both languages. Looking at the percentage of matching tweets we see that only $0.09\%$ of all french tweets contain at least one match.

In [None]:
small_french_filtered[['main', 'keywords']].head()

In [None]:
small_french_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.title('Keyword occurences')
plt.ylabel('number of tweets')
plt.show()

When it comes to French, the keyword _'suicid'_ is by far the most frequent, which all other words being used quite infrequently, having less than 100 matching tweets each.

**German results :**

In [None]:
df = pkl.load(open(SP_SMALL_DE_KW, "rb" ))
small_german_filtered = df[(df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', small_german_filtered.main.count())
print('percentage of matching tweets: ', small_german_filtered.main.count() / df.main.count())

In German, the number of matching tweets is even lower, despite having used synonymes to compensate for the multitude of possible matching words.

In [None]:
small_german_filtered[['main', 'keywords']].head()

In [None]:
small_german_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.title('Keyword occurences')
plt.ylabel('number of tweets')
plt.show()

Looking at the keyword frequencies, we see that similar to english, addiction, depression and suicide ['abhang', 'depression', 'suizid', 'selbstmord'] are the most frequent keywords, which the other keywords in the dictionary having much fewer matches. 

#### 3.3.2. Dictionary 2


We now look at the second dictionary retrived from previous research. This dictionary contains a broader range of words, and is a superset of the previous dictionary.

**English results :**

In [None]:
df = pkl.load(open(SP_EN_KW, "rb" ))
english_filtered = df[(df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', english_filtered.main.count())
print('percentage of matching tweets: ', english_filtered.main.count()/ df.main.count())

In [None]:
english_filtered[['main', 'keywords']].head()

At first glance we can see that this dictionary gives us many more matches, with about $1\%$ of all english tweets containing at least one keyword. We also see that the matches tend to be more general about feelings and less about specific illnesses.

In [None]:
english_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.show()

Even though not all tweets express distress (just as before), we can clearly see sadness, loneliness or fear. We also notice that the number of matching tweets has increased greatly (around x10).

In [None]:
english_filtered[english_filtered.keywords.map(lambda x: len(x) > 1)].keywords.map(lambda x: str(x)).value_counts().head(10)

**French results :**

For french we find similar results as for english. We have more matched, and the matches are less mentall illness specific:

In [None]:
df = pkl.load(open(SP_FR_KW, "rb" ))
french_filtered = df[(df.lang == 'fr') & (df.keywords.map(lambda x: x != [] ))]
print('number of matching tweets: ', french_filtered.main.count())
print('percentage of matching tweets: ', french_filtered.main.count() / df[df.lang == 'fr'].main.count())

In [None]:
french_filtered[['main', 'keywords']].head()

In [None]:
french_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.show()

Looking at the words we see that 'kill' or 'die' is by far the most frequent, followed by words like 'sadness' and 'hate'. It thus gives similar results to english. We also find that suicide is still comparetively quite frequent, even with the addition of other words.

**German results :**

German also shows an increase in matches with the larger dict. However, due to the number of matches being small originally, the total number of matches is still quite small.

In [None]:
df = pkl.load(open(SP_DE_KW, "rb" ))

german_filtered = df[(df.lang == 'de') & (df.keywords.map(lambda x: x != [] ))]

print('number of matching tweets: ', german_filtered.main.count())
print('percentage of matching tweets: ', german_filtered.main.count() / df[df.lang == 'de'].main.count())

In [None]:
german_filtered[['main', 'keywords']].head()

In [None]:
german_filtered.keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.show()

We note that the most frequent words are again 'sad', followed by expressions of lonelyness and emptyness.

### 3.4. Dictionary evaluation (using our sample labeling process)

#### 3.4.0. Helper functions

In [None]:
def tweet_search(df, word):
    return df[df['main'].map(lambda x: word in x)]['main']

In [None]:
def md_search(df, word, col):
    md_count = df[df[col].map(lambda x: word in x) & df['mental'] == 1][col].count()
    print("Tweets showing mental distress and containing the word '" + word + "': " , md_count)
    count = df[df[col].map(lambda x: word in x)][col].count()
    print("Total appearance of the word '" + word + "': " , count)
    print("Ratio between these 2 numbers : ", md_count/count)

In [None]:
def blacklisted_search(df, word):
    count = df[df['split'].map(lambda x: word in x)]['split'].count()
    print("Total appearance of '" + word + "': ", count)

After crossing the dataset with our second dictionary to retrieve tweets exhibiting _mental distress_, we labeled a sample to see if the tweets indeed show such signs. 

Note that this task is complex as it is very subjective (choosing tweets displaying mental health issues was done at the discretion of the person labeling them). Our main conclusion from this step was the necessity to "expand" our research to tweets showing signs of mental distress in general and not only mentioning mental illnesses.

In [None]:
first_iter = pd.read_csv(SP_EN_LBL)
first_iter.main.head()

#### 3.4.1. Key _(qualitative)_ takeaways

- Tweets were labeled "losely" using multiple signs of mental distress, mostly sadness. Tweets with the following specificities were labeled as showing mental distress: 
    - nostalgia (either for the past or simply for the end of a nice day)
    - mention of sad activities (people who watched sad movies, listened to sad songs as this was most probably triggered by a previous feeling of sadness)
    - mention of being mad (either over other people like haters or, more often, mad with their families)
 

- Some tweets mentioned the mental distress of other people (either how they helped them or raising awareness over these issues)
- Some tweets showed desperation (a form of mental distress) over the "way the world is", these were not included.

- An unexpected insight (which we decided not to pursue further as it was too tenuous) was the link between summer and mental distress. Indeed, most of tweets mentioning summer showed some sadness due to the absence of friends (from university for example).
- Some tweets only included _"motivational"_ quotes. We can suppose that these reflect bad times. Nonetheless, these were not included as our previous assumption is a little far-fetched. Moreover, this could have messed with our model (they did not include specific words of our dictionary nor specific mentions to mental distress).

- A lot of mental distress tweets are related to unilateral feelings. They are often messages specifically targeted at someone who most certainly doesn't know of their existence (for example work collegues or college classmates).

- Most tweets talking about depression or terrible low self-esteem show _« covert-humour »_, a coping technique used to hide ones self-deprecation. These tweets are very difficult to detect as they use sarcasm (which can not be perceived using current NLP techniques) and seldom use words from our dictionary.

#### 3.4.2. Key _(quantitative)_ takeaways

After the labeling, we found the necessity of both changing our current dictionary (through additions but also removals) and creating a new dictionary (which we decided to call a _negative_ dictionary, but is simply a blacklist of words we did not want to appear anymore).

**Words to include in our dictionary :**

In [None]:
tweet_search(first_iter, 'unfortunately')

In [None]:
tweet_search(first_iter, 'therapy')

In [None]:
tweet_search(first_iter, 'overthinking')

Even though these expressions do not appear very often, they are unique to tweets showing mental distress. Thus, we decide to include them in our new dictionary.

**Words to remove from our dictionary :**

In [None]:
md_search(first_iter, 'hate', 'split')

In [None]:
md_search(first_iter, 'pain', 'split')

In [None]:
md_search(first_iter, 'addict', 'split')

Even though we decided to exclude many words (_hate_, _hurt_, _pain_, _addict_ and _overdose_), we decided to demonstrate our method only on a few. When looking at the number, we see that more than half the tweets containing the (previously) specified words are false positives, which drives us to striked them from our dictionary.

**Words to blacklist :**

In [None]:
blacklisted_search(first_iter, 'overw')
blacklisted_search(first_iter, 'overwatch')

In [None]:
blacklisted_search(first_iter, 'syria')
blacklisted_search(first_iter, 'assad')
blacklisted_search(first_iter, 'asylum')

In [None]:
blacklisted_search(first_iter, 'trump')
blacklisted_search(first_iter, 'clinton')
blacklisted_search(first_iter, 'america')
blacklisted_search(first_iter, 'usa')

As we can see, the words we want to strike represent major recurring themes (mostly having to do with the news) which do not have anything to do with mental distress (none of these tweets were labeled as showing signs of mental distress). Once again, we only showed our idea on a small subset of topics, the list of words we want to blacklist is : _gov_, _government_, _syria_, _assad_, _refugees_, _asylum_, _overwatch_, _vine_, _trump_, _hillary_, _america_, _usa_ and _india_. 

_Note:_ two other expressions are blacklisted, but that is due to our first research algorithm (which did not match perfectly the strings) : _suicidesquad_ (contains _suicide_) and _spain_ (contains _pain_)

#### 3.4.3. French and German labeling

When labeling the other languages, we saw that only $\sim10\%$ of tweets containing at least a keyword from a dictionary actually showed mental distress. Improving the dictionnaries would probably give better results, but as no research of this kind has been done in these languages, we can only improve our results through many iterations (and labeling steps). As this is too time-consuming and uncertain, we decided to drop both languages and concentrate on English. 

### 3.5. Dictionary evaluation (using LDA)

Once again, we only work with tweets which were crossed with our second dictionary. 

As this is applied on a larger scale than our labeling, it allows us to show subjects which should be avoided (as showed when constructing our _negative_ dictionary).

We confirm the results found in the labeling and add additional blacklisted words.

_Note :_ we use a script we build our LDA models and save thems to the _'data/models'_ folder. We will quickly present these models were created but our focus will be on understanding then (and displaying our results using the _**pyLDAvis**_ library).

#### 3.5.1. Building an English LDA model

To build our LDA model, we need to create a dictionary and define 'texts' (in our case the tweets). This model is then passed to gensim, which then implements the proper LDA model construction. The number of topics was choosen as to be most human understandable. The number of passes and iterations was choosen to be as large as possible, as the corpus we are working on is relatively small.

In [None]:
#remove stopwords for more meaningful results
english_stop_rem = dict_remove_stops(english_filtered.tokenized, 'english')

In [None]:
dictionary = corpora.Dictionary(english_stop_rem)
dictionary.filter_extremes(no_below=2, no_above=.8) #removes too frequent / sparse words
texts = english_stop_rem.tolist()
corpus = [dictionary.doc2bow(text) for text in texts]

After loading our LDA model, we evaluate it using the most frequent word distributions of each topic. We can already notice distinct words in Topic 3 (centers around words such as 'syria' and 'muslim').

In [None]:
ldamodel = models.LdaModel.load('models/english.model')
ldamodel.show_topics()

In [None]:
#Creating the visualisation
vis_data = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis_data)

We use the $\lambda$ slider to give understandable names to the topics:

- 1 & 2 : general topics talking about everyday distress, contains specific words such as relationship, teacher, sunday, gloomy


- 3 : wars & terrorism seems to be the main topic, we see keywords such as muslim, terroist, usa, syria, aleppo, kill, bomb


- 4 & 5: mentions of suicide squad (a movie that came out in 2016 ), less easy to identify an overall topic, japan and whaling related tweets seem to be in this topic aswell


- 6 : mentions of play, music (specificaly gnash and his song i hate u, i love u ft. olivia o'brien), this topic seems to have formed due to the many mentions of this song in tweets.

We see that common specific words in topic 3, 4, 5 and 6 should be excluded from our dictionary search.

### 3.6. Dictionary update (using a second sample labeling)

After everything that we found above, we updated our dictionary to refine our output. The following analysis follows the labeling of a new sample from the dataset evaluated using the new dictionary. We will use all these takeaways to update our dictionary one last time (without analysing it again). 

In [None]:
second_iter = pd.read_csv(SP_NEW_EN_LBL)
second_iter.head()

#### 3.6.1. Key _(qualitative)_ takeaways

Unlike the first pass on the data, this labeling does not include a thorough qualitative analysis. It only allowed us to improve our methodology and find plausible answers to some of the phenomena we observed. One of such phenomena is the difference in emotional expression between the genders. This may be due to the (very high) number of tweets by _"fangirls"_, female tweet authors mentioning their love for their music favorite bands (mostly Asian) or one of their favorite actors.

#### 3.6.2. Key _(quantitative)_ takeaways

In this case, we only determine words to add both to our dictionary and to our _negative_ dictionary (no word in our dictionary was deemed superfluous or inducing too many false positives).

**Words to add to our dictionary :**

In [None]:
tweet_search(second_iter, 'breakdown')

In [None]:
exp_count = second_iter[second_iter['main'].map(lambda x: ('confused' in x or 'lonely' in x) and 'sad' in x)]['main']
print("Total appearance of the expressions involving 'sad' : ", exp_count.count())
exp_count

We did not use our usual research function for the last demonstration as it involves 2 expressions we found the need to group. Once again, this is only a preview of our quantitative analysis. The whole list of expressions we want to add to our dictionary is : _panic attack_, _sleepless_, _problems falling asleep_, _lonely and sad_, _confused and sad_, _gambling_, _breakdown_, and _mental issue_.

**Words to blacklist :**

In [None]:
md_search(second_iter, 'sadly', 'main')

In [None]:
sm_count = second_iter[second_iter['main'].map(lambda x: 'sonic' in x and 'mania' in x)]['main']
print("Total appearance of the expression 'Sonic Mania' : ", sm_count.count())
sm_count

Once again, we use a different format to show a special expression. We need to delete _'Sonic Mania'_ as it contains the word _'mania'_ (which couldn't be taken out even with our new search algorithm). However, we can see that affining our dictionary worked as we have a lot less words to blacklist : _sadly_, _not afraid_, _stress relief_ and words with the radical _"addict"_.

## 4. Analysis of First Dataset

We process the final dictionary using the dict_filtering_final.py script.


We now discuss the final results from our dict.

In [None]:
final_df = pd.read_pickle('data/spinn3r_tweets/keyword_tweets/final_english_keyworded_tweets.pkl')

In [None]:
final_df = final_df[~final_df.keywords.isnull()]

In [None]:
has_keyword = final_df.keywords.map(lambda x: len(x) > 0)

In [None]:
final_df[has_keyword].main.head(10)

In [None]:
final_df[has_keyword].keywords.map(lambda x: ' '.join(x[0])).value_counts().plot(kind='bar')
plt.show()

### Sentiment score

In [None]:
plt.subplot(2,2,1)
(final_df[has_keyword].sentiment.value_counts()/final_df[has_keyword].sentiment.count()).plot(kind='bar')
plt.subplot(2,2,2)
(final_df.sentiment.value_counts()/final_df.sentiment.count()).plot(kind='bar')
plt.show()

### Gender

In [None]:
plt.subplot(2,2,1)
(final_df[has_keyword].author_gender.value_counts()/final_df[has_keyword].author_gender.count()).plot(kind='bar')
plt.subplot(2,2,2)
(final_df.author_gender.value_counts()/final_df.author_gender.count()).plot(kind='bar')
plt.show()

### Temporal analysis

In [None]:
plot_daily(final_df[has_keyword])
plot_daily(final_df)
plt.xlabel('hour')
plt.ylabel('number of tweets')
plt.legend(['filtered', 'all'])
plt.title('Distribution of tweets over 24 hours')
plt.show()

In [None]:
yearly_distribution = final_df[has_keyword].published.map(lambda x: x.month).value_counts() / final_df[has_keyword].published.count()
yearly_distribution2 = final_df.published.map(lambda x: x.month).value_counts() /  final_df.published.count()

sns.tsplot(data=yearly_distribution, time=yearly_distribution.index, color='orange')
sns.tsplot(data=yearly_distribution2, time=yearly_distribution2.index)
plt.legend(['filtered', 'overall'])
plt.ylabel('relative percetage of tweets')
plt.title('Distribution of tweets over a 10 months period')
plt.xticks(range(1, 11),['jan', 'feb', 'mar', 'apr', 'mai', 'jun', 'jul', 'aug', 'sep', 'oct'])
plt.xlabel('month')
plt.show()

### Location analysis

In [None]:
locations = process_location_data(final_df[has_keyword])
location_geo = gpd.GeoSeries(locations.apply(lambda z: Point(z[1], z[0]), 1), crs={'init': 'epsg:4326'})
location = gpd.GeoDataFrame(final_df[has_keyword][~final_df[has_keyword].geo_point.isnull()], geometry=location_geo) #merged data –useful later on–

In [None]:
municipalities = gpd.read_file('data/geodata/ch-municipalities.shp') #load data
municipalities.geometry = municipalities.geometry.to_crs(location_geo.crs) #change format

In [None]:
#join between region and location
counts = gpd.tools.sjoin(location, municipalities).groupby('id').size()

In [None]:
#add counts to the regions
municipalities.set_index('id', inplace=True)
municipalities['counts'] = counts
municipalities.fillna(0, inplace=True) #if we can't find match we have 0 tweets
municipalities.reset_index(inplace=True)

In [None]:
municipalities_0 = municipalities[municipalities.counts < 1]
municipalities_1 = municipalities[municipalities.counts >= 1]

In [None]:
SWISS_COORD = [46.85, 8.23]
swiss_map = folium.Map(location = SWISS_COORD, zoom_start = 8, tiles = 'cartodbpositron')

#layer with municipalities containing more than 10 tweets
swiss_map.choropleth(municipalities_1.to_json(), data = municipalities_1,
                     key_on = 'feature.properties.{}'.format('id'),
                    columns = ['id', 'counts'], fill_color = 'PuRd',
                     legend_name='Number of tweets', fill_opacity=0.9)
#layer with municipalities containing less than 10 tweets in total
swiss_map.choropleth(municipalities_0.to_json(), data = municipalities_0, 
                     key_on = 'feature.properties.{}'.format('id'),
                     columns = ['id', 'counts'], fill_color = 'PuBu', 
                     threshold_scale=[0, 1], legend_name='Area of little to no data',
                     fill_opacity=0.1, line_opacity=0.3)

In [None]:
swiss_map.save('dist.html')

In [None]:
%%HTML
<iframe width='900' height="500" src="dist.html"></iframe>

## 5. Processing Pipeline on the Second Dataset

#### Mapping from Dataset 1 to Dataset 2

Formatting is as follows: **Column in Dataset 2** -> Column in Dataset 1

- **author_user_id**           -> userId
- **geo_point**                -> longitude + latitude (or: placeLongitude + placeLatitude)
- **main**                     -> text
- **published**                -> createdAt
- **source_followers**         -> followersCount
- **source_following**         -> friendsCount (or is it folowwers that are following ?)
- **source_location**          -> userLocation


#### Dataset 1 having no equivalent : 
- id                                -> tweet ID
- truncated                         -> no idea, only NaNs
- placeId                           -> ID of where they are
- inReplyTo                         -> If it is a tweet reply
- source & sourceName, sourceUrl    -> what user used to send tweet (Android, website, and so on)
- userName                          -> username
- screenName                        -> name shown on tweets from username
- statusesCount                     -> number of status of user

#### Dataset 2 having no equivalent : 
- **author_gender**                 -> author's gender
- **lang**                          -> language of tweet (given)
- **sentiment**                     -> tweet's sentiment analysis (given)
- **source_spam_probability**       -> spam probability of tweet (given)
- **tags**                          -> hashtags contained in the tweet (in Dataset 1 stays in text)
- **tokenized**                     -> tokenized text of tweet (done by us)
- **keywords**                      -> keywords fund in tweets (done by us)

The dataset comes with a *txt schema*, giving us an idea of the meaning of each column in the *tsv file* containing the tweets. We were also given a sample file to get an overview of the data, but we also optained the complete set of tweets (5GB) in a _.zip_ file.

The dataset contains the following useful columns:

- **userId** : id identifying the user.
- **createdAt** : time the tweet was posted.
- **text** : content of the tweet.
- **placeLatitude** : latitude of the tweet.
- **placeLongitude** : longitude of the tweet.
- **sourceName** : username.
- **sourceUrl** : URL of the tweet.
- **followersCount** : number of followers.
- **friendsCount** : number of mutual follows.
- **statusesCount** : number of statuses of user.

The sample dataset contains a lot of NaN values, and each column contains at least 1% or more NaN values.

The complete analysis and code can be found in the [Basic Exploration dataset 1 notebook](Basic%20Exploration%20Dataset%201.ipynb)

## 7. Analysis of the Second Dataset

Due to issue with the cluster, we decide to use the dataset in local and preform the necessary processing steps locally.

By removing unnecessary columns,  tweets containing websites, retweets, and general giberish, that is set of characters mixed with numbers, we can reduce the size of the dataset from  5.6 GB down to 1.2 GB. The code for this process can found in the run_d1.py script. We save this dataset in a csv file for further processing.

As mentioned previously, this set is not annotated with the language of the tweet. We thus needed to use an external language classification library to label the tweets and only keep english language tweets. For reasons of speed we use the guess_language function from sprit. Using this function, the labeling of all 14million tweets took about 2 hours.
After this we save the tweets that were labeled as english to an other csv file.

We now give a quick analysis of this dataset:

In [None]:
twitter_df2 = pd.read_csv('data/twitter-swisscom/dataset2_english.csv')

In [None]:
twitter_df2.head(2)

We preform some simple dataprocessing that is needed after retrieving the file from csv.

Then we visualize the overal distribution of the dataset with functions shown in section 2.


In [None]:
twitter_df2.rename(columns={'createdAt':'published'}, inplace=True)
twitter_df2 = twitter_df2[~twitter_df2.published.isnull()]
twitter_df2['published'] = twitter_df2['published'].astype(str)

In [None]:
#remove deformed columns
tw = twitter_df2[twitter_df2.published.map(lambda x: len(x) == 19)]

In [None]:
#transform to date time
tw.published = pd.to_datetime(tw['published'])

We now look at how many tweets we have remaining: 

In [None]:
tw.count()

We only have 1 million tweets remaining, which is quite few given the initial size of the dataset.

From the previous dataset we know that keywords are quite sparse. So despite this dataset spanning a longer amount of time, we still find that only very little usable tweets remain.

In [None]:
plot_daily(tw)
plt.xlabel('hour')
plt.ylabel('number of tweets')
plt.title('Distribution of tweets over 24 hours')
plt.show()

The temporal distribution over one day matches the distrubution we observed in part 2.

In [None]:
yearly_distribution = tw.published.map(lambda x: x.month).value_counts()

In [None]:
sns.tsplot(data=yearly_distribution, time=yearly_distribution.index)
plt.ylabel('number of tweets')
plt.xlabel('month')
plt.show()

We see that the overall monthly distribution is very odd. We have no explination for this phenomenon.


We note that we can find dates as early as 2002 in the dataset. This must be an error, as twitter first went online in 2006.
Due to the sparcity of tweets pre 2013 we only look at tweets after 2013.

In [None]:
tw = tw[tw.published.map(lambda x: int(x.year) >= 2013)]

In [None]:
yearly_distribution = tw.published.map(lambda x: x.date()).value_counts()

import matplotlib
date = matplotlib.dates.date2num(yearly_distribution.index)

In [None]:
sns.tsplot(data=yearly_distribution, time=date)
#plt.xticks([  x.year for x in matplotlib.dates.num2date(date)])
plt.show()

The plot we find is very odd. Again, we have no explination for this.

## 8. Conclusion