# Assignment

## 1. Go to the [eurostat](http://ec.europa.eu/eurostat/data/database) website and try to find a dataset that includes the european unemployment rates at a recent date.

Use this data to build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows the unemployment rate in Europe at a country level. Think about [the colors you use](https://carto.com/academy/courses/intermediate-design/choose-colors-1/), how you decided to [split the intervals into data classes](http://gisgeography.com/choropleth-maps-data-classification/) or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.

In [1]:
import os
import pandas as pd
import numpy as np
import json
import folium
import topojson
import branca

In [2]:
assert(folium.__version__ == '0.5.0') # do we have the right version of folium?

We pick the `tsdec450.tsv` dataset, found here:

_Populations and social conditions_ -> _Labour markets_ -> _Employment and unemployment_ -> _Unemployment_ -> _Total unemployment rate_

This dataset contains yearly unemployment data for all countries of interest.

Note: the dataset is in Tab-Separated Values (TSV) format, which we can parse simply by setting a `'\t'` separator.

In [3]:
# Import dataset
unemployment_eu = pd.read_csv('data/tsdec450.tsv', sep='\t', index_col=0)
unemployment_eu = unemployment_eu.rename(columns=lambda x: x.strip()) # trim whitespaces in headers

# Let's see what this thing looks like
unemployment_eu

Unnamed: 0_level_0,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016
"age,unit,sex,geo\time",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"TOTAL,PC_ACT,T,AT",:,:,:,:,4.3,4.2,4.7,4.7,4.7,4.2,...,4.9,4.1,5.3,4.8,4.6,4.9,5.4,5.6,5.7,6.0
"TOTAL,PC_ACT,T,BE",6.6,6.4,7.1,8.6,9.8,9.7,9.5,9.2,9.3,8.4,...,7.5,7.0,7.9,8.3,7.2,7.6,8.4,8.5,8.5,7.8
"TOTAL,PC_ACT,T,BG",:,:,:,:,:,:,:,:,:,:,...,6.9,5.6,6.8,10.3 i,11.3,12.3,13.0,11.4,9.2,7.6
"TOTAL,PC_ACT,T,CY",:,:,:,:,:,:,:,:,:,:,...,3.9,3.7,5.4,6.3,7.9,11.9,15.9,16.1,15.0,13.0
"TOTAL,PC_ACT,T,CZ",:,:,:,4.3,4.3,4.0,3.9,4.8,6.5,8.7,...,5.3,4.4,6.7,7.3,6.7,7.0,7.0,6.1,5.1,4.0
"TOTAL,PC_ACT,T,DE",:,5.5,6.6,7.8,8.4,8.2,8.9,9.6,9.4,8.6,...,8.5,7.4,7.6,7.0,5.8,5.4,5.2,5.0,4.6,4.1
"TOTAL,PC_ACT,T,DK",7.2,7.9,8.6,9.6,7.7,6.7,6.3,5.2,4.9,5.2,...,3.8,3.4,6.0,7.5,7.6,7.5,7.0,6.6,6.2,6.2
"TOTAL,PC_ACT,T,EA18",:,:,:,:,:,:,:,:,:,9.6,...,7.5,7.6,9.6,10.1,10.1,11.4,12.0,11.6,10.9,10.0
"TOTAL,PC_ACT,T,EA19",:,:,:,:,:,:,:,:,:,9.7,...,7.5,7.6,9.6,10.2,10.2,11.4,12.0,11.6,10.9,10.0
"TOTAL,PC_ACT,T,EE",:,:,:,:,:,:,:,:,:,:,...,4.6,5.5 i,13.5,16.7,12.3,10.0,8.6,7.4,6.2,6.8


As we can see, this table contains unemployment as a percentage of the population in age of working (PC_POP) and as a percentage of the active workforce (PC_ACT). We pick the latter (unemployment as a percentage of the active workforce or PC_ACT) for the rest of this assignment. 

For mapping purposes, we simply need to extract the rows of interest (PC_ACT) and project on the latest year (2016). Once done, we extract the country initials (last 2 letters) from the index column and we have the data we want.

In [4]:
recent_unemployment_eu = unemployment_eu[unemployment_eu.index.str.match(".*PC_ACT.*")]['2016'].reset_index()
recent_unemployment_eu.columns = ['country', 'unemployment']
recent_unemployment_eu['country'] = recent_unemployment_eu['country'].str.extract('([A-z]{2})$', expand=False)
recent_unemployment_eu = recent_unemployment_eu.dropna()
recent_unemployment_eu

Unnamed: 0,country,unemployment
0,AT,6.0
1,BE,7.8
2,BG,7.6
3,CY,13.0
4,CZ,4.0
5,DE,4.1
6,DK,6.2
9,EE,6.8
10,EL,23.6
11,ES,19.6


We now plot the unemployment rate on the map.

In [5]:
# Get the topology
topo = json.load(open("topojson/europe.topojson.json"))

# Center the map and scale it (zoom = 4)
europe_coordinates = [53, 9]
topo_map = folium.Map(location=europe_coordinates, zoom_start=4)
topo_map.choropleth(topo, topojson='objects.europe',
             data=recent_unemployment_eu,
             columns=['country', 'unemployment'], 
             key_on='feature.id',
             fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.2,
             legend_name='Unemployment rate (% of the active workforce)')

# Let's take a look...
topo_map

As we can see, there are three problems with the above solution:

1. Countries which are not in the unemployment dataframe (Switzerland, Russia, etc.) are still assigned a color (light yellow). This is because choropleth assigns the lowest value in the range if there is no associated data.
2. Some countries which are in the unemployment dataframe are not assigned the right color. The reason is that they are indexed differently in topojson. This is the case in particular for the UK, which is associated with the 'GB' identifier.
3. Switzerland's unemployment rate does not appear in the data since it is not part of the EU.
 
In order for the plots to be meaningful and provide insight into the data, we need to address these issues.

In addition to the above, we need to define a meaningful color scale to "split the intervals into data classes". We get inspiration from http://gisgeography.com/choropleth-maps-data-classification/ and opt for a Standard Deviation Classification. To do so, we need stddev and mean.

In [6]:
recent_unemployment_eu.describe()

Unnamed: 0,unemployment
count,33.0
mean,8.166667
std,4.433232
min,3.0
25%,5.1
50%,6.9
75%,9.7
max,23.6


Which gives us the following classes (approximating with mean=8, stddev=4).

In [7]:
classes = [0,4,8,12,16,20,24]
colorscale = branca.colormap.linear.YlOrRd.to_step(index=classes)

We now address the other issues described above.

The first issue cannot be corrected with `choropleth` at this point since it is not possible to set a custom color scale where non-existent values are assigned another color. However, we can use `TopoJson` with a corresponding `style_function` to build a choropleth. The custom style function allows us to not color missing values.

In [8]:
# Get the topology
topo = json.load(open("topojson/europe.topojson.json"))

# Here comes the style function
def style_function(feature):
    employment = recent_unemployment_eu.loc[recent_unemployment_eu['country'] == feature['id']]['unemployment']
    value = float(employment) if len(employment) > 0 else None
    if value is None: # None values stay transparent
        return {
            'fillOpacity': 0.0,
            'lineOpacity': 0.0,
            'weight': 0
        }
    else: # Other values get colored as per the color scale
        return {
            'fillOpacity': 0.7,
            'lineOpacity': 0.2,
            'fillColor': colorscale(value)
    }

We fix the second and third issues by manually editing the dataframe. We pick the unemployment value for Switzerland to be 3.5 as of Dec 2016 (as per the data collected for question 2).

In [9]:
recent_unemployment_eu.loc[recent_unemployment_eu['country'] == 'UK', ['country']] = 'GB'

In [10]:
switzerland = pd.DataFrame([['CH', 3.5]], columns=['country', 'unemployment'])
recent_unemployment_eu = recent_unemployment_eu.append(switzerland)

Now we can (finally) plot it.

In [11]:
# Center the map and scale it (zoom = 4)
europe_coordinates = [53, 9]
topo_map = folium.Map(location=europe_coordinates, tiles='cartodbpositron', zoom_start=4)
# Create a topojson with the above-defined style function
folium.TopoJson(topo, 'objects.europe', style_function=style_function).add_to(topo_map)

# Let's take a look...
topo_map

As we can see, Switzerland's employment rate is on the low end compared to the rest of Europe. In particular, France, Spain, and Italy have much higher employment rates. Germany's and the UK's (which might change after Brexit) are also relatively low.

## 2. Go to the [amstat](https://www.amstat.ch) website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

   > *HINT* Go to the `details` tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through. 

   Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.
   
   The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

   Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.

### Swiss Confederation Definition

We pick the dataset to include unemployment rate for each canton in December 2016. To facilitate import, we opt for Excel spreadsheets, of which we drop the first 3 rows so we only have the headers and data.

In [12]:
# Import dataset
unemployment_ch = pd.read_excel('data/ch_unemployment_1.xlsx', skiprows=[0,1,2])
unemployment_ch = unemployment_ch.drop(unemployment_ch.columns[1], 1)
unemployment_ch = unemployment_ch.drop(unemployment_ch.columns[4:7], 1)
unemployment_ch.columns = ['canton', 'unemployment_rate', 'young_unemployment_rate', 'registered_unemployed']

# Let's see what it looks like
unemployment_ch

Unnamed: 0,canton,unemployment_rate,young_unemployment_rate,registered_unemployed
0,Zurich,3.8,4.0,31570
1,Berne,3.0,3.2,16636
2,Lucerne,2.2,2.6,4883
3,Uri,1.3,1.0,242
4,Schwyz,1.9,1.6,1683
5,Obwald,1.0,1.2,222
6,Nidwald,1.3,1.2,303
7,Glaris,2.5,2.7,569
8,Zoug,2.5,2.2,1693
9,Fribourg,3.1,2.9,5038


In order to use the above dataframe on the folium map, we need to have canton abbreviations (2 letters). Furthermore, we need to map the French names. We can get that mapping from wikipedia (https://fr.wikipedia.org/wiki/Canton_(Suisse)) and place it in a CSV file.

In [13]:
# Import mapping
cantons_mapping = pd.read_csv('data/cantons.csv', index_col=False)

# Let's see what this thing looks like
cantons_mapping

Unnamed: 0,abbreviation,canton
0,ZH,Zurich
1,BE,Berne
2,LU,Lucerne
3,UR,Uri
4,SZ,Schwyz
5,OW,Obwald
6,NW,Nidwald
7,GL,Glaris
8,ZG,Zoug
9,FR,Fribourg


In [14]:
# Join the two dataframes on the canton column
unemployment_ch_to_plot = unemployment_ch.merge(cantons_mapping)
unemployment_ch_to_plot

Unnamed: 0,canton,unemployment_rate,young_unemployment_rate,registered_unemployed,abbreviation
0,Zurich,3.8,4.0,31570,ZH
1,Berne,3.0,3.2,16636,BE
2,Lucerne,2.2,2.6,4883,LU
3,Uri,1.3,1.0,242,UR
4,Schwyz,1.9,1.6,1683,SZ
5,Obwald,1.0,1.2,222,OW
6,Nidwald,1.3,1.2,303,NW
7,Glaris,2.5,2.7,569,GL
8,Zoug,2.5,2.2,1693,ZG
9,Fribourg,3.1,2.9,5038,FR


Let us factor out the choropleth so we can easily create them later on.

In [15]:
def swiss_choropleth(df, columns,legend):
    # Get the topology
    topo = json.load(open("topojson/ch-cantons.topojson.json"))

    # Center the map and scale it (zoom = 8)
    switzerland_coordinates = [46.94, 7.94] # center on Bern
    topo_map = folium.Map(location=switzerland_coordinates, zoom_start=8)
    topo_map.choropleth(topo, topojson='objects.cantons',
                 data=df,
                 columns=columns, 
                 key_on='feature.id',
                 fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.2,
                 legend_name=legend)
    
    return topo_map

Now we can plot the data as we did for the EU countries.

In [16]:
swiss_choropleth(unemployment_ch_to_plot, ['abbreviation', 'unemployment_rate'],'Unemployment rate (%)')

As we can observe, western Switzerland is a bit worse off than the rest of Switzerland. Central Switzerland has the least unemployed people, but it is also not very populated. Border and/or international cantons (Geneva, Zurich, Ticino) also have higher employment rates.

### Other Definitions

In order to compute the unemployment rate when it is defined as $\frac{unemployed}{population}$, as opposed to the Swiss government definition of $\frac{job\_seekers}{population}$, we need a dataset which contains totals of unemployed and job seekers per canton. Given such a dataset, we can easily convert the rates in the previous dataset by multiplying it with $\frac{unemployed}{job\_seekers}$.

In [17]:
# Import dataset
unemployment_totals_ch = pd.read_excel('data/ch_unemployment_2.xlsx', skiprows=[0,1,2])
unemployment_totals_ch = unemployment_totals_ch.drop(unemployment_totals_ch.columns[1], 1)
unemployment_totals_ch.columns = ['canton', 'unemployed', 'job_seekers']

# Merge dataset with previous (join on canton)
df = unemployment_totals_ch.merge(unemployment_ch)
df['new_unemployment_rate'] = df['unemployment_rate'] * df['unemployed'] / df['job_seekers']
new_unemployment_ch_to_plot = df.merge(cantons_mapping)[['abbreviation', 'new_unemployment_rate']]

# Let's see what it looks like
new_unemployment_ch_to_plot

Unnamed: 0,abbreviation,new_unemployment_rate
0,ZH,3.108652
1,BE,2.306604
2,LU,1.364833
3,UR,0.772973
4,SZ,1.27297
5,OW,0.618384
6,NW,0.772353
7,GL,1.661799
8,ZG,1.504621
9,FR,1.731655


Let us plot this new rate...

In [38]:
swiss_choropleth(new_unemployment_ch_to_plot, ['abbreviation', 'new_unemployment_rate'], 'Unemployment rate (%)')

We observe that there is very little difference in the ratio of unemployment rates between cantons in this map. While the rates are lower overall (as expected), it is interesting to see that some cantons seem to have a large number of people trying to change job. This is the case, for instance, of Graubunden and Fribourg.

## 3. Use the [amstat](https://www.amstat.ch) website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between *Swiss* and *foreign* workers.

   The Economic Secretary (SECO) releases [a monthly report](https://www.seco.admin.ch/seco/fr/home/Arbeit/Arbeitslosenversicherung/arbeitslosenzahlen.html) on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for *foreign* (`5.1%`) and *Swiss* (`2.2%`) workers. 

   Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (*hint* The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?

   Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.

In [27]:
# Import dataset
unemployment_foreign_ch = pd.read_excel('data/ch_unemployment_3.xlsx', skiprows=[0,1,2])
unemployment_foreign_ch = unemployment_foreign_ch.drop(unemployment_foreign_ch.columns[2], 1)
unemployment_foreign_ch.columns = ['canton', 'nationality', 'unemployment_rate','registered_unemployed']
unemployment_foreign_ch

Unnamed: 0,canton,nationality,unemployment_rate,registered_unemployed
0,Zurich,Etrangers,5.3,12111
1,Zurich,Suisses,2.5,15114
2,Berne,Etrangers,5.5,4900
3,Berne,Suisses,1.8,8758
4,Lucerne,Etrangers,3.9,1593
5,Lucerne,Suisses,1.3,2292
6,Uri,Etrangers,2.1,53
7,Uri,Suisses,0.4,59
8,Schwyz,Etrangers,3.4,617
9,Schwyz,Suisses,1.2,838


In [28]:
unemployment_ch_only = unemployment_foreign_ch.query("nationality == 'Suisses'").reset_index()
unemployment_ch_only = unemployment_ch_only.drop(unemployment_ch_only.columns[0], 1)
unemployment_ch_only = unemployment_ch_only.merge(cantons_mapping)
unemployment_ch_only

Unnamed: 0,canton,nationality,unemployment_rate,registered_unemployed,abbreviation
0,Zurich,Suisses,2.5,15114,ZH
1,Berne,Suisses,1.8,8758,BE
2,Lucerne,Suisses,1.3,2292,LU
3,Uri,Suisses,0.4,59,UR
4,Schwyz,Suisses,1.2,838,SZ
5,Obwald,Suisses,0.5,85,OW
6,Nidwald,Suisses,0.7,154,NW
7,Glaris,Suisses,1.4,232,GL
8,Zoug,Suisses,1.7,826,ZG
9,Fribourg,Suisses,2.0,2578,FR


In [29]:
unemployment_foreign_only = unemployment_foreign_ch.query("nationality == 'Etrangers'").reset_index()
unemployment_foreign_only = unemployment_foreign_only.drop(unemployment_foreign_only.columns[0], 1)
unemployment_foreign_only = unemployment_foreign_only.merge(cantons_mapping)
unemployment_foreign_only

Unnamed: 0,canton,nationality,unemployment_rate,registered_unemployed,abbreviation
0,Zurich,Etrangers,5.3,12111,ZH
1,Berne,Etrangers,5.5,4900,BE
2,Lucerne,Etrangers,3.9,1593,LU
3,Uri,Etrangers,2.1,53,UR
4,Schwyz,Etrangers,3.4,617,SZ
5,Obwald,Etrangers,2.2,68,OW
6,Nidwald,Etrangers,2.9,94,NW
7,Glaris,Etrangers,3.4,184,GL
8,Zoug,Etrangers,3.9,717,ZG
9,Fribourg,Etrangers,5.0,1888,FR


In [30]:
swiss_choropleth(unemployment_ch_only, ['abbreviation', 'unemployment_rate'],'Unemployment rate (%)')

In [31]:
swiss_choropleth(unemployment_foreign_only, ['abbreviation', 'unemployment_rate'],'Unemployment rate (%)')

Expliquer pourquoi ca veut rien dire (echelle)

In [34]:
unemployment_foreign_swiss = unemployment_foreign_only.copy()[['abbreviation']]
unemployment_foreign_swiss['rel_unemployment_rate'] = unemployment_foreign_only['unemployment_rate'] / unemployment_ch_only['unemployment_rate']
unemployment_foreign_swiss

swiss_choropleth(unemployment_foreign_swiss, ['abbreviation', 'rel_unemployment_rate'], 'Relative unemployment rate')

In [35]:
# Import dataset
unemployment_age_nationality = pd.read_excel('data/ch_unemployment_4.xlsx', skiprows=[0,1,2])
unemployment_age_nationality = unemployment_age_nationality.drop(unemployment_age_nationality.columns[4], 1)
unemployment_age_nationality.columns = ['canton', 'nationality', 'age_class','age','registered_jobseekers']
unemployment_age_nationality.replace(np.nan, '')

Unnamed: 0,canton,nationality,age_class,age,registered_jobseekers
0,Zurich,Etrangers,1,15-24 ans,1015
1,Zurich,Etrangers,2,25-49 ans,8846
2,Zurich,Etrangers,3,50 ans et plus,2250
3,Zurich,Etrangers,Total,,12111
4,Zurich,Suisses,1,15-24 ans,2405
5,Zurich,Suisses,2,25-49 ans,8207
6,Zurich,Suisses,3,50 ans et plus,4502
7,Zurich,Suisses,Total,,15114
8,Zurich,Total,,,27225
9,Berne,Etrangers,1,15-24 ans,597


In [36]:
bli = unemployment_age_nationality.query("age_class == 'Total'")
bli = bli.query("nationality == 'Etrangers'").reset_index()
bli['rate'] = unemployment_foreign_only['unemployment_rate']
bli = bli.drop(bli.columns[0],1)
bli = bli.drop(bli.columns[3],1)
bli['num_foreigners_per_canton'] = round(bli['registered_jobseekers'] / (bli['rate'] / 100))
bli

Unnamed: 0,canton,nationality,age_class,registered_jobseekers,rate,num_foreigners_per_canton
0,Zurich,Etrangers,Total,12111,5.3,228509.0
1,Berne,Etrangers,Total,4900,5.5,89091.0
2,Lucerne,Etrangers,Total,1593,3.9,40846.0
3,Uri,Etrangers,Total,53,2.1,2524.0
4,Schwyz,Etrangers,Total,617,3.4,18147.0
5,Obwald,Etrangers,Total,68,2.2,3091.0
6,Nidwald,Etrangers,Total,94,2.9,3241.0
7,Glaris,Etrangers,Total,184,3.4,5412.0
8,Zoug,Etrangers,Total,717,3.9,18385.0
9,Fribourg,Etrangers,Total,1888,5.0,37760.0


## 4. *BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in unemployment rates between the areas divided by the [Röstigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?