# Homework 2
#### Pierre-Antoine Desplaces, Anaïs Ladoy, Lou Richard

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
import seaborn as sns
import folium

In [None]:
def bars(title, data, xlab, ylab):
    sns.set_style('darkgrid')
    fig, ax = plt.subplots(figsize = (15,8))
    ax.set_title(title, fontsize=15, fontweight='bold')
    sns.barplot(x=xlab, y=ylab, data=data, saturation=0.7, errcolor='.7')
    plt.xticks(rotation=90)
    plt.show()

## <b> <font color='purple'>Question 1</font> </b>
<b>Go to the eurostat website and try to find a dataset that includes the european unemployment rates at a recent date.

Use this data to build a Choropleth map which shows the unemployment rate in Europe at a country level. Think about the colors you use, how you decided to split the intervals into data classes or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.</b>

We find the unemployment information under Population and social conditions > Labour Market > Employment and unemployment. We chose to work with the Unemployment rates by sex, age and educational attainment level database found in the LFS - detailed annual survey results.

In [None]:
# Load the file
raw_euro = pd.read_csv('lfsa_urgaed_1_Data.csv')

Looking at the data, we see that we have a lot of unecessary informations. Since we just need the total unemployment rate, not considering the level of education nor the gender, we keep only the 'Total' rows from Sex column, and the 'All ISCED 2011 levels' from the ISCED11 column.

In [None]:
# Remove unecessary rows
raw_euro = raw_euro[raw_euro.SEX == 'Total']
raw_euro = raw_euro[raw_euro.ISCED11 == 'All ISCED 2011 levels ']

We now create the Dataframe we will use keeping only the relevant columns GEO and Value. We then convert our values into a float type.

In [None]:
# Create dataframe with only the relevant columns
df_euro = raw_euro[['GEO', 'Value']].copy()
df_euro.columns = ('Country', 'Unemployment (%)')
# Adjust the index
df_euro.set_index((df_euro.index/15).astype(int), inplace=True)
# Convert the values to float
df_euro['Unemployment (%)'] = df_euro['Unemployment (%)'].astype(float)

We now load the json we will use to display our map.

In [None]:
# Load the json data of the map
euro_geo_path = 'topojson/europe.topojson.json'
euro_geo = json.load(open(euro_geo_path))
# First element of the json
euro_geo['objects']['europe']['geometries'][0]

When we look at the json, we see that we will have to use the country IDs to match the values from our dataframe to the geometries of the json. We thus create a dictionary mapping the country names to their IDs. We also change the name of Germany and Macedonia in our dataframe in order to match them with the json data. We then add a new column in our dataframe containing the IDs for each country.

In [None]:
# Clean the name of Germany and Macedonia
df_euro.Country = df_euro.Country.replace({'Germany (until 1990 former territory of the FRG)': 'Germany'})
df_euro.Country = df_euro.Country.replace({'Former Yugoslav Republic of Macedonia, the': 'The former Yugoslav Republic of Macedonia'})
# Create a dictionary to convert country names to country IDs
dict_country_id = dict(map(lambda x : (x['properties']['NAME'],x['id']),euro_geo['objects']['europe']['geometries']))
# Adding a new column with country IDs
df_euro['Country ID'] = df_euro.Country.map(lambda x : dict_country_id[x])
# Resulting dataframe
df_euro.head()

To get a better idea of the repartition of the rates, we first display the values in a bar charts.

In [None]:
bars("Unemployment in Europe", df_euro, 'Country ID', 'Unemployment (%)')

We now create our map. To define the threshold_scale, we use the mean and the standard deviation of our values. We divide them according to their distance to the mean in terms of the standard deviation.

In [None]:
# Construct the threshold 
mean = df_euro['Unemployment (%)'].mean()
std = df_euro['Unemployment (%)'].std()
min_rate = df_euro['Unemployment (%)'].min()
max_rate = df_euro['Unemployment (%)'].max()

def dist_std(x) :
    return mean + x*std

threshold = [min_rate,dist_std(-1),dist_std(0),dist_std(1),dist_std(2),max_rate]

# Display the visualization map
m_euro = folium.Map([54,15], zoom_start=4)
m_euro.choropleth(
    geo_data = euro_geo,
    data=df_euro,
    columns=['Country ID', 'Unemployment (%)'],
    key_on='feature.id',
    topojson='objects.europe',
    fill_color = 'OrRd', fill_opacity=0.5, line_opacity=0.2,
    legend_name="Unemployment in Europe (in % of population)",
    threshold_scale = threshold
    )
m_euro

The countries with no color are those for which we don't have unemployment informations from Eurostat.

Looking at the map, we observe that three countries stand out : Spain, Macedonia and Greece. They have the highest unemployment rate. Countries with high unemployment rate are mostly mediterranean countries : France, Italy, Turkey. 

If we look at Switzerland, we see that it belongs to countries with a quite low unemployment rate.

## <font color='purple'>Question 2</font> 
<b>Go to the amstat website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

HINT Go to the details tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through.
Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.</b>

The unemployement rates of swiss cantons are downloadable through the amstat website (https://www.amstat.ch/v2/index.jsp) according several options we can specify (specific period, geographic level, economic or social attributes, ...).  
We choose to download the latest data that were available, corresponding to the **unemployement rates for September 2017 at a canton level**.


We choose to download five variables that characterize the unemployment in Switzerland : 
- *Unemployment rate (Taux de chômage)* : Registered unemployed / Active population [%]
- *Youth unemployment rate (Taux de chômage des jeunes)* : Registered unemployed aging from 15 to 24 years old / Active population (15-24 y/o) [%]
- *Registered unemployed (Chômeurs inscrits)* : Registered unemployed people that are looking for a job
- *Registered job seekers that are not unemployed* : Registered job seekers that already have a job and are looking for a new one
- *Job seekers (Demandeurs d'emplois)* : Registered unemployed + Registered job seekers that are not unemployed

The goal in this exercise is to show how the choice of metrics to define unemployment can lead to very different interpretations.  
We will first plot a chloropleth map representing the unemployment rate as it is defined by the Swiss confederation at a canton level. Then, we will plot a second chloropleth map representing this time all the job seekers (both the unemployed and the ones who have already a job and are searching for a new one).  
Finally, we will zoom on the particular case of youth unemployment as it can be used as an indicator of the economic situation of a country and it is often considered as a priority for the state.

In [None]:
# Load the amstat datas
amstat_unemp_rate=pd.read_csv('amstat_taux_chomage.csv',sep=',',skipfooter=1,thousands="'",encoding='utf-16',engine='python')
# Remove the first column
amstat_unemp_rate.drop(amstat_unemp_rate.columns[0],axis=1,inplace=True)

In [None]:
amstat_unemp_rate.dtypes

In order to plot the second chloropleth map (unemployment rate including the job seekers who have already a job but looking for a new one), we need to compute this new rate and add it to our dataframe.  
The rate is defined as follow :

*Unemployment rate (all job seekers)* : Job seekers / Active population

The active population per canton is not provided in the amstat website so we used the unemployment rate (Registered unemployed / Active population) to compute this metric.

In [None]:
amstat_unemp_rate["Taux de demandeurs d'emplois"]=amstat_unemp_rate["Demandeurs d'emploi"]*amstat_unemp_rate['Taux de chômage']/amstat_unemp_rate['Chômeurs inscrits']
amstat_unemp_rate.sort_values("Taux de demandeurs d'emplois")

The territorial limits of the swiss cantons are already provided in a TopoJSON file and it will be used as an overlay to create our chloropleth map.   


We need to associate the parameters we want to visualize (rates at canton level) with the geographic entities present in the TopoJSON file (canton id). If we inspect our TopoJSON file, we can see that the object id are the canton codes (ZH, BE, ...) and this information is not present in our dataframe. Thus, the first step is to extract this information from the JSON file and add it as a new column to our data.

In [None]:
# Load the TopoJSON file
swiss_cantons=json.load(open('./topojson/ch-cantons.topojson.json'))
# Create a list with the corresponding code (id) for each canton
cantons_id=[(i['properties']['name'],i['id']) for i in swiss_cantons['objects']['cantons']['geometries']]

In [None]:
cantons_id=pd.DataFrame(cantons_id,columns=['Canton','Code'])
cantons_id.head()

We notice that the swiss cantons names in our dataframe and in the TopoJSON file are really different as the names in the TopoJSON are expressed in the official language for each canton. 
Fortunately, we can see that the cantons are in the same order in both our dataframe and the TopoJSON, we can thus match the canton code according the index.

In [None]:
# Adding a new column with country IDs
amstat_unemp_rate = amstat_unemp_rate.merge(cantons_id[['Code']],left_index=True,right_index=True)

The first cloropleth map is a visualization of the unemployment rate in the Swiss cantons for the month of September 2017 according to the definition of the Swiss Confederation.  

Another parameter that will influence a lot the interpretation of our map is the way we classify the data. The mode of classification (equal interval, quantile, natural breaks,...) will define the color intervals that we'll use in our data visualization and it's important to look the values distribution of the parameters we will plot in order to choose the best method.

In [None]:
sns.boxplot(data=amstat_unemp_rate[['Taux de chômage','Taux de chômage des jeunes',"Taux de demandeurs d'emplois"]],orient='h',palette="Set2",whis=1.5)
plt.show()

The distribution for the three parameters are more or less symmetric and we can notice an outlier for the Unemployment rate (Taux de chômage).

The goal of this exercise is to see how the choose of metrics to quantify the unemployment rate can lead to different interpretations. Thus, it is better to use the same classification method for the three parameters we will plot in order to avoid an additional "bias" in the vizualisation interpretation.  

That's why we won't choose the natural breaks classification which find an "optimal" classification for a particular dataset. 
An equal interval classification seems to be an interesting choice as the distribution are not very skewed (there won't be a overrepresentation of a class). The quantile interval leads to the approximately the same number of groups in every class and it doesn't seem as a good choice since we have outliers which could be unnoticed.

#### Chloropleth map for the Unemployment rate as it defined by the Swiss Confederation

In [None]:
# Create a folium map centered on the geographical center of Switzerland
amstat_unemp1_map=folium.Map(location=[46.900000, 8.226667],tiles='cartodbpositron',zoom_start=8)

# Add an overlay to the map (TopoJSON associated with the variable of interest)
amstat_unemp1_map.choropleth(
geo_data=swiss_cantons,
data=amstat_unemp_rate,
columns=['Code', 'Taux de chômage'],
key_on='feature.id',
topojson='objects.cantons',
fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.8, line_color='white',
legend_name="Unemployment rate [%] in September 2017",
)

# Save the map in an HTML file
amstat_unemp1_map.save('Swiss_unemployment_rate_1.html')
amstat_unemp1_map

Even if Geneva is the only outlier with an unemployment rate of 5.2%, the canton of Neuchâtel has also an extreme value (5.1%) so the choice of 6 bins for the data classification makes sense as it permites to highlight the cantons with the extreme unemployment rates without overrepresenting a specific class.

#### Chloropleth map for the Unemployment rate considering all the job seekers

In [None]:
amstat_unemp2_map=folium.Map(location=[46.900000, 8.226667],tiles='cartodbpositron',zoom_start=8)
amstat_unemp2_map.choropleth(
geo_data=swiss_cantons,
data=amstat_unemp_rate,
columns=['Code', "Taux de demandeurs d'emplois"],
key_on='feature.id',
topojson='objects.cantons',
fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.8,line_color='white',
legend_name="Unemployment rate without people having already a job [%] in September 2017"
)
# Save the map in an HTML file
amstat_unemp2_map.save('Swiss_unemployment_rate_2.html')
amstat_unemp2_map

COMMENT

#### Chloropleth map for the Youth unemployment rate

In [None]:
amstat_unemp_youth_map=folium.Map(location=[46.900000, 8.226667],tiles='cartodbpositron',zoom_start=8)
amstat_unemp_youth_map.choropleth(
geo_data=swiss_cantons,
data=amstat_unemp_rate,
columns=['Code', "Taux de chômage des jeunes"],
key_on='feature.id',
topojson='objects.cantons',
fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.8,line_color='white',
legend_name="Youth (15-24 y/o) unemployment rate [%] in September 2017"
)
# Save the map in an HTML file
amstat_unemp_youth_map.save('Swiss_unemployment_rate_2.html')
amstat_unemp_youth_map

## <font color='purple'>Question 3</font> 
<b>Use the amstat website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between Swiss and foreign workers.

The Economic Secretary (SECO) releases a monthly report on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for foreign (5.1%) and Swiss (2.2%) workers.

Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (hint The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?

Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.</b>

In [None]:
foreign_vs_swiss = pd.read_csv('amstat_foreign_vs_swiss.csv',sep=',',skipfooter=1,thousands="'",encoding='utf-16',engine='python')
foreign_vs_swiss

In [None]:
foreign_rate = foreign_vs_swiss[foreign_vs_swiss["Nationalité"] == "Etrangers"].reset_index().drop(["index"],axis=1)
foreign_rate = foreign_rate.merge(cantons_id[['Code']],left_index=True,right_index=True)
swiss_rate = foreign_vs_swiss[foreign_vs_swiss["Nationalité"] == "Suisses"].reset_index().drop(["index"],axis=1)
swiss_rate = swiss_rate.merge(cantons_id[['Code']],left_index=True,right_index=True)

In [None]:
foreign_map=folium.Map(location=[46.801111, 8.226667],tiles='cartodbpositron',zoom_start=8)
foreign_map.choropleth(
geo_data=swiss_cantons,
data=foreign_rate,
columns=['Code', 'Taux de chômage'],
key_on='feature.id',
topojson='objects.cantons',
fill_color='BuPu',
legend_name="Unemployment rates of people of foreign nationality (Nb of registered unemployed/Nb of active persons in %)"
)
foreign_map

In [None]:
swiss_map=folium.Map(location=[46.801111, 8.226667],tiles='cartodbpositron',zoom_start=8)
swiss_map.choropleth(
geo_data=swiss_cantons,
data=swiss_rate,
columns=['Code', 'Taux de chômage'],
key_on='feature.id',
topojson='objects.cantons',
fill_color='BuPu',
legend_name="Unemployment rates of people of swiss nationality (Nb of registered unemployed/Nb of active persons in %)"
)
swiss_map

## <font color='purple'>Question 4</font> 
<b>BONUS: using the map you have just built, and the geographical information contained in it, could you give a rough estimate of the difference in unemployment rates between the areas divided by the Röstigraben?</b>