# Building a choropleth map

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import time
import glob
import matplotlib.pyplot as plt
import folium
import geocoder

%load_ext autoreload
%autoreload 2

## Loading & wrangling the data 

In [2]:
data_path = 'data/P3_GrantExport.xlsx'
grant_data = pd.read_excel(data_path)

We start by selecting the features we're interested in. Obviously, we need the grant given to a project. We also want the University (from which we will retrieve the canton) and the Institution (since for some projects one is missing, but we're somehow confident we can retrieve the canton from either of those two). At last, we decide for now to save the reason a project got a funding, without being sure whether or not it will be usefull.



In [3]:
grant_data = grant_data[["Approved Amount", "University", "Institution","Funding Instrument Hierarchy"]]
grant_data.count()

Approved Amount                 63967
University                      50988
Institution                     58831
Funding Instrument Hierarchy    62915
dtype: int64

For now we did not remove any entries. A more thorough visualization gave us the confirmation that we should though. We start by removing entries for which the Approved Amount is not a number, since we won't be able to use those data in this study.

In [4]:
grant_data = grant_data[grant_data['Approved Amount'].apply(lambda x: str(x).isdigit())]
grant_data.count()

Approved Amount                 52663
University                      50487
Institution                     48205
Funding Instrument Hierarchy    51620
dtype: int64

As we can see we already removed about 11K entries, but we can do better. Indeed, we will use the Google's API to link a university name or an institution to a canton. So far we are confident that we can find the canton from either of those two. That means however, that we cannot treat data that miss both those values.

In [5]:
grant_data = grant_data.dropna(subset=["Institution", "University"], how="all")
grant_data.count()

Approved Amount                 51843
University                      50487
Institution                     48205
Funding Instrument Hierarchy    50800
dtype: int64

We removed another 1K entries. But we saw a special value in the university field: Nicht zuteilbar - NA, for which we won't be able to retrieve the canton.

In [6]:
grant_data = grant_data[grant_data["University"] != "Nicht zuteilbar - NA"]
grant_data.count()

Approved Amount                 49252
University                      47896
Institution                     48025
Funding Instrument Hierarchy    48231
dtype: int64

We observe that for many entries, an acronym is given at the end of the University field. A first logical step to do before moving to an API is to retrieve this acronym, and when possible, associate it to a canton (sometimes, it is already the canton's abbreviation itself). We will then have to  use the API on the remaining, undetermined entries.

In [7]:
grant_data["short"] = grant_data["University"].apply(lambda x : str(x).split("- ")[-1])
grant_data["short"].value_counts()


ZH                           6704
GE                           6346
ETHZ                         6093
BE                           5422
BS                           4685
EPFL                         4370
LA                           4062
FR                           2041
NE                           1580
NPO                          1463
nan                          1356
PSI                           535
FP                            490
SG                            424
USI                           338
EAWAG                         330
HES-SO                        270
ZFH                           256
EMPA                          236
FHNW                          221
WSL                           220
LU                            211
IHEID                         194
BFH                           136
AGS                           135
SUPSI                         134
FMI                            83
ASPIT                          81
IDIAP                          81
HSLU          

With this trick we can already sort a large portion of the dataframe. We will now create a feature "Canton" which we can fill with values of "Short" we can associate for a canton (including EPFL and ETH in VD and ZH because, come on), and let blank for others. To get the list of canton acronyms, we use bash: 


"grep -rnw ch-cantons.topojson.json -e '"id":' | awk '{print $3}' > cantons.csv"

And then we can compare the list of acronyms Folium will know with the list of acronyms we have.



In [26]:
#We already know the locations of some frequent occurences and it seems better to call the API on the fewer cases possible
grant_data.replace(["EPFL", "ETHZ", "LA"], ["VD", "ZH", "VD"], inplace=True)

In [18]:
acronyms = pd.read_csv("cantons.csv", header=None).dropna(axis=1)
acronyms = acronyms[0].apply(str)

In [25]:
grant_data["Canton"] = grant_data["short"][grant_data["short"].isin(acronyms.values)]
grant_data["Canton"].value_counts()
#grant_data["Canton"].describe()

ZH    12797
VD     8432
GE     6346
BE     5422
BS     4685
FR     2041
NE     1580
SG      424
LU      211
Name: Canton, dtype: int64

That's 40K less google searches ! :D

In [20]:
grant_data.to_csv("data.csv")

We should finally be able to retrieve the canton feature in this dataframe.

## Linking the University name to a Swiss canton

In [12]:
geo_str = 'ch-cantons.topojson.json'

ch_map = folium.Map(location=[46.6430788,8.018626], tiles='Mapbox Bright', zoom_start=7)
ch_map.choropleth(geo_path=geo_str)
ch_map.save('ch_map.html')

In [13]:
%%HTML
<iframe width='100%' height="350" src="ch_map.html"></iframe>

In [14]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()

location_uni = []
for uni in grant_data['University'][1:100]:
    location_uni.append(geolocator.geocode(uni))
    

In [15]:
import reverse_geocoder as rg

coord = []
for loc in location_uni:
    if loc!=None:
        coord.append((loc.latitude, loc.longitude))

pos = rg.search(coord)
for elem in pos:
    print(elem['admin1'], elem['cc'])

Loading formatted geocoded file...
Basel-City CH
Fribourg CH
Fribourg CH
Zurich CH
Vaud CH
Geneva CH
Fribourg CH
Geneva CH
Basel-City CH
Zurich CH
Fribourg CH
Geneva CH
Zurich CH
Vaud CH
Geneva CH
Bern CH
Zurich CH
Geneva CH
Zurich CH
Basel-City CH
Geneva CH
Neuchatel CH
Geneva CH
Geneva CH
Fribourg CH
Bern CH
Zurich CH
Basel-City CH
Basel-City CH
Zurich CH
Basel-City CH
Geneva CH
Vaud CH
Neuchatel CH
Geneva CH
Vaud CH
Bern CH
Geneva CH
Fribourg CH
Neuchatel CH
Bern CH
Zurich CH
Fribourg CH
Geneva CH
Geneva CH
Fribourg CH
Zurich CH
Zurich CH
Zurich CH
Zurich CH
Bern CH
Bern CH
Saint Gallen CH
Zurich CH
Zurich CH
Saint Gallen CH
Zurich CH
Zurich CH
Zurich CH
Zurich CH
Geneva CH
Geneva CH
Vaud CH
Basel-City CH
Zurich CH
Bern CH
Basel-City CH
Bern CH
Neuchatel CH
Bern CH
