# ADA / Applied Data Analysis
<h2 style="color:#a8a8a8">Homework 3 - Visualization<br>
Aimée Montero, Alfonso Peterssen, Cyriaque Brousse</h2>

<ol class="toc-item"><li><a href="#Part-1---Data-import">Part 1 - Data import</a><ol class="toc-item"><li><a href="#1a---Indexing">1a - Indexing</a></li><li><a href="#1b---Cleaning-the-data">1b - Cleaning the data</a></li></ol></li><li><a href="#Part-2---Mapping-universities-to-cantons">Part 2 - Mapping universities to cantons</a><ol class="toc-item"><li><a href="#2a---Defining-abstractions">2a - Defining abstractions</a></li><li><a href="#2b---Using-caching">2b - Using caching</a></li><li><a href="#2c---Retrieving-the-canton-for-each-entry">2c - Retrieving the canton for each entry</a></li><li><a href="#2d---Putting-everything-back-together">2d - Putting everything back together</a></li></ol></li><li><a href="#Part-3---Visualization">Part 3 - Visualization</a><ol class="toc-item"><li><a href="#3a---Why-are-you-not-Swiss??">3a - Why are you not Swiss??</a></li><li><a href="#3b---Summing-grants-per-canton">3b - Summing grants per canton</a></li><li><a href="#3c---Final-map">3c - Final map</a></li></ol></li></ol>

## Part 1 - Data import

Let's import the required libraries:

In [55]:
import folium
import pandas as pd
import numpy as np
import requests
import json

And define the constants for the Google APIs:

In [56]:
# change this flag to force overwrite of the saved values by fresh data from Google APIs (takes a long time)
force_download = False

In [57]:
places_api_url = 'https://maps.googleapis.com/maps/api/place/textsearch/json'
geocode_api_url = 'http://maps.googleapis.com/maps/api/geocode/json'

Read the data from the provided CSV file:

In [58]:
data = pd.read_csv('data/grantexport.csv', sep=';')

In [59]:
cols = {'Project Number' : 'pnr',
       'Project Title' : 'title',
       'University' : 'univ',
       'Institution' : 'inst',
       'Approved Amount' : 'amount'}
data = data.rename(columns=cols)
data = data[[v for (k,v) in cols.items()]]

### 1a - Indexing

According to the documentation, the values in the field `pnr` are unique. We check this and use it as an index for our data frame.

In [60]:
data.pnr.is_unique

True

In [61]:
data = data.set_index('pnr')

We discard the values in the `amount` field that are not numeric. The documentation states that *"This amount is not indicated in the case of mobility fellowships since it depends on administrative factors, typically the destination, cost of living, family allowances (if applicable) and exchange rate differences."*.

In [62]:
data.amount = pd.to_numeric(data.amount, errors='coerce')
data.sample(5)

Unnamed: 0_level_0,amount,univ,inst,title
pnr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
160083,151554.0,EPF Lausanne - EPFL,Laboratoire d'hydraulique environnementale EPF...,The Stochastic Torrent: stochastic model for b...
102331,,,The Netherlands Cancer Institute Division of G...,Pharmacogenotyping and -phhenotyping in cancer...
59161,713366.0,Universität Zürich - ZH,Physik-Institut Universität Zürich,Elementary particle physics at the electron-Pr...
136563,8000.0,Universität Basel - BS,Musikwissenschaftliches Seminar Universität Basel,"Urbanität, Identitätskonstruktion und Humanism..."
127021,305940.0,EPF Lausanne - EPFL,Laboratoire de design et media EPFL IC ISIM LDM,Spatio-Temporal Memory Streaming


### 1b - Cleaning the data

The number of rows for which the `amount` field is not a number is:

In [63]:
data.amount.isnull().sum()

10910

In [152]:
data = data.replace('Nicht zuteilbar - NA', np.NaN)
data = data.replace('NPO (Biblioth., Museen, Verwalt.) - NPO', 'NPO')
data = data.replace('Firmen/Privatwirtschaft - FP', 'FP')
data = data.replace('Weitere Institute - FINST', 'FINST')
data = data.dropna()
data.sample(5)

Unnamed: 0_level_0,amount,univ,inst,title
pnr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
154628,7400.0,Université de Neuchâtel - NE,Linguistique française Sciences du langage et ...,Les marqueurs évidentiels dans les langues rom...
29530,21900.0,Universität Basel - BS,UNI: Katedra socioekonomickej geografie Prir o...,Stadterneuerung des Stadtteils Petrzalka in Br...
134571,243084.0,Eidg. Material und Prüfungsanstalt - EMPA,Eidg. Materialprüfungs- und Forschungsanstalt ...,Carbon-Nanotube/Metal Interfaces: Tailoring an...
113542,214314.0,EPF Lausanne - EPFL,Laboratoire en semiconducteurs avancés pour la...,III-V Nitride Microcavities and Nanostructures
6917,61557.0,Universität Bern - BE,Laboratorium für Populationsgenetik Institut f...,Enzymelektrophoretische Untersuchungen über di...


## Part 2 - Mapping universities to cantons

<p>We want to display, somehow, the amount of money granted to each canton.<br>
In technical terms, it means we have to find a mapping for  each `univ` value to its corresponding `canton` value.<p>

<p>To do that, we will use the Google Places API. We take care not to divulge the API key, which will be placed in a separate file, ignored by Git.<br>
The API call will return a location for each university. We then input this location in the Reverse Geocoding API, which will return a canton.</p>

### 2a - Defining abstractions

First, we need to import the Google API key from the key file:

In [65]:
api_key = !head api_key
api_key = api_key[0]

We define the following helper method for fetching JSON from Google APIs:

In [66]:
def fetch_json(url, params):
    ''' Fetches the json object resulting from the query of the url with
        the given params.
        Checks that the status is OK, otherwise returns None.
        If the status is OK, then it will return the first result,
        if it exists, None otherwise.
    '''
    response = requests.get(url, params)
    obj = json.loads(response.text)
    
    if obj['status'] != 'OK':
        #print('[E] status was', obj['status'])
        return None
    
    if len(obj['results']) < 1:
        return None
    
    return obj['results'][0]

We define the following functions to get the canton from the university name, or the institution name if the university name is a non-profit or a private organization:

In [153]:
def get_query(univ, inst):
    ''' Returns the university if it is not a non-profit or private organization, the institution otherwise.
    '''
    return univ if univ not in ('NPO', 'FP', 'FINST') else inst

In [154]:
def get_canton(univ, inst, cache):
    ''' Maps the input univ or inst to its canton.
        First, tries to query the APIs on the university.
          If no result was found, then query on the institution.
          If still no result is found, returns 'NA'.
        Caches the results in the cache dictionary.
    '''
    
    def get_location(query):
        ''' Returns a location in the form (lat,lng) for the given query word, or 'NA' if nothing was found.
            To do so, it uses the Google Places API.
        '''
        params = { 'key' : api_key, 'query' : query}
        places_json = fetch_json(places_api_url, params)
        
        if places_json == None:
            return 'NA'
        else:
            return places_json['geometry']['location']
    
    def get_canton_from_location(latlng):
        ''' Returns a canton from the provided location object, or 'NA' if nothing was found.
            To do so, it uses the Google Geocode API.
        '''
        params = {'latlng' : str(latlng['lat']) + ',' + str(latlng['lng']), 'sensor' : 'false'}
        geo_json = fetch_json(geocode_api_url, params)
        
        if geo_json == None:
            return 'NA'
        else:
            for r in geo_json['address_components']:
                if 'administrative_area_level_1' in r['types']:
                    return r['short_name']
            return 'NA'
    
    # use the cache for values that were previously retrieved
    if univ in cache:
        return cache[univ]
    elif inst in cache:
        return cache[inst]
    
    # determine the right query word (university or institution) and then perform a first lookup
    query  = get_query(univ, inst)
    latlng = get_location(query)
    
    # if it failed, try to lookup by institution name
    if latlng == 'NA' and univ not in ('NPO', 'FP', 'FINST'):
        latlng = get_location(query=inst)
    
    # if that failed too, we have no further options
    if latlng == 'NA':
        return 'NA'
    
    # querying geocode api to get a canton from the location
    canton = get_canton_from_location(latlng)
    if canton != 'NA':
        cache[query] = canton
    return canton

### 2b - Using caching

To make good use of time and API requests, we use caching.<br>
To do so, we will group every entry by its university and try to find a canton for each entry. If we do, we register it in the cache:

In [102]:
cache = {}
univs = data.univ.groupby(data.univ)

for u in univs:
    canton = get_canton(u[0], None, {})
    if not canton == 'NA':
        cache[u[0]] = canton

As we can see, the canton for EPFL was successfully retrieved and cached. In subsequent calls for EPFL, the APIs will not be queried anymore.

In [103]:
cache['EPF Lausanne - EPFL']

'VD'

The initial cache will work for the following proportion of the universities. Then, it will grow as the queries start being executed on the dataframe entries.

In [104]:
print(len(cache.keys()), '/', len(univs))

60 / 76


### 2c - Retrieving the canton for each entry

We now iterate over all the entries of the dataframe, yielding exactly one `canton` value per entry.<br>
This should usually be done using a mapping function from one column to the other (as it was originally done), but so much can go wrong in 47,000+ entries (network failure, etc.), that we prefer the iterative solution. At least, we can save what has already been computed.

In [168]:
if force_download:
    cantons = pd.DataFrame(columns=('pnr', 'canton'))
    cantons = cantons.set_index('pnr')
    
    for r in data.itertuples(index=True):
        cantons.loc[r[0]] = get_canton(r.univ, r.inst, cache)
else:
    cantons = pd.read_csv('data/out_cantons.csv', sep=';')
    cantons = cantons.set_index('pnr')

The proportion of retrieved `canton` values (vs. `NA` values) is:

In [169]:
cantons = cantons.replace('NA', np.NaN)
proportion = (len(cantons.dropna()) / len(cantons)) * 100
print('%.2f' % proportion, '% of canton values are not null')

98.30 % of canton values are not null


We can, if the file isn't already saved, write the computed values to disk:

In [170]:
if force_download:
    cantons.to_csv('data/out_cantons.csv', sep=';', columns=['canton'])

### 2d - Putting everything back together

First, check that the two dataframes are of the same length:

In [171]:
len(data) == len(cantons)

True

Then add the `canton` column to the original data frame:

In [172]:
data_cantons = pd.concat([data, cantons], axis=1, join_axes=[data.index])
data_cantons.sample(5)

Unnamed: 0_level_0,amount,univ,inst,title,canton
pnr,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
150451,1153770.0,Université de Lausanne - LA,"Institut d'études politiques, historiques et i...",Party strategies and the dynamics of electoral...,VD
61145,6860.0,Universität Basel - BS,Deutsches Seminar Universität Basel,Spracherwerb und Lebensalter - ontogenetische ...,BS
162493,117917.0,Universität Zürich - ZH,Institut für Theoretische Physik Universität Z...,Gravitational Lensing by galaxies and galaxy c...,ZH
26628,245893.0,Universität Basel - BS,Departement Forschung Kantonsspital Basel,Carbohydrates as mediators and modulators of c...,BS
165560,200000.0,EPF Lausanne - EPFL,Institut de physique de la matière condensée E...,Light-induced control of the Metal Insulator T...,VD


## Part 3 - Visualization

We need to get rid of the `NA` values for the `canton` field:

In [173]:
print(data_cantons.canton.isnull().sum(), 'rows were deleted')
data_cantons = data_cantons.dropna()

800 rows were deleted


### 3a - Why are you not Swiss??

We notice that some queries that were executed yielded a `canton` field that does not match an actual Swiss canton. It is for example the case for grants given to embassies in foreign countries. <br>
We need to filter the data to keep only the actual Swiss cantons. To that effect, we will get a list of `(short_name,long_name)` for each canton by using the provided `json` file. Then we can filter our dataframe.

In [174]:
import codecs

topo_file = codecs.open('data/ch-cantons.topojson.json', 'r', 'utf-8')
obj = json.load(topo_file)
topo_file.close()

In [175]:
canton_ids = {}
for c in obj['objects']['cantons']['geometries']:
    canton_ids[c['id']] = c['properties']['name']
canton_ids

{'AG': 'Aargau',
 'AI': 'Appenzell Innerrhoden',
 'AR': 'Appenzell Ausserrhoden',
 'BE': 'Bern/Berne',
 'BL': 'Basel-Landschaft',
 'BS': 'Basel-Stadt',
 'FR': 'Fribourg',
 'GE': 'Genève',
 'GL': 'Glarus',
 'GR': 'Graubünden/Grigioni',
 'JU': 'Jura',
 'LU': 'Luzern',
 'NE': 'Neuchâtel',
 'NW': 'Nidwalden',
 'OW': 'Obwalden',
 'SG': 'St. Gallen',
 'SH': 'Schaffhausen',
 'SO': 'Solothurn',
 'SZ': 'Schwyz',
 'TG': 'Thurgau',
 'TI': 'Ticino',
 'UR': 'Uri',
 'VD': 'Vaud',
 'VS': 'Valais/Wallis',
 'ZG': 'Zug',
 'ZH': 'Zürich'}

It looks like the data for Geneva was encoded by Google using the long name `Genève`instead of the code `GE`. We fix this:

In [176]:
data_cantons.canton = data_cantons.canton.replace('Genève', 'GE')

The number of entries whose `canton` field does not match an actual canton is:

In [177]:
not_swiss_predicate = [r for r in data_cantons.index if data_cantons.canton[r] not in canton_ids]
len(data_cantons.loc[not_swiss_predicate])

203

We use the defined predicate to drop the non-Swiss rows:

In [178]:
data_cantons = data_cantons.drop(not_swiss_predicate)

### 3b - Summing grants per canton

The first step is to group each entry by canton and sun all the grants allotted to each canton:

In [222]:
totals = data_cantons.groupby('canton').sum()

We have data for the following number of cantons:

In [223]:
print(len(totals), '/', '26')

23 / 26


For the sake of completeness, we add `0`s for the cantons that are not represented:

In [224]:
for c in canton_ids:
    if c not in totals.index:
        totals.loc[c] = 0

This gives the final list of cantons, sorted by descending total of grants received:

In [238]:
totals.sort_values(by='amount', ascending=False)

Unnamed: 0_level_0,amount
canton,Unnamed: 1_level_1
ZH,3695473000.0
VD,2421560000.0
GE,1872341000.0
BE,1560613000.0
BS,1393646000.0
FR,453358800.0
NE,401114800.0
AG,127624100.0
TI,120118400.0
SG,93552810.0


### 3c - Final map

In [283]:
map_cantons = folium.Map(location=[46.9,8.3], zoom_start=8)
map_cantons.choropleth(geo_path = 'data/ch-cantons.topojson.json',
                       data = totals.reset_index(),
                       columns = ['canton', 'amount'],
                       key_on = 'feature.id',
                       topojson = 'objects.cantons',
                       fill_color = 'BuPu',
                       legend_name = 'Total amount of SNF grants (CHF)'
)



In [284]:
map_cantons