In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import unimap as um

# Assigning Universities/Institutions to Cantons 

In this notebook we provide a way to map the universities and institutions mentioned in the SNF P3 data to their canton of origin.

To do so we use the Places and Geocoding APIs from Google Maps, as well as Yandex's translate API:

Yandex: https://tech.yandex.com/translate/

Google Maps: https://github.com/googlemaps/google-maps-services-python.

The code used to get a canton name from a university name is contained in the **unimap.py** file. Below is a description of the whole process. The end-product will be a json object of the form 
```
{
  'canton': [
    'uni1', 
    'uni2',
    ...
  ], 
  'canton': [
    ...
  ], 
  ...
}
```

First, we start by declaring our API keys, which have been removed for security reasons.

In [10]:
maps = 'GMAPS_KEY_HERE'
yandex = 'YANDEX_KEY_HERE'

## Preparing the data

We put the source CSV file in a pandas DataFrame and decide to replace its NA values with '' for easier string processing.

In [3]:
df = pd.read_csv('data/P3_GrantExport.csv', header=0, sep=';')
df = df.fillna('')
df.head(2)

Unnamed: 0,"﻿""Project Number""",Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,Start Date,End Date,Approved Amount,Keywords
0,1,Schlussband (Bd. VI) der Jacob Burckhardt-Biog...,,Kaegi Werner,Project funding (Div. I-III),Project funding,,Nicht zuteilbar - NA,10302,Swiss history,Human and Social Sciences;Theology & religious...,01.10.1975,30.09.1976,11619.0,
1,4,Batterie de tests à l'usage des enseignants po...,,Massarenti Léonard,Project funding (Div. I-III),Project funding,Faculté de Psychologie et des Sciences de l'Ed...,Université de Genève - GE,10104,Educational science and Pedagogy,"Human and Social Sciences;Psychology, educatio...",01.10.1975,30.09.1976,41022.0,


In this part of the process we're only interested in the universities mentioned in this data. We want to assign to every unique university name the canton where the institution is located.

In [4]:
unis = df['University'].unique()
for u in unis[0:5]:
    print(u)
print('...')

Nicht zuteilbar - NA
Université de Genève - GE
NPO (Biblioth., Museen, Verwalt.) - NPO
Universität Basel - BS
Université de Fribourg - FR
...


## Processing the data

Now that we have the array of university names, we're ready to process it. We use the code contained in **unimap.py**.

The code consists of three classes:
* **`Univ()`**: describes a university object and provides name-handling methods such as replacing abreviations and translating the name to english.

* **`UniMapper(gmaps, yandex)`**: contains all geocoding related methods. We will describe the way they're used below.

* **`CantonDict()`**: initiates an empty dictionary and provides a method that populates it with cantons as keys and lists of unis (in the canton) as values. Has an export method as well.

There is also a `corrections` function that applies 5 corrections on the final results (the only manual ones we do, along with name pre-cleaning).

We instantiate a UniMapper with proper api keys.

In [5]:
m = um.UniMapper(maps, yandex)

And initialise the cantons object...

In [6]:
cantons = um.CantonDict()

...before populating it as follows:

```
input: universities array, empty CantonDict dictionary
output: updated CantonDict dictionary

for each university in array:
    - create Univ() instance
    - use Places API to find address in CH
        - if no address found, translate name with yandex 
            and re-use Places API
        - if still nothing: canton <- 'fail'
    - use geocoding API to get canton of address and do
        canton <- 'that canton'
    - apply the corrections function to new canton
    - if canton exists as key in dict:
          append current uni name to its value
      else:
          create it as key and start list with current uni
        
```

In [7]:
cantons.populate(unis, m)

 
fail <-- Nicht zuteilbar - NA
 
GE <-- Université de Genève - GE
 
fail <-- NPO (Biblioth., Museen, Verwalt.) - NPO
 
BS <-- Universität Basel - BS
 
FR <-- Université de Fribourg - FR
 
ZH <-- Universität Zürich - ZH
 
VD <-- Université de Lausanne - LA
 
BE <-- Universität Bern - BE
 
ZH <-- Eidg. Forschungsanstalt für Wald,Schnee,Land - WSL
 
NE <-- Université de Neuchâtel - NE
 
ZH <-- ETH Zürich - ETHZ
 
GE <-- Inst. de Hautes Etudes Internat. et du Dév - IHEID
 
SG <-- Universität St. Gallen - SG
 
fail <-- Weitere Institute - FINST
 
fail <-- Firmen/Privatwirtschaft - FP
 
GR <-- Pädagogische Hochschule Graubünden - PHGR
 
VD <-- EPF Lausanne - EPFL
 
ZH <-- Pädagogische Hochschule Zürich - PHZFH
 
LU <-- Universität Luzern - LU
 
ZH <-- Schweiz. Institut für Kunstwissenschaft - SIK-ISEA
 
TI <-- SUP della Svizzera italiana - SUPSI
 
JU <-- HES de Suisse occidentale - HES-SO
 
BE <-- Robert Walser-Stiftung Bern - RWS
 
AG <-- Paul Scherrer Institut - PSI
 
SG <-- Pädagogische 

We get almost a 90% assignment rate, which is already pretty good. Let's check the failed assignments to see what is going on.

In [8]:
cantons.d['fail']

['Nicht zuteilbar - NA',
 'NPO (Biblioth., Museen, Verwalt.) - NPO',
 'Weitere Institute - FINST',
 'Firmen/Privatwirtschaft - FP',
 'Weitere Spitäler - ASPIT',
 'Forschungsanstalten Agroscope - AGS',
 'Istituto Svizzero di Roma - ISR',
 'Schweizer Kompetenzzentrum Sozialwissensch. - FORS']

We notice that 5 of these 8 failures are not clear institutions, 1 is in Rome and the other two, AGS and FORS, seem, after checking, to be spread across Switzerland.

Thus, with very few manual adjustments we reach an assignment rate of ~95% of actual institutions.

We can then export the dictionary to a json file for further processing:

In [9]:
cantons.export('data/cantons.json')