# Capstone project proposal by Jörg Schreiner

# Political orientation of Swiss communes from a data science perspective

This notebook is a template for your project proposal.

The details are outlined in the **Proposal** unit on the platform - you should address all points from those instructions with as many markdown/code cells as needed. This should include code, observations, discussions and the planned steps.

## 1) The problem

Using the detailed data that is available
- about ~2'000 __Swiss municipalities__
- from sources like the __Federal Statistical Office__ or the __Federal Finance Administration__, supplemented with less conventional data providers like e.g. __Swiss Federal Railways__ or __geographical data__

I propose it will be possible
- to predict the __political orientation__ of the municipalities as expressed in federal or cantonal elections
- to find __significant factors__ that influence the political orientation
- to see __changes__ in these factors over time

The dependent variable, political orientation, in the context of this project will be a single number that summarizes how far left-leaning or right-leaning a municipalities has voted. This will be computed by assigning each party a left/right value and then getting the sum of party values weighted by election results.

An example: given party lr-values { GPS: -2, SP: -1, GLP: 0, CVP: 0, FDP: 1, SVP: 2 }, for the federal election results of 2019 in  
__Lausanne__: { GPS: 27%, SP: 27%, GLP: 7%, CVP: 2%, FDP: 15% and SVP: 9% } po = __-0.55__ and  
__Schwyz__: { GPS: 3%, SP: 18%, GLP: 3%, CVP: 25%, FDP: 20% and SVP: 35% } po = __+0.57__  
So Lausanne  leans to the left politically and Schwyz to the right. Intuitively that is not surprising. The aim of the capstone project is to quantify it and find out what leads to that (social, economic, geographic, ... factors).

What this project is not about: it's not about predicting future election results.

## 2) The data

### (a) Clear overview of your data

#### 2.a.1 Regionalporträts 2020: Kennzahlen aller Gemeinden
(Regional portraits 2020: key data of all communes)

Source: https://www.bfs.admin.ch/asset/de/je-d-21.03.01  
Download: https://www.bfs.admin.ch/bfsstatic/dam/assets/11587763/master  
Filename: je-d-21.03.01.xlsx (German; English version also available)  
Format: Excel  
Size: 1040 KB

This is the main data set for the project. It is an official statistic from the Bundesamt für Statistik (BFS) (Federal Statistical Office) of Switzerland. It has data from all 2202 municipalities (as of year 2019) with 41 variables. The data is mostly current (from 2017-2019) except for a few variables. Every Swiss commune is listed. The official commune id will be helpful joining the data with other data sets instead of relying on the commune names.

Features:
- Population: residents, change, density, foreigners
- Age distribution, 0-19y, 20-64y, 65+y
- Birth, mortality, marriage, divorce
- Households, number and size
- Area: total, settlements, agricultural, wooded, unproductive,  with changes
- Economy: employees, businesses, primary, secondary, tertiary sectors
- Housing: dwelling vacancy, new housing
- Social assistance rate
- Voter shares national elections for 10 parties (these will go into the political-orientation value)

All data is numerical. The data quality is supposed to be high (because of the source). There are missing values in the employees/workplaces and the social assistance variables. That happens when known values are not published for privacy reasons when they value is very small.

#### 2.a.2 Die 4 Sprachgebiete der Schweiz nach Gemeinden
(The 4 language areas of Switzerland by municipalities)

Source: https://www.atlas.bfs.admin.ch/maps/13/de/12474_3175_235_227/20584.html  
Download: https://www.atlas.bfs.admin.ch/core/projects/13/xshared/xlsx/20584_131.xlsx  
Filename: 20584_131.xlsx  
Format: Excel  
Size: 58 KB  

This contains one variable that contains the language for each commune (2255 communes, 2017). Values are one of ["Deutsches Sprachgebiet", "Französisches Sprachgebiet", "Italienisches Sprachgebiet", "Rätoromanisches Sprachgebiet"]. The communes are identified by their official id as in 2.a.1.

#### 2.a.3 Prämienregionen der Krankenversicherung
(Premium regions of health insurance)

Source: https://www.bag.admin.ch/bag/de/home/versicherungen/krankenversicherung/krankenversicherung-versicherer-aufsicht/praemienregionen.html  
Download: https://www.bag.admin.ch/dam/bag/de/dokumente/kuv-aufsicht/pus/praemienregionen/praemienregionen-version-maerz-2020.xlsx.download.xlsx/praemienregionen-version-maerz-2020.xlsx  
Filename: praemienregionen-version-maerz-2020.xlsx  
Format: Excel  
Size: 834 KB  

In the D_PRIM sheet we find
- canton (official two-letter code)
- premium region (codes [0, 1, 2, 3])
- mean monthly premium for adults, young adults and children (mean is for canton/region, not commune)

for 2'210 communes (year 2020), identified by their official id as in 2.a.1. 

#### 2.a.4 Durchschnittliches steuerbares Einkommen pro Kopf
(Average taxable income per capita)

Source: https://www.atlas.bfs.admin.ch/maps/13/de/15132_9164_9202_7267/23875.html  
Download: https://www.atlas.bfs.admin.ch/core/projects/13/xshared/xlsx/23875_131.xlsx  
Filename:  23875_131.xlsx  
Format: Excel  
Size: 98 KB  

This contains two variables, the taxable income as a total and per capita, for 2'294 communes (year 2016), identified by their official id as in 2.a.1. 

#### 2.a.5 SBB Billetautomaten
(Swiss Federal Railways SBB ticket vending machines)

Source: https://data.sbb.ch/explore/dataset/billetautomat/  
Download: https://data.sbb.ch/explore/dataset/billetautomat/download/?format=csv&timezone=Europe/Berlin&lang=de&use_labels_for_header=true&csv_separator=%3B  
Filename: billetautomat.csv  
Format: Comma-separated values  
Size: 164 KB  

This contains the locations of the 1385 SBB ticket vending machines. The information I want to use is simply the number of these in a commune. Therefore I am interested in the location (station) names (BPS NAME column) (often not the same as BFS community names) and the geo location (geopos column).

Why using this data and not, for example, a list of all public transport stops? I think that the Swiss public transport network is very dense, and including all stops including post cars and the like, would probably only tell me that 90%+ of communes have one. Vending machines on the other hand are only at train stations along the major axes. I hope that this will distinguish urban centers and well-connected communes from the remote ones.

#### 2.a.6 More ideas
If time allows or if required because above variables are not predictive enough, here are more ideas.

Conventional statistics:
- crime rates (https://www.bfs.admin.ch/bfs/en/home/statistics/crime-criminal-justice.html)
- commuters (https://www.bfs.admin.ch/bfs/en/home/statistics/mobility-transport/passenger-transport/commuting.html)

Other features:
- distance from border
- presence of military installations

### (b) Plan to manage and process the data

#### 2.b.1 Regionalporträts (Regional portraits)
- The data can be read using `pd.read_excel()`
- Replace special values ('*', 'X') with zeroes or other appropriate values, check for missing values, check data types
- Save the data as a plain csv-file with clean data

#### 2.b.2 Sprachgebiete (Language areas)
- The data can be read using `pd.read_excel()`
- Data is complete, no missing values
- One-hot encoding of the categorical variable into four columns language_de, language_fr, etc.
- Save the data as a plain csv-file with clean and one-hot encoded data

#### 2.b.3 Krankenversicherung (Health insurance)
- The data can be read using `pd.read_excel()`
- Data is complete, no missing values
- Save the data as a plain csv-file with clean data

#### 2.b.4 Steuerbares Einkommen (Taxable income)
- The data can be read using `pd.read_excel()`
- Data is complete, no missing values
- Save the data as a plain csv-file with clean data

#### 2.b.5 SBB Billetautomaten (SBB ticket vending machines)
- The data can be read using `pd.read_csv()`
- Geo location is contained in two coordinate systems, Swiss local reference system CH1903+, and as global latitude and longitude (WGS84).
- Data is complete, no missing values
- Save the data as a plain csv-file with clean data

#### 2.b.x General
- Renaming columns to be easily readable and in a consistent scheme (python-variable like)
- Optionally create a database and import the data into tables

Feature engineering:
- Add ratios of features, e.g. per-capita or per-area features (some are already present in the data)
- Add polynomial (squared) features or interaction features (products of features). Because of the high number of possible combinations, a strategy has to be found to find a meaningful subset.
- Add change-rate (over time) to features. This would usually require to get historic data of the used data sets and compute the change rates. Some change rates are already present in the data.
- And of course the dependent variable, the political orientation value, needs to be computed.

Combining the data sets:
- Most data sets use the official commune id, which makes combining the data easy
- One obstacle is that the basic set of communes is not stable. Each year a few communes merge and form a new commune. I estimate that this concerns not more than ~100 communes. Either I drop these communes (they are small ones, and we have plenty of those) or use a zero or mean value. In any case, the Swiss official commune register at the BFS documents the changes.
- Data sets that do not contain the official commune id can be merged by commune name or by geo location.
- By name: taking care of the language (e.g. Genf/Genève), spelling (e.g. Zürich/Zuerich/Zurich), and other issues (names that are not communes at all; Zürich Flughafen station is not in Zürich) can make this an expensive task.
- By geo location: in theory, this is very precise. The Swiss federal geo portal, https://www.geo.admin.ch/, offers "swissBOUNDARIES3D municipal boundaries constitute the administrative borders of the municipalities of Switzerland". They offer web services to access their data at https://api3.geo.admin.ch/index.html which seem well documented (but rather extensive). 


## 3) Exploratory data analysis (EDA)

### (a) Preliminary EDA

< details here as markdown/code cells >

### (b) How does the EDA inform your project plan?

< details here as markdown/code cells >

### (c) What further EDA do you plan for project?

< details here as markdown/code cells >

## 4) Machine learning 

### (a) Phrase your project goal as a clear machine learning question

< details here as markdown/code cells >

### (b) What models are you planning to use and why?

< details here as markdown/code cells >

### (c) Please tell us your detailed machine learning strategy 

< details here as markdown/code cells >

## 5) Additional information

< If needed, discuss here any elements or approaches that do not fit with the above categories >