# Capstone project

## Comparison between the cities of Zurich and Geneva (Switzerland)

# Introduction

In this notebook I will analyse and compare the neighborhoods of two swiss cities: Zurich and Geneva. I will provide a visual comparison and a numerical comparison by looking also at the population density of each neighborhood.

# Data

I took the data from wikipedia pages:

- Zurich https://de.wikipedia.org/wiki/Stadtteile_der_Stadt_Z%C3%BCrich
    
- Geneva https://de.wikipedia.org/wiki/Genf#Stadtviertel

Unfortunately the population data refer to different years, I will use 2018 for Zurich and 2015 for Geneva.

From Wikipedia pages I downloaded as csv file the information about the cities of Zurich and Geneva. Let's have a look at the data by loading these two tables into two dataframes.

In [1]:
import pandas as pd
import numpy as np

### Loading Zurich data

In [2]:
zurich=pd.read_csv("zurich.csv")
zurich.head()

Unnamed: 0,Stadtkreis,Statistische Quartiere,BFS-Code,Fläche,Einwohner,Einwohner.1,Einwohner.2,Ausländer
0,Kreis 1,Rathaus,261011,0.38,3267,3194,3081,"30,1 %"
1,Kreis 1,Hochschulen,261012,0.56,664,665,695,"34,3 %"
2,Kreis 1,Lindenhof,261013,0.23,990,923,950,"30,1 %"
3,Kreis 1,City,261014,0.64,829,783,846,"30,0 %"
4,Kreis 2,Wollishofen,261021,5.75,18'923,15'937,15'592,"29,1 %"


First we change the name of some columns

In [3]:
zurich.columns=['District', 'Neighborhood', 'BFS-Code', 'Area(km^2)',
       'Population2018', 'Population2013', 'Population2005', 'Foreigns (%)']
zurich.head()

Unnamed: 0,District,Neighborhood,BFS-Code,Area(km^2),Population2018,Population2013,Population2005,Foreigns (%)
0,Kreis 1,Rathaus,261011,0.38,3267,3194,3081,"30,1 %"
1,Kreis 1,Hochschulen,261012,0.56,664,665,695,"34,3 %"
2,Kreis 1,Lindenhof,261013,0.23,990,923,950,"30,1 %"
3,Kreis 1,City,261014,0.64,829,783,846,"30,0 %"
4,Kreis 2,Wollishofen,261021,5.75,18'923,15'937,15'592,"29,1 %"


We now look at the data type for each column by using .info()

In [4]:
zurich.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 8 columns):
District          34 non-null object
Neighborhood      34 non-null object
BFS-Code          34 non-null int64
Area(km^2)        34 non-null float64
Population2018    34 non-null object
Population2013    34 non-null object
Population2005    34 non-null object
Foreigns (%)      34 non-null object
dtypes: float64(1), int64(1), object(6)
memory usage: 2.2+ KB


We have some problems with data types and format.

The data in the three columns with the population in 2018, 2013 and 2015 are strings not numerical values and some data in population columns have " ' " as thousand separator.

In [5]:
for cat in ["Population2018","Population2013", "Population2005"]:
    zurich[cat]=zurich[cat].apply(lambda x: int("".join(x.split("'"))))
    print(cat)
    print(zurich[cat].dtype)

Population2018
int64
Population2013
int64
Population2005
int64


The data in the foreigns column are also not numerical values and have a comma as decimal separator

We leave only the number, without the %

In [6]:
zurich["Foreigns (%)"]=zurich["Foreigns (%)"].apply(lambda x: float((".".join(x.split(","))).split("%")[0]))
zurich["Foreigns (%)"].dtype

dtype('float64')

In [7]:
zurich.head()

Unnamed: 0,District,Neighborhood,BFS-Code,Area(km^2),Population2018,Population2013,Population2005,Foreigns (%)
0,Kreis 1,Rathaus,261011,0.38,3267,3194,3081,30.1
1,Kreis 1,Hochschulen,261012,0.56,664,665,695,34.3
2,Kreis 1,Lindenhof,261013,0.23,990,923,950,30.1
3,Kreis 1,City,261014,0.64,829,783,846,30.0
4,Kreis 2,Wollishofen,261021,5.75,18923,15937,15592,29.1


I add three columns:
- 1 for the population density (Population2018/Area) 
- 2 for the coordinates of the Neighborhoods "Latitude" and "Longitudine", in a format that will be useful to get the coordinates.

In [8]:
zurich["PopDens"]=zurich.Population2018/zurich["Area(km^2)"]
zurich["Latitude"]=(zurich.Neighborhood+", "+zurich.District)
zurich["Longitude"]=(zurich.Neighborhood+", "+zurich.District)
zurich.head()

Unnamed: 0,District,Neighborhood,BFS-Code,Area(km^2),Population2018,Population2013,Population2005,Foreigns (%),PopDens,Latitude,Longitude
0,Kreis 1,Rathaus,261011,0.38,3267,3194,3081,30.1,8597.368421,"Rathaus, Kreis 1","Rathaus, Kreis 1"
1,Kreis 1,Hochschulen,261012,0.56,664,665,695,34.3,1185.714286,"Hochschulen, Kreis 1","Hochschulen, Kreis 1"
2,Kreis 1,Lindenhof,261013,0.23,990,923,950,30.1,4304.347826,"Lindenhof, Kreis 1","Lindenhof, Kreis 1"
3,Kreis 1,City,261014,0.64,829,783,846,30.0,1295.3125,"City, Kreis 1","City, Kreis 1"
4,Kreis 2,Wollishofen,261021,5.75,18923,15937,15592,29.1,3290.956522,"Wollishofen, Kreis 2","Wollishofen, Kreis 2"


## Loading Geneve data

In [9]:
geneve=pd.read_csv("genf.csv")
geneve

Unnamed: 0,Viertel,Quartier,Nr.,BFS-Code,Fläche,Einwohner
0,,,,,ha[12],(Ende 2015)
1,Cité,Cité – Centre,1.0,6621001.0,106,6'720
2,,,,,,
3,,,,,,
4,,,,,,
5,,,,,,
6,,,,,,
7,,St-Gervais – Chantepoulet,2.0,6621002.0,47,4'474
8,,Délices – Grottes – Montbrillant,3.0,6621003.0,68,13'806
9,,Pâquis,4.0,6621004.0,42,10'650


The NaN values in Quartier, Nr., BFS-Code and Fläche columns are due to the format of the table on wikipedia. We can easily remove those rows (index between 2 and 6) and the first row (index=0) as well. Since I couldn't find the coordinates of the neighborhood O.N.U. by means of the geocode I decided to remove the row with index 17 as well.

In [10]:
for n in [0,2,3,4,5,6,17]:
    geneve.drop(n, axis=0, inplace=True)

In [11]:
geneve.columns=['District', 'Neighborhood', 'Number', 'BFS-Code', 'Area', 'Population2015']

In [12]:
#reset the index
geneve.reset_index(drop=True, inplace=True)
geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015
0,Cité,Cité – Centre,1.0,6621001.0,106,6'720
1,,St-Gervais – Chantepoulet,2.0,6621002.0,47,4'474
2,,Délices – Grottes – Montbrillant,3.0,6621003.0,68,13'806
3,,Pâquis,4.0,6621004.0,42,10'650
4,Plainpalais,Champel,11.0,6621011.0,180,17'968


The NaN values in the district column is due to the fact that different neighborhoods belong to the same district. We fill this missing information with the following code:

In [13]:
gen_dist=list(geneve.District.dropna()) #list of the districts
geneve.District=geneve.District.fillna("0")

dis="Cité"
districts_genf=[]
for i in list(geneve.index):
    if geneve.iloc[i]["District"]!="0":
        dis=geneve.iloc[i]["District"]      
    districts_genf.append(dis)

geneve.District=districts_genf

geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015
0,Cité,Cité – Centre,1.0,6621001.0,106,6'720
1,Cité,St-Gervais – Chantepoulet,2.0,6621002.0,47,4'474
2,Cité,Délices – Grottes – Montbrillant,3.0,6621003.0,68,13'806
3,Cité,Pâquis,4.0,6621004.0,42,10'650
4,Plainpalais,Champel,11.0,6621011.0,180,17'968


In [14]:
geneve.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 6 columns):
District          15 non-null object
Neighborhood      15 non-null object
Number            15 non-null float64
BFS-Code          15 non-null float64
Area              15 non-null object
Population2015    15 non-null object
dtypes: float64(2), object(4)
memory usage: 800.0+ bytes


I transform Area and Population2015 into numerical values and in the latter I remove the " ' " as decimal separator. The area of neighborhoods is in ha, I convert it into km^2 by dividing for a factor of 100.

In [15]:
geneve.Population2015=geneve.Population2015.apply(lambda x: int("".join(x.split("'"))))
geneve.Area=geneve.Area.apply(float)
geneve.Area=geneve.Area/100
geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015
0,Cité,Cité – Centre,1.0,6621001.0,1.06,6720
1,Cité,St-Gervais – Chantepoulet,2.0,6621002.0,0.47,4474
2,Cité,Délices – Grottes – Montbrillant,3.0,6621003.0,0.68,13806
3,Cité,Pâquis,4.0,6621004.0,0.42,10650
4,Plainpalais,Champel,11.0,6621011.0,1.8,17968


I transform the values in Number columns into integers:

In [16]:
geneve.Number=geneve.Number.apply(int)
geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015
0,Cité,Cité – Centre,1,6621001.0,1.06,6720
1,Cité,St-Gervais – Chantepoulet,2,6621002.0,0.47,4474
2,Cité,Délices – Grottes – Montbrillant,3,6621003.0,0.68,13806
3,Cité,Pâquis,4,6621004.0,0.42,10650
4,Plainpalais,Champel,11,6621011.0,1.8,17968


The neighborhood St-Gervais is written incorrect, it is indeed St.Gervais. I correct that:

In [17]:
geneve.Neighborhood=geneve.Neighborhood.apply(lambda x: x.replace("-" , '.'))
geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015
0,Cité,Cité – Centre,1,6621001.0,1.06,6720
1,Cité,St.Gervais – Chantepoulet,2,6621002.0,0.47,4474
2,Cité,Délices – Grottes – Montbrillant,3,6621003.0,0.68,13806
3,Cité,Pâquis,4,6621004.0,0.42,10650
4,Plainpalais,Champel,11,6621011.0,1.8,17968


If I try to get the coordinates of the neighborhoods as they are I cannot get all of them:

In [18]:
from geopy.geocoders import Nominatim

geolocator=Nominatim(user_agent="foursquare_agent")

for n in geneve.Neighborhood: 
    address="{}, {}, {}".format(n,"Genf","Switzerland")
    coord=geolocator.geocode(address)
    print("---")
    print(n)
    print(coord)

---
Cité – Centre
Cité, Genève, 1204, Schweiz/Suisse/Svizzera/Svizra
---
St.Gervais – Chantepoulet
None
---
Délices – Grottes – Montbrillant
None
---
Pâquis
Pâquis, Genève, Schweiz/Suisse/Svizzera/Svizra
---
Champel
Champel, Genève, 1206, Schweiz/Suisse/Svizzera/Svizra
---
La Cluse
Boulevard de la Cluse, Plainpalais, Genève, 1205, Schweiz/Suisse/Svizzera/Svizra
---
Jonction
Jonction, Genève, 1200, Schweiz/Suisse/Svizzera/Svizra
---
Bâtie – Acacias
None
---
Eaux.Vives – Lac
Eaux-Vives, Genève, Schweiz/Suisse/Svizzera/Svizra
---
Florissant – Malagnou
None
---
Sécheron
Sécheron, Pâquis, Genève, 1202, Schweiz/Suisse/Svizzera/Svizra
---
Grand.Pré – Vermont
None
---
Bouchet – Moillebeau
Rue Paul-Bouchet, Grottes et Saint-Gervais, Genève, 1201, Schweiz/Suisse/Svizzera/Svizra
---
Charmilles – Châtelaine
None
---
St.Jean – Aire
None


In order to get neighborhood coordinates I have to use only the first name for each row in neighborhood column, because unfortunately I couldn't find population information or area of the other neighboroods (for instance Chantepoulet in the second row). I add here two columns for the coordinates, in a format that will be useful to get the coordinates later. I add also a column with population density (Population2015/Area) 

In [19]:
geneve["PopDens"]=geneve.Population2015/geneve.Area
geneve["Latitude"]=geneve.Neighborhood.apply(lambda x: x.split("–")[0])
geneve["Longitude"]=geneve.Neighborhood.apply(lambda x: x.split("–")[0])

In [20]:
geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015,PopDens,Latitude,Longitude
0,Cité,Cité – Centre,1,6621001.0,1.06,6720,6339.622642,Cité,Cité
1,Cité,St.Gervais – Chantepoulet,2,6621002.0,0.47,4474,9519.148936,St.Gervais,St.Gervais
2,Cité,Délices – Grottes – Montbrillant,3,6621003.0,0.68,13806,20302.941176,Délices,Délices
3,Cité,Pâquis,4,6621004.0,0.42,10650,25357.142857,Pâquis,Pâquis
4,Plainpalais,Champel,11,6621011.0,1.8,17968,9982.222222,Champel,Champel


### Getting the coordinates

We first define functions to get the coordinates:

In [21]:
def getlat(row,city,state):
    address="{}, {}, {}".format(row,city,state)
    coord=geolocator.geocode(address)
    latitude=coord.latitude
    return latitude

def getlng(row,city,state):
    address="{}, {}, {}".format(row,city,state)
    coord=geolocator.geocode(address)
    longitude=coord.longitude
    return longitude

And functions to add the coordinates to a dataframe:

In [22]:
def add_lat(df,col,city,state):
    df[col]=df[col].apply(lambda x: getlat(x,city,state))
    return df

def add_lng(df,col,city,state):
    df[col]=df[col].apply(lambda x: getlng(x,city,state))
    return df

I add now the coordinates to the dataframes

In [23]:
zurich=add_lat(zurich,"Latitude","Zurich","Switzerland")
zurich=add_lng(zurich,"Longitude","Zurich","Switzerland")
geneve=add_lat(geneve,"Latitude","Genf","Switzerland")
geneve=add_lng(geneve,"Longitude","Genf","Switzerland")

### Selecting the column for each dataframe

We now have two dataframes:

In [24]:
zurich.head()

Unnamed: 0,District,Neighborhood,BFS-Code,Area(km^2),Population2018,Population2013,Population2005,Foreigns (%),PopDens,Latitude,Longitude
0,Kreis 1,Rathaus,261011,0.38,3267,3194,3081,30.1,8597.368421,47.372649,8.544311
1,Kreis 1,Hochschulen,261012,0.56,664,665,695,34.3,1185.714286,47.373846,8.548613
2,Kreis 1,Lindenhof,261013,0.23,990,923,950,30.1,4304.347826,47.372996,8.540799
3,Kreis 1,City,261014,0.64,829,783,846,30.0,1295.3125,47.372943,8.535346
4,Kreis 2,Wollishofen,261021,5.75,18923,15937,15592,29.1,3290.956522,47.342427,8.530708


In [25]:
geneve.head()

Unnamed: 0,District,Neighborhood,Number,BFS-Code,Area,Population2015,PopDens,Latitude,Longitude
0,Cité,Cité – Centre,1,6621001.0,1.06,6720,6339.622642,46.201035,6.146221
1,Cité,St.Gervais – Chantepoulet,2,6621002.0,0.47,4474,9519.148936,46.210723,6.139083
2,Cité,Délices – Grottes – Montbrillant,3,6621003.0,0.68,13806,20302.941176,46.207626,6.133291
3,Cité,Pâquis,4,6621004.0,0.42,10650,25357.142857,46.219521,6.147314
4,Plainpalais,Champel,11,6621011.0,1.8,17968,9982.222222,46.19149,6.158405


We keep now only the columns that could be useful for further analysis:

In [28]:
df_zurich=zurich[["District","Neighborhood","PopDens","Latitude","Longitude"]]
df_geneve=geneve[["District","Neighborhood","PopDens","Latitude","Longitude"]]

In [29]:
df_zurich.head()

Unnamed: 0,District,Neighborhood,PopDens,Latitude,Longitude
0,Kreis 1,Rathaus,8597.368421,47.372649,8.544311
1,Kreis 1,Hochschulen,1185.714286,47.373846,8.548613
2,Kreis 1,Lindenhof,4304.347826,47.372996,8.540799
3,Kreis 1,City,1295.3125,47.372943,8.535346
4,Kreis 2,Wollishofen,3290.956522,47.342427,8.530708


In [30]:
df_geneve.head()

Unnamed: 0,District,Neighborhood,PopDens,Latitude,Longitude
0,Cité,Cité – Centre,6339.622642,46.201035,6.146221
1,Cité,St.Gervais – Chantepoulet,9519.148936,46.210723,6.139083
2,Cité,Délices – Grottes – Montbrillant,20302.941176,46.207626,6.133291
3,Cité,Pâquis,25357.142857,46.219521,6.147314
4,Plainpalais,Champel,9982.222222,46.19149,6.158405
