# Data Preparing

As described in the data section, to compare the areas/boroughs of London/Frankfurt we have to get to now both cities a little bit better. London has a population of approximately 9 mio. people, covers 1,572 km$^{2}$ and is organised in city of London & 32 boroughs.

Frankfurt on the other side has a poulation of approximately 0,8 mio. people and covers only 248,31km$^{2}$. It is organized in 46 "Stadtteilen"/boroughs, but those boroughs are significantly samller then the boroughs of London. Because of that it is hard to compare the boroughs. 

I decided to compare the [London areas](https://en.wikipedia.org/wiki/List_of_areas_of_London) with the [boroughs of Frankfurt](https://de.wikipedia.org/wiki/Liste_der_Stadtteile_von_Frankfurt_am_Main) because the area and the population is more compareable.

E.g.
* London area: Barnes, area: 4,50 km$^{2}$, population: 21.218
* Frankfurt area: Ostend, area: 5,56 km$^{2}$, poulation: 29.171

In [1]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium

## Getting Areas of London

In [2]:
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_areas_of_London")
df = df[1]
df.head()

Unnamed: 0,Location,London borough,Post town,Postcode district,Dial code,OS grid ref
0,Abbey Wood,"Bexley, Greenwich [7]",LONDON,SE2,20,TQ465785
1,Acton,"Ealing, Hammersmith and Fulham[8]",LONDON,"W3, W4",20,TQ205805
2,Addington,Croydon[8],CROYDON,CR0,20,TQ375645
3,Addiscombe,Croydon[8],CROYDON,CR0,20,TQ345665
4,Albany Park,Bexley,"BEXLEY, SIDCUP","DA5, DA14",20,TQ478728


In [3]:
df.drop(df.columns.difference(['Location','Post town', 'OS grid ref']), 1, inplace=True)
df.columns = ["Area", "Latitude", "Longitude"]
df.head()

Unnamed: 0,Area,Latitude,Longitude
0,Abbey Wood,LONDON,TQ465785
1,Acton,LONDON,TQ205805
2,Addington,CROYDON,TQ375645
3,Addiscombe,CROYDON,TQ345665
4,Albany Park,"BEXLEY, SIDCUP",TQ478728


In [4]:
# Delete all the (also ...)-strings after Borough name.
for i, s in enumerate(df["Area"]):
    if(" (" in s):
        num = s.find(" (")
        s = s[:num]
        df.at[i, "Area"] = s
df.head()

Unnamed: 0,Area,Latitude,Longitude
0,Abbey Wood,LONDON,TQ465785
1,Acton,LONDON,TQ205805
2,Addington,CROYDON,TQ375645
3,Addiscombe,CROYDON,TQ345665
4,Albany Park,"BEXLEY, SIDCUP",TQ478728


### Adding latitude and longitude

In [5]:
for i, area in enumerate(df["Area"]):
    address = f'{area}, London, Great Britain'
    geolocator = Nominatim(user_agent="l_explorer", timeout=None)
    try:
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
    except:
        latitude = float('nan')
        longitude = float('nan')
    # print(address, latitude, longitude)
    df.at[i, ["Latitude"]] = latitude
    df.at[i, ["Longitude"]] = longitude

df.head()


Unnamed: 0,Area,Latitude,Longitude
0,Abbey Wood,51.4876,0.11405
1,Acton,51.5081,-0.273261
2,Addington,51.3586,-0.0316347
3,Addiscombe,51.3797,-0.0742821
4,Albany Park,51.4354,0.125965


## Deleting dublicates and nan values

In [6]:
print(df.shape)
print(df.isna().any())
df.dropna(0, inplace=True)
print(df.isna().any())

print("Duplicates?",df.duplicated().any())
df.drop_duplicates(inplace=True)
print("Duplicates:",df.duplicated().any())
df.reset_index(inplace=True, drop=True)
df.shape

(533, 3)
Area         False
Latitude      True
Longitude     True
dtype: bool
Area         False
Latitude     False
Longitude    False
dtype: bool
Duplicates? True
Duplicates: False


(522, 3)

In [7]:
df.head()

Unnamed: 0,Area,Latitude,Longitude
0,Abbey Wood,51.4876,0.11405
1,Acton,51.5081,-0.273261
2,Addington,51.3586,-0.0316347
3,Addiscombe,51.3797,-0.0742821
4,Albany Park,51.4354,0.125965


In [8]:
#df.to_csv("london_areas_latlong.csv", index=False)    # uncomment if you want to save the dataframe

## Getting Frankfurt Stadtteile

In [9]:
# df_f means dataframe frankfurt
df_f = pd.read_html("https://de.wikipedia.org/wiki/Liste_der_Stadtteile_von_Frankfurt_am_Main")
df_f = df_f[0]
df_f.drop(df_f.tail(1).index,inplace=True) # deleting the row "Stadt Frankfurt am Main" beacuse it is now Area in Frankfurt.
df_f.tail()

Unnamed: 0,Nr.,Stadtteil,Fläche[3]in km²,Einwohner,Weiblich,Männlich,Deutsche,Ausländer,Ausländerin Prozent,Einwohnerje km²,Ortsbezirk,Stadtgebietseit,Vorherige Zugehörigkeit
41,43,Kalbach-Riedberg,,,,,,,23,3154,12 Kalbach-Riedberg,1972[7],Obertaunuskreis
42,44,Harheim,,,,,,,136,1020,14 Harheim,1972[6],Landkreis Friedberg
43,45,Nieder-Eschbach,,,,,,,231,1804,15 Nieder-Eschbach,1972[6],Landkreis Friedberg
44,46,Bergen-Enkheim,,,,,,,199,1434,16 Bergen-Enkheim,1977[8],Main-Kinzig-Kreis
45,47,Frankfurter Berg,,,,,,,262,3435,10 Nord-Ost,1910[Anm. 15],Landkreis Frankfurt[Anm. 4]


In [10]:
df_f.drop(df_f.columns.difference(['Stadtteil','Weiblich', 'Männlich']), 1, inplace=True)
df_f.columns = ["Borough", "Latitude", "Longitude"]
df_f.head()

Unnamed: 0,Borough,Latitude,Longitude
0,Altstadt,,
1,Innenstadt,,
2,Bahnhofsviertel,,
3,Westend-Süd,,
4,Westend-Nord,,


### Adding latitude and longitude

In [11]:
for i, borough in enumerate(df_f["Borough"]):
    address = f'{borough}, Frankfurt, Germany'
    
    geolocator = Nominatim(user_agent="f_explorer", timeout=None)
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    #print(latitude)
    df_f.at[i, ["Latitude"]] = latitude
    df_f.at[i, ["Longitude"]] = longitude

#df_f.to_csv("frankfurt_boroughs_latlong.csv", index=False)    # uncomment if you want to save the dataframe
df_f.head()


Unnamed: 0,Borough,Latitude,Longitude
0,Altstadt,50.1104,8.6829
1,Innenstadt,50.113,8.67434
2,Bahnhofsviertel,50.1077,8.66868
3,Westend-Süd,50.1152,8.66227
4,Westend-Nord,50.1264,8.66792
