### Applied Data Science Capstone by IBM/Coursera


## Capstone Project -  Week 4 - Questions 1&2
<br></br>
<br></br>
## Topic: The Highest Quality Borough for Tourists in Berlin 



## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

**Berlin** is a great city for tourists. It is not a surprise that it is ranked as the most visited German city by <a href="https://www.worldatlas.com/articles/the-10-most-visited-cities-in-germany.html" target="_blank" rel="noopener">Worldatlas</a> with 31.1 milion tourists for 2016. 

However, as a tourist, you may find it **diffucult** to find out in which part of the city you should spend most of your time while being in Berlin. The city is relatively big. It is consisted by 891,8 km² of urbanized area, which is divided in 12 boroughs, full of great places to visit. The question I will try to answer is **which borough has the highest variaty of high quality tourist venues.** 

The main audience which may benefit by solving the problem above is the **one-day tourists.** It will be extremely valuable for them to know in which district should they spend their only day in Berlin in order to get the best out of it.

## Data <a name="data"></a>

Based on definition of the problem, factors that will influence the solution are:

* geolocation of the Berlin boroughs
* number of unique categories of venues in the borough
* average rating of the venues measured by the number of likes

In order to get this data, I decided to use three data providers:

* the list of the names of boroughs will be obtained from **Wikipedia**
* the geolocation latitute and longitute of the boroughs in Berlin will be obtained using **Google Maps API**
* number of unique categories of venues in the district and average rating of the venues measured by the number of likes will be obtained using **Foursquare API**

### Obtaining the names of the Berlin boroughs from Wikipedia



#### Installing the necessary libriaries

In [1]:
! pip install pandas
import pandas as pd
print("Libraries installed")

Libraries installed


#### Obtaining the data from Wikipedia

In [2]:
from pandas.io.html import read_html
page = "https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin"

wikitables = read_html(page, attrs={"class":"wikitable"})
print("Exracted {num} wikitables".format(num=len(wikitables)))

Exracted 13 wikitables


#### Converting the Wikitable into a Pandas DataFrame

In [3]:
df1 = pd.DataFrame(wikitables[1])
df2 = pd.DataFrame(wikitables[2])
df3 = pd.DataFrame(wikitables[3])
df4 = pd.DataFrame(wikitables[4])
df5 = pd.DataFrame(wikitables[5])
df6 = pd.DataFrame(wikitables[6])
df7 = pd.DataFrame(wikitables[7])
df8 = pd.DataFrame(wikitables[8])
df9 = pd.DataFrame(wikitables[9])
df10 = pd.DataFrame(wikitables[10])
df11 = pd.DataFrame(wikitables[11])
df12 = pd.DataFrame(wikitables[12])

#### Concatenating all wikitables into one Pandas DataFrame 

In [4]:
district_list = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12], ignore_index=True)
district_list.head()

Unnamed: 0,Locality,Area in km²,Population as of 2008,Density inhabitants per km²,Map
0,(0101) Mitte,10.7,79582,7445,
1,(0102) Moabit,7.72,69425,8993,
2,(0103) Hansaviertel,0.53,5889,11111,
3,(0104) Tiergarten,5.17,12486,2415,
4,(0105) Wedding,9.23,76363,8273,


#### Removing the unnecessary columns and renaming the main column

In [5]:
dropped_df = district_list.drop(columns=['Population as of 2008', 'Density inhabitants per km²', 'Map'], axis=1)
dropped_df.head()

Unnamed: 0,Locality,Area in km²
0,(0101) Mitte,10.7
1,(0102) Moabit,7.72
2,(0103) Hansaviertel,0.53
3,(0104) Tiergarten,5.17
4,(0105) Wedding,9.23


#### Renaming the Locality column into District

In [13]:
renamed_df = dropped_df.rename(columns={"Locality": "District"})
renamed_df.head()

Unnamed: 0,District,Area in km²
0,(0101) Mitte,10.7
1,(0102) Moabit,7.72
2,(0103) Hansaviertel,0.53
3,(0104) Tiergarten,5.17
4,(0105) Wedding,9.23


#### Removing the codes from the District names

In [14]:
renamed_df['District'] = pd.DataFrame(renamed_df['District'].str[7:])
renamed_df.head()

Unnamed: 0,District,Area in km²
0,Mitte,10.7
1,Moabit,7.72
2,Hansaviertel,0.53
3,Tiergarten,5.17
4,Wedding,9.23


#### Importing the Geospatial data of the Districts

In [8]:
import pandas as pd
geo_data = pd.read_csv('/Users/kirilyunakov/Downloads/BERLIN.csv')
geo_data.head()

Unnamed: 0,Locale,Lat,Long
0,Mitte,52.519444,13.406667
1,Moabit,52.533333,13.333333
2,Hansaviertel,52.516667,13.338889
3,Tiergarten,52.516667,13.366667
4,Wedding,52.55,13.366667


#### Renaming the column based on which will perform the merge (District)

object


In [10]:
renamed_geo_data = geo_data.rename(columns={'Locale': 'District'})
renamed_geo_data.head()

Unnamed: 0,District,Lat,Long
0,Mitte,52.519444,13.406667
1,Moabit,52.533333,13.333333
2,Hansaviertel,52.516667,13.338889
3,Tiergarten,52.516667,13.366667
4,Wedding,52.55,13.366667


#### Merging the Geospatial data with the data from Wikipedia based on the District name

In [11]:
result = pd.merge(renamed_df, renamed_geo_data, on='District', how='left')

result.head()

Unnamed: 0,District,Area in km²,Lat,Long
0,Mitte,10.7,52.519444,13.406667
1,Moabit,7.72,52.533333,13.333333
2,Hansaviertel,0.53,52.516667,13.338889
3,Tiergarten,5.17,52.516667,13.366667
4,Wedding,9.23,52.55,13.366667
