### Applied Data Science Capstone by IBM/Coursera


## Capstone Project -  Week 4 - Questions 1&2
<br></br>
<br></br>
## Topic: The Highest Quality Borough for Tourists in Berlin 



## Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Results](#results)
* [Discussion](#discussion)
* [Conclusion](#conclusion)

## Introduction <a name="introduction"></a>

**Berlin** is a great city for tourists. It is not a surprise that it is ranked as the most visited German city by <a href="https://www.worldatlas.com/articles/the-10-most-visited-cities-in-germany.html" target="_blank" rel="noopener">Worldatlas</a> with 31.1 milion tourists for 2016. 

However, as a tourist, you may find it **diffucult** to find out in which part of the city you should spend most of your time while being in Berlin. The city is relatively big. It is consisted by 891,8 km² of urbanized area, which is divided in 12 boroughs, full of great places to visit. The question I will try to answer is **which borough has the highest variaty of high quality tourist venues.** 

The main audience which may benefit by solving the problem above is the **one-day tourists.** It will be extremely valuable for them to know in which district should they spend their only day in Berlin in order to get the best out of it.

## Data <a name="data"></a>

Based on definition of the problem, factors that will influence the solution are:

* geolocation of the Berlin boroughs
* number of unique categories of venues in the borough
* average rating of the venues measured by the number of likes

In order to get this data, I decided to use three data providers:

* the list of the names of boroughs will be obtained from **Wikipedia**
* the geolocation latitute and longitute of the boroughs in Berlin will be obtained using **Google Maps API**
* number of unique categories of venues in the district and average rating of the venues measured by the number of likes will be obtained using **Foursquare API**

### Obtaining the names of the Berlin boroughs from Wikipedia



#### Installing the necessary libriaries

In [1]:
! pip install pandas
import pandas as pd

from pandas.io.html import read_html

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim 

import requests
from pandas.io.json import json_normalize 

!conda install -c conda-forge folium=0.5.0 --yes
import folium

print("Libraries installed")

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries installed


#### Obtaining the data from Wikipedia

In [2]:
page = "https://en.wikipedia.org/wiki/Boroughs_and_neighborhoods_of_Berlin"

wikitables = read_html(page, attrs={"class":"wikitable"})
print("Exracted {num} wikitables".format(num=len(wikitables)))

Exracted 13 wikitables


#### Converting the Wikitable into a Pandas DataFrame

In [3]:
df1 = pd.DataFrame(wikitables[1])
df2 = pd.DataFrame(wikitables[2])
df3 = pd.DataFrame(wikitables[3])
df4 = pd.DataFrame(wikitables[4])
df5 = pd.DataFrame(wikitables[5])
df6 = pd.DataFrame(wikitables[6])
df7 = pd.DataFrame(wikitables[7])
df8 = pd.DataFrame(wikitables[8])
df9 = pd.DataFrame(wikitables[9])
df10 = pd.DataFrame(wikitables[10])
df11 = pd.DataFrame(wikitables[11])
df12 = pd.DataFrame(wikitables[12])

#### Concatenating all wikitables into one Pandas DataFrame 

In [4]:
district_list = pd.concat([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12], ignore_index=True)
district_list.head()

Unnamed: 0,Locality,Area in km²,Population as of 2008,Density inhabitants per km²,Map
0,(0101) Mitte,10.7,79582,7445,
1,(0102) Moabit,7.72,69425,8993,
2,(0103) Hansaviertel,0.53,5889,11111,
3,(0104) Tiergarten,5.17,12486,2415,
4,(0105) Wedding,9.23,76363,8273,


#### Removing the unnecessary columns and renaming the main column

In [5]:
dropped_df = district_list.drop(columns=['Population as of 2008', 'Density inhabitants per km²', 'Map'], axis=1)
dropped_df.head()

Unnamed: 0,Locality,Area in km²
0,(0101) Mitte,10.7
1,(0102) Moabit,7.72
2,(0103) Hansaviertel,0.53
3,(0104) Tiergarten,5.17
4,(0105) Wedding,9.23


#### Renaming the Locality column into District

In [6]:
renamed_df = dropped_df.rename(columns={"Locality": "District"})
renamed_df.head()

Unnamed: 0,District,Area in km²
0,(0101) Mitte,10.7
1,(0102) Moabit,7.72
2,(0103) Hansaviertel,0.53
3,(0104) Tiergarten,5.17
4,(0105) Wedding,9.23


#### Removing the codes from the District names

In [7]:
renamed_df['District'] = pd.DataFrame(renamed_df['District'].str[7:])
renamed_df.head()

Unnamed: 0,District,Area in km²
0,Mitte,10.7
1,Moabit,7.72
2,Hansaviertel,0.53
3,Tiergarten,5.17
4,Wedding,9.23


#### Importing the Geospatial data of the Districts

In [8]:
import pandas as pd
geo_data = pd.read_csv('/Users/kirilyunakov/Downloads/Berlin_District_Coordinates.csv')
geo_data.head()

Unnamed: 0,Locale,Lat,Long
0,Mitte,52.519444,13.406667
1,Moabit,52.533333,13.333333
2,Hansaviertel,52.516667,13.338889
3,Tiergarten,52.516667,13.366667
4,Wedding,52.55,13.366667


#### Renaming the column based on which will perform the merge (District)

In [9]:
renamed_geo_data = geo_data.rename(columns={'Locale': 'District'})
renamed_geo_data.head()

Unnamed: 0,District,Lat,Long
0,Mitte,52.519444,13.406667
1,Moabit,52.533333,13.333333
2,Hansaviertel,52.516667,13.338889
3,Tiergarten,52.516667,13.366667
4,Wedding,52.55,13.366667


#### Merging the Geospatial data with the data from Wikipedia based on the District name

In [10]:
result = pd.merge(renamed_df, renamed_geo_data, on='District', how='left')
result.head()

Unnamed: 0,District,Area in km²,Lat,Long
0,Mitte,10.7,52.519444,13.406667
1,Moabit,7.72,52.533333,13.333333
2,Hansaviertel,0.53,52.516667,13.338889
3,Tiergarten,5.17,52.516667,13.366667
4,Wedding,9.23,52.55,13.366667


In [11]:
neighborhoods = result

In [12]:
address = 'Berlin'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Berlin are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Berlin are 52.5170365, 13.3888599.


In [13]:
map_berlin = folium.Map(location=[latitude, longitude], zoom_start=10.5)

for lat, lng, neighborhood in zip(neighborhoods['Lat'], neighborhoods['Long'], neighborhoods['District']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_berlin)  
    
map_berlin

#### Displaying the distribution

In [14]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid")
sns.boxplot(y='Area in km²', data=neighborhoods)

<matplotlib.axes._subplots.AxesSubplot at 0x1a18609eb8>

#### 5-10km2 is the size of the 50% of the districts
This will allows us to set a appropriate perimater for our radius of scan.

#### Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = 'X0UR5GR2EL3FBUPRBTPQW0M1XOECT3RPNNPUXEPACXEPO44J' 
CLIENT_SECRET = 'ANFDRV2DKSZG1J453PD3JW0V4DHM4FDKLR5UHFVDEW5IYFZO' 
VERSION = '20180605' 

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentails:
CLIENT_ID: X0UR5GR2EL3FBUPRBTPQW0M1XOECT3RPNNPUXEPACXEPO44J
CLIENT_SECRET:ANFDRV2DKSZG1J453PD3JW0V4DHM4FDKLR5UHFVDEW5IYFZO


In [16]:
neighborhood_latitude = neighborhoods.loc[0, 'Lat'] 
neighborhood_longitude = neighborhoods.loc[0, 'Long'] 

neighborhood_name = neighborhoods.loc[0, 'District'] 

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Mitte are 52.51944399999999, 13.406667.


In [36]:
LIMIT = 50
radius = 500

section = "sights"
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius,
    section,
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=X0UR5GR2EL3FBUPRBTPQW0M1XOECT3RPNNPUXEPACXEPO44J&client_secret=ANFDRV2DKSZG1J453PD3JW0V4DHM4FDKLR5UHFVDEW5IYFZO&v=20180605&ll=52.51944399999999,13.406667&radius=500&limit=sights'

In [37]:
new_results = requests.get(url).json()

In [38]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [41]:
venues = new_results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues)

filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

KeyError: 'groups'

In [40]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

50 venues were returned by Foursquare.


In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=50):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        section = "sights"    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            section,
            LIMIT)
            
        # make the GET request
        new_results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in new_results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [30]:
berlin_venues = getNearbyVenues(names=neighborhoods['District'],
                                   latitudes=neighborhoods['Lat'],
                                   longitudes=neighborhoods['Long']
                                  )

Mitte


KeyError: 'groups'

In [None]:
print(berlin_venues.shape)
berlin_venues.tail()

In [None]:
berlin_venues.groupby('District').count()

In [None]:
print('There are {} uniques categories.'.format(len(berlin_venues['Venue Category'].unique())))