<h1 align="center"><font size="5">DATA SCIENCE PROJECT - The perfect weekend trip</font></h1>

<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
        <li><a href="#ref_Problem">Problem Statement</li>
        <li><a href="#ref_Data">Data Description</li>    
        <li><a href="#ref0">Importing Libraries</li>
        <li><a href="#ref1">Scrapping Data & Creating A Dataframe of provinces</a></li>
        <li><a href="#ref2">Get Latitute And Longitude Coordinates </a></li>
        <li><a href="#ref3">Explore And Cluster The Cities</a></li>
        <ol>
            <li><a href="#ref31">Overview: Creating a map of the main cities in Italy</a></li>
            <li><a href="#ref32">Explore Veneto</a></li>
            <li><a href="#ref33">Analyze Each Province</a></li>
            <li><a href="#ref34">Cluster Provinces</a></li>
            <li><a href="#ref35">Examining Cities</a></li>
        </ol><br>
        <li><a href="#ref_Conclusions">Conclusions</a></li>
    </ul>
</div>
<br>
<hr>

<a id="ref_Problem"></a>
# Problem Statement

You just started to work as a Junior Data Analyst at a travel agency.<br>
Since Italy will be a very sought travel destination after COVID-19, your manager asks you to propose a travel itinerary for busy people.<br> Ideally, the itinerary should pack different experiences or cities for a weekend trip. Therefore, these cities should be quite different (from a venue's point of view) and quite close from a geographical point of view. <br><br>
<ul>
    <li>We want to create a weekend trip that includes 3 different cities that are close to each other</li>
    <li>The stakeholder is my manager</li>
    <li>The audience will be the tourists</li>
<ul>

<a id="ref_Data"></a>
# Data Description

To provide an interesting weekend trip we will use the following data:
<ul>
    <li>List of Italian provinces and Regions from <a hre="https://en.wikipedia.org/wiki/List_of_postal_codes_in_Italy">Wikipedia</a>.</li>
    <li>Latitude and longitude data of Italian cities from <a href="https://simplemaps.com/static/data/country-cities/it/it.csv">Simplemaps</a></li>
    <li>Venues data from Foursquare. In this case we are interested in the top 10 venues by city in a specific region</li>




<a id="ref0"></a>
# Importing Libraries

In [290]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


usage: conda-script.py [-h] [-V] command ...
conda-script.py: error: unrecognized arguments: # uncomment this line if you haven't completed the Foursquare API lab


<a id="ref1"></a>
# Scrapping Data & Creating A Dataframe of provinces

<b>1.1 Scrapping provinces data from Wikipedia</b><br>

In [291]:
wikipedia_link='https://en.wikipedia.org/wiki/List_of_postal_codes_in_Italy'
df=pd.read_html(wikipedia_link, header=0)[0]

df.head()

Unnamed: 0,Province,Code,Region,CAP capital towns,CAP other towns
0,Roma,RM,Lazio,001xx (00118 to 00199),000xx (00010 to 00069)
1,Vatican City,SCV,-,00120,-
2,Viterbo,VT,Lazio,01100,010xx (01010 to 01039)
3,Rieti,RI,Lazio,02100,020xx (02010 to 02049)
4,Frosinone,FR,Lazio,03100,030xx (03010 to 03049)


<b>1.2 Cleaning the dataframe</b><br> 
The dataframe consists of five columns but we need the first three only.
<ul>
    <li>The name of some cities is in Italian and needs to be translated to English</li>
    <li>Some values in Province have no Region because they are not part of Italy e.g. Vatican City</li>
    <li>Columns 3 and 4 can be removed</li>
</ul>

In [292]:
df_clean = df.drop(["CAP capital towns", "CAP other towns"], axis=1) 
df_clean = df_clean[df_clean["Region"] != "-"]
df_clean["Province"][0] = "Rome"
df_clean["Province"][41] = "Venice"
df_clean["Province"][64] = "Florence"
df_clean["Province"][90] = "Naples"
# There are more but it's hard to find them manually. I will compare it later
df_clean.head()


Unnamed: 0,Province,Code,Region
0,Rome,RM,Lazio
2,Viterbo,VT,Lazio
3,Rieti,RI,Lazio
4,Frosinone,FR,Lazio
5,Latina,LT,Lazio


Let's have a look at how many regions and provinces


In [293]:
print('The dataframe has {} rows and {} columns'.format(df_clean.shape[0], df_clean.shape[1]))

The dataframe has 110 rows and 3 columns


<a id="ref2"></a>
# Get Latitute And Longitude Coordinates for each province's capital city

<p>
    Given that I couldn't get the geographical coordinates from Geocode, I will import the csv file.<br>
    This file is available on <a href="https://simplemaps.com/static/data/country-cities/it/it.csv">Simplemaps.com</a>
</p>

In [294]:
coord = pd.read_csv("it.csv") 
coord.head()

Unnamed: 0,city,lat,lng,country,iso2,admin,capital,population,population_proper
0,Rome,41.9,12.483333,Italy,IT,Lazio,primary,3339000.0,35452.0
1,Milan,45.466667,9.2,Italy,IT,Lombardy,admin,2945000.0,1306661.0
2,Naples,40.833333,14.25,Italy,IT,Campania,admin,2250000.0,988972.0
3,Turin,45.05,7.666667,Italy,IT,Piedmont,admin,1652000.0,865263.0
4,Florence,43.766667,11.25,Italy,IT,Tuscany,admin,1500000.0,371517.0


In [295]:
# Cleaning data that we don't need
coord_clean = coord.drop(coord.columns[3:], axis=1)
coord_clean.columns = ["Province", "lat", "lng"]
coord_clean["Province"][56] = "Padua"
coord_clean#.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Province,lat,lng
0,Rome,41.9,12.483333
1,Milan,45.466667,9.2
2,Naples,40.833333,14.25
3,Turin,45.05,7.666667
4,Florence,43.766667,11.25
5,Salerno,40.683333,14.783333
6,Palermo,38.116667,13.366667
7,Catania,37.5,15.1
8,Genoa,44.416667,8.95
9,Bari,41.133333,16.85


In [151]:
coord_clean.shape

(118, 3)

In [152]:
df_IT = pd.merge(df_clean, coord_clean)

df_IT.head()


Unnamed: 0,Province,Code,Region,lat,lng
0,Rome,RM,Lazio,41.9,12.483333
1,Viterbo,VT,Lazio,42.416667,12.1
2,Rieti,RI,Lazio,42.4,12.85
3,Frosinone,FR,Lazio,41.633333,13.316667
4,Latina,LT,Lazio,41.466667,12.866667


In [297]:
#This ensure that all the rows in df are in place
df_IT.shape
print('The dataframe has {} rows and {} columns'.format(df_IT.shape[0], df_IT.shape[1]))

The dataframe has 92 rows and 5 columns


<a id="ref3"></a>
# Explore And Cluster The Cities

Explore and cluster the three cities to visit in a weekend<br>
After an overview of Italy, we will move to a single region: Veneto<br><br>



<a id="ref31"></a>
## 1. Overview: Creating a map of the main cities in Italy 



In [304]:
# Creating a map of Italy using latitude and longitude values (I got the values from Google)
latitude = 37.9028
longitude = 12.4964
map_IT = folium.Map(location=[latitude, longitude], zoom_start=5)

# add markers to map
for lat, lng, province, region in zip(df_IT['lat'], df_IT['lng'], df_IT['Province'], df_IT['Region']):
    label = '{}, {}'.format(province, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_IT)  
map_IT 

<a id="ref32"></a>
## 2. Explore Veneto <br>
### Let's be a tourist in the Veneto region 


Veneto is a northeastern Italian region stretching from the Dolomite Mountains to the Adriatic Sea.<br> Venice, its regional capital, is famed for its canals, Gothic architecture and Carnival celebrations. <br>
So, it is a perfect candidate to be explored! <br><br>

We will explore the provinces through their capital cities.<br>
Then we will decide which cities are different enough to be part of a weekend trip


In [307]:
# Assigning data to variables related to Foursquare API
CLIENT_ID = 'PAD4YZHZZ1Y4ZH0BYM4DYWNMZN3ONDQXOJXUDBLSVNVWY45D' # your Foursquare ID
CLIENT_SECRET = 'XOCYXBV4IAE33CDNC2Q0ZVDIHQHFZLQKIC30Q1WCD3IHHCUU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [308]:
#Creating Veneto df
df_Veneto = df_IT[df_IT['Region'] == 'Veneto'].reset_index(drop=True)
df_Veneto.head(10)

Unnamed: 0,Province,Code,Region,lat,lng
0,Venice,VE,Veneto,45.438611,12.326667
1,Treviso,TV,Veneto,45.666667,12.245
2,Belluno,BL,Veneto,46.145,12.221389
3,Padua,PD,Veneto,45.416667,11.883333
4,Vicenza,VI,Veneto,45.55,11.55
5,Verona,VR,Veneto,45.45,11.0
6,Rovigo,RO,Veneto,45.066667,11.783333


So, it looks like there are 7 provinces in Veneto.<br>
The name of the province is also the name of the capital city of that province!

#### 2.1 Creating a map of Veneto and its provinces

In [310]:
address_Venice = 'Venice, Veneto'
latitude_Venice = df_Veneto['lat'][0]
longitude_Venice = df_Veneto['lng'][0]
latitude_Venice
map_Venice = folium.Map(location=[latitude_Venice, longitude_Venice], zoom_start=7)

# add markers to map
for lat, lng, province, region in zip(df_Veneto['lat'], df_Veneto['lng'], df_Veneto['Province'], df_Veneto['Region']):
    label = '{}, {}'.format(province, region)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Venice)  
    
map_Venice

Looking at the map, we can see that some cities are on the water, like Venice, and some cities are close to the mountains.<br>
This could be a hint. In a weekend trip it could be nice to visit Venice and a city close to the mountains.

Get Venice's latitude and longitude values.

In [313]:
neighborhood_latitude = df_Veneto.loc[0, 'lat'] # Province latitude value
neighborhood_longitude = df_Veneto.loc[0, 'lng'] # Province longitude value

neighborhood_name = df_Veneto.loc[0, 'Province'] # Province name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Venice are 45.438611, 12.326667.


#### 2.3 Getting the top 100 venues in Venice within a radius of 500 meters.

First, let's create the GET request URL.

In [314]:
LIMIT = 100
radius = 1000
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, neighborhood_latitude, neighborhood_longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=PAD4YZHZZ1Y4ZH0BYM4DYWNMZN3ONDQXOJXUDBLSVNVWY45D&client_secret=XOCYXBV4IAE33CDNC2Q0ZVDIHQHFZLQKIC30Q1WCD3IHHCUU&ll=45.438611,12.326667&v=20180605&radius=1000&limit=100'

Send the GET request and examine the resutls

In [315]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ec4e1fc7828ae0028a2c1ee'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'San Polo',
  'headerFullLocation': 'San Polo, Venice',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 235,
  'suggestedBounds': {'ne': {'lat': 45.44761100900001,
    'lng': 12.339469550611614},
   'sw': {'lat': 45.42961099099999, 'lng': 12.313864449388387}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '5960d594bed483018d95a7dd',
       'name': 'Il Mercante',
       'location': {'address': 'San Polo 2564',
        'lat': 45.437286,
        'lng': 12.327226,
        'labeledLatLngs': [{'label': 'display',
          'lat': 45.437286,
          'lng': 12

We know that all the information is in the *items* key. <br>
Let's use the **get_category_type** function from the Foursquare lab.

In [316]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we clean the json and structure it into a *pandas* dataframe.

In [317]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Il Mercante,Cocktail Bar,45.437286,12.327226
1,Campo dei Frari,Plaza,45.437193,12.327056
2,Pizza 2000,Pizza Place,45.4388,12.32867
3,Osteria Da Filo,Brewery,45.439548,12.327823
4,Ai Garzoti,Italian Restaurant,45.439759,12.324761


In [318]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


#### 2.4 Let's have an overview of each province

In [319]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Province', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [333]:
City_venues = getNearbyVenues(names=df_Veneto['Province'],
                                   latitudes=df_Veneto['lat'],
                                   longitudes=df_Veneto['lng']
                                  )

Venice
Treviso
Belluno
Padua
Vicenza
Verona
Rovigo


In [334]:
#print(City_venues.shape)
#City_venues.head()
print('{} venues were returned by Foursquare.'.format(City_venues.shape[0]))

294 venues were returned by Foursquare.


Let's check how many venues were returned for each Province

In [337]:

City_venues.groupby('Province').count()
# City_venues

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Province,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Belluno,8,8,8,8,8,8
Padua,18,18,18,18,18,18
Rovigo,10,10,10,10,10,10
Treviso,100,100,100,100,100,100
Venice,74,74,74,74,74,74
Verona,27,27,27,27,27,27
Vicenza,57,57,57,57,57,57


Let's find out how many unique categories can be curated from all the returned venues

In [336]:
print('There are {} uniques categories.'.format(len(City_venues['Venue Category'].unique())))

There are 77 uniques categories.


<a id="ref33"></a>
## 3. Analyze Each Province



In [375]:
# one hot encoding
City_onehot = pd.get_dummies(City_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
City_onehot['Province'] = City_venues['Province'] 

# move neighborhood column to the first column
fixed_columns = [City_onehot.columns[-1]] + list(City_onehot.columns[:-1])
City_onehot = City_onehot[fixed_columns]

City_onehot.head()

Unnamed: 0,Province,Art Museum,Bakery,Bar,Beach Bar,Bed & Breakfast,Beer Bar,Beer Garden,Big Box Store,Bistro,Boat or Ferry,Bookstore,Breakfast Spot,Brewery,Burger Joint,Café,Capitol Building,Castle,Cheese Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,Comic Shop,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Design Studio,Dessert Shop,Diner,Electronics Store,Fish Market,Fountain,Fried Chicken Joint,Frozen Yogurt Shop,Gastropub,Gift Shop,Gourmet Shop,Gym,Historic Site,History Museum,Hotel,Ice Cream Shop,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Light Rail Station,Martial Arts Dojo,Mexican Restaurant,Museum,Park,Pastry Shop,Performing Arts Venue,Pizza Place,Platform,Plaza,Pub,Public Art,Restaurant,River,Sandwich Place,Scenic Lookout,Science Museum,Seafood Restaurant,Snack Place,Soccer Field,Soccer Stadium,Speakeasy,Street Food Gathering,Supermarket,Sushi Restaurant,Tea Room,Theater,Trattoria/Osteria,Vegetarian / Vegan Restaurant,Warehouse Store,Wine Bar,Winery
0,Venice,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Venice,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Venice,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Venice,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Venice,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [376]:
City_onehot.shape

(294, 78)

#### 3.1 Group rows by province and by taking the mean of the frequency of occurrence of each category

In [377]:
city_grouped = City_onehot.groupby('Province').mean().reset_index()
city_grouped

Unnamed: 0,Province,Art Museum,Bakery,Bar,Beach Bar,Bed & Breakfast,Beer Bar,Beer Garden,Big Box Store,Bistro,Boat or Ferry,Bookstore,Breakfast Spot,Brewery,Burger Joint,Café,Capitol Building,Castle,Cheese Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,Comic Shop,Cosmetics Shop,Cupcake Shop,Deli / Bodega,Design Studio,Dessert Shop,Diner,Electronics Store,Fish Market,Fountain,Fried Chicken Joint,Frozen Yogurt Shop,Gastropub,Gift Shop,Gourmet Shop,Gym,Historic Site,History Museum,Hotel,Ice Cream Shop,Indie Movie Theater,Italian Restaurant,Japanese Restaurant,Light Rail Station,Martial Arts Dojo,Mexican Restaurant,Museum,Park,Pastry Shop,Performing Arts Venue,Pizza Place,Platform,Plaza,Pub,Public Art,Restaurant,River,Sandwich Place,Scenic Lookout,Science Museum,Seafood Restaurant,Snack Place,Soccer Field,Soccer Stadium,Speakeasy,Street Food Gathering,Supermarket,Sushi Restaurant,Tea Room,Theater,Trattoria/Osteria,Vegetarian / Vegan Restaurant,Warehouse Store,Wine Bar,Winery
0,Belluno,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.125,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0
1,Padua,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.111111,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.111111,0.0,0.055556,0.055556,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.055556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0,0.111111,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Rovigo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.2,0.0,0.1,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Treviso,0.01,0.01,0.05,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.02,0.11,0.0,0.0,0.0,0.0,0.03,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.02,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.02,0.0,0.0,0.04,0.0,0.1,0.02,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.05,0.0,0.1,0.0,0.0,0.03,0.0,0.01,0.0,0.0,0.02,0.01,0.0,0.0,0.01,0.0,0.01,0.01,0.01,0.01,0.03,0.01,0.01,0.07,0.02
4,Venice,0.040541,0.013514,0.040541,0.0,0.040541,0.0,0.013514,0.0,0.013514,0.0,0.0,0.0,0.013514,0.0,0.054054,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.0,0.013514,0.013514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.0,0.013514,0.0,0.148649,0.0,0.0,0.243243,0.0,0.0,0.0,0.0,0.0,0.0,0.013514,0.0,0.013514,0.0,0.067568,0.0,0.0,0.040541,0.0,0.013514,0.0,0.013514,0.013514,0.013514,0.0,0.0,0.0,0.013514,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.108108,0.0
5,Verona,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.148148,0.0,0.037037,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074074,0.0,0.185185,0.0,0.0,0.037037,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.074074,0.037037,0.0,0.037037,0.0,0.0,0.037037,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0,0.037037,0.0
6,Vicenza,0.035088,0.035088,0.052632,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.175439,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.017544,0.017544,0.017544,0.035088,0.017544,0.087719,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.0,0.035088,0.0,0.052632,0.035088,0.017544,0.035088,0.0,0.035088,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.035088,0.0,0.0,0.0,0.035088,0.017544


In [378]:
city_grouped.shape

(7, 78)

#### 3.2 Let's print each province along with the top 3 most common venues

In [379]:
num_top_venues = 3

for p in city_grouped['Province']:
    print("----"+p+"----")
    temp = city_grouped[city_grouped['Province'] == p].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Belluno----
                venue  freq
0                 Bar  0.12
1            Wine Bar  0.12
2  Italian Restaurant  0.12


----Padua----
                venue  freq
0               Hotel  0.11
1  Light Rail Station  0.11
2         Supermarket  0.11


----Rovigo----
            venue  freq
0     Pizza Place   0.2
1  Soccer Stadium   0.1
2             Pub   0.1


----Treviso----
                venue  freq
0                Café  0.11
1               Plaza  0.10
2  Italian Restaurant  0.10


----Venice----
                venue  freq
0  Italian Restaurant  0.24
1               Hotel  0.15
2            Wine Bar  0.11


----Verona----
                venue  freq
0  Italian Restaurant  0.19
1                Café  0.15
2          Restaurant  0.07


----Vicenza----
                venue  freq
0                Café  0.18
1  Italian Restaurant  0.09
2                 Bar  0.05




#### 3.3 Let's put that into a *pandas* dataframe <br>
First, let's write a function to sort the venues in descending order.<br>
Second, let's create the new dataframe and display the top 10 venues for each province.

In [386]:
# Defining the function
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Province']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Creating the new dataframe
province_venues_sorted = pd.DataFrame(columns=columns)
province_venues_sorted['Province'] = city_grouped['Province']

for ind in np.arange(Venice_grouped.shape[0]):
    province_venues_sorted.iloc[ind, 1:] = return_most_common_venues(city_grouped.iloc[ind, :], num_top_venues)

# Displaying the top 10 venues for each neighborhood.
province_venues_sorted

Unnamed: 0,Province,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Belluno,Soccer Stadium,Fried Chicken Joint,Bar,Wine Bar,Hotel,Supermarket,Italian Restaurant,Japanese Restaurant,Design Studio,Coffee Shop
1,Padua,Light Rail Station,Sushi Restaurant,Supermarket,Hotel,Breakfast Spot,Boat or Ferry,Gift Shop,Platform,Plaza,Indie Movie Theater
2,Rovigo,Pizza Place,Soccer Stadium,Italian Restaurant,Café,Park,Plaza,Design Studio,Pub,Dessert Shop,Diner
3,Treviso,Café,Italian Restaurant,Plaza,Wine Bar,Bar,Pizza Place,Ice Cream Shop,Restaurant,Trattoria/Osteria,Clothing Store
4,Venice,Italian Restaurant,Hotel,Wine Bar,Plaza,Café,Restaurant,Art Museum,Bed & Breakfast,Bar,Brewery
5,Verona,Italian Restaurant,Café,Ice Cream Shop,Restaurant,Soccer Field,Cheese Shop,Castle,Martial Arts Dojo,River,Scenic Lookout
6,Vicenza,Café,Italian Restaurant,Plaza,Bar,Art Museum,Wine Bar,Pub,Restaurant,Sandwich Place,Ice Cream Shop


#### Some interim conclusions

Every province in Veneto has Italian Restaurants in the top 10 common venues.<br>
However, we see that Treviso and Vicenza are quite similar so we will visit only one of the two

<a id="ref34"></a>
## 4. Cluster Provinces

#### 4.1 Run *k*-means to cluster the provinces into 5 clusters.

In [387]:
# set number of clusters
kclusters = 3

Veneto_clustering = city_grouped.drop('Province', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Veneto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 1, 1, 0, 1, 1])

#### 4.2 Creating a new dataframe that includes the cluster as well as the top 10 venues for each province.

In [388]:
# add clustering labels
province_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
df_Veneto2 = pd.merge(df_Veneto, province_venues_sorted)

df_Veneto2.drop(["Code", "Region"], axis=1, inplace=True) 
df_Veneto2


Unnamed: 0,Province,lat,lng,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Venice,45.438611,12.326667,0,Italian Restaurant,Hotel,Wine Bar,Plaza,Café,Restaurant,Art Museum,Bed & Breakfast,Bar,Brewery
1,Treviso,45.666667,12.245,1,Café,Italian Restaurant,Plaza,Wine Bar,Bar,Pizza Place,Ice Cream Shop,Restaurant,Trattoria/Osteria,Clothing Store
2,Belluno,46.145,12.221389,0,Soccer Stadium,Fried Chicken Joint,Bar,Wine Bar,Hotel,Supermarket,Italian Restaurant,Japanese Restaurant,Design Studio,Coffee Shop
3,Padua,45.416667,11.883333,2,Light Rail Station,Sushi Restaurant,Supermarket,Hotel,Breakfast Spot,Boat or Ferry,Gift Shop,Platform,Plaza,Indie Movie Theater
4,Vicenza,45.55,11.55,1,Café,Italian Restaurant,Plaza,Bar,Art Museum,Wine Bar,Pub,Restaurant,Sandwich Place,Ice Cream Shop
5,Verona,45.45,11.0,1,Italian Restaurant,Café,Ice Cream Shop,Restaurant,Soccer Field,Cheese Shop,Castle,Martial Arts Dojo,River,Scenic Lookout
6,Rovigo,45.066667,11.783333,1,Pizza Place,Soccer Stadium,Italian Restaurant,Café,Park,Plaza,Design Studio,Pub,Dessert Shop,Diner


#### 4.3 Visualizing the resulting clusters

In [389]:
# create map
latitude = df_Veneto2['lat'][0]
longitude = df_Veneto2['lng'][0]
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=9)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_Veneto2['lat'], df_Veneto2['lng'], df_Veneto2['Province'], df_Veneto2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color= rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id="ref35"></a>
## 5. Examining Clusters 

#### Cluster 0

In [390]:
df_Veneto2.loc[df_Veneto2['Cluster Labels'] == 0, df_Veneto2.columns[[0] + list(range(4, df_Veneto2.shape[1]))]]

Unnamed: 0,Province,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Venice,Italian Restaurant,Hotel,Wine Bar,Plaza,Café,Restaurant,Art Museum,Bed & Breakfast,Bar,Brewery
2,Belluno,Soccer Stadium,Fried Chicken Joint,Bar,Wine Bar,Hotel,Supermarket,Italian Restaurant,Japanese Restaurant,Design Studio,Coffee Shop


#### Cluster 1

In [391]:
df_Veneto2.loc[df_Veneto2['Cluster Labels'] == 1, df_Veneto2.columns[[0] + list(range(4, df_Veneto2.shape[1]))]]

Unnamed: 0,Province,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Treviso,Café,Italian Restaurant,Plaza,Wine Bar,Bar,Pizza Place,Ice Cream Shop,Restaurant,Trattoria/Osteria,Clothing Store
4,Vicenza,Café,Italian Restaurant,Plaza,Bar,Art Museum,Wine Bar,Pub,Restaurant,Sandwich Place,Ice Cream Shop
5,Verona,Italian Restaurant,Café,Ice Cream Shop,Restaurant,Soccer Field,Cheese Shop,Castle,Martial Arts Dojo,River,Scenic Lookout
6,Rovigo,Pizza Place,Soccer Stadium,Italian Restaurant,Café,Park,Plaza,Design Studio,Pub,Dessert Shop,Diner


#### Cluster 2

In [392]:
df_Veneto2.loc[df_Veneto2['Cluster Labels'] == 2, df_Veneto2.columns[[0] + list(range(4, df_Veneto2.shape[1]))]]

Unnamed: 0,Province,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Padua,Light Rail Station,Sushi Restaurant,Supermarket,Hotel,Breakfast Spot,Boat or Ferry,Gift Shop,Platform,Plaza,Indie Movie Theater


<a id="ref_Conclusions"></a> 
# Conclusions 

We carried out this analysis with the intent to show different cities that could be visited during a weekend trip. The initial request was to find out cities that should be quite different (from a venue's point of view) and quite close from a geographical point of view.<br>

Looking at the table and map presented above we can see how cities/Provinces are clustered:
<ul>
    <li>Cluster 0: Venice, Belluno</li>
    <li>Cluster 1: Verona, Vicenza, Treviso, Rovigo</li>
    <li>Cluster 2: Padua</li>
</ul><br>

We want to visit a city for each cluster. Since Cluster 2 has only one city, Padua, this could be a city we want to visit. <br>

In cluster 0 there are two cities, Venice and Belluno. However, Venice is much closer to Padua, therefore Venice becomes the second city on our weekend trip. <br>

In cluster 1, we have four cities, but Treviso is the nearest to both Padua and Venice, therefore Treviso is going to be the third city on our weekend trip. <br>
