# Restaurant quality and profile at the top 50 airports in the world

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
<a href="#item0">Context</a>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Top 50 Airports by Passenger Traffic</a>

3. <a href="#item3">Analyze the Restaurants at each airport and cluster the airports</a>

 3.1. <a href="#item31">Restaurant Quality</a>
    
 3.2. <a href="#item31">Restaurant Profile</a>

4. <a href="#item4">Examine Clusters</a>
    
 4.1. <a href="#item31">Restaurant Quality</a>
    
 4.2. <a href="#item31">Restaurant Profile</a>
    
5. <a href="#item5">Report</a>

 5.1. <a href="#item31">Introduction</a>
    
 5.2. <a href="#item31">Methodology</a>

 5.3. <a href="#item31">Results</a>

 5.4. <a href="#item31">Discussion</a>
    
 5.5. <a href="#item31">Conclusion</a>

</font>
</div>

<font size = 5><a href="#item0">Context</a> <br></font>
While flying becomes a commodity and passenger traffic rises ever more people spent time at and around big airports. Apart from internal transport infrastruce, connecting public transport, new amenities such as sleep pods, sports etc stil a great opportunity to differentiate are great restaurants, hotels and shopping opportunites. Airport operators need a means to compare themselves with other top airports across the world.
The following analysis will provide a starting point by looking at the 50 airports with the largest passenger turnover per year and compare them based on how their restaurants, hotels and shops where rated by passengers. The rating data is supplied by foursquare, the airport location data is supplied by https://openflights.org/data.html and the data for the passenger traffic comes from Wikipedia (https://en.wikipedia.org/wiki/List_of_busiest_airports_by_passenger_traffic). 



In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import xlrd

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

#### Source: https://openflights.org/data.html

In [2]:
airport_data = pd.read_csv("airports.csv")
airport_data.head()

Unnamed: 0,id,name,city,country,code,icao,latitude,longitude,altitude,offset,dst,timezone
0,1,Goroka,Goroka,Papua New Guinea,GKA,AYGA,-6.081689,145.391881,5282,10.0,U,Pacific/Port_Moresby
1,2,Madang,Madang,Papua New Guinea,MAG,AYMD,-5.207083,145.7887,20,10.0,U,Pacific/Port_Moresby
2,3,Mount Hagen,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789,144.295861,5388,10.0,U,Pacific/Port_Moresby
3,4,Nadzab,Nadzab,Papua New Guinea,LAE,AYNZ,-6.569828,146.726242,239,10.0,U,Pacific/Port_Moresby
4,5,Port Moresby Jacksons Intl,Port Moresby,Papua New Guinea,POM,AYPY,-9.443383,147.22005,146,10.0,U,Pacific/Port_Moresby


#### Source: Wikipedia

In [3]:
top50_airports = pd.read_excel("airports_2016.xlsx")
top50_airports.head()

Unnamed: 0,rank,name,code,location,country,passengers
0,1,HartsfieldñJackson Atlanta International Airport,ATL/KATL,"Atlanta, Georgia",United States,104171935
1,2,Beijing Capital International Airport,PEK/ZBAA,"Chaoyang-Shunyi, Beijing",China,94393454
2,3,Dubai International Airport,DXB/OMDB,"Garhoud, Dubai",United Arab Emirates,83654250
3,4,Los Angeles International Airport,LAX/KLAX,"Los Angeles, California",United States,80921527
4,5,Tokyo International Airport,HND/RJTT,"Ota, Tokyo",Japan,79699762


In [4]:
def split_code(row):
    code = row.split("/")[0]
    icao = row.split("/")[1]
    return code

In [5]:
top50_airports["code_new"] = top50_airports["code"].apply(split_code)

In [6]:
top50_airports.head()

Unnamed: 0,rank,name,code,location,country,passengers,code_new
0,1,HartsfieldñJackson Atlanta International Airport,ATL/KATL,"Atlanta, Georgia",United States,104171935,ATL
1,2,Beijing Capital International Airport,PEK/ZBAA,"Chaoyang-Shunyi, Beijing",China,94393454,PEK
2,3,Dubai International Airport,DXB/OMDB,"Garhoud, Dubai",United Arab Emirates,83654250,DXB
3,4,Los Angeles International Airport,LAX/KLAX,"Los Angeles, California",United States,80921527,LAX
4,5,Tokyo International Airport,HND/RJTT,"Ota, Tokyo",Japan,79699762,HND


In [7]:
top50_airports.drop("code", axis=1, inplace=True)

In [8]:
top50_airports.rename({"code_new":"code"}, axis=1, inplace=True)

In [9]:
my_data = top50_airports.merge(airport_data, how="left", on="code")

In [10]:
my_data.head()

Unnamed: 0,rank,name_x,location,country_x,passengers,code,id,name_y,city,country_y,icao,latitude,longitude,altitude,offset,dst,timezone
0,1,HartsfieldñJackson Atlanta International Airport,"Atlanta, Georgia",United States,104171935,ATL,3682,Hartsfield Jackson Atlanta Intl,Atlanta,United States,KATL,33.636719,-84.428067,1026,-5.0,A,America/New_York
1,2,Beijing Capital International Airport,"Chaoyang-Shunyi, Beijing",China,94393454,PEK,3364,Capital Intl,Beijing,China,ZBAA,40.080111,116.584556,116,8.0,U,Asia/Chongqing
2,3,Dubai International Airport,"Garhoud, Dubai",United Arab Emirates,83654250,DXB,2188,Dubai Intl,Dubai,United Arab Emirates,OMDB,25.252778,55.364444,62,4.0,U,Asia/Dubai
3,4,Los Angeles International Airport,"Los Angeles, California",United States,80921527,LAX,3484,Los Angeles Intl,Los Angeles,United States,KLAX,33.942536,-118.408075,126,-8.0,A,America/Los_Angeles
4,5,Tokyo International Airport,"Ota, Tokyo",Japan,79699762,HND,2359,Tokyo Intl,Tokyo,Japan,RJTT,35.552258,139.779694,35,9.0,U,Asia/Tokyo


In [11]:
my_data.columns

Index(['rank', 'name_x', 'location', 'country_x', 'passengers', 'code', 'id',
       'name_y', 'city', 'country_y', 'icao', 'latitude', 'longitude',
       'altitude', 'offset', 'dst', 'timezone'],
      dtype='object')

In [12]:
my_data.drop(['rank', 'passengers', 'code',
       'name_y','country_y', 'icao',"id",
       'altitude', 'offset', 'dst', 'timezone'], axis=1, inplace=True)

In [13]:
my_data.columns = ["name", "location", "country", "city", "latitude", "longitude"]

## 2. Explore Top 50 Airports by Passenger Traffic

In [14]:
my_data.head()

Unnamed: 0,name,location,country,city,latitude,longitude
0,HartsfieldñJackson Atlanta International Airport,"Atlanta, Georgia",United States,Atlanta,33.636719,-84.428067
1,Beijing Capital International Airport,"Chaoyang-Shunyi, Beijing",China,Beijing,40.080111,116.584556
2,Dubai International Airport,"Garhoud, Dubai",United Arab Emirates,Dubai,25.252778,55.364444
3,Los Angeles International Airport,"Los Angeles, California",United States,Los Angeles,33.942536,-118.408075
4,Tokyo International Airport,"Ota, Tokyo",Japan,Tokyo,35.552258,139.779694


In [16]:
# create map of Manhattan using latitude and longitude values
map_top50 = folium.Map(location=[51.509865, -0.118092], zoom_start=2)

# add markers to map
for lat, lng, label in zip(my_data['latitude'], my_data['longitude'], my_data['name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,).add_to(map_top50) 

folium.Marker(
    [85, -45],
    icon=DivIcon(
        icon_size=(600,200),
        icon_anchor=(100,0),
        html='<div style="font-size: 18pt">Top 50 Airports by passenger traffic</div>')
    ).add_to(map_top50)
    
map_top50

NameError: name 'DivIcon' is not defined

## 3. Analyze the Restaurants at each airport and cluster the airports

### 3.1. Restaurant Quality

In [None]:
CLIENT_ID = 'BMMZILCYQCZJE1KR2SJYMHMFLDBQBMRZOXFRN3C3AARRX113'
CLIENT_SECRET= 'C1GY15IRDEV4W0KO2U1FZXMLSE2PH33O50FB1FEVMW0G4BPI'
VERSION = '20180604'
LIMIT = 100
radius = 1000
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

#### test for one airport

In [None]:
airport_latitude = my_data.loc[12, 'latitude'] # aiport latitude value
airport_longitude = my_data.loc[12, 'longitude'] # airport longitude value

airport_name = my_data.loc[12, 'name'] # airport name

print('Latitude and longitude values of {} and {}, {}.'.format(airport_latitude, airport_longitude, airport_name))

In [None]:
temp_lim = 30
radius=3000
search_term="restaurant"
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    airport_latitude, 
    airport_longitude, 
    radius, 
    temp_lim,
    search_term)
url

In [None]:
results = requests.get(url).json()

In [None]:
# assign relevant part of JSON to venues
venues = results['response']['venues']


# tranform venues into a dataframe
dataframe = json_normalize(venues)
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]
dataframe_filtered.head()

In [None]:
len(results["response"]['venues'])

In [None]:
results["response"]['venues'][1]

In [None]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered

In [None]:
venue_id = dataframe_filtered.loc[0, 'id'] # ID of the first venue
url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
result = requests.get(url).json()

In [None]:
try:
    print(result['response']['venue']['rating'])
except:
    print('This venue has not been rated yet.')

## repeat for all airports

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=12000, LIMIT_to=20, query_term="restaurant"):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT_to,
            query_term)
   
        # make the GET request
        my_results = requests.get(url).json()["response"]["venues"]
        #print(my_results[0])
        # return only relevant information for each nearby venue
        try:
            for v in my_results:
                venues_list.append([(
                    name, 
                    lat, 
                    lng, 
                    v['id'],
                    v['name'], 
                    v['location']['lat'], 
                    v['location']['lng'],  
                    v['categories'][0]['name'])])
        except:
            pass

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['airport_name', 
                  'latitude', 
                  'longitude',
                  'Venue ID',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']            
    
    return(nearby_venues)

In [None]:
airport_restaurants = getNearbyVenues(names=my_data['name'],
                                   latitudes=my_data['latitude'],
                                   longitudes=my_data['longitude']
                                  )
airport_restaurants.head()

In [None]:
def get_venue_rating(curr_venue_id):
    url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(curr_venue_id, CLIENT_ID, CLIENT_SECRET, VERSION)
    result = requests.get(url).json()
    try:
        return result['response']['venue']['rating']
    except:
        return None

In [None]:
airport_restaurants["rating"] = airport_restaurants["Venue ID"].apply(get_venue_rating)

In [None]:
airport_restaurants_stripped = airport_restaurants.drop(["latitude","longitude","Venue ID", "Venue","Venue Latitude","Venue Longitude"], axis=1)

In [None]:
airport_restaurants_stripped = airport_restaurants_stripped.dropna()

In [None]:
airport_restaurants_stripped.head()

In [None]:
len(set(airport_restaurants_stripped["airport_name"]))

In [None]:
set(airport_restaurants_stripped["Venue Category"])

In [None]:
airport_restaurants_stripped.head()

In [None]:
len(airport_restaurants_stripped)

In [None]:
restaurant_df = pd.DataFrame()

In [None]:
for idx, item in enumerate(airport_restaurants_stripped.iterrows()):
    restaurant_df.loc[idx, item[1][1]] = float(item[1][2])
restaurant_df["airport_name"] = airport_restaurants_stripped["airport_name"]

In [None]:
restaurant_df.head()

In [None]:
restaurant_df_grouped = restaurant_df.groupby("airport_name").mean()

In [None]:
airport_restaurants_stripped.groupby("airport_name").mean().sort_values(by="rating", ascending=False)

In [None]:
restaurant_df_grouped = restaurant_df_grouped.fillna(0)

In [None]:
restaurant_df_grouped.head(25)

In [None]:
# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(restaurant_df_grouped)

# check cluster labels generated for each row in the dataframe
set(kmeans.labels_)

In [None]:
airports_merged = my_data.copy()
len(set(restaurant_df_grouped.index))

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Best Rated Restaurant Category'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Best Rated Restaurant Category'.format(ind+1))

# create a new dataframe
restaurant_df_sorted = pd.DataFrame(columns=columns)
restaurant_df_sorted['name'] = restaurant_df_grouped.index

for ind in np.arange(restaurant_df_grouped.shape[0]):
    restaurant_df_sorted.iloc[ind, 1:] = return_most_common_venues(restaurant_df_grouped.iloc[ind, :], num_top_venues)

restaurant_df_sorted.head()


In [None]:
restaurant_df_sorted.head(50)

In [None]:
#merge airport data with most best rated venue(restaurant)

airports_merged = airports_merged.join(restaurant_df_sorted.set_index('name'), on='name')

try:
    airports_merged = airports_merged.drop("rating_mean", 1)
except:
    pass

In [None]:
restaurant_df_grouped["Cluster Label"] = kmeans.labels_

In [None]:
restaurant_df_grouped.head()

In [None]:
len(kmeans.labels_)

In [None]:
restaurant_df_grouped.reset_index(level=0, inplace=True)

In [None]:
airports_merged_wcluster = airports_merged.copy()

In [None]:
airports_merged_wcluster = airports_merged_wcluster.merge(restaurant_df_grouped, how="right",left_on="name", right_on="airport_name")

In [None]:
x = np.arange(kclusters)

ys = [i+x+(i*x)**2 for i in range(kclusters)]

colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [17]:
# create map
airports_clusters = folium.Map(location=[51.509865, -0.118092], zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to the map
for lat, lng, label in zip(my_data['latitude'], my_data['longitude'], my_data['name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='grey',
        fill=True,
        fill_color='grey',
        fill_opacity=0.7,).add_to(airports_clusters)
    

markers_colors = []
for lat, lon, poi, cluster in zip(airports_merged_wcluster['latitude'], airports_merged_wcluster['longitude'], airports_merged_wcluster['name'], airports_merged_wcluster['Cluster Label']):
    label = folium.Popup(str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=1).add_to(airports_clusters)
 
    
    # add some info to the map to get an immediate understanding of the approach
folium.Marker(
    [84, -70],
    icon=DivIcon(
        icon_size=(600,200),
        icon_anchor=(100,0),
        html='<div style="font-size: 14pt; text-align:center">Airports clustered based on their restaurant user ratings to get an overview of performance in this area for airport operators</div>')
    ).add_to(airports_clusters)

airports_clusters

NameError: name 'kclusters' is not defined

#### The analysis shows there is one large cluster and a few that stand out from the crowd. However, a granular "restaurant analysis" based on the different  restaurant categories is biased because for some airports there are simply not enough or no rated restaurants.

#### Therefore, I will refine the approach to give airport operators or restaurant owners an unbiased impression how they rank among the top 50 airports and identify the "restaurant profile" for each airport and use it to cluster the airports.

### 3.2. Restaurant Types

In [None]:
def getNearbyVenues_explore(names, latitudes, longitudes, radius=8000, LIMIT_to=50, query_term="restaurant"):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        try:    
            # make the GET request
            results = requests.get(url).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
      
        except:
            venues_list.append([(
            name, 
            lat, 
            lng, 
            0, 
            0, 
            0,  
            0) for v in results])
        

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['name', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
airport_restaurants_full = getNearbyVenues_explore(names=my_data['name'],
                                   latitudes=my_data['latitude'],
                                   longitudes=my_data['longitude'],
                                   radius= 10000,
                                   LIMIT_to=50,
                                   query_term="restaurant"
                                  )

In [None]:
airport_restaurants_full.head()

In [None]:
airport_restaurants_full.shape

In [None]:
len(set(airport_restaurants_full["Venue Category"]))

In [None]:
# one hot encoding
airport_onehot = pd.get_dummies(airport_restaurants_full[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
airport_onehot['name'] = airport_restaurants_full["name"] 

# move neighborhood column to the first column
fixed_columns = [airport_onehot.columns[-1]] + list(airport_onehot.columns[:-1])
airport_onehot = airport_onehot[fixed_columns]

airport_onehot.head()

In [None]:
airport_restaurants_grouped = airport_onehot.groupby('name').mean().reset_index()
airport_restaurants_grouped.head()

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
num_top_venues = 50

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['name']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
airport_restaurants_sorted = pd.DataFrame(columns=columns)
airport_restaurants_sorted['name'] = airport_restaurants_grouped['name']

for ind in np.arange(airport_restaurants_grouped.shape[0]):
    airport_restaurants_sorted.iloc[ind, 1:] = return_most_common_venues(airport_restaurants_grouped.iloc[ind, :], num_top_venues)

airport_restaurants_sorted.head()

In [None]:
# set number of clusters
kclusters = 5

airport_restaurants_clustering = airport_restaurants_grouped.drop('name', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(airport_restaurants_clustering)

# check cluster labels generated for each row in the dataframe
set(kmeans.labels_)

In [None]:
# add clustering labels
my_data_restaurants = my_data.copy()
my_data_restaurants['Cluster Labels'] = kmeans.labels_

#merge airport data with most frequent venue(restaurant) to add latitude/longitude for each airport

my_data_restaurants = my_data_restaurants.join(airport_restaurants_sorted.set_index('name'), on='name')

try:
    my_data_restaurants = my_data_restaurants.drop("rating_mean", 1)
except:
    pass
my_data_restaurants.head()

In [None]:
x = np.arange(kclusters)

ys = [i+x+(i*x)**2 for i in range(kclusters)]

colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

In [None]:
# create map
airport_clusters = folium.Map(location=[51.509865, -0.118092], zoom_start=2)

from folium.features import DivIcon

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(my_data_restaurants['latitude'], my_data_restaurants['longitude'], my_data_restaurants['name'], my_data_restaurants['Cluster Labels']):
    label = folium.Popup(str(poi) + ', Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(airport_clusters)
    
# add some info to the map to get an immediate understanding of the approach
folium.Marker(
    [84, -70],
    icon=DivIcon(
        icon_size=(600,200),
        icon_anchor=(100,0),
        html='<div style="font-size: 14pt; text-align:center">Airports clustered based on their restaurant profile to identify gaps for airport operators or restaurants chains / owners</div>')
    ).add_to(airport_clusters)

airport_clusters

### 4. Examine Clusters

#### 4.1. Restaurant Quality

In [None]:
airports_merged_wcluster.loc[airports_merged_wcluster['Cluster Label'] == 0, airports_merged_wcluster.columns[[0] + [2] + list(range(6, airports_merged_wcluster.shape[1]))]].head(10)

In [None]:
airports_merged_wcluster.loc[airports_merged_wcluster['Cluster Label'] == 1, airports_merged_wcluster.columns[[0] + [2] + list(range(6, airports_merged_wcluster.shape[1]))]].head(10)

In [None]:
airports_merged_wcluster.loc[airports_merged_wcluster['Cluster Label'] == 2, airports_merged_wcluster.columns[[0] + [2] + list(range(6, airports_merged_wcluster.shape[1]))]].head(10)

In [None]:
airports_merged_wcluster.loc[airports_merged_wcluster['Cluster Label'] == 3, airports_merged_wcluster.columns[[0] + [2] + list(range(6, airports_merged_wcluster.shape[1]))]].head(10)

In [None]:
airports_merged_wcluster.loc[airports_merged_wcluster['Cluster Label'] == 4, airports_merged_wcluster.columns[[0] + [2] + list(range(6, airports_merged_wcluster.shape[1]))]].head(10)

In [None]:
airport_restaurants_stripped.groupby("airport_name").mean().sort_values(by="rating", ascending=False)

In [None]:
airport_restaurants_stripped.sort_values(by="rating").head()

#### 4.2. Restaurant Profile

In [None]:
def most_often_in_t10(df):
    counter_dict2 = {}
    for idx, item in enumerate(df.iterrows()):
        weight = 10 ## to account for 1st, 2nd, 3rd etc.
        for i in range(3,13):
            if item[1][i] in counter_dict2:
                counter_dict2[item[1][i]] += (1*weight)
            else:
                counter_dict2[item[1][i]] = (1*weight)
            weight -= 1
    return round(pd.Series(counter_dict2).sort_values(ascending=False)/len(df), 1).head(15)

In [None]:
my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 0, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]].head(3)

In [None]:
len(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 0, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
most_often_in_t10(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 0, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 1, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]].head(3)

In [None]:
len(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 1, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
most_often_in_t10(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 1, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 2, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]].head(3)

In [None]:
len(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 2, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
most_often_in_t10(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 2, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 3, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]].head(10)

In [None]:
len(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 3, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
most_often_in_t10(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 3, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 4, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]].head(10)

In [None]:
len(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 4, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

In [None]:
most_often_in_t10(my_data_restaurants.loc[my_data_restaurants['Cluster Labels'] == 4, my_data_restaurants.columns[[0] + [2] + list(range(6, 17))]])

## 5. Report

### 5.1 Introduction

While flying becomes a commodity, passenger traffic rises and ever more people spent time at and around big airports the business opportunities at airports are still large. Apart from internal transport infrastruce, connecting public transport, new amenities such as sleep pods, sports etc also restaurants are stil a great opportunity to differentiate. 

The data analysis undertaken here is important for airport operators as a means to compare themselves with other top airports across the world but also for restaurant chains and owners to identify coverage gaps or shortfalls in high quality restaurants at certain airports. This analysis can support the optimization of an airports restaurant portfolio or help deciding on where to open a new restaurant and, importantly, in which category (e.g. shushi, fastfood, german).

The following analysis will provide a starting point to do this by looking at the 50 airports with the largest passenger turnover per year and compare them based on what restaurants are available and how restaurants were rated by passengers. The restaurant and rating data is supplied by foursquare, the airport location data is supplied by https://openflights.org/data.html and the data for the passenger traffic comes from Wikipedia (https://en.wikipedia.org/wiki/List_of_busiest_airports_by_passenger_traffic). 

Thus, we have three data sources:
    1. airport location data
    2. airport passenger traffic
    3. Data from the foursquare API to retrieve the different restaurants at each airport and their rating and category

### 5.2 Methodology

Firstly, the airport location data is merged with the airport passenger traffic data to generate a consistent data set of the top 50 airports and their respective locations.
Afterwards, this data is being visualized on a world map to identify which parts and cities of the world are represented and in what proportion.
The location of each airport, precisely the center of each airport, is now being used to retrieve data about restaurants in a radius of 10 kilometres, which is reasonable given the size of the respective airports. Here two different approaches are utilized:

    1. Retrieve all restaurants, their category and their ratings to measure the quality each airport has with regard to restaurants
    
    2. Retrieve popular restaurants and their category to measure the profile and coverage of diffenerent restaurant categories

In the first approach at first all restaurants in the 10km radius are being retrieved with the foursquare "search"-API. Afterwards this data is enriched with the ratings using the "venue"-API from foursquare. Only the restaurants with ratings are kept. 
After transforming the data an unsupervised machine learning technique is used to cluster the airports based on their similarity. Here the partitioning k-Means clustering algorithm is used. In order to answer the question at hand 5 clusters were chosen.

This algorithm was chosen because the idea is to find those airports that strongly differ from the majority, hence finding small clusters.
With respect to this approach, evaluating the quality of each airport, this must mean that airports in those "small" clusters perform either better than the rest or worse. Of course this holds only true under the assumption of a normal distribution respectively that the majority of airports perform average. This will be further evaluated in the results section.

Then, the data is being visualized to identify similar and distinct airport clusters.
Finally, those cluster are being examined on a high level by looking at which restaurant categories are rated best (and worst) to make recommendations for airport operators and restaurant owners what is working well at some airports but is not at others.

In the second approach, again restaurants in the 10km radius are being retrieved but this time with the foursquare "explore"-API which delivers the popular restaurant at each airport. Now this data is being transformed to identify which restaurant categories are highly represented at the respective airport and which are not. Then, the same partitioning clustering algorithm is used as in the first approach and the different clusters are being visualized on a map to find out similarities and differences between airports. Again, 5 clusters were chosen.

The partitioning cluster algorithm is chosen for similar reasons as in approach one. The difference here is that we want to find unique restaurant portfolios which are interesting to look at in comparison to the majority (small vs. large clusters).

Finally, those clusters are being examined on a high level by looking at the top 10 restaurant categories and how often they occur per cluster(results are weighted by amount of airports per cluster).

### 5.3 Results

Since two different approaches to answer the question of where an airport ranges when it comes to restaurants or where there might be potential at this airport to open up a new restaurant in a specific category also the results for those two approaches are different.

Unfortunately, the results for evaluating the quality of restaurants by looking at user ratings are biased since there is just not enough data, respectively for some airports only a few or none of the restaurants were rated. Moreover, not for all restaurant categories data was available. In addition, the "venue"-API only permits only 500 calls per day with a standard account and even then it is very slow which makes it pretty difficult to work with. Given that, even if more data were available, collecting only 10 ratings per airport would still not account for a comprehensive analysis.
Still, for 19 airports 326 rated venues were found and the clustering shows that 2 airports are quite unique (L.A. and JFK), there is one cluster with two airports and two bigger clusters. Especially, how the three small clusters compare to the bigger ones is of interest and can be find out by looking at the rating score and which restaurant categories were rated best.

Since approach number two only relied on the "explore"-API of foursquare which permits 95000 calls per day and data was readily available the results are much more useful and insightful. 
For the 50 airports 4645 restaurants or venues that contain a restaurant (such as hotels) in 379 different restaurant categories were used to cluster the airports. The assumption that many airports will be rather similar to each other and consequently are clustered together was right. Only 2 clusters comprise of 43 airports while the other 7 are split up in three different clusters.
This makes it particularly interesting to find out how they differentiate from the majority.

### 5.4 Discussion

The first part of the analysis tackles the restaurant quality. In order to make any recommendations for retaurants owners it is of particular interest to look at airports with average or bad ratings. 
This corresponds to a "red ocean strategy" where you try to push out existing competitors. This could be targeted at a single restaurant category with a weak rating e.g. Café's or American Restaurant's at Shanghai Pudong International Airport or Thai Restaurants at Suvarnabhumi Airport.It could also be targeted at an airport that in provides an average restaurant offering in general and it seems like airports such as Guangzhou Baiyun International Airport, Kuala Lumpur and Beijing Capital are worth looking at.

The second part of the analysis tackles the restaurant profile. In order to make any recommendations for retaurants owners it is of particular interest to look at airports with categories that are either not or only sparsely filled. This would correspond to a "blue ocean strategy" because it is about trying to occupy a spot with no one else has taken yet.
Especially in Cluster 0 but also in 1,2 and 3 it seems pretty unadvisable to open up a new coffee shop because there are plenty of them already while in Cluster 4 they are somewhat underrepresented compared to the others. Depending on where a restaurant owner or airport operator is located many more details could be derived from the analysis to have a starting point where new business opportunities are.

### 5.5 Conclusion

In this report airports where clustered based on the quality of their restaurants and based on the availability of restaurants in the different restaurant categories.Airport operators and restaurant owners alike can draw valuable insights from the provided analysis and get a starting point of where to look for new business opportunities.

Additionally, in line with the analysis a tool was developed that can easily serve to do the same analysis of the quality and/or profile of airports not just for restaurants but also for e.g. hotels, retail, IT or sports offerings by simply adjusting the query term of the location data provider foursquare. 

This analysis is highly dependent on the data quality of the location data provider. Therefore, it is advisable to incorporate other location data providers such as Google Places and merge the data to achieve a consistent analysis.