<h1><center><b>Guilty Pleasures After Dark - Where to Set Up Your Eatery in Night-Time Sydney</b></center></h1>

<h2><center>A Data-Driven Case Study by Ossama Mughal</center><h2>

<h2>Introduction to Applied Data Science Capstone</h2>

This project will use location data providers to solve a business problem. The provided data will be wrangled, processed, and applied to machine learning models to effectively derive patterns and generate answers to the issue. The data science methodology will be implemented in this project, and practical data analysis conducted to reach a conclusion.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. [Introduction/Business Problem](#introduction-business-problem)

2. [Data](#data)

3. [Methodology](#methodology)

4. [Results](#results)

5. [Discussion](#discussion)    
    
6. [Conclusion](#conclusion)  
</font>
</div>

<a name="introduction-business-problem"></a>
## 1. Introduction/Business Problem

Since the Sydney Lockout Laws Legislation was introduced in 2014, the night-time economy of Australia's most famous city has become an interesting subject of study. Enforcing 1.30AM lockouts and 3AM last drinks at bars, pubs, and clubs in Sydneys entertainment precincts (such as Kings Cross) to deter alcohol- and drug-fuelled violence, these businesses have suffered a decrease in trade and demand. Given pedestrian traffic dropped by 40% in Kings Cross, falling from a Saturday peak of 5,590 per hour between 1AM and 2AM in 2010, to a Saturday peak of 3,888 between 12AM and 1AM in 2015 (http://www.theguardian.com/news/datablog/2016/feb/11/sydneys-lockout-laws-five-key-facts-about-the-citys-alcohol-debate), other night-time facility types, such as eateries, have a greater pressure to pick up the fall. A Deloitte Access Economics report, called ImagineSydney: Play, examines the city’s night-time economy is underperforming and losing $16bn, further stating 'a number of sectors would need to expand their night-time services, ranging from restaurants and bars to arts and culture, entertainment and fitness centres' (https://www.news.com.au/finance/money/costs/sydney-is-losing-out-on-16-billion-a-year-due-to-an-underdeveloped-nighttime-economy/news-story/5fab6b8bd90e41bafdab5c7438ba2e3b).

A greater demand for these other services creates an opportunity for increased market share. Investors, interested particularly in rising fads of international desserts such as bubble tea, can capitalise on the dynamic shift in Sydney's night-time activities by assessing the factors that have contributed to successful night-time restaurants/cafes. Indeed, factors such as parking space proximity and public transport accessibility are now critical to a eatery businesses success in light of the Lockout Laws and the decrease in accessibility of these facilities (eg. less frequent trains, more expensive parking during lockout period etc.)(https://www.thebalancesmb.com/choosing-restaurant-location-2888543). It is then critical that an investor is able to discern patterns in these details pertaining to existing night-time eateries, and apply this to predict the optimal decisions in setting up the business that consider it successful.

The **business problem**, then, is defined as determining the ideal location(s) to set up a successful night-time eatery, based on factors including private/public transport accessibility, pre-existing successful eatery street locations, and general facility features (eg. Wi-Fi, outdoor seating etc.). Foursquare's location API services will be utilised to obtain late-night venue details, such as coordinates, ratings, and restaurant features, and investigate these details for patterns and grouping. The **stakeholder** of this business problem can be identified as an investor attempting to capitalise on a night-time economy shifting demand to more eateries, and opening a new eatery business utilising, for success, discerned patterns on the aforementioned factors.

<a name="data"></a>
## 2. Data

On a functional perspective, the data used to solve the business problem can be categorised into 3 areas:
<ol>
    <li>Late-Night Sydney Venues Details</li>
    <li>Parking Facility (Private Transport) Details</li>
    <li>Public Transport Details</li>
</ol>
The sub-sections below briefly detail these data areas from a technical perspective, as well as examples of the datasets.

<h3>Late-Night Sydney Venues Details</h3>
A Foursquare Developer API will be used to retrieve details about Sydney's venues in the late-night, eatery category. The request parameters will define a query retrieving venue details in JSON format as response, such as name, coordinates, restaurant ratings, restaurant features, etc. The response fields that will be used for this business problem will be stored in a dataframe, and are as follows:

<table style: "width:10%">
<tr>
    <th>API response attribute</th>
    <th>Field Name</th>
    <th>Field Description/ Relevance</th>
</tr>
<tr>
    <td>id</td>
    <td>Venue ID</td>
    <td>The Foursquare ID of the venue</td>
</tr>
<tr>
    <td>name</td>
    <td>Venue Name</td>
    <td>The venue name</td>
</tr>
<tr>
    <td>categories</td>
    <td>Venue Category</td>
    <td>The venue categories. This field will be filtered for late-night eateries.</td>
</tr>
<tr>
    <td>hasPerk</td>
    <td>Venue Feature Indicator</td>
    <td>Indicates whether the venue contains features (eg. Wi-Fi, outdoor seating). This field will help interpret the significance of restaurant perks in an eatery's success alongside accessibility. </td>
</tr>
<tr>
    <td>location.address</td>
    <td>Venue Address</td>
    <td>The street address of the venue</td>
</tr>
<tr>
    <td>location.lat</td>
    <td>Venue Latitude Coordinate</td>
    <td>The latitude coordinate of the venue</td>
</tr>
<tr>
    <td>location.lng</td>
    <td>Venue Longitude Coordinate</td>
    <td>The longitude coordinate of the venue</td>
</tr>
<tr>
    <td>location.state</td>
    <td>Venue State</td>
    <td>The state of the venue</td>
</tr>
<tr>
    <td>location.country</td>
    <td>Venue Country</td>
    <td>The country of the venue</td>
</tr>
<tr>
    <td>location.formattedAddress</td>
    <td>Venue Formatted Address</td>
    <td>The overall address of the venue (ie. including state and country)</td>
</tr>
<tr>
    <td>delivery.id</td>
    <td>Delivery Vendor ID</td>
    <td>The Foursquare ID of the delivery vendor. This field will be transformed to a boolean value, and used to help interpret it's significance as a restaurant perk in an eatery's success alongside accessibility</td>
</tr>
</tr>
</table>

These properties will be used in conjunction with private/public transport data to train and test the clustering model, and cluster the venues into groups based on the instances. Hence, this data will be critical in assessing the ideal locations to place a new late-night eatery.

An example dataframe of one restaurant's details, retrieved using the Foursquare API and processed for attributes of interest, are output below:

In [3]:
#Import relevant libraries
import requests # library to handle requests
import json #library to handle json files
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from pandas.io.json import json_normalize # tranforming json file into a pandas dataframe library

#Set request parameter values
search_query = 'Bay Vista Gateau'
radius = 500
CLIENT_ID = 'VCITPTWXT3TOCGPTNZCETKPK3FV4RMLXZVIZSZNWEASFFMZ5' # your Foursquare ID
CLIENT_SECRET = 'D2FEO4SDBK0PS2Z3SJX4KPWRK1YWF2LJAKRLLPNTAFPFZKZW' # your Foursquare Secret
latitude = -33.961275
longitude = 151.155991
VERSION = '20180604'
LIMIT = 30

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, search_query, radius, LIMIT)

# call API and store response values
results_venues = requests.get(url).json()

# assign relevant part of JSON to venues
venues = results_venues['response']['venues']

# tranform venues into a dataframe
df_venues = json_normalize(venues)

# filter columns for relevant fields
filtered_col_venues = ['id', 'name', 'categories','hasPerk','location.address','location.lat','location.lng','location.state','location.country','location.formattedAddress','delivery.id','location.crossStreet']
df_venues_filtered = df_venues.loc[:, filtered_col_venues]

df_venues_filtered

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


Unnamed: 0,id,name,categories,hasPerk,location.address,location.lat,location.lng,location.state,location.country,location.formattedAddress,delivery.id,location.crossStreet
0,4b05875ef964a5201b8e22e3,Bay Vista Gateau,"[{'id': '4bf58dd8d48988d1d0941735', 'name': 'D...",False,83 The Grand Pde.,-33.961268,151.156108,NSW,Australia,"[83 The Grand Pde., Brighton-Le-Sands NSW 2216...",,


<h4>Usage</h4>

This dataset will have it's **categories** column filtered to only show venues that are open late-night in Sydney CBD, and are considered eateries under various labels.

The venue coordinates will be calculated in proximity to private/public transport available in the respective datasets, which will be flagged if within a certain radius of the venue and included in a count with respect to the venue.

The **hasPerk** and **delivery.id** values will be applied against OR logic, as either one is treated as a restaurant feature for the sake of the business problem.

The value of using this dataset is two-fold: The venues will be grouped into clusters using their details to showcase dissimilarities, and will enable a correlation analysis of private/public transport accessibility to the success (ie. rating) of a venue. The venues in each cluster can then be investigated for their dataset information to understand how each cluster is dissimilar, and ultimately form a conclusion on ideal locations (ie. locations nearby existing successful venues in these clusters) to set up a new late-night eatery.

<h3>Parking Facility (Private Transport) Details</h3>

An [OpenData Transport for NSW (tfNSW) API](https://opendata.transport.nsw.gov.au/node/306/exploreapi#/offstreetparking) will be used to retrieve details about Sydney's off-street parking provided by major parking providers for over 20,000 off-street parking spaces. The retrieved parking details will be in JSON format as response, such as building name, coordinates, total number of bays, address, etc.
Given the API response is constricted to a compressed file content type and has a limit of 5 calls a day, the GeoJSON file has been directly uploaded to the GitHub repository for simplicity, and accessed using the *urllib* library.

The response fields that will be used for this business problem will be stored in a dataframe, and are as follows:

<table style: "width:20%">
<tr>
    <th>API response attribute</th>
    <th>Field Name</th>
    <th>Field Description/Relevance</th>
</tr>
<tr>
    <td>geometry.coordinates</td>
    <td>Parking Space Coordinates</td>
    <td>The latitude and longitude coordinates of the parking space</td>
</tr>
<tr>
    <td>properties.FRD_ID</td>
    <td>Parking Space ID</td>
    <td>The tfNSW ID for the parking space</td>
</tr>
<tr>
    <td>properties.Building_name_location</td>
    <td>Parking Space Building Name</td>
    <td>The name of the CBD building associated with the parking space.</td>
</tr>
<tr>
    <td>properties.Street_Number_GPS</td>
    <td>Parking Space Street Number</td>
    <td>The street number of the parking space</td>
</tr>
<tr>
    <td>properties.Street_Name_GPS</td>
    <td>Parking Space Street Name</td>
    <td>The street name of the parking space</td>
</tr>
<tr>
    <td>properties.Suburb_GPS</td>
    <td>Parking Space Suburb</td>
    <td>The suburb of the parking space</td>
</tr>
<tr>
    <td>properties.State</td>
    <td>Parking Space State</td>
    <td>The state of the parking space</td>
</tr>
<tr>
    <td>properties.Postcode</td>
    <td>Parking Space Postal Code</td>
    <td>The postal code of the parking space</td>
</tr>
<tr>
    <td>properties.Total_number_of_bays</td>
    <td>Total Number of Parking Space Bays</td>
    <td>The number of total bays (unavailable/available) in the parking space. A larger space may compensate for the parking space location being slightly further than the eatery, and hence maintain accessibility for the individual. Therefore, this data field will be considered for training the clustering model.</td>
</tr>
</table>

These properties will be used in conjunction with public transport data and late-night eatery venue data to train and test the clustering model, and cluster the venues into groups based on the instances. Hence, this data will be critical in assessing the ideal locations to place a new late-night eatery.

An example dataframe of some parking spaces details, retrieved using the GeoJSON file provided by the OpenData tfNSW API and processed for attributes of interest, are output below:

In [8]:
import urllib.request, json

with urllib.request.urlopen('https://raw.githubusercontent.com/ozzie-mughal/Coursera_Capstone/master/OffstreetparkingData.geojson') as geojson_offstreetparking:
    results_offstreetparking = json.loads(geojson_offstreetparking.read().decode())

# assign relevant part of JSON to venues
offstreetparking = results_offstreetparking['features']

# tranform venues into a dataframe
df_offstreetparking = json_normalize(offstreetparking)

# filter columns for relevant fields
filtered_col_offstreetparking = ['geometry.coordinates','properties.FRD_ID','properties.Building_name_location','properties.Street_Number_GPS','properties.Street_Name_GPS','properties.Suburb_GPS','properties.State','properties.Postcode','properties.Total_number_of_bays']
df_offstreetparking_filtered = df_offstreetparking.loc[:,filtered_col_offstreetparking]

df_offstreetparking_filtered.head()

Unnamed: 0,geometry.coordinates,properties.FRD_ID,properties.Building_name_location,properties.Street_Number_GPS,properties.Street_Name_GPS,properties.Suburb_GPS,properties.State,properties.Postcode,properties.Total_number_of_bays
0,"[151.212657, -33.86218]",FRD-001,93 Macquarie Street,2,Albert St,Sydney,NSW,2000,94.0
1,"[151.209166, -33.861878]",FRD-002,Gold Fields House,18,Pitt St,Sydney,NSW,2000,114.0
2,"[151.209767, -33.8644]",FRD-003,No. 1 OConnell St Car Park,3,Bent St,Sydney,NSW,2000,100.0
3,"[151.211869, -33.865398]",FRD-004,The Chifley Tower,27,Bent St,Sydney,NSW,2000,362.0
4,"[151.2107, -33.865404]",FRD-005,Sofitel Wentworth Hotel,15,Bligh St,Sydney,NSW,2000,173.0


<h4>Usage</h4>

This dataset will have it's coordinates column split, and adjusted to represent latitude and longitude values appropriately.

These parking space coordinates will finally be calculated in proximity to each venue available in the venue dataset, and flagged if within a certain radius. A categorical variable will be assigned to the venue dataset, and the proximity calculations transformed to this variable (eg. Very accessible if number of parking spaces within radius > 10). The variable will be one-hot encoded to prepare the dataset for clustering model data infeed.

The value of using this dataset is two-fold: It will provide an effective predictor for grouping venues into clusters under the assumption that more successful venues are more accessible to parking spaces, and will enable a correlation analysis of parking space accessibility to the success (ie. rating) of a venue.

<h3>Public Transport Details</h3>

An [OpenData Transport for NSW (tfNSW) API](https://opendata.transport.nsw.gov.au/node/320/exploreapi) will be used to retrieve details about Sydney's public transport locations, ranging across train stations, ferry wharves, and bus interchanges. Furthermore, information on associated facilities (eg. bicycle racks, commuter car parks) will also be available. The retrieved transport location details will be in JSON format as response, such as types of transport, coordinates, facilities available, address, etc.
Given the API response is constricted to a compressed file content type and has a limit of 5 calls a day, the CSV file has been directly uploaded to the GitHub repository for simplicity, and accessed using the *pandas* library.

The response fields that will be used for this business problem will be stored in a dataframe, and are as follows:

<table style: "width:20%">
<tr>
    <th>API response attribute</th>
    <th>Field Name</th>
    <th>Field Description/Relevance</th>
</tr>
<tr>
    <td>LOCATION</td>
    <td>Public Transport Location Name</td>
    <td>Location name of the Station/Wharf/Light Rail/Bus Interchange</td>
</tr>
<tr>
    <td>TSN</td>
    <td>Transit Stop Number</td>
    <td>Transit Stop Number (TSN) ID for each stop at the Station/Wharf/Light Rail/Bus Interchange</td>
</tr>
<tr>
    <td>X_COORD</td>
    <td>Public Transport Location Longitude</td>
    <td>Longitude coordinate of the Station/Wharf/Light Rail/Bus Interchange</td>
</tr>
<tr>
    <td>Y_COORD</td>
    <td>Public Transport Location Latitude</td>
    <td>Latitude coordinate of the Station/Wharf/Light Rail/Bus Interchange</td>
</tr>
<tr>
    <td>ADDRESS</td>
    <td>Public Transport Location Street Name</td>
    <td>Street Name of the Station/Wharf/Light Rail/Bus Interchange</td>
</tr>
<tr>
    <td>TRANSPORT</td>
    <td>Public Transport Type(s)</td>
    <td>The type of transport mode(s) available at the location</td>
</tr>
</table>

These properties will be used in conjunction with private transport data and late-night eatery venue data to train and test the clustering model, and cluster the venues into groups based on the instances. Hence, this data will be critical in assessing the ideal locations to place a new late-night eatery.

An example dataframe of some public transport location details, retrieved using the CSV file provided by the OpenData tfNSW API and processed for attributes of interest, are output below:

In [7]:
results_publictransport = pd.read_csv('https://raw.githubusercontent.com/ozzie-mughal/Coursera_Capstone/master/LocationFacilitiesData.csv')
results_publictransport.drop(['EFA_ID','FACILITIES','ACCESS','PHONE'],axis=1,inplace=True)
results_publictransport.head()

Unnamed: 0,LOCATION_NAME,TSN,X_COORD,Y_COORD,ADDRESS,TRANSPORT
0,Aberdeen Station,233610,150.8920522,-32.16710358,"New England Hwy, Aberdeen",Train
1,Adamstown Station,228920,151.7200807,-32.93372988,"Park Ave, Adamstown",Train|Bus
2,Albion Park Station,252710,150.7984997,-34.56264671,"Princes Hwy, Albion Park",Train|Bus
3,Allawah Station,222020,151.11433,-33.96958391,"Railway Pde, Allawah",Train|Bus
4,Arncliffe Station,220520,151.1472995,-33.93645619,"Firth St, Arncliffe",Train|Bus|Taxi rank


<h4>Usage</h4>

This dataset will have it's tuples split to be unique to a single mode of transport. Furthermore, the coordinates columns will be adjusted to represent latitude and longitude values appropriately.

These transport location coordinates will finally be calculated in proximity to each venue available in the venue dataset, and flagged if within a certain radius. A categorical variable will be assigned to the venue dataset, and the proximity calculations transformed to this variable (eg. Very accessible if number of public transport within radius > 10). The variable will be one-hot encoded to prepare the dataset for clustering model data infeed.

The value of using this dataset is two-fold: It will provide an effective predictor for grouping venues into clusters under the assumption that more successful venues are more accessible to public transport, and will enable a correlation analysis of public transport accessibility to the success (ie. rating) of a venue.

<a name="methodology"></a>
## 3. Methodology

<a name="results"></a>
## 4. Results

<a name="discussion"></a>
## 5. Discussion

<a name="conclusion"></a>
## 6. Conclusion