# Capstone Notebook

- **Cafe Help** is a Coffee Shop Supplies & Repairs business based in Melbourne. 
- **Cafe Help** has decided to expand to Sydney and would like to start with three service locations. 
- It is crucially important that **Cafe Help** is within short proximity of local cafes. 
- This notebook will look to recommend the three best locations for **Cafe Help** to begin operations in Sydney.

#### Import, install and preprocess

In [54]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import json # library to handle JSON files
import matplotlib.pyplot as plt
from collections import Counter

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.11

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Folium installed
Libraries imported.


In [3]:
#define the foursquare credentials we will use
CLIENT_ID = 'AQFVYFY2FD3LF2CGPOEOLTBR5LQRDZJYNXZVXMYOO34J0LDG' # your Foursquare ID
CLIENT_SECRET = 'ASZJ3CV5E0PNEB1OMRVCBMNPXB1XVPASWVHF3JMJG1EERE01' # your Foursquare Secret

VERSION = '20190820'
LIMIT = 9999999
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: AQFVYFY2FD3LF2CGPOEOLTBR5LQRDZJYNXZVXMYOO34J0LDG
CLIENT_SECRET:ASZJ3CV5E0PNEB1OMRVCBMNPXB1XVPASWVHF3JMJG1EERE01


#### Build dataframe of cafes in Sydney

We will now build a dataframe of cafes in Sydney

**PROBLEM** - We quickly discover that FourSquare Search API has a limit of 50 results per call. We cannot simply query all cafes within a defined radius of the Sydney CBD as the results will always be limited to 50.

There are many more than 50 cafes in Sydney. A different approach is therefore required.

- We will therefore need to build a grid of latitudinal and longitudinal values.
- This grid will form a "box" around the Sydney CBD.
- We will then incrementally query points in that box.
- Up to 500 points inside the box (as this is the daily FourSquare API limit).
- We will add each result set to a single list.

**CBD Points are as follows**
- West - -33.868505, 151.189767 (Go as far West as the Sydney fish markets)
- North - -33.853535, 151.208219 (Go as far North as Pylon Lookout on the Sydney Harbour Bridge)
- East - -33.873601, 151.229592 (Go as far East as Rushcutters Bay Park)
- South - -33.888811, 151.202038 (Go as far South as Clevland St Redfern)

**Corners of our box are as follows**
- NorthWest - -33.853535, 151.189767
- SouthWest - -33.888811, 151.189767 
- SouthEast - -33.888811, 151.229592
- NorthEast - -33.853535, 151.229592

<img src = "https://raw.githubusercontent.com/mattingersole/Coursera_Capstone/master/CBD%20Box.png">

**The box is 4 km high** 
- we will plot 25 points on the Y axis 
- one point every 160 metres

**The box is 3.7km wide**
- we will plot 20 points on the X axis 
- one point every 185 metres

In [7]:
# We build a new dataframe "grid" with our points

start_lat = -33.888811
end_lat = -33.853535
total_lats = 25
lat_inc = (end_lat - start_lat)/25

start_long = 151.189767
end_long = 151.229592
total_long = 20
long_inc = (end_long - start_long)/20

grid = pd.DataFrame(columns=['Latitude', 'Longitude'])

lat_points = 0

while lat_points <= total_lats:
    long_points = 0
    while long_points <= total_long:
        new_point = {'Latitude': start_lat+(lat_inc*lat_points), 'Longitude': start_long+(long_inc*long_points)}
        grid = grid.append(new_point, ignore_index=True)
        long_points+=1
    lat_points+=1

grid.head()

Unnamed: 0,Latitude,Longitude
0,-33.888811,151.189767
1,-33.888811,151.191758
2,-33.888811,151.193749
3,-33.888811,151.195741
4,-33.888811,151.197732


Show our query points on a map

In [9]:
#define our address as the centre of Sydney (Martin Place)
address = 'martin place, sydney'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

grid_map = folium.Map(location=[latitude, longitude], zoom_start=14) # generate map centred around Martin Place

# add the Italian restaurants as blue circle markers
for lat, lng in zip(grid.Latitude, grid.Longitude):
    folium.features.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(grid_map)

# display map
grid_map

**We will now send each point to the FourSquare API and recieve all cafes near each point**

In [29]:
#Build a search query
category = '4bf58dd8d48988d16d941735' #FourSquare's cafe category ID
radius = 95 #half the distance between our furtherest points

#create a dataframe to store the cafes in
cafes = pd.DataFrame()

#for every point in the grid
for lat, lng in zip(grid.Latitude, grid.Longitude):

    #Build a URL with the credentials, address and credentials already defined
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, VERSION, category, radius, LIMIT)

    #Make the request and pull the results into an object
    results = requests.get(url).json()
    # assign relevant part of JSON to venues
    venues = results['response']['venues']

    # tranform venues into a dataframe
    dataframe = json_normalize(venues)
    
    #append the new data frame into the cafes data frame 
    cafes = pd.concat([cafes, dataframe], ignore_index=True, sort =False)
    
cafes.head()

Unnamed: 0,id,name,categories,referralId,hasPerk,location.address,location.lat,location.lng,location.labeledLatLngs,location.distance,location.postalCode,location.cc,location.city,location.state,location.country,location.formattedAddress,location.crossStreet,location.neighborhood,venuePage.id
0,53993276498ee67090aabbd4,Laneway,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",v-1566261948,False,"Level 3, Wentworth Building",-33.889302,151.190775,"[{'label': 'display', 'lat': -33.8893018047187...",107,2006.0,AU,Darlington,NSW,Australia,"[Level 3, Wentworth Building, Darlington NSW 2...",,,
1,4c611418832fa5930937f1d3,Azuri,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",v-1566261948,False,"Wentworth Building, Butlin Ave",-33.889613,151.190521,"[{'label': 'display', 'lat': -33.8896128139542...",113,,AU,Darlington,NSW,Australia,"[Wentworth Building, Butlin Ave, Darlington NS...",,,
2,4b0f17f7f964a520165f23e3,Azzuri Espresso,"[{'id': '4bf58dd8d48988d1e0931735', 'name': 'C...",v-1566261948,False,"Wentworth Bldg., University of Sydney",-33.889655,151.191167,"[{'label': 'display', 'lat': -33.8896549887214...",108,2006.0,AU,Darlington,NSW,Australia,"[Wentworth Bldg., University of Sydney (at Maz...","at Maze Cresent, cnr Butlin Ave.",,
3,53993276498ee67090aabbd4,Laneway,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",v-1566261948,False,"Level 3, Wentworth Building",-33.889302,151.190775,"[{'label': 'display', 'lat': -33.8893018047187...",106,2006.0,AU,Darlington,NSW,Australia,"[Level 3, Wentworth Building, Darlington NSW 2...",,,
4,4b1c86fff964a520300824e3,Parma Cucina & Bar,"[{'id': '4bf58dd8d48988d16d941735', 'name': 'C...",v-1566261948,False,285A Crown St,-33.889236,151.19112,"[{'label': 'display', 'lat': -33.889236, 'lng'...",75,2010.0,AU,Surry Hills,NSW,Australia,[285A Crown St (Jane Foss Russell Bldg (Shop 3...,Jane Foss Russell Bldg (Shop 3),,


#### Wrangle the cafes dataframe

Let's do some more work on this data frame

In [47]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in cafes.columns if col.startswith('location.')] + ['id']
dataframe_filtered = cafes.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

#drop unneccessary columns
syd_cafes = dataframe_filtered[['id','name', 'categories', 'address', 'city','postalCode','state','lat','lng']]

#remove any duplicates which would have no doubt been created through overlapping point radius'
syd_cafes.drop_duplicates(subset="id", keep='first', inplace=True)

#set index to be the id
syd_cafes.set_index('id', inplace=True)

syd_cafes.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0_level_0,name,categories,address,city,postalCode,state,lat,lng
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
53993276498ee67090aabbd4,Laneway,Café,"Level 3, Wentworth Building",Darlington,2006.0,NSW,-33.889302,151.190775
4c611418832fa5930937f1d3,Azuri,Café,"Wentworth Building, Butlin Ave",Darlington,,NSW,-33.889613,151.190521
4b0f17f7f964a520165f23e3,Azzuri Espresso,Coffee Shop,"Wentworth Bldg., University of Sydney",Darlington,2006.0,NSW,-33.889655,151.191167
4b1c86fff964a520300824e3,Parma Cucina & Bar,Café,285A Crown St,Surry Hills,2010.0,NSW,-33.889236,151.19112
4e6946e118a89685778fda60,Snack Express,Café,"Wentworth Building, Univetsity Of Sydney",Darlington,,NSW,-33.889411,151.191448


In [48]:
#return some basic statistics on our dataframe
syd_cafes.describe(include=['object'])

Unnamed: 0,name,categories,address,city,postalCode,state
count,1561,1561,1194,1304,1113,1315
unique,1493,44,1102,41,24,8
top,Toby's Estate,Café,Metcentre,Sydney,2000,NSW
freq,7,1377,5,699,650,1297


**We now have all 1561 cafes within our grid.**

#### Plot the cafes dataframe on a geo map

Let's plot the cafes on a map

In [52]:
#define our address as the centre of Sydney (Martin Place)
address = 'martin place, sydney'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

cafes_map = folium.Map(location=[latitude, longitude], zoom_start=14) # generate map centred around Martin Place

# add the cafes as red markers
for lat, lng in zip(syd_cafes.lat, syd_cafes.lng):
    folium.features.CircleMarker(
        [lat, lng],
        radius=2,
        color='red',
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(cafes_map)

# display map
cafes_map

#### Cluster the cafes dataframe

In [57]:
# set number of clusters
kclusters = 3


X=syd_cafes.loc[:,['lat','lng']]

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

In [62]:
#add our cluster labels to our dataframe
syd_cafes.insert(0, 'Cluster Labels', kmeans.labels_)
syd_cafes.head()

Unnamed: 0_level_0,Cluster Labels,name,categories,address,city,postalCode,state,lat,lng
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
53993276498ee67090aabbd4,2,Laneway,Café,"Level 3, Wentworth Building",Darlington,2006.0,NSW,-33.889302,151.190775
4c611418832fa5930937f1d3,2,Azuri,Café,"Wentworth Building, Butlin Ave",Darlington,,NSW,-33.889613,151.190521
4b0f17f7f964a520165f23e3,2,Azzuri Espresso,Coffee Shop,"Wentworth Bldg., University of Sydney",Darlington,2006.0,NSW,-33.889655,151.191167
4b1c86fff964a520300824e3,2,Parma Cucina & Bar,Café,285A Crown St,Surry Hills,2010.0,NSW,-33.889236,151.19112
4e6946e118a89685778fda60,2,Snack Express,Café,"Wentworth Building, Univetsity Of Sydney",Darlington,,NSW,-33.889411,151.191448


#### Plot the clusters on a geo map

In [65]:
#define our address as the centre of Sydney (Martin Place)
address = 'martin place, sydney'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, cluster in zip(syd_cafes['lat'], syd_cafes['lng'], syd_cafes['Cluster Labels']):
    folium.CircleMarker(
        [lat, lon],
        radius=2,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Make a recommendation to Cafe Help

Get the centres of our clusters

In [73]:
centres = pd.DataFrame(kmeans.cluster_centers_, columns=)
centres.head()

Unnamed: 0,0,1
0,-33.86757,151.205952
1,-33.877151,151.220383
2,-33.881541,151.204675


Plot the centres of our clusters on the cafes map to illustrate the point

In [81]:
#define our address as the centre of Sydney (Martin Place)
address = 'martin place, sydney'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map
final_map = folium.Map(location=[latitude, longitude], zoom_start=14)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, cluster in zip(syd_cafes['lat'], syd_cafes['lng'], syd_cafes['Cluster Labels']):
    folium.CircleMarker(
        [lat, lon],
        radius=2,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(final_map)
       
# add markers to map
for lat, lng in zip(centres[0], centres[1]):
    folium.CircleMarker(
        [lat, lng],
        radius=12,
        color='yellow',
        fill=True,
        fill_color='yellow',
        fill_opacity=0.7,
        parse_html=False).add_to(final_map)  
        
final_map

**The optimal three locations for Cafe Help's new shops are shown in the map above, along with their respective service areas**

#### - Wynyard Store
#### - Haymarket Store
#### - Darlinghurst Store