# Capstone Project - The Battle of Neighborhoods

## 1. Business Problem section

### 1.1 Background

I live in Sydney, a place being in the world’s top-10 best cities for quality of living. Among the ranking elements is the dining culture. Indeed, Sydney is famous for its high-quality restaurants and is the place where you can find all of the world's famous dishes.

### 1.2 Business Problem
One of my friends wanted to join the competition by investing in a restaurant business in inner Sydney (i.e. the City of Sydney). He asked me which suburb and which type of restaurants are the best to invest in. To answer the question, we can cluster the suburbs in inner Sydney with similar restaurant types. The target suburb should be the one in the cluster with the least number of restaurants.

The inner Sydney consists of 29 suburbs but our focus is on 5 most crowded suburbs - Sydney, The Rocks, Haymarket, Ultimo and Pyrmont.

### 1.3 Target Audience
* Business owners who want to invest or open a restaurant. This analysis will be a comprehensive guide to start or expand restaurants in an area with the least competiton
* Data Scientists who want to implement some of the most used Exploratory Data Analysis techniques to obtain necessary data, analyze it, and, finally be able to tell a story out of it.

## 2. Data section

### 2.1 Getting list of suburbs

First of all, let us import all the required libraries and packages.

In [1]:
import numpy as np
import pandas as pd
import datetime as dt # Datetime
import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes
import folium #import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

We can import the list of suburbs in Sydney from arcgis.

In [2]:
# read suburb list from 
df = pd.read_csv('https://opendata.arcgis.com/datasets/2a2b04faf74446309f7b22fd1d6651a2_0.csv')
df.head()

Unnamed: 0,FID,NAME,F2005_06,F2006_07,F2007_08,F2008_09,F2009_10,F2010_11,F2011_12,F2012_13,F2013_14,F2014_15,Shape_Leng,Shape_Area
0,1,Alexandria,179751048,179751048,180395100.0,164415400.0,154257000.0,162372200.0,163169000.0,155096161,145509600.0,147534614,10168.649178,3523771.0
1,2,Forest Lodge + Annandale,16720193,16720193,16336300.0,15537920.0,15603170.0,15768910.0,15785640.0,18743393,20187370.0,20761284,8654.226944,545770.4
2,3,Millers Point + Barangaroo,39666586,39666586,41351600.0,41601440.0,41843320.0,40595740.0,37915610.0,34786136,30142880.0,37728668,3944.508809,463478.9
3,4,Beaconsfield,8454492,8454492,10127940.0,11923960.0,12339120.0,12848150.0,12517850.0,9622120,5212607.0,5090894,1916.726468,167472.0
4,5,Camperdown,116493273,116493273,119503300.0,122507800.0,126025800.0,126707100.0,124501400.0,129747022,133478900.0,139736392,7055.860737,1072898.0


We then process the data to keep the five busiest suburbs only 

In [3]:
df.drop(df.columns.difference(['NAME']), 1, inplace=True)
df.head()

Unnamed: 0,NAME
0,Alexandria
1,Forest Lodge + Annandale
2,Millers Point + Barangaroo
3,Beaconsfield
4,Camperdown


In [4]:
keep_list=['Sydney', 'The Rocks', 'Haymarket', 'Pyrmont', 'Ultimo']
df2=df[df['NAME'].isin(keep_list)]

In [None]:
df2

Unnamed: 0,NAME
14,Haymarket
19,Pyrmont
24,Sydney
25,The Rocks
26,Ultimo


### 2.2 Getting Coordinates of suburbs

We can get the coordinates of these five suburbs using geocoder class of Geopy client

In [None]:
geolocator = Nominatim(user_agent="Sydney_explorer")
df2['City_coord'] = df2['NAME'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))

In [None]:
df2[['Latitude', 'Longitude']] = df2['City_coord'].apply(pd.Series)

In [None]:
df2 = df2.drop(columns=['City_coord'])

In [None]:
df2

As you can see the coordinates of Haymarket and Ultimo are completely wrong, which is due similar suburb names in other countries so, I had to replace these coordinates with values acquired from google search.

In [None]:
df2['Latitude'][14] = -33.8809
df2['Longitude'][14] = 151.2029
df2['Latitude'][26] = -33.8822
df2['Longitude'][26] = 151.1970

In [None]:
df2

In [None]:
address = 'Sydney, Australia'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Sydney are {}, {}.'.format(latitude, longitude))

Visualisation of the five suburbs

In [None]:
map_sydney = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, suburb in zip(df2['Latitude'], df2['Longitude'], df2['NAME']):
    label = '{}'.format(suburb)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sydney)  
    
map_sydney

Looking at the map, you can clearly tell that the coordination of Sydney suburb is wrong (i.e. right in the water). Therefore, we should replace it with Town Hall Station coordinate, which is the centre of Sydney CBD

In [None]:
df2['Latitude'][24] = -33.8735
df2['Longitude'][24] = 151.2069

### 2.3 Using Foursquare Location Data

To explore and target recommended locations across different venues according to the presence of restaurants, we will access data through FourSquare API interface and arrange them as a dataframe for visualization. Foursquare data is very comprehensive and it powers location data for Apple, Uber etc. For this business problem I have used, as a part of the assignment, the Foursquare API to retrieve information about the popular spots around these five suburbs.

In this analysis, I’ve chosen 100 popular spots for each major suburb within a radius of 0.5 km.

In [None]:
CLIENT_ID = 'TBSAWZA34Y1WK5SFX0LKDMURD2U1C3C5AWE1SWSVTE5J2AXJ' # your Foursquare ID
CLIENT_SECRET = '0B5O15IRU01GNF1N42ONWDZTTUPQFADEJNSF1RBKIHFCJEE5' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, LIMIT = 100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Suburb', 
                  'Suburb Latitude', 
                  'Suburb Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
sydney_venues = getNearbyVenues(names=df2['NAME'],
                                   latitudes=df2['Latitude'],
                                   longitudes=df2['Longitude']
                                  )

print("Shape of venue dataframe is ", sydney_venues.shape)
sydney_venues.head()

A quick look at the number of venues in each suburb:

In [None]:
sydney_venues.groupby('Suburb').count()

In [None]:
# get the List of Unique Categories
print('There are {} uniques categories.'.format(len(sydney_venues['Venue Category'].unique())))

## 3. Exploratory Data Analysis

### 3.1 Processing data

To make the data ready for analysing, we create a data-frame with pandas one hot encoding for the venue categories

In [None]:
# one hot encoding
venues_onehot = pd.get_dummies(sydney_venues[['Venue Category']], prefix="", prefix_sep="")

# add street column back to dataframe
venues_onehot['Suburb'] = sydney_venues['Suburb'] 

# move street column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])

#fixed_columns
venues_onehot = venues_onehot[fixed_columns]

venues_onehot.head()

In [None]:
sydney_grouped = venues_onehot.groupby('Suburb').mean().reset_index()
sydney_grouped

### 3.2 Analysing the top venues

This analysis is to identify the top venues in suburb to get a deeper understanding of the data.

In [None]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Suburb']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

In [None]:
# Define a function to return the most common venue

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
# create a new dataframe
venues_sorted = pd.DataFrame(columns=columns)
venues_sorted['Suburb'] = sydney_grouped['Suburb']

for ind in np.arange(sydney_grouped.shape[0]):
    venues_sorted.iloc[ind, 1:] = return_most_common_venues(sydney_grouped.iloc[ind, :], num_top_venues)

The top ten venues of each suburb are shown below:

In [None]:
venues_sorted.head()

### 3.3 Clustering the suburbs

We cluster these five suburbs based on the venue categories and use K-Means clustering. So our expectation would be based on the similarities of venue categories, these districts will be clustered.

In [None]:
#Distribute in 3 Clusters
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs
# set number of clusters
kclusters = 3

sydney_grouped_clustering = sydney_grouped.drop('Suburb', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sydney_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:50]

In [None]:
sydney_grouped_clustering

In [None]:
# add clustering labels
venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [None]:
venues_sorted.columns.values[1]='NAME'

The clusters are shown in below:

In [None]:
sydney_merged = df2
# merge sydney_merged with venue data to add latitude/longitude for each suburb
sydney_merged = sydney_merged.join(venues_sorted.set_index('NAME'), on='NAME')

sydney_merged.head() # check the last columns!

A visualisation of the clusters:

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sydney_merged['Latitude'], sydney_merged['Longitude'], sydney_merged['NAME'], sydney_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

This is the final result to use to make decision

In [None]:
sydney_merged.head()

From the grouping above, The Rocks is the best place to open a restaurant because:
* It is similar to Sydney in term of amenties
* The most frequently visited places are cafe and hotel, which leaves an opportunity to open a restaurant

The next question is which type of restaurant we should open in The Rocks

### 3.4 Analysing restaurant types

In [None]:
sydney_restaurant = sydney_venues[sydney_venues['Venue Category'].str.contains('Restaurant')].reset_index(drop=True)
print("Shape of the dataframe including restaurants in Sydney", sydney_restaurant.shape)
sydney_restaurant.head()

In [None]:
sydney_restaurant_sort=pd.crosstab(sydney_restaurant['Suburb'], sydney_restaurant['Venue Category'])
sydney_restaurant_sort['Total'] = sydney_restaurant_sort.sum(axis=1)
df_t = sydney_restaurant_sort.T
print(df_t)

A quick look at the types of restaurants in The Rocks, we can find that the suburb is dominant by "Autralian restaurants".There are only 2 Japanese restaurants, compared to 9 in Sydney. Therefore, we recommend opening a Japanese restaurant in The Rocks to fill the gap and also to take advantage of a large number of tourists staying in this suburb. 

## 4. Results and Discussion

We first examine the five suburbs in question according to the amenties. They are clustered in three groups: 
* The Rocks and Sydney
* Pyrmont and Ultimo
* Haymarket

The Rocks and Sydney suburbs are host to the world-famous Opera House and also busy business areas. These two suburbs are similar and therefore clustered together. However, The Rocks is chosen by tourists- favouring hotels and cafes, where Sydney suburb is preferred by Japanese restaurant lovers.

Pyrmont and Ultimo are popular residental areas for young people, therefore it is understandable that these two areas are grouped together with cafe being the most visited venue. The data confirms the point that these two areas are not great to open restaurants.

Haymarket, where Chinatown is located, is famous for Asian dining, thus, not a good area to open the restaurant business.

Our first conclusion is that The Rocks is the best area to open a restaurant.

Secondly, we take a further look at types of restaurants in The Rocks. We  find that the suburb is dominant by "Autralian restaurants".There are only 2 Japanese restaurants, compared to 9 in Sydney. Therefore, we recommend opening a Japanese restaurant in The Rocks to fill the gap and also to take advantage of a large number of tourists staying in this suburb. 

## 5. Conclusion

To sum up, Sydney has the privilege of being in the world's top ten cities for standard of living. The city has a reputation for the variety of restaurants with dishes from all over the world. Therefore it is a good idea to start investing in a restaurant business in the city. The business problem we try to answer is which suburb and which type of restaurant to invest in.

To solve this business problem, we clustered Sydney suburbs based on venues to identify the most profitable businesses in the area using Foursquare API. The results in fact aligns with waht I expect after staying 4 years in Sydney. After finding the optimal suburb for restaurant business - The Rocks, we then do a further investigation on the restaurant type. The final conclusion is that  we should open a Japanese restaurant in The Rocks. 

Hope you have enjoyed the analysis and get a small glimpse of how Sydney suburbs are.