# IBM Data Science Capstone: Battle of the Neighborhoods

### Introduction: Business Plan 

Chinese restaurants are a indicator of Chinese cultural influence in many cities. Oftentimes, they are isolated to certain distinct districts or Chinatowns. 

This project seeks to determine if there is a pattern in the distribution of Chinese restaurants in the city of Toronto and where would be an optimal location to open a new restaurant serving Chinese cuisine. The optimal location would consider factors that influence a food business such as proximity to a loyal customer base and competition from other restaurants. 

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import urllib.request
from bs4 import BeautifulSoup
import json 
from pandas.io.json import json_normalize

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

import requests
from sklearn.cluster import KMeans

### 1. Scrape Wikipedia page for list of Toronto neighborhoods

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "lxml")

neighborhood = soup.find('table', class_ = 'wikitable')
neighborhood_rows = neighborhood.find_all('tr')

information = []
for row in neighborhood_rows:
    info = row.text.split('\n')[1:-1] # remove empty str (the first and last items)
    information.append(info)
    
information[0:20] #preview the first 20 rows

[['Postal Code', '', 'Borough', '', 'Neighbourhood'],
 ['M1A', '', 'Not assigned', '', 'Not assigned'],
 ['M2A', '', 'Not assigned', '', 'Not assigned'],
 ['M3A', '', 'North York', '', 'Parkwoods'],
 ['M4A', '', 'North York', '', 'Victoria Village'],
 ['M5A', '', 'Downtown Toronto', '', 'Regent Park, Harbourfront'],
 ['M6A', '', 'North York', '', 'Lawrence Manor, Lawrence Heights'],
 ['M7A',
  '',
  'Downtown Toronto',
  '',
  "Queen's Park, Ontario Provincial Government"],
 ['M8A', '', 'Not assigned', '', 'Not assigned'],
 ['M9A', '', 'Etobicoke', '', 'Islington Avenue, Humber Valley Village'],
 ['M1B', '', 'Scarborough', '', 'Malvern, Rouge'],
 ['M2B', '', 'Not assigned', '', 'Not assigned'],
 ['M3B', '', 'North York', '', 'Don Mills'],
 ['M4B', '', 'East York', '', 'Parkview Hill, Woodbine Gardens'],
 ['M5B', '', 'Downtown Toronto', '', 'Garden District, Ryerson'],
 ['M6B', '', 'North York', '', 'Glencairn'],
 ['M7B', '', 'Not assigned', '', 'Not assigned'],
 ['M8B', '', 'Not assign

### 2. Fit the data to a dataframe, remove not assigned cells, and rename columns

In [3]:
neighbor_df = pd.DataFrame(information[1:], columns=information[0])
# where information[1:] contains each row of neighborhoods
# and columns = information[0] gives the column names

neighbor_df = neighbor_df[neighbor_df.Borough != 'Not assigned']

neighbor_df.reset_index(drop=True, inplace=True)
neighbor_df = neighbor_df.rename(columns = {'District':'Borough'})

neighbor_df.head(20)

Unnamed: 0,Postal Code,Unnamed: 2,Borough,Unnamed: 4,Neighbourhood
0,M3A,,North York,,Parkwoods
1,M4A,,North York,,Victoria Village
2,M5A,,Downtown Toronto,,"Regent Park, Harbourfront"
3,M6A,,North York,,"Lawrence Manor, Lawrence Heights"
4,M7A,,Downtown Toronto,,"Queen's Park, Ontario Provincial Government"
5,M9A,,Etobicoke,,"Islington Avenue, Humber Valley Village"
6,M1B,,Scarborough,,"Malvern, Rouge"
7,M3B,,North York,,Don Mills
8,M4B,,East York,,"Parkview Hill, Woodbine Gardens"
9,M5B,,Downtown Toronto,,"Garden District, Ryerson"


In [4]:
neighbor_df.shape

(103, 5)

### 3. Obtain lattitude and longitude values from csv file

In [5]:
geospatial_data = pd.read_csv('Geospatial_Coordinates.csv')
geospatial_data.head()

final_table = neighbor_df.merge(geospatial_data, on = 'Postal Code')
final_table = final_table.rename(columns = {'Postal Code': 'PostalCode','Neighbourhood':'Neighborhood'})
final_table.head()

Unnamed: 0,PostalCode,Unnamed: 2,Borough,Unnamed: 4,Neighborhood,Latitude,Longitude
0,M3A,,North York,,Parkwoods,43.753259,-79.329656
1,M4A,,North York,,Victoria Village,43.725882,-79.315572
2,M5A,,Downtown Toronto,,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,,North York,,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,,Downtown Toronto,,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### 4. Get coordinates for Toronto 

In [6]:
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim(user_agent="TO_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.6534817, -79.3839347.


### 5. Create map of Toronto neighborhoods

In [7]:
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

for lat, long, bor, neigh in zip(final_table['Latitude'], final_table['Longitude'], 
                                 final_table['Borough'], final_table['Neighborhood']):
    label = '{}, {}'.format(neigh, bor)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius = 3, 
        popup = label,
        color = 'blue',
        fill = True,
        fill_opacity = 0.7,
        parse_html = False).add_to(map_toronto)

map_toronto


### 6. Filter for Toronto neighborhoods only

In [8]:
boroughs = list(final_table.Borough.unique())

toronto_boroughs = []
for x in boroughs:
    if "Toronto" in x:
        toronto_boroughs.append(x)
        
print(toronto_boroughs) 

['Downtown Toronto', 'East Toronto', 'West Toronto', 'Central Toronto']


In [9]:
df_toronto = final_table[final_table['Borough'].isin(toronto_boroughs)].reset_index(drop = True)
print(df_toronto.shape)
df_toronto.head()

(39, 7)


Unnamed: 0,PostalCode,Unnamed: 2,Borough,Unnamed: 4,Neighborhood,Latitude,Longitude
0,M5A,,Downtown Toronto,,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,,Downtown Toronto,,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,,Downtown Toronto,,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,,Downtown Toronto,,St. James Town,43.651494,-79.375418
4,M4E,,East Toronto,,The Beaches,43.676357,-79.293031



### 7. Create a new map of Toronto

In [10]:
map_toronto = folium.Map(location = [latitude, longitude], zoom_start = 10)

for lat, long, bor, neigh in zip(df_toronto['Latitude'], df_toronto['Longitude'], 
                                 df_toronto['Borough'], df_toronto['Neighborhood']):
    label = '{}, {}'.format(neigh, bor)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius = 3, 
        popup = label,
        color = 'blue',
        fill = True,
        fill_opacity = 0.7,
        parse_html = False).add_to(map_toronto)

map_toronto

### 8. Use Foursquare API to get venue data


In [11]:
CLIENT_ID = '5PW5KOASMN01GCHZFYUP2TDKWKCE2XOAYOWXCXBEDDFUMSNU'
CLIENT_SECRET = '5ZPMDBHN5VNNBN3ZHM2LWMCOF2X03UDRIMMNUTD44YRIAROE '
VERSION = '20200605'



### 9. Get top 100 locations within 0.5 kilometer radius

In [15]:
radius = 500
LIMIT = 100

venues = []

for lat, long, post, borough, neighborhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['PostalCode'], df_toronto['Borough'], df_toronto['Neighborhood']):
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    for venue in results:
        venues.append((
            post, 
            borough,
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

KeyError: 'groups'

In [13]:
# converting the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# defining the column names
venues_df.columns = ['PostalCode', 'Borough', 'Neighborhood', 'BoroughLatitude', 'BoroughLongitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

ValueError: Length mismatch: Expected axis has 0 elements, new values have 9 elements

### 10. Group the dataset by each neighborhood and count the number of each type of venue


In [None]:
venues_df.groupby(["PostalCode", "Borough", "Neighborhood"]).count()
venues_df['VenueCategory'].unique()[:50]


### 11. Look over each area and reformat variables to one hot encoding 

In [None]:
# one hot encoding
toronto_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add postal, borough and neighborhood column back to dataframe
toronto_onehot['PostalCode'] = venues_df['PostalCode'] 
toronto_onehot['Borough'] = venues_df['Borough'] 
toronto_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move postal, borough and neighborhood column to the first column
fixed_columns = list(toronto_onehot.columns[-3:]) + list(toronto_onehot.columns[:-3])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

### 12. Group and take frequency of each type of venue

In [None]:
toronto_grouped = toronto_onehot.groupby(["PostalCode", "Borough", "Neighborhoods"]).mean().reset_index()

print(toronto_grouped.shape)
toronto_grouped

### 13. Find Chinese restaurants in dataset

In [None]:
toronto_chinese = toronto_grouped[["Neighborhoods","Chinese Restaurant"]]


### 14. Use K-means clustering algorithm

In [None]:
from sklearn.cluster import KMeans # importing library 
toclusters = 3  # 3 clusters selected

toronto_clustering = toronto_american.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=toclusters, random_state=1)
kmeans.fit_transform(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20]

### 15.  Add labels to newly formed clusters

In [None]:
toronto_merged = toronto_chinese.copy()

# add clustering labels
toronto_merged["Cluster Labels"] = kmeans.labels_
toronto_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True) 

toronto_merged = toronto_merged.join(venues_df.set_index("Neighborhood"), on="Neighborhood")

print(toronto_merged.shape)
toronto_merged.head()
toronto_merged.sort_values(["Cluster Labels"], inplace=True)
toronto_merged.head()

### 16. Plot a map of the clusters created by k-means

In [None]:
# Plot the map
map_clusters = folium.Map(location=[latitude, longitude],zoom_start=14)

# set color scheme for the clusters


# add markers to the map
markers_colors={}
markers_colors[0] = 'red'
markers_colors[1] = 'blue'
markers_colors[2] = 'green'
markers_colors[3] = 'yellow'
markers_colors[4] = 'cyan'
markers_colors[5] = 'black'
for lat, lon, cluster in zip(toronto_merged['BoroughLatitude'], toronto_merged['BoroughLongitude'], toronto_merged['Cluster Labels']):
    
    
    folium.features.CircleMarker(
        [lat, lon],
        radius=5,
       
        color =markers_colors[cluster],
        fill_color=markers_colors[cluster],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Final remarks

To conclude, we can open Chinese restaurants in clusters 1 and 2, in the following locations:

Cluster 1 - Adelaide, King, Stn A PO Boxes, Church and Wellesley.

Cluster 2 - Studio District, Summerhill West, Rathnelly, South Hill, Forest. 

Cluster 1 seems more suitable as it is the downtown area. Further analysis can include factors such as housing, foot traffic, and concentration of potential customers who enjoy Chinese cuisine to obtain a more detailed interpretation of optimal locations to open a new Chinese restaurant. 
