# Capstone Project - Finding the Best Neighborhood
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>

The idea is to find neighborhood in Toronto city of Canada that has all the basic necessity shops within kilometers of the living place. People who are new to the city or shifting from another city to Toronto may require a place to live in. It might be difficult for them to find the neighborhood with all their necessities. The aim of the project is to divide the city neighborhoods in different categories according to shops and facilities available in the neighborhoods. The Foursquare API will be used to find all the nearby venues in neighborhoods and retrieve categories and count of shops in each category for each neighborhood.

## Data <a name="data"></a>

Data of boroughs and neighborhoods of the Toronto City would be retrieved from Wikipedia (https://en.wikipedia.org/wiki/List_of_neighbourhoods_in_Toronto).   The data is there in form of tables with postal codes and names of neighborhoods in each of the Borough. The Geospatial data would be used to retrieve Longitude and Latitude of each neighborhood. Then, Foursquare API would be used to retrieved nearby venues of each neighborhood.

*	Wikipedia Data: Columns Retrieved: Borough, Postal Code, Neighborhoods
*	Foursquare Data: Latitude, Longitude, Venues, Category
* Example:
Consider North York Borough of Toronto City. It has many neighborhoods in it which it has multiple postal codes for multiple Neighborhoods. For ex, M3A postal codes belongs to Parkwood neighborhood and M4A belongs to Victoria Village. Foursquare will provide longitude and latitude of Parkwood which is -79.329656 and 43.753259 respectively. Foursquare will also provide venues near Parkwood like Cafes, Parks etc.
Here, there are multiple venue categories as columns in the data. The categories selected for this project were basic necessity categories for ex., Gym, Grocery Store, Bank etc. These features can be selected according to user’s need of facilities. The selected features would determine how clusters would be made of neighborhoods.


Import of all necessary libraries

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

BeautifulSoup Object to make request to website 

In [3]:
URL = 'http://en.turkcewiki.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'lxml')

In [4]:
tble = soup.find('table')
print(len(tble))

2


In [5]:
postal_codes=[]
boroughs = []
neighs = []

Web Scraping from the table

In [6]:
for neigh in tble.find_all('td'):
  sp = neigh.find('span')
  if(sp.text != 'Not assigned'):
    postal_codes.append(neigh.find('b').text)
    data = sp.text
    split_both = data.split("(")
    hoods = split_both[1].split(")")[0]
    hoods_data = hoods.replace("/",",")
    boroughs.append(split_both[0])
    neighs.append(hoods_data)

In [7]:
print(len(postal_codes))
print(len(boroughs))
print(len(neighs))

103
103
103


In [8]:
df = pd.DataFrame(
columns=['PostalCode','Borough','Neighbourhood'])
df

Unnamed: 0,PostalCode,Borough,Neighbourhood


In [9]:

df['PostalCode'] =postal_codes
df['Borough'] = boroughs
df['Neighbourhood'] = neighs

df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park , Harbourfront"
3,M6A,North York,"Lawrence Manor , Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,..."


In [10]:
len(df['PostalCode'].unique())

103

GeoSpatial Data of Latitude and Longitude

In [11]:

from io import StringIO 
url = 'http://cocl.us/Geospatial_data'
s=requests.get(url).content
c=pd.read_csv(StringIO(s.decode('utf-8')))
c

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [12]:
df2= df.merge(c, left_on='PostalCode',right_on = 'Postal Code', how='left')
df2.drop(columns=['Postal Code'],inplace=True)
df2

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.636258,-79.498509


In [13]:
from geopy.geocoders import Nominatim 
import folium
import json
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

Getting Longitude and latitude of Toronto City

In [14]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 43.6534817, -79.3839347.


Visulisation of all the neighborhoods of Toronto city

In [15]:
map_north_york = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, label in zip(df2['Latitude'],df2['Longitude'], df2['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_north_york)  
    
map_north_york

## Methodology <a name="methodology"></a>

The aim of the project was to get best neighborhood according to places near it. For that, the longitude and latitude of each neighborhood were required. The GeoSpatial data contains Postal codes wise Longitude and Latitude data for all the 103 Boroughs of the Toronto city. The two datasets were combined and the new dataset with boroughs, neighborhoods and latitude and longitude was prepared. 

The next step was to retrieve nearby places of each neighborhood. The Foursquare API was used for this purpose. The explore request was used to get nearby venues. Limit of 100 was set for each neighborhood nearby venues. The Foursquare API returned a JSON response of the explore query for all the neighborhoods. The information needed from the JSON response was name, longitude, latitude and category of each venue retrieved. The new data frame containing Neighborhood name, longitude, latitude, Venue name, Venue Category, Venue latitude and Venue longitude. As a cleaning step neighborhood with less than 5 nearby venues were removed from the dataset. The reason behind the step was to provide neighborhoods that has possibility of covering all the facilities and so neighborhoods with less than 5 venues were not perfect fit for the solution. 

The category data was to be converted to numerical data for modeling the data. Categories data was one-hot encoded using pandas get_dummies function. Now, Data has Neighborhoods and each numerical category data. In the dataset, some neighborhoods were repeated as they had multiple venues and to compare neighborhoods, we have to combine all the same neighborhoods data into one row. For that purpose, mean of each neighborhood for each category was retrieved. 

There were 313 different categories of venues. All the categories were not important for the analysis. For a particular person, 6 or 7 categories would be of more importance than other categories. Here, we don’t have particular personal need so I have considered 7 basic necessities categories to cluster neighborhoods. Those 7 categories are Gym/Fitness Centre, Grocery Store, Bank, ATM, Pharmacy, Shopping Mall and Restaurant.

In [16]:
CLIENT_ID = 'RFI4U14MUXCU2TS4TCEPPFVL2MVA2DC4VFFZAAX2334YMWF3' 
CLIENT_SECRET = 'NCCPVLPHQ05YAFD1ATX1B0A1AZXPOZ4MICI3V5KRHTYCJP0Ot'
VERSION = '20180605' 
LIMIT = 100 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RFI4U14MUXCU2TS4TCEPPFVL2MVA2DC4VFFZAAX2334YMWF3
CLIENT_SECRET:NCCPVLPHQ05YAFD1ATX1B0A1AZXPOZ4MICI3V5KRHTYCJP0Ot


In [17]:
df3 = df2

In [18]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [19]:
df_toronto = pd.DataFrame(columns=["name","categories","lat","lng"])
df_toronto

Unnamed: 0,name,categories,lat,lng


Getting top 100 venues from FourSquare API

In [20]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    AUTH = 'IKEMNNFA2KMSR1AEWHKN2HUZEV3HCQB3GNZHGQKJ4EZ3XL3O'
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&oauth_token={}&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            AUTH,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [21]:
north_york_venues = getNearbyVenues(names=df3['Neighbourhood'],
                                   latitudes=df3['Latitude'],
                                   longitudes=df3['Longitude'])
north_york_venues

Parkwoods
Victoria Village
Regent Park , Harbourfront
Lawrence Manor , Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern , Rouge
Don Mills
Parkview Hill , Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park , Princess Gardens , Martin Grove , Islington , Cloverdale
Rouge Hill , Port Union , Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate , Bloordale Gardens , Old Burnhamthorpe , Markland Wood
Guildwood , Morningside , West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor , Wilson Heights , Downsview North
Thorncliffe Park
Richmond , Adelaide , King
Dufferin , Dovercourt Village
Scarborough Village
Fairview , Henry Farm , Oriole
Northwood Park , York University
The Danforth  East
Harbourfront East , Union Station , Toronto Islands
Little Portugal , Trinity
Kennedy Park , Ionview , East Birchmount Park
Bayview Village
Downsview
T

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.332140,Park
1,Parkwoods,43.753259,-79.329656,Careful & Reliable Painting,43.752622,-79.331957,Construction & Landscaping
2,Parkwoods,43.753259,-79.329656,649 Variety,43.754513,-79.331942,Convenience Store
3,Parkwoods,43.753259,-79.329656,Towns On The Ravine,43.754754,-79.332552,Hotel
4,Parkwoods,43.753259,-79.329656,Sun Life,43.754760,-79.332783,Construction & Landscaping
...,...,...,...,...,...,...,...
3100,"Mimico NW , The Queensway West , South of Bloo...",43.628841,-79.520999,Koala Tan Tanning Salon & Sunless Spa,43.631370,-79.519006,Tanning Salon
3101,"Mimico NW , The Queensway West , South of Bloo...",43.628841,-79.520999,Once Upon A Child,43.631075,-79.518290,Kids Store
3102,"Mimico NW , The Queensway West , South of Bloo...",43.628841,-79.520999,Value Village,43.631269,-79.518238,Thrift / Vintage Store
3103,"Mimico NW , The Queensway West , South of Bloo...",43.628841,-79.520999,Kingsway Boxing Club,43.627254,-79.526684,Gym


In [22]:
north_york_venues['Venue Category'].unique()

array(['Park', 'Construction & Landscaping', 'Convenience Store', 'Hotel',
       'Fireworks Store', 'Food & Drink Shop', 'Bus Stop', 'BBQ Joint',
       'Hockey Arena', 'Portuguese Restaurant', 'Coffee Shop',
       'Bridal Shop', 'Intersection', 'Pizza Place',
       'Financial or Legal Service', 'Bakery', 'Distribution Center',
       'Spa', 'Restaurant', 'Breakfast Spot', 'Gym / Fitness Center',
       'Historic Site', 'Chocolate Shop', 'Farmers Market', 'Pub',
       'Performing Arts Venue', 'Dessert Shop', 'Yoga Studio',
       'Mexican Restaurant', 'Café', 'Theater', 'History Museum',
       'Event Space', 'French Restaurant', 'Food Truck', 'Shoe Store',
       'Greek Restaurant', 'Art Gallery', 'Cosmetics Shop',
       'Asian Restaurant', 'Electronics Store', 'Furniture / Home Store',
       'Brewery', 'Italian Restaurant', 'Bank', 'Discount Store',
       'Sandwich Place', 'Seafood Restaurant', 'Beer Store', 'Lounge',
       'Thrift / Vintage Store', 'Chinese Restaurant', 'Gym

In [23]:
north_york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,7,7,7,7,7,7
"Alderwood , Long Branch",12,12,12,12,12,12
"Bathurst Manor , Wilson Heights , Downsview North",34,34,34,34,34,34
Bayview Village,6,6,6,6,6,6
"Bedford Park , Lawrence Manor East",54,54,54,54,54,54
...,...,...,...,...,...,...
"Willowdale , Newtonbrook",3,3,3,3,3,3
Woburn,4,4,4,4,4,4
Woodbine Heights,16,16,16,16,16,16
"York Mills , Silver Hills",3,3,3,3,3,3


In [24]:
g = north_york_venues.groupby('Neighborhood') 
north_york_venues = g.filter(lambda x: len(x) > 5)
north_york_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,7,7,7,7,7,7
"Alderwood , Long Branch",12,12,12,12,12,12
"Bathurst Manor , Wilson Heights , Downsview North",34,34,34,34,34,34
Bayview Village,6,6,6,6,6,6
"Bedford Park , Lawrence Manor East",54,54,54,54,54,54
...,...,...,...,...,...,...
Victoria Village,7,7,7,7,7,7
Westmount,10,10,10,10,10,10
"Wexford , Maryvale",7,7,7,7,7,7
Willowdale,58,58,58,58,58,58


Encoding Catrgorical data to numeric data

In [None]:
north_york_onehot = pd.get_dummies(north_york_venues[['Venue Category']], prefix="", prefix_sep="")

north_york_onehot['Neighborhood'] = north_york_venues['Neighborhood'] 

fixed_columns = [north_york_onehot.columns[-1]] + list(north_york_onehot.columns[:-1])
north_york_onehot = north_york_onehot[fixed_columns]

north_york_onehot.head()

Unnamed: 0,Yoga Studio,ATM,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Bath House,Beach,Beer Bar,Beer Store,Belgian Restaurant,Bike Rental / Bike Share,Bike Shop,Bistro,Board Shop,Boat or Ferry,...,Social Club,Soup Place,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Watch Shop,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
north_york_onehot.shape

(3007, 314)

Finding mean of each neighborhood for each category

In [None]:
north_york_grouped = north_york_onehot.groupby('Neighborhood').mean().reset_index()
north_york_grouped

Unnamed: 0,Neighborhood,Yoga Studio,ATM,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Arepa Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Bath House,Beach,Beer Bar,Beer Store,Belgian Restaurant,Bike Rental / Bike Share,Bike Shop,Bistro,Board Shop,...,Social Club,Soup Place,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tech Startup,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Watch Shop,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
1,"Alderwood , Long Branch",0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.083333,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
2,"Bathurst Manor , Wilson Heights , Downsview North",0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.029412,0.0,0.0,0.0,0.000000,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
3,Bayview Village,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
4,"Bedford Park , Lawrence Manor East",0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019231,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.096154,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.019231,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.019231
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,Victoria Village,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
70,Westmount,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
71,"Wexford , Maryvale",0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0,0.142857,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000
72,Willowdale,0.0,0.0000,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.034483,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0000,0.017241,0.0,0.0,0.0,0.0,0.0,0.000000


Selecting only basic necessity categories

In [None]:
df_nec = north_york_grouped[['Neighborhood','Gym / Fitness Center','Shopping Mall','Restaurant','Grocery Store','Bank','ATM','Pharmacy']]
df_nec

Unnamed: 0,Neighborhood,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
0,Agincourt,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000
1,"Alderwood , Long Branch",0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.083333
2,"Bathurst Manor , Wilson Heights , Downsview North",0.0,0.029412,0.029412,0.029412,0.058824,0.0000,0.058824
3,Bayview Village,0.0,0.000000,0.000000,0.000000,0.166667,0.0000,0.000000
4,"Bedford Park , Lawrence Manor East",0.0,0.000000,0.038462,0.019231,0.000000,0.0000,0.019231
...,...,...,...,...,...,...,...,...
69,Victoria Village,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000
70,Westmount,0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000
71,"Wexford , Maryvale",0.0,0.000000,0.000000,0.000000,0.000000,0.0000,0.000000
72,Willowdale,0.0,0.017241,0.017241,0.034483,0.017241,0.0000,0.034483


Removing all zeros rows

In [None]:
df_nec = df_nec[(df_nec['Gym / Fitness Center']!=0.0) | (df_nec['Shopping Mall']!=0.000000) | (df_nec['Restaurant']!=0.000000) | (df_nec['Grocery Store']!=0.000000) | (df_nec['Bank']!=0.000000) | (df_nec['ATM']!=0.0000) | (df_nec['Pharmacy']!=0.000000)]
df_nec

Unnamed: 0,Neighborhood,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
1,"Alderwood , Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.083333
2,"Bathurst Manor , Wilson Heights , Downsview North",0.0,0.029412,0.029412,0.029412,0.058824,0.0,0.058824
3,Bayview Village,0.0,0.0,0.0,0.0,0.166667,0.0,0.0
4,"Bedford Park , Lawrence Manor East",0.0,0.0,0.038462,0.019231,0.0,0.0,0.019231
5,Berczy Park,0.0,0.01,0.02,0.01,0.0,0.0,0.02
6,"Brockton , Parkdale Village , Exhibition Place",0.03125,0.0,0.03125,0.0,0.0,0.0,0.0
8,Cedarbrae,0.0,0.0,0.0,0.0,0.076923,0.0,0.0
9,Central Bay Street,0.02,0.0,0.01,0.0,0.01,0.0,0.01
10,Christie,0.034483,0.0,0.034483,0.137931,0.034483,0.0,0.0
11,Church and Wellesley,0.01,0.0,0.02,0.01,0.0,0.0,0.01


In [None]:
df_nec.shape

(56, 8)

## Analysis <a name="analysis"></a>

K-means clustering with 5 clusters were used on the dataset. The features of clustering were those 7 categories retrieved on previous step. The frequency of occurrence of each category determined clusters of neighborhoods. The cluster which has high frequency of occurrence of these categories are better. These clusters will help in recognizing neighborhoods with needed category shops

In [None]:
kclusters = 5

df_nec_clusters = df_nec.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_nec_clusters)

kmeans.labels_[0:10] 

array([4, 0, 3, 1, 1, 1, 3, 1, 0, 1], dtype=int32)

In [None]:
df_nec.insert(0, 'Cluster Labels', kmeans.labels_)

north_york_merged = df3

df_nec_merged = north_york_merged.join(df_nec.set_index('Neighborhood'), on='Neighbourhood')

df_nec_merged.head(10)

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
0,M3A,North York,Parkwoods,43.753259,-79.329656,,,,,,,,
1,M4A,North York,Victoria Village,43.725882,-79.315572,,,,,,,,
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636,1.0,0.012987,0.0,0.012987,0.0,0.012987,0.012987,0.012987
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763,,,,,,,,
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,1.0,0.0,0.01,0.01,0.02,0.02,0.0,0.02
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242,,,,,,,,
6,M1B,Scarborough,"Malvern , Rouge",43.806686,-79.194353,,,,,,,,
7,M3B,North York,Don Mills,43.745906,-79.352188,1.0,0.0,0.0,0.073171,0.02439,0.0,0.02439,0.02439
8,M4B,East York,"Parkview Hill , Woodbine Gardens",43.706397,-79.309937,0.0,0.071429,0.0,0.0,0.0,0.071429,0.0,0.071429
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1.0,0.01,0.01,0.01,0.0,0.01,0.0,0.0


In [None]:
df_nec_merged

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
0,M3A,North York,Parkwoods,43.753259,-79.329656,,,,,,,,
1,M4A,North York,Victoria Village,43.725882,-79.315572,,,,,,,,
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.654260,-79.360636,1.0,0.012987,0.00,0.012987,0.000000,0.012987,0.012987,0.012987
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763,,,,,,,,
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494,1.0,0.000000,0.01,0.010000,0.020000,0.020000,0.000000,0.020000
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway , Montgomery Road , Old Mill North",43.653654,-79.506944,,,,,,,,
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,1.0,0.010000,0.00,0.020000,0.010000,0.000000,0.000000,0.010000
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L,43.662744,-79.321558,1.0,0.086957,0.00,0.043478,0.000000,0.000000,0.000000,0.000000
101,M8Y,Etobicoke,"Old Mill South , King's Mill Park , Sunnylea ,...",43.636258,-79.498509,,,,,,,,


In [None]:
df_nec_merged.dropna(inplace=True)

In [None]:
df_nec_merged['Cluster Labels'] = df_nec_merged['Cluster Labels'].astype('int64')

Map of Clusters of different characteristics

In [None]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(df_nec_merged['Latitude'], df_nec_merged['Longitude'],df_nec_merged['Neighbourhood'], df_nec_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
df_nec_merged.loc[df_nec_merged['Cluster Labels'] == 0, df_nec_merged.columns[[1] + [2] + list(range(5, df_nec_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
8,East York,"Parkview Hill , Woodbine Gardens",0,0.071429,0.0,0.0,0.0,0.071429,0.0,0.071429
25,Downtown Toronto,Christie,0,0.034483,0.0,0.034483,0.137931,0.034483,0.0,0.0
28,North York,"Bathurst Manor , Wilson Heights , Downsview North",0,0.0,0.029412,0.029412,0.029412,0.058824,0.0,0.058824
29,East York,Thorncliffe Park,0,0.026316,0.026316,0.026316,0.026316,0.026316,0.0,0.078947
31,West Toronto,"Dufferin , Dovercourt Village",0,0.0,0.0,0.0,0.041667,0.041667,0.0,0.083333
40,North York,Downsview,0,0.041667,0.083333,0.0,0.083333,0.041667,0.0,0.041667
46,North York,Downsview,0,0.041667,0.083333,0.0,0.083333,0.041667,0.0,0.041667
53,North York,Downsview,0,0.041667,0.083333,0.0,0.083333,0.041667,0.0,0.041667
59,North York,Willowdale,0,0.0,0.017241,0.017241,0.034483,0.017241,0.0,0.034483
60,North York,Downsview,0,0.041667,0.083333,0.0,0.083333,0.041667,0.0,0.041667


In [None]:
df_nec_merged.loc[df_nec_merged['Cluster Labels'] == 1, df_nec_merged.columns[[1] +[2] + list(range(5, df_nec_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
2,Downtown Toronto,"Regent Park , Harbourfront",1,0.012987,0.0,0.012987,0.0,0.012987,0.012987,0.012987
4,Queen's Park,Ontario Provincial Government,1,0.0,0.01,0.01,0.02,0.02,0.0,0.02
7,North York,Don Mills,1,0.0,0.0,0.073171,0.02439,0.0,0.02439,0.02439
9,Downtown Toronto,"Garden District, Ryerson",1,0.01,0.01,0.01,0.0,0.01,0.0,0.0
13,North York,Don Mills,1,0.0,0.0,0.073171,0.02439,0.0,0.02439,0.02439
15,Downtown Toronto,St. James Town,1,0.0,0.0,0.04,0.01,0.0,0.0,0.0
20,Downtown Toronto,Berczy Park,1,0.0,0.01,0.02,0.01,0.0,0.0,0.02
23,East York,Leaside,1,0.0,0.032787,0.032787,0.016393,0.032787,0.0,0.0
24,Downtown Toronto,Central Bay Street,1,0.02,0.0,0.01,0.0,0.01,0.0,0.01
30,Downtown Toronto,"Richmond , Adelaide , King",1,0.01,0.0,0.02,0.0,0.0,0.0,0.01


In [None]:
df_nec_merged.loc[df_nec_merged['Cluster Labels'] == 2, df_nec_merged.columns[[1] +[2] + list(range(5, df_nec_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
56,York,"Del Ray , Mount Dennis , Keelsdale and Silvert...",2,0.0,0.0,0.166667,0.0,0.0,0.0,0.166667


In [None]:
df_nec_merged.loc[df_nec_merged['Cluster Labels'] == 3, df_nec_merged.columns[[1] + [2] + list(range(5, df_nec_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
18,Scarborough,"Guildwood , Morningside , West Hill",3,0.0,0.0,0.083333,0.0,0.083333,0.0,0.0
26,Scarborough,Cedarbrae,3,0.0,0.0,0.0,0.0,0.076923,0.0,0.0
39,North York,Bayview Village,3,0.0,0.0,0.0,0.0,0.166667,0.0,0.0


In [None]:
df_nec_merged.loc[df_nec_merged['Cluster Labels'] == 4, df_nec_merged.columns[[1] + [2] + list(range(5, df_nec_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,Cluster Labels,Gym / Fitness Center,Shopping Mall,Restaurant,Grocery Store,Bank,ATM,Pharmacy
10,North York,Glencairn,4,0.0,0.0,0.0,0.0,0.0,0.0,0.083333
14,East York,Woodbine Heights,4,0.0,0.0,0.0,0.0,0.0,0.0625,0.0625
17,Etobicoke,"Eringate , Bloordale Gardens , Old Burnhamthor...",4,0.0,0.0,0.0,0.0,0.0,0.0,0.076923
38,Scarborough,"Kennedy Park , Ionview , East Birchmount Park",4,0.0,0.0,0.0,0.0,0.0,0.0,0.166667
82,Scarborough,"Clarks Corners , Tam O'Shanter , Sullivan",4,0.0,0.045455,0.0,0.0,0.045455,0.0,0.090909
93,Etobicoke,"Alderwood , Long Branch",4,0.0,0.0,0.0,0.0,0.0,0.0,0.083333


## Results and Discussion <a name="results"></a>

There are 5 different clusters of neighborhoods. Red and Purple clusters have more neighborhoods compared to other clusters. There are basically 5 different types. 
The red clusters are mostly on the airport side of the City which seems less populated. 
Purple neighborhoods are near University of Toronto and beach side. This side is more dense than other sides. 
The yellow cluster is of neighborhoods which are very far from main city area. 
The sea blue cluster has only one neighborhood in it which is inside city region but it is only one neighborhood in the area. 
The Cyan clusters are nearly on the border of the city.


The results include 5 clusters and are of different properties and characteristics. The sea blue cluster has only one neighborhood and it is very deserted area. This area does not all the necessary facilities which makes it very weak candidate for the selection of this neighborhood. The Cyan cluster is at very end of the city which makes it very obvious for having less amenities so it is also not good for selection. The yellow cluster has very similar properties as Cyan s it is also a very bad candidate. There are two clusters remaining for the selection Red and Purple. The red cluster has no ATMs. The purple has few ATMs but is scarce in terms of Gyms and Shopping Malls. The red cluster is very scattered and purple is very dense in the area. The decision of choosing neighborhood now depends on distance, area of choice and which facilities are more important than others. For example, if Gyms and Shopping malls are more important and more frequently visited than ATMs and the person like to live in scattered area with some free space then neighborhoods from Red clusters will be more good choice over purple clusters. Then, to choose a neighborhood from the selected cluster would consist of consideration of proximity of work place. The one thing that was not considered in the discussion was number of restaurants. The reason was that there were many categories of restaurants in the City so it would clearly depend on the person to choose type of restaurant with his/her favorite food types. Here, I have considered generic restaurant category for clustering.

## Conclusion <a name="conclusion"></a>

The project overall helps person select best neighborhood to live in. The other aspect of the project may help shop owners and businessmen to determine what kind of shops would be required in the area. If a person could identify basic needs of people living in the neighborhood than one place with all those facilities can be built and would give guaranteed business. The one limitation I can identify of this approach is that some small shops in small cities may not be registered on Foursquare and it would become difficult to take them into consideration while finding best fit of neighborhood. Overall, this project would help all the stakeholders to solve the problem and get the best solution.