## Finding a better place in North York

Importing libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Using BeautifulSoup to Scrape list of Postal Codes on given Wikipeida page

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
html_data = requests.get(url)
soup = BeautifulSoup(html_data.text, "html5lib")

Creating table with three columns and adding data to it from the BeautifulSoup object.
* Only Adding rows which has assigned Borough

In [3]:
column_names = ['Postal Code', 'Borough', 'Neighbourhood']
data = pd.DataFrame(columns = column_names)

for row in soup.find('tbody').find_all('td'):
    if(row.span.text!="Not assigned"):
        pcode = row.p.text[0:3]
        bor = row.span.text.split('(')[0]
        neigh = row.span.text.split("(")[1].strip(")").replace(" /",',').replace(")"," ").strip(" ")
        data = data.append({"Postal Code":pcode,"Borough":bor,"Neighbourhood":neigh}, ignore_index = True)
        

Cleaning the table:
* Assuming that any row which has empty Neighbourhood will only have ''(empty string) value
* Replacing all empty Neighbourhood with value of their Borough

In [4]:
data['Neighbourhood'] = data.apply(lambda x: x['Borough'] if(x['Neighbourhood']=='') else x['Neighbourhood'], axis = 1)

Printing the final table obtained from scraping the webpage

In [5]:
data

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East TorontoBusiness reply mail Processing Cen...,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Shape of Obtained Table:

In [6]:
data.shape

(103, 3)

## Finding a better place in North York | Finding location

Importing geocoder

In [7]:
# !pip install geocoder
import geocoder

Defining function for geocoder

In [8]:
def get_latlong(postal_code):
    lat_long_coords = None
    while(lat_long_coords is None):
        g = geocoder.arcgis("{}, Toronto, Ontario".format(postal_code))
        lat_long_coords = g.latlng
    return lat_long_coords

get_latlong("M5A")

[43.65512000000007, -79.36263999999994]

Getting latitude longitude for each postal code

In [9]:
postal_codes = data['Postal Code']
coords = []
for postal_code in postal_codes.tolist():
    coords.append(get_latlong(postal_code))

Coverting into dataframe

In [10]:
data_coords = pd.DataFrame(coords,columns = ['Latitude', 'Longitude'])
data_coords.head()

Unnamed: 0,Latitude,Longitude
0,43.75245,-79.32991
1,43.73057,-79.31306
2,43.65512,-79.36264
3,43.72327,-79.45042
4,43.66253,-79.39188


Merging dataframes

In [11]:
data['Latitude'] = data_coords['Latitude']
data['Longitude'] = data_coords['Longitude']

In [12]:
data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188


## Finding a better place in North York | Sperating data and mapping

Installing libraries

In [13]:
import folium
import json
import matplotlib.cm as cm
import matplotlib.colors as colors

from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim


In [14]:
address = 'North York, Toronto'
geolocator = Nominatim(user_agent="my-app")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [15]:
york_data = data[data['Borough'].str.contains("North York")].reset_index(drop=True)
print(york_data.shape)
york_data

(24, 5)


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
3,M3B,North York,Don Mills North,43.74923,-79.36186
4,M6B,North York,Glencairn,43.70687,-79.44812
5,M3C,North York,Don Mills South,43.72168,-79.34352
6,M2H,North York,Hillcrest Village,43.80225,-79.35558
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.75788,-79.44847
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.78097,-79.34781
9,M3J,North York,"Northwood Park, York University",43.76476,-79.48798


In [16]:
map_yorkk = folium.Map(location = [latitude, longitude], zoom_start=11)
for pincode, lat, long, borough, neighbourhood in zip(york_data['Postal Code'],york_data['Latitude'],york_data['Longitude'],york_data['Borough'],york_data['Neighbourhood']):
    label = '{}, {}, {}'.format(neighbourhood,borough,pincode)
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
    [lat,long],
    radius=4,
    popup=label,
    color = 'blue',
    fill=True,
    fill_color= "#3186cc",
    fill_opacity=0.7,
    parse_html=False).add_to(map_yorkk)
map_yorkk

#### Foursquare data

In [17]:
CLIENT_ID = 'DMMBKR5YEYMEYULXZHBF4SOTBIZKA3NGR5ZQ54AIS1OAZYH4' # Put Your Client Id
CLIENT_SECRET = 'IS0NIGDRWTYIKW5VCFTDIROYQAJJI4STFWWVGRUDCI3BBYN2' # Put You Client Secret 
VERSION = '20210727'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: Hidden')
print('CLIENT_SECRET: Hidden')

Your credentails:
CLIENT_ID: Hidden
CLIENT_SECRET: Hidden


#### 1. Exploring Neighbours in North York

Finding all neighbour venues

In [18]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([( name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 'Neighborhood Latitude', 'Neighborhood Longitude', 'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']
    
    return(nearby_venues)

In [19]:
york_venues = getNearbyVenues(names = york_data['Neighbourhood'], latitudes = york_data['Latitude'], longitudes = york_data['Longitude'])


Parkwoods
Victoria Village
Lawrence Manor, Lawrence Heights
Don Mills North
Glencairn
Don Mills South
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Fairview, Henry Farm, Oriole
Northwood Park, York University
Bayview Village
Downsview East
York Mills, Silver Hills
Downsview West
North Park, Maple Leaf Park, Upwood Park
Humber Summit
Willowdale, Newtonbrook
Downsview Central
Bedford Park, Lawrence Manor East
Humberlea, Emery
Willowdale South
Downsview Northwest
York Mills West
Willowdale West


In [20]:
york_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.75245,-79.32991,Brookbanks Park,43.751976,-79.332140,Park
1,Parkwoods,43.75245,-79.32991,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.75245,-79.32991,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Parkwoods,43.75245,-79.32991,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
4,Victoria Village,43.73057,-79.31306,Wigmore Park,43.731023,-79.310771,Park
...,...,...,...,...,...,...,...
236,Willowdale West,43.77989,-79.44678,Tim Hortons,43.780940,-79.444231,Coffee Shop
237,Willowdale West,43.77989,-79.44678,Antibes Park,43.778872,-79.448705,Park
238,Willowdale West,43.77989,-79.44678,Price Chopper,43.783237,-79.446339,Grocery Store
239,Willowdale West,43.77989,-79.44678,Bathurst Village Market,43.784063,-79.445984,Supermarket


In [21]:
york_venues.groupby("Neighborhood").count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",1,1,1,1,1,1
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",21,21,21,21,21,21
Don Mills North,4,4,4,4,4,4
Don Mills South,6,6,6,6,6,6
Downsview Central,2,2,2,2,2,2
Downsview East,11,11,11,11,11,11
Downsview Northwest,19,19,19,19,19,19
Downsview West,12,12,12,12,12,12
"Fairview, Henry Farm, Oriole",30,30,30,30,30,30


In [22]:
print('There are {} uniques categories.'.format(len(york_venues['Venue Category'].unique())))

There are 90 uniques categories.


#### 2. Analyse Borough's each neighbour


One hot encoding the venues for each venue category

In [23]:
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

york_onehot['Neighborhood'] = york_venues['Neighborhood'] 

fixed_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])
york_onehot = york_onehot[fixed_columns]
york_onehot

Unnamed: 0,Neighborhood,Arts & Crafts Store,BBQ Joint,Bakery,Bank,Bar,Basketball Court,Beer Store,Bookstore,Breakfast Spot,...,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Trail,Turkish Restaurant,Video Game Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,Willowdale West,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
237,Willowdale West,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
238,Willowdale West,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
239,Willowdale West,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [24]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped

Unnamed: 0,Neighborhood,Arts & Crafts Store,BBQ Joint,Bakery,Bank,Bar,Basketball Court,Beer Store,Bookstore,Breakfast Spot,...,Supermarket,Sushi Restaurant,Thai Restaurant,Theater,Toy / Game Store,Trail,Turkish Restaurant,Video Game Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619,...,0.0,0.047619,0.047619,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Don Mills North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Don Mills South,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Downsview Central,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Downsview East,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.181818,0.0,0.090909,0.0
7,Downsview Northwest,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0
8,Downsview West,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0
9,"Fairview, Henry Farm, Oriole",0.0,0.0,0.0,0.066667,0.033333,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033333,0.033333,0.0,0.0,0.033333,0.0,0.0


In [25]:
york_grouped.shape

(24, 91)

Displaying top 5 venues for each neighbour in North York

In [26]:
num_top_venues = 5
for neigh in york_grouped['Neighborhood']:
    print("----"+neigh+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == neigh].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Wilson Heights, Downsview North----
                 venue  freq
0     Business Service   1.0
1  Arts & Crafts Store   0.0
2            Pet Store   0.0
3                 Park   0.0
4            Nightclub   0.0


----Bayview Village----
                        venue  freq
0                       Trail  0.50
1  Construction & Landscaping  0.25
2                        Park  0.25
3         Arts & Crafts Store  0.00
4    Mediterranean Restaurant  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant  0.10
1         Coffee Shop  0.10
2      Sandwich Place  0.10
3                 Pub  0.05
4             Butcher  0.05


----Don Mills North----
                 venue  freq
0                 Park  0.25
1          Coffee Shop  0.25
2         Burger Joint  0.25
3          Gas Station  0.25
4  Arts & Crafts Store  0.00


----Don Mills South----
             venue  freq
0     Intersection  0.17
1  Bubble Tea Shop  0.17
2      Coffee S

Finding most common venues in each neighbour

In [27]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
import numpy as np
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Business Service,Women's Store,Furniture / Home Store,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant,Food & Drink Shop,Food Court
1,Bayview Village,Trail,Construction & Landscaping,Park,Fried Chicken Joint,Cosmetics Shop,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant
2,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Italian Restaurant,Juice Bar,Breakfast Spot,Pub,Pharmacy,Fast Food Restaurant,Restaurant,Liquor Store
3,Don Mills North,Burger Joint,Coffee Shop,Gas Station,Park,Fried Chicken Joint,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant
4,Don Mills South,Intersection,Coffee Shop,Gym,Grocery Store,Supermarket,Bubble Tea Shop,Women's Store,Fried Chicken Joint,Dessert Shop,Discount Store
5,Downsview Central,Insurance Office,Construction & Landscaping,BBQ Joint,Gas Station,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant,Food & Drink Shop,Food Court
6,Downsview East,Turkish Restaurant,Park,Middle Eastern Restaurant,Bakery,Vietnamese Restaurant,Italian Restaurant,Pizza Place,Chinese Restaurant,Sandwich Place,Latin American Restaurant
7,Downsview Northwest,Shopping Mall,Discount Store,Pizza Place,Grocery Store,Fried Chicken Joint,Pharmacy,Fast Food Restaurant,Caribbean Restaurant,Liquor Store,Sandwich Place
8,Downsview West,Hotel,Vietnamese Restaurant,Grocery Store,Department Store,Pizza Place,Coffee Shop,Discount Store,Fast Food Restaurant,Convenience Store,Beer Store
9,"Fairview, Henry Farm, Oriole",Clothing Store,Coffee Shop,Fast Food Restaurant,Bank,Juice Bar,Restaurant,Chocolate Shop,Cosmetics Shop,Movie Theater,Liquor Store


#### 3. Clustering Neighbourhoods

Importing Libraries

In [29]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import sklearn.cluster.k_means_
km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1, 
  verbose=True)



Performing KMeans Clustering

In [30]:
kclusters = 3
york_grouped_clustering = york_grouped.drop('Neighborhood', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=2).fit(york_grouped_clustering)
print(kmeans.labels_)
print(len(kmeans.labels_))

[1 0 0 0 0 2 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0]
24


In [31]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
york_merged = york_data
york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')

york_merged # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.75245,-79.32991,0,Park,Bus Stop,Fast Food Restaurant,Food & Drink Shop,Cosmetics Shop,Department Store,Dessert Shop,Discount Store,Electronics Store,Food Court
1,M4A,North York,Victoria Village,43.73057,-79.31306,0,Park,Nail Salon,Grocery Store,German Restaurant,Food Court,Cosmetics Shop,Department Store,Dessert Shop,Discount Store,Electronics Store
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042,0,Clothing Store,Women's Store,Cosmetics Shop,Restaurant,Pharmacy,Electronics Store,Leather Goods Store,Coffee Shop,Greek Restaurant,Kitchen Supply Store
3,M3B,North York,Don Mills North,43.74923,-79.36186,0,Burger Joint,Coffee Shop,Gas Station,Park,Fried Chicken Joint,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant
4,M6B,North York,Glencairn,43.70687,-79.44812,0,Pizza Place,Grocery Store,Latin American Restaurant,Bakery,Bank,Japanese Restaurant,Fast Food Restaurant,Mediterranean Restaurant,Gas Station,Fried Chicken Joint
5,M3C,North York,Don Mills South,43.72168,-79.34352,0,Intersection,Coffee Shop,Gym,Grocery Store,Supermarket,Bubble Tea Shop,Women's Store,Fried Chicken Joint,Dessert Shop,Discount Store
6,M2H,North York,Hillcrest Village,43.80225,-79.35558,0,Residential Building (Apartment / Condo),Park,Women's Store,Fried Chicken Joint,Cosmetics Shop,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant
7,M3H,North York,"Bathurst Manor, Wilson Heights, Downsview North",43.75788,-79.44847,1,Business Service,Women's Store,Furniture / Home Store,Department Store,Dessert Shop,Discount Store,Electronics Store,Fast Food Restaurant,Food & Drink Shop,Food Court
8,M2J,North York,"Fairview, Henry Farm, Oriole",43.78097,-79.34781,0,Clothing Store,Coffee Shop,Fast Food Restaurant,Bank,Juice Bar,Restaurant,Chocolate Shop,Cosmetics Shop,Movie Theater,Liquor Store
9,M3J,North York,"Northwood Park, York University",43.76476,-79.48798,0,Furniture / Home Store,Miscellaneous Shop,Coffee Shop,Bar,Japanese Restaurant,Metro Station,Fast Food Restaurant,Caribbean Restaurant,Restaurant,Massage Studio


Visualizing data

In [32]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, 5))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'], york_merged['Neighbourhood'],kmeans.labels_):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster-1], fill=True, fill_color=rainbow[cluster-1], fill_opacity=0.7).add_to(map_clusters)
map_clusters

## Conclusion 1: The following clusters are created:

In [33]:
york_merged[york_merged['Cluster Labels']==0][['Postal Code','Neighbourhood']]

Unnamed: 0,Postal Code,Neighbourhood
0,M3A,Parkwoods
1,M4A,Victoria Village
2,M6A,"Lawrence Manor, Lawrence Heights"
3,M3B,Don Mills North
4,M6B,Glencairn
5,M3C,Don Mills South
6,M2H,Hillcrest Village
8,M2J,"Fairview, Henry Farm, Oriole"
9,M3J,"Northwood Park, York University"
10,M2K,Bayview Village


In [34]:
york_merged[york_merged['Cluster Labels']==1][['Postal Code','Neighbourhood']]

Unnamed: 0,Postal Code,Neighbourhood
7,M3H,"Bathurst Manor, Wilson Heights, Downsview North"


In [35]:
york_merged[york_merged['Cluster Labels']==2][['Postal Code','Neighbourhood']]

Unnamed: 0,Postal Code,Neighbourhood
15,M9L,Humber Summit
17,M3M,Downsview Central


## Finding a better place in North York | House Pricing and Population

To find a better place according to house pricing and population of the area, another analysis is done with the scrapped data of average pricing of house and population of the area

## Scraping average price and population in North York

In [36]:
url = "https://housepricehub.com/cities/city/Toronto"
url2 = 'https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&SR=1&S=22&O=A&RPP=9999&PR=0'
html_data = requests.get(url)
html_data2 = requests.get(url2)
soup = BeautifulSoup(html_data.text, "html5lib")
soup2 = BeautifulSoup(html_data2.text, "html5lib")

Scrapped data for Averge Price of each postal code

In [37]:
column_names = ['Postal Code', 'Price']
data = pd.DataFrame(columns = column_names)

for row in soup.find('tbody').find_all('tr'):
    data = data.append({"Postal Code":row.find('td').text, "Price":float((row.text.split('$')[1].split('N')[0]).replace(",",''))},ignore_index = True)
data     

Unnamed: 0,Postal Code,Price
0,M3C,9591000.0
1,M4W,6800492.0
2,M3B,6575363.0
3,M4V,6045268.0
4,M4Y,5970000.0
...,...,...
95,M3N,893249.0
96,M2A,839000.0
97,M5E,799000.0
98,M7K,699900.0


Scrapped data for Population of each postal code

In [38]:
column_names = ['Postal Code', 'Population']
data2 = pd.DataFrame(columns = column_names)

for row in soup2.find_all('tbody')[1].find_all('tr'):
    data2 = data2.append({"Postal Code":row.text.split('\t')[4].strip(), "Population":float(row.text.split('\t')[6].strip().replace(",",''))},ignore_index = True)
data2

Unnamed: 0,Postal Code,Population
0,CanadaFootnote 1,35151728.0
1,A0A,46587.0
2,A0B,19792.0
3,A0C,12587.0
4,A0E,22294.0
...,...,...
1637,X0G,500.0
1638,X1A,20054.0
1639,Y0A,1641.0
1640,Y0B,6561.0


#### Merging all the data

In [39]:
york_housing = pd.merge(
left = york_merged, right = data, left_on = ['Postal Code'], right_on=['Postal Code'], how = 'left')
# york_housing  = york_housing[['Neighbourhood','Price','Latitude','Longitude']]
york_housing_final = pd.merge(
left = york_housing, right = data2, left_on = ['Postal Code'], right_on=['Postal Code'], how = 'left')
york_housing_final = york_housing_final[['Postal Code','Neighbourhood','Price','Population','Latitude','Longitude']]
york_housing_final

Unnamed: 0,Postal Code,Neighbourhood,Price,Population,Latitude,Longitude
0,M3A,Parkwoods,1632144.0,34615.0,43.75245,-79.32991
1,M4A,Victoria Village,1579992.0,14443.0,43.73057,-79.31306
2,M6A,"Lawrence Manor, Lawrence Heights",2049592.0,21048.0,43.72327,-79.45042
3,M3B,Don Mills North,6575363.0,13324.0,43.74923,-79.36186
4,M6B,Glencairn,1624918.0,28522.0,43.70687,-79.44812
5,M3C,Don Mills South,9591000.0,39153.0,43.72168,-79.34352
6,M2H,Hillcrest Village,1362927.0,24497.0,43.80225,-79.35558
7,M3H,"Bathurst Manor, Wilson Heights, Downsview North",1895263.0,37011.0,43.75788,-79.44847
8,M2J,"Fairview, Henry Farm, Oriole",1599720.0,58293.0,43.78097,-79.34781
9,M3J,"Northwood Park, York University",1068740.0,25473.0,43.76476,-79.48798


## Clustering data according to House Pricing and Population of the Area

#### Performing normalization of the price and population data so to avoid skewed clustering

In [40]:
km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1, 
  verbose=True)
kclusters = 3
york_house_clustering = york_housing_final[['Price','Population']]

scaler = StandardScaler().fit(york_house_clustering)
normalized_data = york_house_clustering.copy()
normalized_data[['Price','Population']] = scaler.transform(york_house_clustering)

kmeans = KMeans(n_clusters=kclusters, random_state=111).fit(normalized_data)
print(kmeans.labels_)
print(len(kmeans.labels_))

[0 0 0 1 0 1 0 0 2 0 0 0 1 0 0 0 0 0 0 0 2 0 1 0]
24


In [41]:
york_housing_final['cluster'] = kmeans.labels_
york_housing_final

Unnamed: 0,Postal Code,Neighbourhood,Price,Population,Latitude,Longitude,cluster
0,M3A,Parkwoods,1632144.0,34615.0,43.75245,-79.32991,0
1,M4A,Victoria Village,1579992.0,14443.0,43.73057,-79.31306,0
2,M6A,"Lawrence Manor, Lawrence Heights",2049592.0,21048.0,43.72327,-79.45042,0
3,M3B,Don Mills North,6575363.0,13324.0,43.74923,-79.36186,1
4,M6B,Glencairn,1624918.0,28522.0,43.70687,-79.44812,0
5,M3C,Don Mills South,9591000.0,39153.0,43.72168,-79.34352,1
6,M2H,Hillcrest Village,1362927.0,24497.0,43.80225,-79.35558,0
7,M3H,"Bathurst Manor, Wilson Heights, Downsview North",1895263.0,37011.0,43.75788,-79.44847,0
8,M2J,"Fairview, Henry Farm, Oriole",1599720.0,58293.0,43.78097,-79.34781,2
9,M3J,"Northwood Park, York University",1068740.0,25473.0,43.76476,-79.48798,0


### To find which cluster can be reffered for what usage, clusters are grouped by mean values

In [42]:
york_housing_final.groupby('cluster').mean()

Unnamed: 0_level_0,Price,Population,Latitude,Longitude
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1596880.0,25150.944444,43.748603,-79.449109
1,6812207.0,18009.25,43.743918,-79.371577
2,2022157.0,67095.0,43.774355,-79.377545


### According to analysis:
#### * The Cluster 0 has least mean Housing Price along with normal Population - **can be termed as ideal area for living**
#### * The Cluster 1 has little higher(better house quality) but with too high average population - **can be termed as avoidable area for living**
#### * The Cluster 2 has maximum house pricing but with least population - **can be termed as posh area of the city**

Mapping the above data on map

In [43]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

x = np.arange(kclusters)
rainbow = ['green','orange','red']

markers_colors = []
for lat, lon, poi, cluster in zip(york_housing_final['Latitude'], york_housing_final['Longitude'], 
                                  york_housing_final['Neighbourhood'],york_housing_final['cluster']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker([lat, lon], radius=5, popup=label, color=rainbow[cluster], fill=True, fill_color=rainbow[cluster], fill_opacity=0.7).add_to(map_clusters)
map_clusters

## Conclusion 2: The following clusters are created according to housing price and population:

### The Cluster 0 has least mean Housing Price along with normal Population - **can be termed as ideal area for living**

In [44]:
york_housing_final[york_housing_final['cluster']==0][['Postal Code','Neighbourhood']]

Unnamed: 0,Postal Code,Neighbourhood
0,M3A,Parkwoods
1,M4A,Victoria Village
2,M6A,"Lawrence Manor, Lawrence Heights"
4,M6B,Glencairn
6,M2H,Hillcrest Village
7,M3H,"Bathurst Manor, Wilson Heights, Downsview North"
9,M3J,"Northwood Park, York University"
10,M2K,Bayview Village
11,M3K,Downsview East
13,M3L,Downsview West


### The Cluster 1 has little higher(better house quality) but with too high average population - **can be termed as avoidable area for living**

In [45]:
york_housing_final[york_housing_final['cluster']==1][['Postal Code','Neighbourhood']]

Unnamed: 0,Postal Code,Neighbourhood
3,M3B,Don Mills North
5,M3C,Don Mills South
12,M2L,"York Mills, Silver Hills"
22,M2P,York Mills West


### The Cluster 2 has maximum house pricing but with least population - **can be termed as posh area of the city**

In [46]:
york_housing_final[york_housing_final['cluster']==2][['Postal Code','Neighbourhood']]

Unnamed: 0,Postal Code,Neighbourhood
8,M2J,"Fairview, Henry Farm, Oriole"
20,M2N,Willowdale South
