<h1 align=center><font size = 5>Finding your ideal Neighborhood for a home in Toronto City</font></h1>

## Introduction

This project revolves around selecting an ideal neighborhood from a list based on some criteria in this order-
1. Neighborhood should have good schools
2. Neighborhood should also have parks and restaurants nearby

In this project, i will be using the location data of Toronto from the previous assignment. And also use the Foursquare API to explore neighborhoods in Toronto city. Then use the *k*-means clustering algorithm to complete this task. Analyze the result with different values of k. Finally, use the Folium library to visualize the neighborhoods in Toronto neighborhood and their emerging clusters.

## Problem Statement:

People with kids, coming from suburbs, small towns or other countries to Toronto city looking for an accommodation have one common problem i.e. finding a good neighborhood for a home. They first criteria is having a good school in the neighborhood. And then other amenities like Restaurants and Parks etc. 

## Dataset:

For the sake of this project we will only use Toronto neighborhoods where Borough contains Toronto. Here we need to find the school data from online sources. The data should be available by Postal Code OR by Latitude and Longitude. Here is a link to find all elementary schools 
by postal code along with there ratings :
http://ontario.compareschoolrankings.org/elementary/SchoolsByRankLocationName.aspx?schooltype=elementary

The restaurants and parks information for all neighborhoods of interest in Toronto can be got using Foursquare. Since school has a higher priority we need to restrict our neighborhoods to those which have good schools.

Once we get the School data along with the school rating, we need to merge this with the Restaurants and Parks data. We will use the counts to summarize the Restaurants and Parks data. And use mean for Schools. Then cluster this data set to find the neighborhoods with similar characterstics. Users can use this data to find a neighborhood of their choice from these different cluseters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Analyze Each Neighborhood for restaurants and parks</a>

3. <a href="#item3">Find top schools in each Neighborhood in Toronto</a>

4. <a href="#item4">Cluster Neighborhoods</a>

</font>
</div>

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Libraries imported.


## 1. Download and Explore Dataset

Download the Toronto location data from the previous assignment.

In [2]:
neighborhoods = pd.read_csv("toronto_data.csv")
neighborhoods.columns = ['Num', 'Postalcode', 'Borough', 'Neighborhood','Latitude','Longitude']
neighborhoods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 6 columns):
Num             103 non-null int64
Postalcode      103 non-null object
Borough         103 non-null object
Neighborhood    103 non-null object
Latitude        103 non-null float64
Longitude       103 non-null float64
dtypes: float64(2), int64(1), object(3)
memory usage: 4.9+ KB


In [3]:
del neighborhoods['Num']
neighborhoods.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


Next will get only the neighborhoods that contain Toronto in the Borough.

In [4]:
neighborhoods = neighborhoods[neighborhoods['Borough'].str.contains("Toronto")].sort_values('Postalcode').reset_index(drop=True)
neighborhoods

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049


In [5]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="coursera-capstone-project")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.653963, -79.387207.


In [6]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [7]:
CLIENT_ID = '2FPN1CICVCJC3FTJJVN1VJG04RJ0IWGB3HGYL3HTAENMFUK1' # your Foursquare ID
CLIENT_SECRET = 'SXYV152RCW3DLF3RLTOMMSRU0LIZWJSFFRCFOQ2PXIWXK12X' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2FPN1CICVCJC3FTJJVN1VJG04RJ0IWGB3HGYL3HTAENMFUK1
CLIENT_SECRET:SXYV152RCW3DLF3RLTOMMSRU0LIZWJSFFRCFOQ2PXIWXK12X


In [8]:
neighborhoods.nunique()

Postalcode      38
Borough          4
Neighborhood    38
Latitude        38
Longitude       33
dtype: int64

In [9]:
neighborhoods['Neighborhood'].nunique()

38

## 2. Analyze Each Neighborhood for restaurants and parks

In [10]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [11]:
# function that find nearby venues for a neighborhood based on a search query
def getNearbyVenues(neighs, latitudes, longitudes, radius=500):
    
    dataframe_filtered = pd.DataFrame()
    nearby_schools = pd.DataFrame()
    for neigh, lat, lng in zip(neighs, latitudes, longitudes):
        
        dataframe_filtered = dataframe_filtered[0:0]
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lng, 
            VERSION, 
            search_query, 
            radius, 
            LIMIT)

            
        # make the GET request
        results = requests.get(url).json()
        
        # assign relevant part of JSON to venues
        venues = results['response']['venues']
        if (venues == []): continue

        # tranform venues into a dataframe
        dataframe = json_normalize(venues)
        #dataframe.head()

        # keep only columns that include venue name, and anything that is associated with location
        filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
        dataframe_filtered = dataframe.loc[:, filtered_columns]

        # filter the category for each row
        dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)        

        # clean column names by keeping only last term
        dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]
        dataframe_filtered['neighborhood'] = neigh

        nearby_schools = nearby_schools.append(dataframe_filtered)
            
    return(nearby_schools)   

### Run the above function to get all parks by neighborhood

In [12]:
search_query = 'Park'
LIMIT = 30
toronto_parks = getNearbyVenues(neighs=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  sort=sort)


In [13]:
print(toronto_parks.shape)
toronto_parks.head()

(566, 16)


Unnamed: 0,address,categories,cc,city,country,crossStreet,distance,formattedAddress,id,labeledLatLngs,lat,lng,name,neighborhood,postalCode,state
0,131 Glen Manor Drive,Park,CA,Toronto,Canada,,177,"[131 Glen Manor Drive, Toronto ON, Canada]",4dbc8fe96a23e294ba3237bd,"[{'label': 'display', 'lat': 43.67527822698259...",43.675278,-79.294647,Glen Stewart Park,The Beaches,,ON
1,,Park,CA,,Canada,,340,[Canada],51e19355498e102728a023e2,"[{'label': 'display', 'lat': 43.673384, 'lng':...",43.673384,-79.294036,glen manor park,The Beaches,,
2,,Park,CA,,Canada,,451,[Canada],4fb19f6ce4b0b9253b201145,"[{'label': 'display', 'lat': 43.672308, 'lng':...",43.672308,-79.292782,Small Park,The Beaches,,
3,6 Williamson Road,Recreation Center,CA,Toronto,Canada,,505,"[6 Williamson Road, Toronto ON, Canada]",533202d3498e5cfccb08b4e1,"[{'label': 'display', 'lat': 43.67365646362305...",43.673656,-79.298073,department of parks and recreation beaches rec...,The Beaches,,ON
0,,Bus Line,CA,Toronto,Canada,,654,"[Toronto ON M4K 1N1, Canada]",5a562882610f043e686ddff0,"[{'label': 'display', 'lat': 43.68441888798419...",43.684419,-79.356766,TTC Bus #100A Flemingdon Park To Don Mills & W...,"The Danforth West, Riverdale",M4K 1N1,ON


In [14]:
# only pick the rows where category is Park

toronto_parks_list = toronto_parks.loc[toronto_parks['categories'] == 'Park']
toronto_parks_list = toronto_parks_list[['id','name', 'categories','neighborhood']]
toronto_parks_count = toronto_parks_list.groupby('neighborhood').count().reset_index()

In [15]:
print(toronto_parks_list.shape)
toronto_parks_count

(137, 4)


Unnamed: 0,neighborhood,id,name,categories
0,"Adelaide, King, Richmond",4,4,4
1,Berczy Park,4,4,4
2,"Brockton, Exhibition Place, Parkdale Village",5,5,5
3,Business reply mail Processing Centre969 Eastern,1,1,1
4,"Cabbagetown, St. James Town",3,3,3
5,Central Bay Street,2,2,2
6,"Chinatown, Grange Park, Kensington Market",7,7,7
7,Christie,2,2,2
8,Church and Wellesley,7,7,7
9,"Commerce Court, Victoria Hotel",6,6,6


In [16]:
# type your answer here
search_query = 'Restaurant'
LIMIT = 30
toronto_restaurant = getNearbyVenues(neighs=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.


  sort=sort)


In [17]:
toronto_restaurants_list = toronto_restaurant.loc[toronto_restaurant['categories'] != 'NaN']
toronto_restaurants_list = toronto_restaurants_list[['id','name', 'categories','neighborhood']]
toronto_restaurant_count = toronto_restaurants_list.groupby('neighborhood').count().reset_index()
toronto_restaurant_count

Unnamed: 0,neighborhood,id,name,categories
0,"Adelaide, King, Richmond",30,30,30
1,Berczy Park,14,14,14
2,"Brockton, Exhibition Place, Parkdale Village",4,4,4
3,"Cabbagetown, St. James Town",9,9,8
4,Central Bay Street,30,30,30
5,"Chinatown, Grange Park, Kensington Market",30,30,29
6,Christie,3,3,3
7,Church and Wellesley,19,19,19
8,"Commerce Court, Victoria Hotel",30,30,30
9,Davisville,6,6,5


## 3. Find good scools in Toronto by neighborhood

### School ratings by postalcode is available at-
http://ontario.compareschoolrankings.org/elementary/SchoolsByRankLocationName.aspx?schooltype=elementary
We need to download data by using City and Postal Code filters.

In [18]:
school_ratings = pd.read_csv("toronto_schools.csv")
school_ratings.head()

Unnamed: 0,Rank,Rating,Postalcode,School
0,165/3064,8.4,M4E,Balmy Beach
1,224/3064,8.2,M4E,St Denis
2,567/3064,7.5,M4E,St John
3,770/3064,7.2,M4E,Williamson Road
4,986/3064,6.9,M4E,Adam Beck


In [19]:
# Only select schools where rating >= 8
top_schools = school_ratings.loc[school_ratings['Rating'] >= 8 ].reset_index(drop=True)
top_schools.head()

Unnamed: 0,Rank,Rating,Postalcode,School
0,165/3064,8.4,M4E,Balmy Beach
1,224/3064,8.2,M4E,St Denis
2,27/3064,9.7,M4K,Pape Avenue
3,65/3064,9.0,M4K,Jackman Avenue
4,101/3064,8.7,M4K,Withrow Avenue


In [20]:
top_schools_inToronto = top_schools.merge(neighborhoods, on='Postalcode', how = 'inner')
#df1 = df.merge(geocoord, how = 'left', on = 'Postcode')

### Group schools by neighborhood to get the mean of the rating

In [21]:
schools_grouped = top_schools_inToronto.groupby('Neighborhood').mean().reset_index()
schools_grouped

Unnamed: 0,Neighborhood,Rating,Latitude,Longitude
0,Christie,8.2,43.669542,-79.422564
1,Davisville,8.2,43.704324,-79.38879
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",9.8,43.686412,-79.400049
3,"Dovercourt Village, Dufferin",10.0,43.669005,-79.442259
4,Lawrence Park,9.1,43.72802,-79.38879
5,"Little Portugal, Trinity",8.4,43.647927,-79.41975
6,"Moore Park, Summerhill East",8.566667,43.689574,-79.38316
7,North Toronto West,9.3,43.715383,-79.405678
8,"Parkdale, Roncesvalles",8.2,43.64896,-79.456325
9,Rosedale,8.7,43.679563,-79.377529


Idea was to also use the venue rating for selecting a neighborhood. But the ratings data for restaurants and parks is not good enough to use here. In most cases the ratings are not available.

In [22]:
def getVenueRating(venue_ids):
    
    ratingsl = []
    for venueid in zip(venue_ids):

        # create the API request URL
        venue_id = str(venueid)
        venue_id = venue_id.replace(',)','')
        venue_id = venue_id.replace('(','')
        venue_id = venue_id.strip('\'')
        #print(venue_id)

        url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(
            venue_id,
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION)
        #print(url)    
        # make the GET request
        result = requests.get(url).json()
        
        try:
            x = result['response']['venue']['rating']
        except:
            x = 0
        
        #print(x)
        ratingsl.append(x)
        
    return ratingsl

In [23]:
#toronto_gshops_list.astype('str')
toronto_park_rating = getVenueRating(venue_ids = toronto_parks_list['id'])

In [24]:
toronto_parks_count.columns = ['Neighborhood','id','name','categories']
toronto_parks_count.head()

Unnamed: 0,Neighborhood,id,name,categories
0,"Adelaide, King, Richmond",4,4,4
1,Berczy Park,4,4,4
2,"Brockton, Exhibition Place, Parkdale Village",5,5,5
3,Business reply mail Processing Centre969 Eastern,1,1,1
4,"Cabbagetown, St. James Town",3,3,3


In [25]:
toronto_restaurant_count.columns = ['Neighborhood','id','name','categories']
toronto_restaurant_count.head()

Unnamed: 0,Neighborhood,id,name,categories
0,"Adelaide, King, Richmond",30,30,30
1,Berczy Park,14,14,14
2,"Brockton, Exhibition Place, Parkdale Village",4,4,4
3,"Cabbagetown, St. James Town",9,9,8
4,Central Bay Street,30,30,30


In [26]:
schools_grouped.columns =  ['Neighborhood','Rating','Latitude','Longitude']
schools_grouped.head()

Unnamed: 0,Neighborhood,Rating,Latitude,Longitude
0,Christie,8.2,43.669542,-79.422564
1,Davisville,8.2,43.704324,-79.38879
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",9.8,43.686412,-79.400049
3,"Dovercourt Village, Dufferin",10.0,43.669005,-79.442259
4,Lawrence Park,9.1,43.72802,-79.38879


### Merge the park and restaurant data with school data. School has the first priority so we only need to have the neighborhoods where schools are good

In [30]:
#df = df[0:0]
df = schools_grouped.merge(toronto_parks_count, on='Neighborhood', how = 'left')
df.head()

Unnamed: 0,Neighborhood,Rating,Latitude,Longitude,id,name,categories
0,Christie,8.2,43.669542,-79.422564,2.0,2.0,2.0
1,Davisville,8.2,43.704324,-79.38879,3.0,3.0,3.0
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",9.8,43.686412,-79.400049,1.0,1.0,1.0
3,"Dovercourt Village, Dufferin",10.0,43.669005,-79.442259,6.0,6.0,6.0
4,Lawrence Park,9.1,43.72802,-79.38879,1.0,1.0,1.0


In [31]:
df = df.merge(toronto_restaurant_count, on='Neighborhood', how = 'left')
df.head()

Unnamed: 0,Neighborhood,Rating,Latitude,Longitude,id_x,name_x,categories_x,id_y,name_y,categories_y
0,Christie,8.2,43.669542,-79.422564,2.0,2.0,2.0,3.0,3.0,3.0
1,Davisville,8.2,43.704324,-79.38879,3.0,3.0,3.0,6.0,6.0,5.0
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",9.8,43.686412,-79.400049,1.0,1.0,1.0,5.0,5.0,4.0
3,"Dovercourt Village, Dufferin",10.0,43.669005,-79.442259,6.0,6.0,6.0,1.0,1.0,1.0
4,Lawrence Park,9.1,43.72802,-79.38879,1.0,1.0,1.0,,,


In [32]:
df = df[['Neighborhood','Latitude','Longitude','Rating','id_x','id_y']]
df.columns = ['Neighborhood','Latitude','Longitude','SchoolRating','ParksCount','RestaurantCount']
df.fillna(0, inplace = True)
df

Unnamed: 0,Neighborhood,Latitude,Longitude,SchoolRating,ParksCount,RestaurantCount
0,Christie,43.669542,-79.422564,8.2,2.0,3.0
1,Davisville,43.704324,-79.38879,8.2,3.0,6.0
2,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049,9.8,1.0,5.0
3,"Dovercourt Village, Dufferin",43.669005,-79.442259,10.0,6.0,1.0
4,Lawrence Park,43.72802,-79.38879,9.1,1.0,0.0
5,"Little Portugal, Trinity",43.647927,-79.41975,8.4,4.0,8.0
6,"Moore Park, Summerhill East",43.689574,-79.38316,8.566667,2.0,0.0
7,North Toronto West,43.715383,-79.405678,9.3,2.0,2.0
8,"Parkdale, Roncesvalles",43.64896,-79.456325,8.2,0.0,4.0
9,Rosedale,43.679563,-79.377529,8.7,3.0,0.0


## 4. Cluster Neighborhoods

In [33]:
# set number of clusters
kclusters = 4

df_clustering = df.drop(['Neighborhood', 'Latitude', 'Longitude'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 0, 3, 1, 1, 0, 1, 1, 3, 1], dtype=int32)

In [34]:
# add clustering labels
df['Cluster Labels'] = kmeans.labels_

df.sort_values('Cluster Labels') # check the last columns!

Unnamed: 0,Neighborhood,Latitude,Longitude,SchoolRating,ParksCount,RestaurantCount,Cluster Labels
1,Davisville,43.704324,-79.38879,8.2,3.0,6.0,0
5,"Little Portugal, Trinity",43.647927,-79.41975,8.4,4.0,8.0,0
11,"Runnymede, Swansea",43.651571,-79.48445,8.766667,3.0,7.0,0
15,"The Danforth West, Riverdale",43.679557,-79.352188,8.9,3.0,9.0,0
3,"Dovercourt Village, Dufferin",43.669005,-79.442259,10.0,6.0,1.0,1
4,Lawrence Park,43.72802,-79.38879,9.1,1.0,0.0,1
6,"Moore Park, Summerhill East",43.689574,-79.38316,8.566667,2.0,0.0,1
7,North Toronto West,43.715383,-79.405678,9.3,2.0,2.0,1
9,Rosedale,43.679563,-79.377529,8.7,3.0,0.0,1
10,Roselawn,43.711695,-79.416936,8.45,0.0,0.0,1


In [35]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighborhood'], df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters