### Introduction

In this assignment, we will be exploring and clustering the neighborhoods in Toronto. So, first we will scrape the https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, to get the details of Toronto neighborhood. The data will be stored into a dataframe.

Let's download and import the dependencies.

In [33]:
# Import pandas and numpy
import pandas as pd
import numpy as np

from geopy.geocoders import Nominatim

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

print("Libraries Imported")

Libraries Imported


Now define the website url that is to be scraped. And use <b>read_html</b> function of pandas to extract the tables.\
In the webpage, the first table is the required one. So we should fetch that table columns into pandas dataframe.

In [34]:
# define the url
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

# Extract the tables
table_dfs = pd.read_html(url)

# Get the first table
table_df = table_dfs[0]
table_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Change the column name Neighbourhood to Neighborhood.

In [35]:
table_df = table_df.rename(columns={'Neighbourhood': 'Neighborhood'})

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [36]:
values = ['Not assigned']
table_df.drop(pd.Index(np.where(table_df['Borough'].isin(values))[0]), inplace = True)
table_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Display the total number of rows.

In [37]:
print("There are {} number of rows in the dataframe.".format(table_df.shape[0]))

There are 103 number of rows in the dataframe.


### Define the Foursquare credentials and version

In [38]:
CLIENT_ID = 'MARBGB0XUSWGYWU2DBN2XNDIC2NJFECHSEP2MY10ZM5QZS2V' # your Foursquare ID
CLIENT_SECRET = 'RKZ0FXERT2WRFR4GRL1WSALH55VGWMXXL2GF0HLNY2GL3RRG' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

Now we need to get the latitude and longitude of each neighborhood and for this we will be using the csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data . This link gets downloaded with name 'Geospatial_Coordinates.csv'\

We can use pandas read_csv method to read the data into a dataframe.

In [39]:
geodata_df = pd.read_csv('Geospatial_Coordinates.csv')
geodata_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now the above dataframe needs to be merged with our postal code dataframe to get the full data.

In [40]:
table_df = table_df.join(geodata_df.set_index('Postal Code'), on='Postal Code')
table_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


#### Explore the neighborhood

Now we will consider boroughs that has keyword 'Toronto' in it. 

In [41]:
toronto_df = table_df.loc[table_df['Borough'].str.contains('Toronto')]
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
13,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
22,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
30,M4E,East Toronto,The Beaches,43.676357,-79.293031


#### Let's create a function to get details to all the neighborhoods in  boroughs having Toronto keyword.

In [42]:
import requests
def getNearbyVenues(pc, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for pc, name, lat, lng in zip(pc, names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            pc,
            name,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [
                  'Postal Code',
                  'Neighborhood',
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new dataframe called *toronto_venues*.

In [43]:
toronto_venues = getNearbyVenues(pc=toronto_df['Postal Code'],
                                   names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

In [44]:
toronto_venues.head()

Unnamed: 0,Postal Code,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Cooper Koo Family YMCA,43.653249,-79.358008,Distribution Center
3,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


#### Let's check the size of the resulting dataframe

In [45]:
print(toronto_venues.shape)

(858, 8)


Let's check how many venues were returned for each neighborhood

In [46]:
toronto_venues.groupby('Postal Code').count()

Unnamed: 0_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
M4E,4,4,4,4,4,4,4
M4K,30,30,30,30,30,30,30
M4L,19,19,19,19,19,19,19
M4M,30,30,30,30,30,30,30
M4N,3,3,3,3,3,3,3
M4P,7,7,7,7,7,7,7
M4R,20,20,20,20,20,20,20
M4S,30,30,30,30,30,30,30
M4T,4,4,4,4,4,4,4
M4V,16,16,16,16,16,16,16


#### Let's find out how many unique categories can be curated from all the returned venues

In [47]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 194 uniques categories.


### Analyze Each Neighborhood

In [48]:
# define a final dataframe with first column as Neighborhood
toronto_final_df = pd.DataFrame(data=toronto_venues, columns=['Postal Code', 'Neighborhood'])

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Concatenating the above two dataframes
toronto_final_df = pd.concat([toronto_final_df, toronto_onehot], axis=1)


toronto_final_df.head()


Unnamed: 0,Postal Code,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,M5A,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M5A,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M5A,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M5A,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M5A,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The size of the new dataframe is:

In [49]:
toronto_final_df.shape

(858, 196)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [50]:
# First remove the duplicate Neighborhood column
toronto_final_df = toronto_final_df.loc[:,~toronto_final_df.columns.duplicated()]

In [51]:
toronto_grouped = toronto_final_df.groupby(by=['Postal Code', 'Neighborhood']).mean().reset_index()
toronto_grouped

Unnamed: 0,Postal Code,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,M4E,The Beaches,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,"The Danforth West, Riverdale",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.033333
2,M4L,"India Bazaar, The Beaches West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,Studio District,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033333,0.033333
4,M4N,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M4P,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4R,"North Toronto West, Lawrence Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05
7,M4S,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4T,"Moore Park, Summerhill East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0
9,M4V,"Summerhill West, Rathnelly, South Hill, Forest...",0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0


Let's see the size of new grouped dataframe.

In [52]:
toronto_grouped.shape

(39, 195)

#### Let's print each neighborhood along with the top 5 most common venues

In [53]:
num_top_venues = 5

for hood in toronto_grouped['Postal Code']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Postal Code'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[2:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----M4E----
                        venue  freq
0                         Pub  0.25
1                       Trail  0.25
2           Health Food Store  0.25
3                     Airport  0.00
4  Modern European Restaurant  0.00


----M4K----
                venue  freq
0    Greek Restaurant  0.23
1      Ice Cream Shop  0.07
2  Italian Restaurant  0.07
3          Restaurant  0.07
4         Yoga Studio  0.03


----M4L----
                venue  freq
0           Pet Store  0.05
1          Restaurant  0.05
2  Light Rail Station  0.05
3             Brewery  0.05
4                Park  0.05


----M4M----
              venue  freq
0              Café  0.13
1       Coffee Shop  0.10
2            Bakery  0.07
3       Fish Market  0.03
4  Stationery Store  0.03


----M4N----
                       venue  freq
0                       Park  0.33
1                   Bus Line  0.33
2                Swim School  0.33
3                    Airport  0.00
4  Middle Eastern Restaurant  0.00


----M4P----


#### Let us now put this data into a new dataframe

First, let's write a function to sort the venues in descending order.

In [54]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[2:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [55]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code', 'Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,The Beaches,Trail,Pub,Health Food Store,Yoga Studio,Creperie,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop
1,M4K,"The Danforth West, Riverdale",Greek Restaurant,Restaurant,Italian Restaurant,Ice Cream Shop,Yoga Studio,Pizza Place,Juice Bar,Bookstore,Dessert Shop,Café
2,M4L,"India Bazaar, The Beaches West",Park,Sushi Restaurant,Pet Store,Pizza Place,Liquor Store,Light Rail Station,Pub,Burrito Place,Restaurant,Brewery
3,M4M,Studio District,Café,Coffee Shop,Bakery,Yoga Studio,Comfort Food Restaurant,Bookstore,Sandwich Place,Brewery,Cheese Shop,Pet Store
4,M4N,Lawrence Park,Park,Bus Line,Swim School,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega


### Cluster the Neighborhoods

Run *k*-means to cluster the neighborhood into 5 clusters.

In [56]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop(columns=['Postal Code', 'Neighborhood'])

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([4, 0, 0, 0, 2, 0, 0, 0, 3, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [57]:
# This is to keep only one column after doing final merge
table_df = table_df.drop('Neighborhood', 1)

In [58]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)


In [59]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
table_df = table_df.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
table_df.dropna(inplace=True)
table_df['Cluster Labels'] = table_df['Cluster Labels'].astype('int')
table_df.head()

Unnamed: 0,Postal Code,Borough,Latitude,Longitude,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,M5A,Downtown Toronto,43.65426,-79.360636,0,"Regent Park, Harbourfront",Coffee Shop,Park,Theater,Bakery,Breakfast Spot,Café,Restaurant,Pub,Chocolate Shop,Yoga Studio
6,M7A,Downtown Toronto,43.662301,-79.389494,0,"Queen's Park, Ontario Provincial Government",Coffee Shop,Diner,Yoga Studio,Hobby Shop,Park,Mexican Restaurant,Creperie,Portuguese Restaurant,Café,Burrito Place
13,M5B,Downtown Toronto,43.657162,-79.378937,0,"Garden District, Ryerson",Café,Coffee Shop,Clothing Store,Theater,Pizza Place,Bookstore,Sandwich Place,Diner,Burger Joint,Burrito Place
22,M5C,Downtown Toronto,43.651494,-79.375418,0,St. James Town,Restaurant,Coffee Shop,Gastropub,Café,Farmers Market,Creperie,Japanese Restaurant,Italian Restaurant,Diner,Ice Cream Shop
30,M4E,East Toronto,43.676357,-79.293031,4,The Beaches,Trail,Pub,Health Food Store,Yoga Studio,Creperie,Dog Run,Distribution Center,Discount Store,Diner,Dessert Shop


In [60]:
table_df.shape

(39, 16)

Let's visualize it.

In [61]:
address = 'Toronto'

geolocator = Nominatim(user_agent='ny-explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [62]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(table_df['Latitude'], table_df['Longitude'], table_df['Neighborhood'], table_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters