### Segmenting and Clustering Neighborhoods in Toronto
Kaival Panchal\
Coursera-IBM Capstone Project

Goals:

1. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. 
2. Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.http://cocl.us/Geospatial_data .Use the Geocoder package or the csv file to create the dataframe with long and lad values
3. Explore and cluster the neighborhoods in Toronto. You  decide to work with only boroughs that contain the word Toronto 


#### Import libraries

In [1]:
import requests # to load webpages
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

In [2]:
link = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

r = requests.get(link) # loading the webpage content
webpage = bs(r.content) # # Convert to a Beautiful Soup object, collects the HTML from the above website
print(webpage.prettify()) # # Convert to a Beautiful Soup object


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b6928b0a-c33c-4c16-9dca-a680f600278b","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":979555370,"wgRevisionId":979555370,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Communicati

In [3]:
#Creating a Pandas data frame 
columns = webpage.find_all('th')
#column_name = [c.string for c in columns]
#mylist = []
#for i in column_name[0:3]:
    #mylist.append(i.strip())
table_rows = webpage.find('tbody').find_all('tr')
l=[]
for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row)
    
df = pd.DataFrame(l, columns =['Postal Code','Borough','Neighborhood'])
df.head(50)





Unnamed: 0,Postal Code,Borough,Neighborhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M8A,Not assigned,Not assigned
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"


In [4]:
print('The shape of the dataset prior to any data cleaning is',df.shape)

The shape of the dataset prior to any data cleaning is (181, 3)


In [5]:
# Data Cleaning
df = df.dropna(how = 'all') # dropping all null values
df_filter = df[ df['Borough'] == 'Not assigned' ].index  # finding all boroughs that are not assigned
df.drop(df_filter, inplace = True) # removing all boroughs that are not assigned
df.head(50)



Unnamed: 0,Postal Code,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
10,M1B,Scarborough,"Malvern, Rouge"
12,M3B,North York,Don Mills
13,M4B,East York,"Parkview Hill, Woodbine Gardens"
14,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
print('The shape of the dataset after data cleaning is',df.shape)

The shape of the dataset after data cleaning is (103, 3)


In [7]:
# Check to see if there are any duplicate zips
print('The length of the cleaned data set prior to checking any duplicate zips is ' + str(len(df)))

df['Count'] = 1 
# initialize count of each row to 1, when we groupby and sum, 
#if there is any duplicate values, the count will increase to a unique value other than 1 , 
#and the length of the new frame will decrease
PC = df.groupby('Postal Code').sum()
unique = np.unique(PC["Count"])
print('The length of the cleaned data set after to checking any duplicate zips is ' + str(len(PC)) + ' and unique count values are all '+ str(unique))




The length of the cleaned data set prior to checking any duplicate zips is 103
The length of the cleaned data set after to checking any duplicate zips is 103 and unique count values are all [1]


In [8]:
# Get the latitude and the longitude coordinates of each neighborhood from CSV file
LLCD = pd.read_csv('Geospatial_Coordinates.csv')
LLCD.head()

# merge the two data sets together based on matching Postal Codes
result = pd.merge(df, LLCD[['Postal Code', 'Latitude', 'Longitude']],on='Postal Code')
result.drop('Count', axis='columns', inplace=True)
new_df = result
new_df


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [9]:
# Create a map of Toronto Canda using folium
import folium


# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[43.6487, -79.38544], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(new_df['Latitude'], new_df['Longitude'], new_df['Borough'], new_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        #fill=True,
        #fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Foursquare

In [10]:
CLIENT_ID = '2VWCCZ433QCZ31II12R11D0MJIIVWL4HJL3IDDZVRTCLSUHC' # your Foursquare ID
CLIENT_SECRET = 'GNIU5ER5B1FI15OZMKXQBRE1T0IRMDBWQVWGW3BPE2MU4LJ1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2VWCCZ433QCZ31II12R11D0MJIIVWL4HJL3IDDZVRTCLSUHC
CLIENT_SECRET:GNIU5ER5B1FI15OZMKXQBRE1T0IRMDBWQVWGW3BPE2MU4LJ1


In [11]:
# Studying Area with Name containing Toronto
toronto_data=new_df[new_df['Borough'].str.contains("East Toronto")]
toronto_data

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
47,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
54,M4M,East Toronto,Studio District,43.659526,-79.340923
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558


In [12]:


def getNearbyVenues(names, latitudes, longitudes):
    radius=500
    LIMIT=100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [13]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'], latitudes=toronto_data['Latitude'],longitudes=toronto_data['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
India Bazaar, The Beaches West
Studio District
Business reply mail Processing Centre, South Central Letter Processing Plant Toronto


In [14]:
toronto_venues.head(50)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant
5,"The Danforth West, Riverdale",43.679557,-79.352188,MenEssentials,43.67782,-79.351265,Cosmetics Shop
6,"The Danforth West, Riverdale",43.679557,-79.352188,Cafe Fiorentina,43.677743,-79.350115,Italian Restaurant
7,"The Danforth West, Riverdale",43.679557,-79.352188,Mezes,43.677962,-79.350196,Greek Restaurant
8,"The Danforth West, Riverdale",43.679557,-79.352188,La Diperie,43.677702,-79.352265,Ice Cream Shop
9,"The Danforth West, Riverdale",43.679557,-79.352188,Dolce Gelato,43.677773,-79.351187,Ice Cream Shop


In [15]:
toronto_venues.groupby('Neighborhood').count() #Let's check how many venues were returned for each neighborhood

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Business reply mail Processing Centre, South Central Letter Processing Plant Toronto",16,16,16,16,16,16
"India Bazaar, The Beaches West",19,19,19,19,19,19
Studio District,37,37,37,37,37,37
The Beaches,4,4,4,4,4,4
"The Danforth West, Riverdale",43,43,43,43,43,43


In [16]:
#Let's find out how many unique categories can be curated from all the returned venues
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 66 uniques categories.


3. Analyze Each Neighborhood

In [17]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

toronto_onehot.drop(['Neighborhood'],axis=1,inplace=True) 
toronto_onehot.insert(loc=0, column='Neighborhood', value=toronto_venues['Neighborhood'] ) # add neighborhood column back to dataframe
toronto_onehot.shape



(119, 66)

In [18]:
#Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,American Restaurant,Auto Workshop,Bakery,Bank,Bar,Bookstore,Brewery,Bubble Tea Shop,Burrito Place,...,Seafood Restaurant,Skate Park,Spa,Stationery Store,Steakhouse,Sushi Restaurant,Thai Restaurant,Trail,Wine Bar,Yoga Studio
0,"Business reply mail Processing Centre, South C...",0.0,0.0625,0.0,0.0,0.0,0.0,0.0625,0.0,0.0625,...,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"India Bazaar, The Beaches West",0.0,0.0,0.0,0.0,0.0,0.0,0.052632,0.0,0.052632,...,0.0,0.0,0.0,0.0,0.052632,0.052632,0.0,0.0,0.0,0.0
2,Studio District,0.054054,0.0,0.054054,0.027027,0.027027,0.027027,0.054054,0.0,0.0,...,0.027027,0.0,0.0,0.027027,0.0,0.0,0.027027,0.0,0.027027,0.027027
3,The Beaches,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0
4,"The Danforth West, Riverdale",0.023256,0.0,0.023256,0.0,0.0,0.046512,0.023256,0.023256,0.0,...,0.0,0.0,0.023256,0.0,0.0,0.023256,0.0,0.023256,0.0,0.023256


In [19]:
#Let's print each neighborhood along with the top 5 most common venues
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Business reply mail Processing Centre, South Central Letter Processing Plant Toronto----
                  venue  freq
0  Gym / Fitness Center  0.06
1         Garden Center  0.06
2        Farmers Market  0.06
3                  Park  0.06
4           Pizza Place  0.06


----India Bazaar, The Beaches West----
               venue  freq
0               Park  0.11
1                Pub  0.05
2  Fish & Chips Shop  0.05
3      Movie Theater  0.05
4          Pet Store  0.05


----Studio District----
                 venue  freq
0          Coffee Shop  0.08
1  American Restaurant  0.05
2               Bakery  0.05
3            Gastropub  0.05
4              Brewery  0.05


----The Beaches----
                       venue  freq
0                      Trail  0.25
1                        Pub  0.25
2          Health Food Store  0.25
3        American Restaurant  0.00
4  Latin American Restaurant  0.00


----The Danforth West, Riverdale----
                    venue  freq
0        Greek Restau

In [20]:
#First, let's write a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [21]:
#Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Business reply mail Processing Centre, South C...",Recording Studio,Burrito Place,Light Rail Station,Gym / Fitness Center,Park,Farmers Market,Pizza Place,Butcher,Restaurant,Comic Shop
1,"India Bazaar, The Beaches West",Park,Fish & Chips Shop,Restaurant,Italian Restaurant,Gym,Liquor Store,Burrito Place,Movie Theater,Pet Store,Pizza Place
2,Studio District,Coffee Shop,Brewery,Gastropub,Café,American Restaurant,Bakery,Bookstore,Bar,Bank,Cheese Shop
3,The Beaches,Trail,Pub,Health Food Store,Yoga Studio,Farmers Market,Convenience Store,Cosmetics Shop,Coworking Space,Dessert Shop,Diner
4,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Restaurant,Bookstore,Ice Cream Shop,Furniture / Home Store,Indian Restaurant,Grocery Store,Fruit & Vegetable Store


In [22]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([3, 1, 4, 2, 0])

In [23]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Trail,Pub,Health Food Store,Yoga Studio,Farmers Market,Convenience Store,Cosmetics Shop,Coworking Space,Dessert Shop,Diner
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Restaurant,Bookstore,Ice Cream Shop,Furniture / Home Store,Indian Restaurant,Grocery Store,Fruit & Vegetable Store
47,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,1,Park,Fish & Chips Shop,Restaurant,Italian Restaurant,Gym,Liquor Store,Burrito Place,Movie Theater,Pet Store,Pizza Place
54,M4M,East Toronto,Studio District,43.659526,-79.340923,4,Coffee Shop,Brewery,Gastropub,Café,American Restaurant,Bakery,Bookstore,Bar,Bank,Cheese Shop
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,3,Recording Studio,Burrito Place,Light Rail Station,Gym / Fitness Center,Park,Farmers Market,Pizza Place,Butcher,Restaurant,Comic Shop


In [24]:
neighborhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,3,"Business reply mail Processing Centre, South C...",Recording Studio,Burrito Place,Light Rail Station,Gym / Fitness Center,Park,Farmers Market,Pizza Place,Butcher,Restaurant,Comic Shop
1,1,"India Bazaar, The Beaches West",Park,Fish & Chips Shop,Restaurant,Italian Restaurant,Gym,Liquor Store,Burrito Place,Movie Theater,Pet Store,Pizza Place
2,4,Studio District,Coffee Shop,Brewery,Gastropub,Café,American Restaurant,Bakery,Bookstore,Bar,Bank,Cheese Shop
3,2,The Beaches,Trail,Pub,Health Food Store,Yoga Studio,Farmers Market,Convenience Store,Cosmetics Shop,Coworking Space,Dessert Shop,Diner
4,0,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Restaurant,Bookstore,Ice Cream Shop,Furniture / Home Store,Indian Restaurant,Grocery Store,Fruit & Vegetable Store


In [25]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map of Toronto using latitude and longitude values
map_toronto_clusters = folium.Map(location=[43.6487, -79.38544], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_toronto_clusters)
       
map_toronto_clusters