<h2>  Applied Data Science Capstone Project </h2> 

This notebook will be mainly used for the capstone project

<h2>  Segmenting and Clustering Neighborhoods in Toronto </h2> 

<h3>  Part 1:  </h3> 
<p> To obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe </p>

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
from sklearn.cluster import KMeans

In [3]:
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

In [4]:
import matplotlib.cm as cm
import matplotlib.colors as colors

Scrapping the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes

In [5]:
#The below url contains html table of postal codes of the city of Toronto.
#url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969"

In [6]:
data  = requests.get(url).text
soup = BeautifulSoup(data,"html5lib")

In [7]:
#find all html tables in the web page
tables = soup.find_all('table')

Using Pandas to transform the data in the table on the Wikipedia page into a dataframe.

In [8]:
postal_codes_table = pd.read_html(str(tables[0]), flavor='bs4')[0]
postal_codes_table

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


<p>Wrangling and cleaning the data</p>

  * Removing cells with a borough that is Not assigned.

In [9]:
postal_codes_table['Borough'].replace('Not assigned', np.nan, inplace = True)
postal_codes_table.dropna(subset=['Borough'], inplace = True)
postal_codes_table

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


* Combining the rows where one postal code area has more than one neighborhood, into one row with the neighborhoods separated with a comma 

In [10]:
postal_codes_df = postal_codes_table.groupby(['Postal Code','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [11]:
for row in range(len(postal_codes_df)):
    if postal_codes_df['Neighbourhood'].iloc[row] =='Not assigned':
        postal_codes_df['Neighbourhood'].iloc[row]= df['Borough'].iloc[row]

Now we have our dataframe

In [12]:
postal_codes_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [13]:
postal_codes_df.shape

(103, 3)

<h3>  Part 2:  </h3> 
<p> To get the latitude and the longitude coordinates of each neighborhood </p>

In [15]:
# The code was removed by Watson Studio for sharing.

In [16]:
df_data_1.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [17]:
neighborhood_coord = postal_codes_df.merge(df_data_1, on = 'Postal Code')

In [18]:
neighborhood_coord

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


<h3>  Part 3:  </h3> 
<p> To explore and cluster the neighborhoods in Toronto (I had decided to work with only boroughs that contain the word Toronto) </p>


In [19]:
neighborhood_coord['Borough'].unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       'Toronto/York', 'Mississauga', 'Etobicoke'], dtype=object)

In [20]:
neighborhood_coord_toronto = neighborhood_coord[neighborhood_coord['Borough'].str.contains("Toronto")].reset_index(drop=True)
neighborhood_coord_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [21]:
# create map of Toronto using latitude and longitude values (43.651070, -79.347015)
latitude = 43.651070
longitude = -79.347015
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(neighborhood_coord_toronto['Latitude'], neighborhood_coord_toronto['Longitude'], neighborhood_coord_toronto['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

<p> Let's borrow the getNearbyVenues function from the Segmenting and Clustering lab, in order to get the top 100 venues that are within a radius of 500 meters from each neighbourghood </p>

In [22]:
CLIENT_ID = 'XXXX'  # your Foursquare ID
CLIENT_SECRET = 'XXXX'  # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
                    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<p> Running the above function on each neighborhood and create a new dataframe called toronto_venues </p>

In [24]:
toronto_venues = getNearbyVenues(names=neighborhood_coord_toronto['Neighbourhood'],
                                   latitudes=neighborhood_coord_toronto['Latitude'],
                                   longitudes=neighborhood_coord_toronto['Longitude']
                                  )

In [25]:
print(toronto_venues.shape)
toronto_venues.head()

(1604, 7)


Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Glen Stewart Park,43.675278,-79.294647,Park
4,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood


<h5>Analizing the neighbourhoods and preparing the data for the clustering based on the Venue Categories</h5>

* Applying a one hot encoding to the toronto_venues data frame

In [26]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

In [27]:
# add neighborhood column back to dataframe
toronto_onehot = pd.concat([toronto_venues['Neighbourhood'], toronto_onehot], axis=1)
toronto_onehot.head()

Unnamed: 0,Neighbourhood,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


* Groupping rows by neighborhood and taking the mean of the frequency of occurrence of each category

In [28]:
toronto_freq = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_freq

Unnamed: 0,Neighbourhood,Adult Boutique,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Business reply mail Processing Centre, South C...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0625,0.0625,0.125,0.125,0.125,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.015873,0.0,0.0,0.015873,0.015873
5,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Church and Wellesley,0.014286,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,0.0,...,0.014286,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571
7,"Commerce Court, Victoria Hotel",0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0
8,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Davisville North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


* Create the new dataframe and display the top 10 venues for each neighborhood. So, first we borrow the _return_most_common_venues_ function from the Segmenting and Clustering lab. 

In [29]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [30]:
num_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = toronto_freq['Neighbourhood']

for ind in np.arange(toronto_freq.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_freq.iloc[ind, :], num_venues)

neighbourhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Restaurant,Farmers Market,Seafood Restaurant,Cheese Shop,Beer Bar,Pharmacy,Art Gallery
1,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Bakery,Coffee Shop,Nightclub,Pet Store,Performing Arts Venue,Restaurant,Climbing Gym,Burrito Place
2,"Business reply mail Processing Centre, South C...",Yoga Studio,Auto Workshop,Park,Comic Shop,Recording Studio,Restaurant,Farmers Market,Fast Food Restaurant,Skate Park,Burrito Place
3,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Plane,Boutique,Coffee Shop,Bar,Sculpture Garden,Rental Car Location,Boat or Ferry
4,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Bubble Tea Shop,Department Store,Thai Restaurant,Salad Place,Burger Joint,Japanese Restaurant


<h5>Clustering Neighborhoods</h5>

* Run _k_-means to cluster the neighborhood into 5 clusters

In [31]:
# set number of clusters
kclusters = 6

toronto_clustering = toronto_freq.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 4, 2, 0, 2,
       2, 2, 2, 2, 3, 1, 2, 5, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2], dtype=int32)

* Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood

In [32]:
# add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighbourhoods_venues_sorted.head()

Unnamed: 0,Cluster Labels,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,2,Berczy Park,Coffee Shop,Bakery,Cocktail Bar,Restaurant,Farmers Market,Seafood Restaurant,Cheese Shop,Beer Bar,Pharmacy,Art Gallery
1,2,"Brockton, Parkdale Village, Exhibition Place",Café,Breakfast Spot,Bakery,Coffee Shop,Nightclub,Pet Store,Performing Arts Venue,Restaurant,Climbing Gym,Burrito Place
2,2,"Business reply mail Processing Centre, South C...",Yoga Studio,Auto Workshop,Park,Comic Shop,Recording Studio,Restaurant,Farmers Market,Fast Food Restaurant,Skate Park,Burrito Place
3,2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Lounge,Airport Service,Airport Terminal,Plane,Boutique,Coffee Shop,Bar,Sculpture Garden,Rental Car Location,Boat or Ferry
4,2,Central Bay Street,Coffee Shop,Sandwich Place,Italian Restaurant,Café,Bubble Tea Shop,Department Store,Thai Restaurant,Salad Place,Burger Joint,Japanese Restaurant


In [33]:
# merge neighbourhoods_venues_sorted with neighborhood_coord_toronto to add latitude/longitude for each neighborhood

toronto_venues = neighborhood_coord_toronto.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_venues.head() # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,Health Food Store,Park,Trail,Pub,Neighborhood,Other Great Outdoors,Museum,Men's Store,Metro Station,Mexican Restaurant
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,2,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Bookstore,Restaurant,Pub,Bubble Tea Shop,Spa
2,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,2,Restaurant,Movie Theater,Sushi Restaurant,Sandwich Place,Fish & Chips Shop,Liquor Store,Coffee Shop,Italian Restaurant,Pub,Pet Store
3,M4M,East Toronto,Studio District,43.659526,-79.340923,2,Coffee Shop,Brewery,Café,American Restaurant,Gastropub,Bakery,Yoga Studio,Diner,Cheese Shop,Seafood Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,4,Park,Bus Line,Swim School,Adult Boutique,Music Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant


* Finally, let's visualize the resulting clusters

In [34]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_venues['Latitude'], toronto_venues['Longitude'], toronto_venues['Neighbourhood'], toronto_venues['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters