### Coursera/IBM Applied Data Science Capstone Course
#### Week 3 assignment: Toronto Neighborhoods 

---

## Part 1: Scraping neighborhood postal codes and names from wikipedia page

First step: we import the necessary libraries:

* *Request* to grap html site data

* *BeautifulSoup* to scrape html data

* *Numpy* to handle data in a vectorized manner

* *Pandas* for data analysis and dataframes

In [1]:
import requests # library to grab html data
from bs4 import BeautifulSoup # library to scrape html data

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis

In the next steps we:
* 1- Define the URL link
* 2- Use request.get to download the data from the wikipedia site and assign the data to the variable *wikipedia_data*
* 3- Use the data attribute text to extract the html data as text string, parse it with BeautifulSoup function and assign to the variable *soup*

In [2]:
#1 Define the URL link
wikipedia_link="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#2 Download the data site 
wikipedia_data= requests.get(wikipedia_link)

#3 parse data 
soup = BeautifulSoup(wikipedia_data.text, 'html.parser')

Next we define the dataframe column names by
* 4- Finding the relevant postalcode data in the body of an html table with the attribute *.tbody* and extracting the content within the *th* html element using the BeautifulSoup method *find_all*
* 5- Extracting the text contained in each *th* element and adding it to an array of the column names
* 6- Creating a new dataframe with those columns
* 7- Adjusting some of the column names to fit the asignment description

In [3]:
#4 select data from html table 
column_name_array = soup.tbody.find_all('th')

#5 Extracting column names
column_names = [column_name_array[0].string , column_name_array[1].string , column_name_array[2].string.strip('\n')]

#6 create new dataframe
toronto_neighborhoods = pd.DataFrame(columns=column_names)

#7 Adjust column names 
toronto_neighborhoods = toronto_neighborhoods.rename(columns={'Postcode':'PostalCode' , 'Neighbourhood': 'Neighborhood'})
toronto_neighborhoods

Unnamed: 0,PostalCode,Borough,Neighborhood


---

Now we fill in the dataframe columns with the data from the wikipedia postalcodes table:
* 8- Using the BeautifulSoup method *find_all* we collect the table rows into an array variable *table_data*
* 9- Looping through the array (except the first element corresponding to the headers used for the column names), each element corresponding to a row in the table
* 10- Extracting the row elements using the *find_all* on the html tag *td*, which results in an array with the three values of interest. 
* 11- Assign the values to each column in the dataframe

In [4]:
#8
table_data = soup.tbody.find_all('tr')

#9
for row in table_data[1:]:
    #10
    row_entries = row.find_all('td')
    #11
    postcode = row_entries[0].get_text()
    borough = row_entries[1].get_text()
    neighborhood = row_entries[2].get_text().strip('\n')
    toronto_neighborhoods = toronto_neighborhoods.append({'PostalCode': postcode,
                                          'Borough': borough,
                                          'Neighborhood': neighborhood}, ignore_index=True)

    
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


---

Now we clean the dataframe by eliminating all rows without an assigned borough, that is, those containing *'Not assigned'* as value. 
* 12- First convert all the elements with the value *'Not assigned'* in the column *Borough* into a *NaN*
* 13- Then drop all the rows containing *NaN*

In [5]:
#12 convert 'Not assigned' to NaN
toronto_neighborhoods.loc[toronto_neighborhoods['Borough'] == 'Not assigned','Borough'] = np.nan

#13 drop all NaN
toronto_neighborhoods = toronto_neighborhoods.dropna()
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


---

* 14- Next find and replace all rows where the *Neighborhood* value is set to *'Not assigned'* and replace it by the value in the *Borough* column using the *numpy.where* function

In [6]:
#14 Replace 'Not assigned' by borough name
toronto_neighborhoods['Neighborhood'] = np.where(toronto_neighborhoods['Neighborhood'] == 'Not assigned', toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood'])
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


---

* 15- Finally, we group all neighborhoods with the same postal code and Borough name into a single row, combining or aggregating the neighborhood names into a list separated by comas
* 16- We need to reset the index 

In [7]:
#15 Group neighborhoods with same postal code
toronto_neighborhoods = toronto_neighborhoods.groupby(['PostalCode', 'Borough']).agg(lambda x: ','.join(x.values))

#16 Reset index
toronto_neighborhoods.reset_index(inplace = True)
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


---

17- Display the dataframe size/shape

In [8]:
#17
toronto_neighborhoods.shape

(103, 3)

---
---
## Part 2: Acquiring longitude and latitude coordinates for each borough


After much trying to connect to the Geocoder Python package, I needed to use plan B and load the coordinates from the csv file:
* 1- Download data from csv file
* 2- Reset the index
* 3- Rename column to match *'toronto_neighborhoods'* dataframe


In [9]:

#1 Download csv data
LatLong_data = pd.read_csv('http://cocl.us/Geospatial_data', header=0, index_col=0)

#2 reset index
LatLong_data = LatLong_data.reset_index()

#3 rename column
LatLong_data = LatLong_data.rename(columns={'Postal Code' : 'PostalCode'})

LatLong_data.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


---

4. Next we merge both dataframes using the *PostalCode* column as the key

In [10]:
#4 Merge dataframes
toronto_neighborhoods = pd.merge(toronto_neighborhoods, LatLong_data, on='PostalCode')
toronto_neighborhoods.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


---

Display the dataframe size/shape

In [11]:
toronto_neighborhoods.shape

(103, 5)

---
---
## Part 3: Explore and cluster the neighborhoods in Toronto. 

In this part of the assignment we will:
* a- Obtain venues per postal code and explore results
* b- Cluster in 8 groups
* c- Plot the clusters on the map
* d- Analyze the clusters obtaining data about the size of each cluster and the most common venues per cluster

---

Part 3-a: Obtain the venues from FourSquare API:
* 1- Define Foursquare Credentials and Version
* 2- Define some parameters as LIMIT and radius of the request

In [12]:
#1 FourSquare credentials
CLIENT_ID = 'ZNG3GOLASP2WLNX3Y5QYM4JRYHM1KEQ45AHQUK5ITQZTEMUO' # your Foursquare ID
CLIENT_SECRET = 'Z0KOIU0BDUWC5UUHFFLNYNMQSTLEN0RNLGLOGZVWHZ21Y00W' # your Foursquare Secret
VERSION = '20180604'

#2 FourSquare parameters
LIMIT = 100
radius = 500

---

* 3- Define a function to extract the category of the venue

In [13]:
# 3 
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

---

* 4- Define function to loop through all postal codes, call FourSquare to extract venues relevant information and convert to pandas dataframe

In [14]:
# 4 function to extract relevant FourSquare data
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

---

* 5- Create new dataframe *toronto_data* from *toronto_neighborhoods* by dropping the *'Postal Code'* column
* 6- Run the above function on each row of *toronto_data* dataframe and create a new dataframe called *toronto_venues*

In [15]:
#5 Drop 'Postal Code' column
toronto_data = toronto_neighborhoods.drop(['PostalCode'], axis=1)

#6 Get venues
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Rouge,Malvern
Highland Creek,Rouge Hill,Port Union
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Clairlea,Golden Mile,Oakridge
Cliffcrest,Cliffside,Scarborough Village West
Birch Cliff,Cliffside West
Dorset Park,Scarborough Town Centre,Wexford Heights
Maryvale,Wexford
Agincourt
Clarks Corners,Sullivan,Tam O'Shanter
Agincourt North,L'Amoreaux East,Milliken,Steeles East
L'Amoreaux West,Steeles West
Upper Rouge
Hillcrest Village
Fairview,Henry Farm,Oriole
Bayview Village
Silver Hills,York Mills
Newtonbrook,Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park,Don Mills South
Bathurst Manor,Downsview North,Wilson Heights
Northwood Park,York University
CFB Toronto,Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens,Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West,Riverdale
The Beac

---

* 7- Explore how many venues were returned for each row or postal code

In [16]:
#7 Group venues by neighborhood and explore
toronto_venues_grouped = toronto_venues.groupby('Neighborhood').count()
print(toronto_venues_grouped.shape)
toronto_venues_grouped = toronto_venues_grouped.reset_index()
toronto_venues_grouped.head()

(100, 6)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Adelaide,King,Richmond",100,100,100,100,100,100
1,Agincourt,4,4,4,4,4,4
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",2,2,2,2,2,2
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",9,9,9,9,9,9
4,"Alderwood,Long Branch",9,9,9,9,9,9


---

Not all postal codes returned venues, thus size of the dataframe *toronto_venues_grouped* contains fewer rows than *toronto_neighborhoods*. To make both dataframes the same size (necessary to apply clustering later) we need to:
* 8- drop the rows/postal codes that did not returned any venue

In [17]:
#8 Drop neighborhoods without venues
toronto_with_venues = toronto_venues_grouped[['Neighborhood']]
toronto_data_with_venues = pd.merge(toronto_data, toronto_with_venues, on='Neighborhood')
print(toronto_data_with_venues.shape)

(100, 4)


---

* 9- Explore the venues in each postal code applying the same analysis as to the NYC example

In [18]:
# 9
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
print(toronto_grouped.shape)
toronto_grouped.head()

(100, 271)


Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Alderwood,Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

* 10- Print each neighborhood along with the top 5 most common venues

In [19]:
# 10
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide,King,Richmond----
                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.05
2  American Restaurant  0.04
3           Steakhouse  0.04
4      Thai Restaurant  0.04


----Agincourt----
            venue  freq
0          Lounge  0.25
1    Skating Rink  0.25
2  Clothing Store  0.25
3  Breakfast Spot  0.25
4     Yoga Studio  0.00


----Agincourt North,L'Amoreaux East,Milliken,Steeles East----
                 venue  freq
0           Playground   0.5
1                 Park   0.5
2          Yoga Studio   0.0
3   Mexican Restaurant   0.0
4  Monument / Landmark   0.0


----Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,Silverstone,South Steeles,Thistletown----
                  venue  freq
0         Grocery Store  0.22
1           Pizza Place  0.11
2              Pharmacy  0.11
3  Fast Food Restaurant  0.11
4            Beer Store  0.11


----Alderwood,Long Branch----
          venue  freq
0   Pizza Place  0.22
1  Skating Rink  0.11
2 

---

* 11- Write a function to sort the venues in descending order.
* 12- Create the new dataframe called *neighborhoods_venues_sorted* that displays the top 10 venues for each neighborhood.

In [20]:
#11
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


#12
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,American Restaurant,Steakhouse,Thai Restaurant,Bakery,Restaurant,Bar,Gym,Asian Restaurant
1,Agincourt,Lounge,Skating Rink,Clothing Store,Breakfast Spot,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Park,Playground,Women's Store,Donut Shop,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Beer Store,Fried Chicken Joint,Fast Food Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Pharmacy,Airport Lounge,Airport Service
4,"Alderwood,Long Branch",Pizza Place,Gym,Skating Rink,Coffee Shop,Pharmacy,Pub,Sandwich Place,Pool,Diner,Department Store


---

Part 3-b: Clustering neighborhoods based on the returned venues 

* 13- import k-means from clustering stage
* 14- Run k-means to cluster the neighborhoods into clusters.

In [21]:
#13 import clustering library
from sklearn.cluster import KMeans

#14 Clustering
# set number of clusters
kclusters = 8

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 3, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

---

Part 3-c: Plot clusters on the map

(Note: to visualize maps use this link https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/63e846b5-dc40-4a78-a4d2-d1ad3a6e7041/view?projectid=1d452119-c608-4261-827c-9cce84b02520&context=wdp )

* 15- Install necessary functions from libraries: matplotlib, geopy, folium
* 16- Get Toronto's coordinates from Nominatim
* 17- Plot neighborhoods on map with Folium

In [22]:
#15
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

#16 get coordinates for Toronto
address = 'Toronto, Ontario, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

#17  create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto



The geograpical coordinate of Toronto are 43.653963, -79.387207.


---

* 18- Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
* 19- Add clustering labels
* 20- merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
* 21- plot clusters on map

(Note: to visualize maps use this link https://dataplatform.cloud.ibm.com/analytics/notebooks/v2/63e846b5-dc40-4a78-a4d2-d1ad3a6e7041/view?projectid=1d452119-c608-4261-827c-9cce84b02520&context=wdp )

In [23]:
#18: create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
toronto_merged = toronto_data_with_venues

#19: add clustering labels
toronto_merged['Cluster Labels'] = kmeans.labels_

#20: merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#21: plot clusters on map
# Plot clusters on Toronto's map
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

---

Part 3-d: Analyze the clusters obtaining data about the size of each cluster and the most common venues per cluster


* 22- Group data in dataframe *toronto_grouped* by cluster label, save in *toronto_grouped_by_cluster* dataframe

In [24]:
#22 add cluster labels to *toronto_grouped* dataframe 
toronto_grouped['Cluster Labels'] = kmeans.labels_

# Group dataframe by cluster
toronto_grouped_by_cluster = toronto_grouped.groupby('Cluster Labels').mean().reset_index()
toronto_grouped_by_cluster

Unnamed: 0,Cluster Labels,Yoga Studio,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.004041,0.00119,0.000149,0.000149,0.000928,0.000928,0.000928,0.001855,0.001855,...,0.003556,0.002554,0.002403,0.000852,0.00211,0.005487,0.000866,0.001272,0.001648,0.00158
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3,0.0,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,...,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011111
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---

* 22- Create the new dataframe and display the top 5 venues for each cluster.

In [25]:
#22
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Cluster Labels']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
clusters_venues_sorted = pd.DataFrame(columns=columns)
clusters_venues_sorted['Cluster Labels'] = toronto_grouped_by_cluster['Cluster Labels']

for ind in np.arange(toronto_grouped_by_cluster.shape[0]):
    clusters_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped_by_cluster.iloc[ind, :], num_top_venues)

clusters_venues_sorted

Unnamed: 0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,0,Business Service,Food Truck,Baseball Field,Drugstore,Diner
1,1,Coffee Shop,Pizza Place,Café,Fast Food Restaurant,Sandwich Place
2,2,Baseball Field,Women's Store,Drugstore,Dim Sum Restaurant,Diner
3,3,Park,Playground,Convenience Store,Bus Stop,Bank
4,4,Garden,Women's Store,Donut Shop,Dim Sum Restaurant,Diner
5,5,Bank,Drugstore,Dim Sum Restaurant,Diner,Discount Store
6,6,Bar,Women's Store,Drugstore,Dim Sum Restaurant,Diner
7,7,Playground,Trail,Field,Hockey Arena,Donut Shop


---

* 23- Get size of each cluster measured as the number of neighborhoods in that group
* 24- Add column *'Cluster Size'* to dataframe
* 25- Sort clusters by size

In [26]:
# 23 Get array of cluster sizes
cluster_sizes_array=[]

for k in range(kclusters):
    cluster_results = toronto_grouped.loc[toronto_grouped['Cluster Labels'] == k, toronto_grouped.columns[[1] + list(range(5, toronto_grouped.shape[1]))]]
    cluster_sizes_array.extend([cluster_results.shape[0]])

# 24 Add size to dataframe
clusters_venues_sorted['Cluster Size'] = pd.DataFrame(np.asarray(cluster_sizes_array))
clusters_venues_sorted

# 25 sort dataframe 
clusters_venues_sorted.sort_values(by='Cluster Size', ascending=False)

Unnamed: 0,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,Cluster Size
1,1,Coffee Shop,Pizza Place,Café,Fast Food Restaurant,Sandwich Place,77
3,3,Park,Playground,Convenience Store,Bus Stop,Bank,15
2,2,Baseball Field,Women's Store,Drugstore,Dim Sum Restaurant,Diner,2
7,7,Playground,Trail,Field,Hockey Arena,Donut Shop,2
0,0,Business Service,Food Truck,Baseball Field,Drugstore,Diner,1
4,4,Garden,Women's Store,Donut Shop,Dim Sum Restaurant,Diner,1
5,5,Bank,Drugstore,Dim Sum Restaurant,Diner,Discount Store,1
6,6,Bar,Women's Store,Drugstore,Dim Sum Restaurant,Diner,1


* 26- print each cluster along with the top 3 most common venues and the size of the cluster

In [27]:
# print each cluster along with the top 3 most common venues and the size of the cluster
num_top_venues = 3

for cluster in toronto_grouped_by_cluster['Cluster Labels']:
    print("---- Cluster Number = "+str(cluster)+" ----")
    currSize = clusters_venues_sorted.loc[cluster,'Cluster Size']
    print("---- Cluster Size = "+str(currSize)+" neighborhoods ----")
    temp = toronto_grouped_by_cluster[toronto_grouped_by_cluster['Cluster Labels'] == cluster].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

---- Cluster Number = 0 ----
---- Cluster Size = 1 neighborhoods ----
              venue  freq
0    Baseball Field  0.33
1        Food Truck  0.33
2  Business Service  0.33


---- Cluster Number = 1 ----
---- Cluster Size = 77 neighborhoods ----
         venue  freq
0  Coffee Shop  0.07
1  Pizza Place  0.05
2         Café  0.04


---- Cluster Number = 2 ----
---- Cluster Size = 2 neighborhoods ----
                       venue  freq
0             Baseball Field   1.0
1                Yoga Studio   0.0
2  Middle Eastern Restaurant   0.0


---- Cluster Number = 3 ----
---- Cluster Size = 15 neighborhoods ----
               venue  freq
0               Park  0.35
1         Playground  0.05
2  Convenience Store  0.05


---- Cluster Number = 4 ----
---- Cluster Size = 1 neighborhoods ----
                       venue  freq
0                     Garden   1.0
1  Middle Eastern Restaurant   0.0
2                      Motel   0.0


---- Cluster Number = 5 ----
---- Cluster Size = 1 neighborhoo

This is the end of the analysis