# Segmenting and Clustering Neighborhoods in Toronto

### 1. Start by creating a new Notebook for this assignment.

##### Preprocessing

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### 2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 

In [2]:
# getting data from internet
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_page= requests.get(url).text

# using beautiful soup to parse the HTML/XML codes.
soup = BeautifulSoup(wikipedia_page,'xml')
#print(soup.prettify())

### 3. To create the above dataframe:

##### 3.a) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. <br> & <br> 3.b) If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [3]:
table = soup.find('table')
Postcode      = []
Borough       = []
Neighbourhood = []

#print(table)

# extracting a clean form of the table
for tr_cell in table.find_all('tr'):
    counter = 1
    Postcode_var      = '-1'
    Borough_var       = '-1'
    Neighbourhood_var = '-1'
    
    for td_cell in tr_cell.find_all('td'):
        if counter == 1: 
            Postcode_var = td_cell.text
        if counter == 2: 
            Borough_var = td_cell.text
            tag_a_Borough = td_cell.find('a')
        if counter == 3: 
            Neighbourhood_var = str(td_cell.text).strip()
            tag_a_Neighbourhood = td_cell.find('a')
            
        counter +=1
        
        if (Borough_var == 'Not assigned'):
            continue
            
    try:
        if (tag_a_Borough is None):
            continue
        else:
            if (tag_a_Neighbourhood is None):
                    Neighbourhood_var = Borough_var
    except:
        pass
    
    if(Postcode_var == '-1' or Borough_var == '-1' or Neighbourhood_var == '-1'):
        continue
        
    Postcode.append(Postcode_var)
    Borough.append(Borough_var)
    Neighbourhood.append(Neighbourhood_var)

##### 3.c) More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [4]:
unique_p = set(Postcode)
print('num of unique Postal codes:', len(unique_p))
Postcode_u      = []
Borough_u       = []
Neighbourhood_u = []

for postcode_unique_element in unique_p:
    p_var = '';
    b_var = '';
    n_var = ''; 
    for postcode_idx, postcode_element in enumerate(Postcode):
        if postcode_unique_element == postcode_element:
            p_var = postcode_element;
            b_var = Borough[postcode_idx]
            if n_var == '': 
                n_var = Neighbourhood[postcode_idx]
            else:
                n_var = n_var + ', ' + Neighbourhood[postcode_idx]
    Postcode_u.append(p_var)
    Borough_u.append(b_var)
    Neighbourhood_u.append(n_var)

num of unique Postal codes: 100


##### 3.d) The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [5]:
toronto_dict = {'Postcode':Postcode_u, 'Borough':Borough_u, 'Neighbourhood':Neighbourhood_u}
df_toronto = pd.DataFrame.from_dict(toronto_dict)
df_toronto.to_csv('toronto_part1.csv')
#df_toronto.loc[df_toronto['Postcode'] == 'M5V']
#df_toronto.loc[df_toronto['Postcode'] == 'M9A']
df_toronto

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M6P,West Toronto,"High Park, West Toronto"
1,M1J,Scarborough,Scarborough Village
2,M4H,East York,Thorncliffe Park
3,M5E,Downtown Toronto,Berczy Park
4,M8Z,Etobicoke,"Etobicoke, Mimico NW, The Queensway West, Etob..."
5,M5X,Downtown Toronto,"First Canadian Place, Underground city"
6,M4G,East York,Leaside
7,M4E,East Toronto,The Beaches
8,M1H,Scarborough,Scarborough
9,M2J,North York,"North York, Henry Farm, North York"


##### 3.e) In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [6]:
df_toronto.shape

(100, 3)

### 4. Submit a link to your Notebook on your Github repository

### 5.  In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood

##### Since it is not able to obtain coordinates from using geocoder.google, the given csv file is used

In [7]:
# Since it is not able to obtain coordinates from using geocoder.google, the given csv file is used
!wget -O GeoCord.csv http://cocl.us/Geospatial_data/

--2020-01-24 11:55:18--  http://cocl.us/Geospatial_data/
Resolving cocl.us (cocl.us)... 158.85.108.83, 158.85.108.86, 169.48.113.194
Connecting to cocl.us (cocl.us)|158.85.108.83|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data/ [following]
--2020-01-24 11:55:18--  https://cocl.us/Geospatial_data/
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-01-24 11:55:19--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197, 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-01-24 11:55:20--  https:/

In [8]:
df_cord = pd.read_csv('GeoCord.csv') # Read the csv data
df_cord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# Create Latitude and Longitude columns in df_assigned
df_toronto['Latitude'] = np.nan
df_toronto['Longitude'] = np.nan

# For each postcode in df_assigned, find corresponding coordinates in df_cord and assign it to df_assigned
for idx in df_toronto.index:
    cord_idx = df_cord['Postal Code'] == df_toronto.loc[idx, 'Postcode']
    df_toronto.at[idx, 'Latitude'] = df_cord.loc[cord_idx, 'Latitude'].values
    df_toronto.at[idx, 'Longitude'] = df_cord.loc[cord_idx, 'Longitude'].values

In [10]:
# Display the results
df_toronto.head(20)
#df_toronto.loc[df_toronto['Postcode'] == 'M5G']
#df_toronto.loc[df_toronto['Postcode'] == 'M2H']

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M6P,West Toronto,"High Park, West Toronto",43.661608,-79.464763
1,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
2,M4H,East York,Thorncliffe Park,43.705369,-79.349372
3,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
4,M8Z,Etobicoke,"Etobicoke, Mimico NW, The Queensway West, Etob...",43.628841,-79.520999
5,M5X,Downtown Toronto,"First Canadian Place, Underground city",43.648429,-79.38228
6,M4G,East York,Leaside,43.70906,-79.363452
7,M4E,East Toronto,The Beaches,43.676357,-79.293031
8,M1H,Scarborough,Scarborough,43.773136,-79.239476
9,M2J,North York,"North York, Henry Farm, North York",43.778517,-79.346556


### 6.- Explore and cluster the neighborhoods in Toronto

In [11]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [12]:
address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


  app.launch_new_instance()


In [13]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

In [14]:
#Set up for Foursquare
CLIENT_ID = 'T1BA0JFTFAETKI5MM3JZ0PF04YJG2MTUS23TNOHWHTQ0E2S0'
CLIENT_SECRET = '5G3XQJU1U0R2C5AYKRG4SDM1IUQK4ANGUOAWXU34IUBPGJRS'
VERSION = '20180605'
LIMIT = 100
radius = 500

In [19]:
#Define Get vanue data function
def getNearbyVenues(neighbourhood, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for neighbourhood, lat, lng in zip(df_toronto["Neighbourhood"], df_toronto["Latitude"], df_toronto["Longitude"]):
                   
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            neighbourhood, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [20]:
Toronto_venues = getNearbyVenues(neighbourhood=df_toronto['Neighbourhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude']
                                  )

In [21]:
#Take a look at the result
print(Toronto_venues.shape)
Toronto_venues.head()

(2226, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"High Park, West Toronto",43.661608,-79.464763,Lithuania Park,43.658667,-79.463038,Park
1,"High Park, West Toronto",43.661608,-79.464763,Hole in the Wall,43.665296,-79.465118,Bar
2,"High Park, West Toronto",43.661608,-79.464763,nodo,43.665303,-79.465621,Italian Restaurant
3,"High Park, West Toronto",43.661608,-79.464763,Indie Alehouse,43.665475,-79.46529,Gastropub
4,"High Park, West Toronto",43.661608,-79.464763,famous last words,43.665181,-79.468471,Speakeasy


### Analyze Each Neighborhood

In [22]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighborhood'] = Toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
Toronto_onehot.set_index("Neighborhood", inplace= True)
Toronto_onehot.reset_index(inplace= True)
Toronto_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"High Park, West Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"High Park, West Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"High Park, West Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"High Park, West Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"High Park, West Toronto",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#Group by Neighborhood
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Agincourt North, Scarborough, Milliken, Scarbo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,"Bathurst Manor, North York, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Assume that K = 5 is an optimal K

In [25]:
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

#Clustering by Kmean method
kclusters = 5
Toronto_grouped_clustering = Toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [27]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Clothing Store,Breakfast Spot,Latin American Restaurant,Lounge,Farmers Market
1,"Agincourt North, Scarborough, Milliken, Scarbo...",Park,Playground,Coffee Shop,Yoga Studio,Dumpling Restaurant
2,"Alderwood, Long Branch",Pizza Place,Pharmacy,Coffee Shop,Sandwich Place,Athletics & Sports
3,"Bathurst Manor, North York, Wilson Heights",Coffee Shop,Pet Store,Frozen Yogurt Shop,Shopping Mall,Sandwich Place
4,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Dog Run


### Assign Cluster

In [30]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = df_toronto.drop(["Postcode"], axis= 1)

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighbourhood')
Toronto_merged.dropna(subset=["Cluster Labels"], axis= 0, inplace= True)
Toronto_merged = Toronto_merged.astype({"Cluster Labels": "int64"}, inplace= True)
Toronto_merged.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,West Toronto,"High Park, West Toronto",43.661608,-79.464763,0,Thai Restaurant,Mexican Restaurant,Café,Bar,Grocery Store
1,Scarborough,Scarborough Village,43.744734,-79.239476,1,Playground,Convenience Store,Yoga Studio,Eastern European Restaurant,Dog Run
2,East York,Thorncliffe Park,43.705369,-79.349372,0,Indian Restaurant,Yoga Studio,Bank,Gym,Gym / Fitness Center
3,Downtown Toronto,Berczy Park,43.644771,-79.373306,0,Coffee Shop,Cocktail Bar,Steakhouse,Bakery,Farmers Market
4,Etobicoke,"Etobicoke, Mimico NW, The Queensway West, Etob...",43.628841,-79.520999,0,Hardware Store,Bakery,Grocery Store,Fast Food Restaurant,Discount Store


### Visualize clustering

In [33]:
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters