Segmenting and Clustering in Toronto 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
import requests # library to handle requests and web scraping

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import csv

print("everything was installed")

everything was installed


Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(source, 'lxml')

print("The soup is hot and ready")

The soup is hot and ready


Webscraping and putting everything into a table and rows.

In [3]:
data_table = soup.find('table', {'class': 'wikitable sortable'})
data_rows = data_table.find_all('tr')

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [4]:
data=[]
for row in data_rows:
    data.append([t.text.strip() for t in row.find_all('td')])
    
df_toronto = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighborhood'])
df_toronto = df_toronto[~df_toronto['PostalCode'].isnull()] # this filters out the bad rows.
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


In [5]:
df_toronto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 287 entries, 1 to 287
Data columns (total 3 columns):
PostalCode      287 non-null object
Borough         287 non-null object
Neighborhood    287 non-null object
dtypes: object(3)
memory usage: 9.0+ KB


In [6]:
df_toronto.shape

(287, 3)

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [7]:
df_toronto.drop(df_toronto[df_toronto['Borough'] =="Not assigned"].index, axis=0, inplace=True)
df_toronto

Unnamed: 0,PostalCode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Downtown Toronto,Queen's Park
10,M9A,Queen's Park,Not assigned
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern
14,M3B,North York,Don Mills North


In [8]:
df_toronto.reset_index(drop=True, inplace=True)
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [9]:
df_tor_group = df_toronto.groupby('PostalCode').agg(lambda x: ','.join(x))
df_tor_group.reset_index(inplace=True)
df_tor_group.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,"Scarborough,Scarborough","Rouge,Malvern"
1,M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
2,M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
df_tor_group.shape

(103, 3)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [11]:
df_tor_group.loc[df_tor_group['Neighborhood'] == "Not Assigned", 'Neighborhood'] = df_tor_group.loc[df_tor_group['Neighborhood']=="Not assigned", 'Borough']
df_tor_group.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,"Scarborough,Scarborough","Rouge,Malvern"
1,M1C,"Scarborough,Scarborough,Scarborough","Highland Creek,Rouge Hill,Port Union"
2,M1E,"Scarborough,Scarborough,Scarborough","Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Remove the duplicate the boroughs.

In [12]:
df_tor_group['Borough']=df_tor_group['Borough'].str.replace('nan|[{}\s]','').str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",")
df_tor_group.set_index('PostalCode', inplace=True)
df_tor_group

Unnamed: 0_level_0,Borough,Neighborhood
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,"Rouge,Malvern"
M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
M1E,Scarborough,"Guildwood,Morningside,West Hill"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae
M1J,Scarborough,Scarborough Village
M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
M1N,Scarborough,"Birch Cliff,Cliffside West"


In [13]:
df_tor_group.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103 entries, M1B to M9W
Data columns (total 2 columns):
Borough         103 non-null object
Neighborhood    103 non-null object
dtypes: object(2)
memory usage: 2.4+ KB


In [14]:
df_tor_group.shape

(103, 2)

# Question 2:
Use the Geocoder package or the csv file to create the a dataframe.

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [15]:
!conda install -c conda-forge geopy
from geopy.geocoders import Nominatim

!conda install -c conda-forge folium=0.5.0 --yes
import folium
print('Geopy and Folium installed')

Solving environment: done

# All requested packages already installed.

Solving environment: done

# All requested packages already installed.

Geopy and Folium installed


Get long and lat data from CSV

In [16]:
df_lat_long = pd.read_csv("http://cocl.us/Geospatial_data")
print("Lat and Long read")

Lat and Long read


In [17]:
df_lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [18]:
#rename columns and set the index to PostalCode
df_lat_long.columns = ['PostalCode', 'Latitude', 'Longitude']
if (df_lat_long.index.name != "PostalCode"):
    df_lat_long = df_lat_long.set_index('PostalCode')
    
df_lat_long.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


Join dataframes

In [20]:
df_combined = df_tor_group.join(df_lat_long)
df_combined.reset_index(inplace = True)
df_combined

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


# Question 3:
Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

Just make sure:

1. to add enough Markdown cells to explain what you decided to do and to report any observations you make.
2. to generate maps to visualize your neighborhoods and how they cluster together.

In [21]:
from geopy.geocoders import Nominatim 
address = 'Toronto, On'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinates of Toronto are 43.653963, -79.387207.


Create a map of Toronto with neighborhoods superimposed on top

In [22]:
# create map of Toronto using latitude and longitude values
import folium
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_combined['Latitude'], df_combined['Longitude'], df_combined['Borough'], df_combined['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

    Looking at the "Scarborough" borough

In [23]:
scarborough_data = df_combined[df_combined['Borough'] == 'Scarborough'].reset_index(drop=True)
scarborough_data

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff,Cliffside West",43.692657,-79.264848


Get the geographical coordinates of Scarborough

In [24]:
borough_address = 'Scarborough, On'

geo_borough = Nominatim(user_agent ='bor_explorer')
location = geo_borough.geocode(borough_address)
latitude_scar = location.latitude
longitude_scar = location.longitude
print('The geograpical coordinates of Scarborough are {}, {}.'.format(latitude_scar, longitude_scar))


The geograpical coordinates of Scarborough are 43.773077, -79.257774.


Map of Scarborough

In [25]:
# create map of Scarborough using latitude and Longitude values
map_scarborough = folium.Map(location=[latitude_scar, longitude_scar], zoom_start=11)

# add markers to map
for lat_scar, long_scar, label in zip(scarborough_data['Latitude'], scarborough_data['Longitude'], scarborough_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat_scar, long_scar],
        radius = 5,
        popup=label,
        color = 'red',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html=False).add_to(map_scarborough)

map_scarborough

Define Foursquare Credentials and Version

In [27]:
# The code was removed by Watson Studio for sharing.

imported json and json_normalize


Get the neighborhood's name and coordinates

In [28]:
scarborough_data.loc[3,'Neighborhood']

'Woburn'

In [29]:
neighborhood_lat = scarborough_data.loc[3,'Latitude'] #neighborhood latitude value
neighborhood_long = scarborough_data.loc[3,'Longitude'] #Longitude value

neighborhood_name = scarborough_data.loc[3, 'Neighborhood'] #neighborhood name
print("Latitude and Longitude values of {} are {}, {}.".format(neighborhood_name,
                                                                neighborhood_lat,
                                                                neighborhood_long))

Latitude and Longitude values of Woburn are 43.7709921, -79.21691740000001.


Get the top 50 venues that are in Woburn within a radius of 500 meters.

In [31]:
# The code was removed by Watson Studio for sharing.

Send the GET request and examine the results

In [32]:
woburn_results = requests.get(url_woburn).json()
woburn_results

{'meta': {'code': 200, 'requestId': '5e1e2d6b0de0d9001b9cec1d'},
  'headerLocation': 'Toronto',
  'headerFullLocation': 'Toronto',
  'headerLocationGranularity': 'city',
  'totalResults': 3,
  'suggestedBounds': {'ne': {'lat': 43.7754921045, 'lng': -79.21069729639068},
   'sw': {'lat': 43.7664920955, 'lng': -79.22313750360935}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4cc1d28c06c254815ac18547',
       'name': 'Starbucks',
       'location': {'address': '300 Borough Dr',
        'crossStreet': 'Scarborough Town Centre',
        'lat': 43.770037201625215,
        'lng': -79.22115586641958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.770037201625215,
          'lng': -79.22115586641958}],
        'distance': 356,
        'cc': 'CA',
        '

Define a function that extracts the category of the venues found:

In [33]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) ==0:
        return None
    else:
        return categories_list[0]['name']

Clean JSON and structure in into a pandas dataframe:

In [34]:
venues = woburn_results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) #flatten JSON

#filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues = nearby_venues.loc[:, filtered_columns]

#filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

#clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Starbucks,Coffee Shop,43.770037,-79.221156
1,Tim Hortons,Coffee Shop,43.770827,-79.223078
2,Korean Grill House,Korean Restaurant,43.770812,-79.214502


How many Venues were returned by Foursquare?