# Toronto Neighborhoods Analysis

### Part 1: Getting the names of neighborhoods and boroughs, and the postal code

The libraries will be imported as we need them, not at the beginning.

First, we need to get the list of neighborhoods in Toronto from the Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [72]:
wikipedia_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [10]:
!pip install bs4
!pip install requests

from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # this module helps us to download a web page

import pandas as pd
import numpy as np



In [73]:
# Get data from page
text_data = requests.get(wikipedia_url).text

# Process data with BeautifulSoup
soup = BeautifulSoup(text_data,"html5lib")

# Find all tables in the page
all_tables = soup.find_all('table')
print("There are {} tables in the page".format(len(all_tables)))

# Seeing the page, we can see that we want the table starting with 'M1A'
for index,table in enumerate(all_tables):
    if ("M1A" in str(table)):
        table_index = index
print("The table we want is of index {}".format(table_index))

# Get that table in a variable and print it so we can see its structure
raw_table = all_tables[0]
print(raw_table.prettify())

There are 3 tables in the page
The table we want is of index 0
<table cellpadding="2" cellspacing="0" rules="all" style="width:100%; border-collapse:collapse; border:1px solid #ccc;">
 <tbody>
  <tr>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M1A
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M2A
     </b>
     <br/>
     <span style="font-size:85%;">
      <i>
       Not assigned
      </i>
     </span>
    </p>
   </td>
   <td style="width:11%; vertical-align:top;">
    <p>
     <b>
      M3A
     </b>
     <br/>
     <span style="font-size:85%;">
      <a href="/wiki/North_York" title="North York">
       North York
      </a>
      <br/>
      (
      <a href="/wiki/Parkwoods" title="Parkwoods">
       Parkwoods
      </a>
      )
     </span>
    </p>
   </td>
   <td style="width:11

Seeing the table, we can see that each 'datapoint' is a cell, not a row!

It follows this structure:

```python
<td style="width:11%; vertical-align:top;">
    <p>
     <b>
      M5A # This is the Postal Code
     </b>
     <br/>
     <span style="font-size:85%;">
      <a href="/wiki/Downtown_Toronto" title="Downtown Toronto">
       Downtown Toronto # The first value is the Borough
      </a>
      <br/>
      (
      <a href="/wiki/Regent_Park" title="Regent Park">
       Regent Park # The following values are the neighborhoods
      </a>
      /
      <a href="/wiki/Harbourfront,_Toronto" title="Harbourfront, Toronto">
       Harbourfront # This is also a neighborhood
      </a>
      )
     </span>
    </p>
   </td>
```

Now, we will create a list to store our data.

In [74]:
# Create empty list
table_contents=[]

# Iterate through table cells ("td")
for row in raw_table.findAll('td'):
    
    # Createempty dictionary
    cell = {}
    
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3] # Get the first 3 letters of the text in each cell
        
        cell['Borough'] = (row.span.text).split('(')[0] # Get everything that is before the "("
        
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ') # Get what is after the ")", and replace slashes with commas
        
        table_contents.append(cell) # Append cell to contents

# print(table_contents)

# Transform list into a dataframe 
toronto_df=pd.DataFrame(table_contents)

# Make adjustments as recommended
toronto_df['Borough']=toronto_df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

print("Shape is {}".format(toronto_df.shape))
toronto_df.head(5)

Shape is (103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


# Print the shape again just to be sure

In [75]:
print("Shape of Toronto neighborhood dataframe is {}".format(toronto_df.shape))

Shape of Toronto neighborhood dataframe is (103, 3)


### Part 2: Getting latitude and longitude for each borough

First, we import pgeocode

In [16]:
!pip install pgeocode
import pgeocode # import pgeocode



Now we get the latitudes and longitudes

In [76]:

# Convert postal codes to a list
postal_codes = toronto_df['PostalCode'].tolist()

# Define the geolocator
geolocator = pgeocode.Nominatim('ca')

# Create empty lists for lat and long
latitudes = []
longitudes = []

# Go through the postal codes and get the latlong
for i, postal_code in enumerate(postal_codes):
    
    # Get specific location
    g = geolocator.query_postal_code(postal_code)
    
    # Get lat and long
    if not g.empty:
        latitudes.append(g.latitude)
        longitudes.append(g.longitude)
    else:
        latitudes.append("PC not found")
        longitudes.append("PC not found")

Pass the lat and long we just got into the dataframe with neighborhoods and boroughs

In [80]:
toronto_df_latlong = toronto_df[:]
toronto_df_latlong['Latitude'] = latitudes
toronto_df_latlong['Longitude'] = longitudes
toronto_df_latlong.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7545,-79.33
1,M4A,North York,Victoria Village,43.7276,-79.3148
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.6555,-79.3626
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.7223,-79.4504
4,M7A,Queen's Park,Ontario Provincial Government,43.6641,-79.3889


We can see that one of the latlongs is not available, let's investigate it

In [81]:
toronto_df_latlong.iloc[[76]]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,M7R,Mississauga,Enclave of L4W,,


In [82]:
g = geolocator.query_postal_code('M7R')
g

postal_code       M7R
country_code      NaN
place_name        NaN
state_name        NaN
state_code        NaN
county_name       NaN
county_code       NaN
community_name    NaN
community_code    NaN
latitude          NaN
longitude         NaN
accuracy          NaN
Name: 0, dtype: object

It indeed didn't find anything. Let's delete this row as it won't impact the exercise

In [83]:
toronto_df_latlong_fixed = toronto_df_latlong
toronto_df_latlong_fixed.drop(76, inplace=True)
toronto_df_latlong_fixed.reset_index(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


We now get the new shape:

In [85]:
print("The new shape of the dataframe is {}".format(toronto_df_latlong_fixed.shape))

The new shape of the dataframe is (102, 6)


And row 76 now works (previous row 77)

In [101]:
toronto_df_latlong_fixed.iloc[[76]]

Unnamed: 0,index,PostalCode,Borough,Neighborhood,Latitude,Longitude
76,77,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.6898,-79.5582


### Part 3: Explore the data

#### Part 3.1: Visualize the points on a map

#### First, we import folium and other libraries for colors

In [87]:
# For maps
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium # If the code above doesn't work
import folium # map rendering library

# For colors
import matplotlib.cm as cm
import matplotlib.colors as colors

# For K Means (later)
#!pip uninstall numpy
!pip install numpy
!pip install sklearn
from sklearn.cluster import KMeans

# Reference for issues
# https://stackoverflow.com/questions/57518050/conda-install-and-update-do-not-work-also-solving-environment-get-errors



#### Test map visualization

In [88]:
lat_test = 43.6532
long_test = -79.3832
toronto_test_map = folium.Map(location=[lat_test, long_test], zoom_start=12)
toronto_test_map

#### Now we will plot the location of each neighborhood

In [102]:
# create map of Manhattan using latitude and longitude values
map_toronto_neigh = folium.Map(location=[lat_test, long_test], zoom_start=11)

# add markers to map
for lat, lng, label in zip(toronto_df_latlong_fixed['Latitude'], toronto_df_latlong_fixed['Longitude'], toronto_df_latlong_fixed['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_neigh)  
    
map_toronto_neigh

#### Now we will get Foursquare data for each neighborhood

In [89]:
# Correct values are in a hidden cell

CLIENT_ID = 'xxxx' # your Foursquare ID
CLIENT_SECRET = 'xxxx' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

In [90]:
# The code was removed by Watson Studio for sharing.

#### Define function to get nearby venues based on latlong

In [104]:
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('Done!!')
    return(nearby_venues)

#### Now we apply it on the Toronto dataset

In [105]:
toronto_venues = getNearbyVenues(names=toronto_df_latlong_fixed['PostalCode'] + ' - ' + toronto_df_latlong_fixed['Neighborhood'],
                                 latitudes = toronto_df_latlong_fixed['Latitude'],
                                 longitudes = toronto_df_latlong_fixed['Longitude'])

M3A - Parkwoods
M4A - Victoria Village
M5A - Regent Park, Harbourfront
M6A - Lawrence Manor, Lawrence Heights
M7A - Ontario Provincial Government
M9A - Islington Avenue
M1B - Malvern, Rouge
M3B - Don Mills North
M4B - Parkview Hill, Woodbine Gardens
M5B - Garden District, Ryerson
M6B - Glencairn
M9B - West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
M1C - Rouge Hill, Port Union, Highland Creek
M3C - Don Mills South
M4C - Woodbine Heights
M5C - St. James Town
M6C - Humewood-Cedarvale
M9C - Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
M1E - Guildwood, Morningside, West Hill
M4E - The Beaches
M5E - Berczy Park
M6E - Caledonia-Fairbanks
M1G - Woburn
M4G - Leaside
M5G - Central Bay Street
M6G - Christie
M1H - Cedarbrae
M2H - Hillcrest Village
M3H - Bathurst Manor, Wilson Heights, Downsview North
M4H - Thorncliffe Park
M5H - Richmond, Adelaide, King
M6H - Dufferin, Dovercourt Village
M1J - Scarborough Village
M2J - Fairview, Henry Farm, Oriole
M3J - Nor

#### Let's see our results :)

In [106]:
print('Shape of toronto_venues is {}'.format(toronto_venues.shape))
toronto_venues.head(15)

Shape of toronto_venues is (2149, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A - Parkwoods,43.7545,-79.33,Brookbanks Park,43.751976,-79.33214,Park
1,M3A - Parkwoods,43.7545,-79.33,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
2,M3A - Parkwoods,43.7545,-79.33,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A - Victoria Village,43.7276,-79.3148,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A - Victoria Village,43.7276,-79.3148,Portugril,43.725819,-79.312785,Portuguese Restaurant
5,M4A - Victoria Village,43.7276,-79.3148,Tim Hortons,43.725517,-79.313103,Coffee Shop
6,M4A - Victoria Village,43.7276,-79.3148,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.31362,Intersection
7,M4A - Victoria Village,43.7276,-79.3148,Pizza Nova,43.725824,-79.31286,Pizza Place
8,M4A - Victoria Village,43.7276,-79.3148,Wigmore Park,43.731023,-79.310771,Park
9,"M5A - Regent Park, Harbourfront",43.6555,-79.3626,Tandem Coffee,43.653559,-79.361809,Coffee Shop


#### Let's check how many venues were returned for each neighborhood

In [107]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"M1B - Malvern, Rouge",1,1,1,1,1,1
"M1E - Guildwood, Morningside, West Hill",32,32,32,32,32,32
M1G - Woburn,3,3,3,3,3,3
M1H - Cedarbrae,3,3,3,3,3,3
M1J - Scarborough Village,3,3,3,3,3,3
...,...,...,...,...,...,...
M9N - Weston,4,4,4,4,4,4
M9P - Westmount,9,9,9,9,9,9
"M9R - Kingsview Village, St. Phillips, Martin Grove Gardens, Richview Gardens",10,10,10,10,10,10
"M9V - South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens",10,10,10,10,10,10


In [108]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 254 uniques categories.


#### Analyze type of venues for each neighborhood

In [119]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['NeighborhoodName'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,NeighborhoodName,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,M3A - Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,M3A - Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,M3A - Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4A - Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,M4A - Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [110]:
print('The new size, with categories of places on columns, is {}'.format(toronto_onehot.shape))

The new size, with categories of places on columns, is (2149, 254)


#### Now we will group it by neighbourhood

In [122]:
toronto_grouped = toronto_onehot.groupby('NeighborhoodName').mean().reset_index()
toronto_grouped.head(5)

Unnamed: 0,NeighborhoodName,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"M1B - Malvern, Rouge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"M1E - Guildwood, Morningside, West Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1G - Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1H - Cedarbrae,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1J - Scarborough Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [123]:
print('The new shape is {}'.format(toronto_grouped.shape))

The new shape is (99, 255)


### The new shape has less rows than we had neighbourhoods. This likely means that some neighbourhoods didn't find any venue nearby (500m range)

### Part 4: Cluster neighborhoods