<img src="https://www.toronto.ca/wp-content/uploads/2020/03/94a1-emergency-home-page-skyline.jpg" width=500>

# Segmenting and Clustering Neighborhoods in Toronto

## Instructions

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

# **Part I: Scrapping and Wrangling**

## Scrapping Data from Wikipedia

In [1]:
import requests
import lxml.html as lh
import pandas as pd

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'#Create a handle, page, to handle the contents of the website
page = requests.get(url)#Store the contents of the website under doc
doc = lh.fromstring(page.content)#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

In [3]:
[len(T) for T in tr_elements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [4]:
tr_elements = doc.xpath('//tr')#Create empty list
col=[]
i=0#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print ('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postal Code
"
2:"Borough
"
3:"Neighborhood
"


In [5]:
col

[('Postal Code\n', []), ('Borough\n', []), ('Neighborhood\n', [])]

In [6]:
len(tr_elements)

185

In [7]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 10, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content() 
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=str(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1

In [8]:
[len(C) for (title,C) in col]

[181, 181, 181]

In [9]:
col

[('Postal Code\n',
  ['M1A\n',
   'M2A\n',
   'M3A\n',
   'M4A\n',
   'M5A\n',
   'M6A\n',
   'M7A\n',
   'M8A\n',
   'M9A\n',
   'M1B\n',
   'M2B\n',
   'M3B\n',
   'M4B\n',
   'M5B\n',
   'M6B\n',
   'M7B\n',
   'M8B\n',
   'M9B\n',
   'M1C\n',
   'M2C\n',
   'M3C\n',
   'M4C\n',
   'M5C\n',
   'M6C\n',
   'M7C\n',
   'M8C\n',
   'M9C\n',
   'M1E\n',
   'M2E\n',
   'M3E\n',
   'M4E\n',
   'M5E\n',
   'M6E\n',
   'M7E\n',
   'M8E\n',
   'M9E\n',
   'M1G\n',
   'M2G\n',
   'M3G\n',
   'M4G\n',
   'M5G\n',
   'M6G\n',
   'M7G\n',
   'M8G\n',
   'M9G\n',
   'M1H\n',
   'M2H\n',
   'M3H\n',
   'M4H\n',
   'M5H\n',
   'M6H\n',
   'M7H\n',
   'M8H\n',
   'M9H\n',
   'M1J\n',
   'M2J\n',
   'M3J\n',
   'M4J\n',
   'M5J\n',
   'M6J\n',
   'M7J\n',
   'M8J\n',
   'M9J\n',
   'M1K\n',
   'M2K\n',
   'M3K\n',
   'M4K\n',
   'M5K\n',
   'M6K\n',
   'M7K\n',
   'M8K\n',
   'M9K\n',
   'M1L\n',
   'M2L\n',
   'M3L\n',
   'M4L\n',
   'M5L\n',
   'M6L\n',
   'M7L\n',
   'M8L\n',
   'M9L\n',
   'M1M\n

In [10]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)

In [11]:
df.columns=["Postal Code","Borough", "Neighborhood"]

In [12]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A\n,Not assigned\n,\n
1,M2A\n,Not assigned\n,\n
2,M3A\n,North York\n,Parkwoods\n
3,M4A\n,North York\n,Victoria Village\n
4,M5A\n,Downtown Toronto\n,"Regent Park, Harbourfront\n"
...,...,...,...
176,M6Z\n,Not assigned\n,\n
177,M7Z\n,Not assigned\n,\n
178,M8Z\n,Etobicoke\n,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z\n,Not assigned\n,\n


## Cleaning Data

In [13]:
df = df.replace(r'\n','', regex=True) 

In [14]:
df.columns = df.columns.str.strip()

In [15]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
179,M9Z,Not assigned,


In [16]:
df.drop(df[df['Borough'] == "Not assigned"].index, inplace = True) 

In [17]:
df.reset_index(drop=True, inplace=True)

In [18]:
df.drop([103], inplace=True)

## Resultant Dataframe

In [19]:
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [20]:
df.shape

(103, 3)

# **Part II: Dataframe with longitudes and latitudes**

## Adding longitude and latitude to Neighborhood

In [21]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
n=df.Neighborhood.to_list()
lat=[]
long=[]

In [22]:
for i in range(len(n)):    

    try:
    
        address = n[i] +', Toronto'
        geolocator = Nominatim(user_agent="toronto_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        lat.append(latitude)
        long.append(longitude)
    
    except:
        lat.append(None)        #This manage the addrees that retrieved errors
        long.append(None)

In [23]:
toronto_df=df.copy()

In [24]:
toronto_df["Latitude"]=lat
toronto_df["Longitude"]=long

In [25]:
toronto_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Postal Code   103 non-null    object 
 1   Borough       103 non-null    object 
 2   Neighborhood  103 non-null    object 
 3   Latitude      55 non-null     float64
 4   Longitude     55 non-null     float64
dtypes: float64(2), object(3)
memory usage: 4.8+ KB


In [27]:
toronto_df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.758800,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.640769,-79.379892
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.715283,-79.443914
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",,
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",,
99,M4Y,Downtown Toronto,Church and Wellesley,43.665524,-79.383801
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",,
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",,


## Dealing with NaN Data using the csv with Coordinates

In [28]:
coordinates=pd.read_csv("Toronto_Coordinates.csv")
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [29]:
coordinates.loc[0,"Longitude"]

-79.19435340000001

In [30]:
lat_csv=[]
long_csv=[]
for i,row in toronto_df.iterrows():
     for j,row2 in coordinates.iterrows():
        if toronto_df.loc[i,"Postal Code"]==coordinates.loc[j,"Postal Code"]:
            lat_csv.append(coordinates.loc[j,"Latitude"])
            long_csv.append(coordinates.loc[j,"Longitude"])

In [31]:
toronto_df["Latitude"]=lat_csv
toronto_df["Longitude"]=long_csv

## Toronto Neighborhood Dataframe Coordinated

In [32]:
toronto_df

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [33]:
toronto_df.to_csv('toronto_data.csv')

# **Part III: Clustering Toronto Neighborhood**

## Creating Toronto Map with folium

In [34]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

In [35]:
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


In [36]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Getting Neighborhood Venues with Foursquare

In [37]:
CLIENT_ID = 'RGWL5ZMJ5310ADNWYMA4D2IJ1G2LZT44LW5P22XVCKTKL3FY' # your Foursquare ID
CLIENT_SECRET = '0ID4LP32Z4NBH4HGE40SX51RECN2LLMINCV1VIEEZXWLLHYL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: RGWL5ZMJ5310ADNWYMA4D2IJ1G2LZT44LW5P22XVCKTKL3FY
CLIENT_SECRET:0ID4LP32Z4NBH4HGE40SX51RECN2LLMINCV1VIEEZXWLLHYL


## Exploring Firts Neighborhood

In [38]:
toronto_df.loc[0, 'Neighborhood']

'Parkwoods'

In [39]:
latitude = toronto_df.loc[0, 'Latitude'] # neighborhood latitude value
longitude = toronto_df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_df.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               latitude, 
                                                               longitude))

Latitude and longitude values of Parkwoods are 43.7532586, -79.3296565.


In [40]:
radius=500
LIMIT=100
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=RGWL5ZMJ5310ADNWYMA4D2IJ1G2LZT44LW5P22XVCKTKL3FY&client_secret=0ID4LP32Z4NBH4HGE40SX51RECN2LLMINCV1VIEEZXWLLHYL&v=20180605&ll=43.7532586,-79.3296565&radius=500&limit=100'

In [41]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ecf20ac5fb726001be01beb'},
  'headerLocation': 'Parkwoods - Donalda',
  'headerFullLocation': 'Parkwoods - Donalda, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 43.757758604500005,
    'lng': -79.32343823984928},
   'sw': {'lat': 43.7487585955, 'lng': -79.33587476015072}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 245,
        'cc': 'CA',
        'c

In [42]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [43]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [44]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


## Getting Venues for Neighborhoods

In [45]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [50]:
toronto_venues = getNearbyVenues(names=toronto_df['Neighborhood'],
                                   latitudes=toronto_df['Latitude'],
                                   longitudes=toronto_df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [51]:
print(toronto_venues.shape)
toronto_venues.head()

(2130, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


In [52]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,4,4,4,4,4,4
"Alderwood, Long Branch",9,9,9,9,9,9
"Bathurst Manor, Wilson Heights, Downsview North",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",25,25,25,25,25,25
...,...,...,...,...,...,...
"Willowdale, Willowdale West",4,4,4,4,4,4
Woburn,4,4,4,4,4,4
Woodbine Heights,8,8,8,8,8,8
York Mills West,4,4,4,4,4,4


In [53]:
toronto_venues['Neighborhood']

0                                               Parkwoods
1                                               Parkwoods
2                                        Victoria Village
3                                        Victoria Village
4                                        Victoria Village
                              ...                        
2125    Mimico NW, The Queensway West, South of Bloor,...
2126    Mimico NW, The Queensway West, South of Bloor,...
2127    Mimico NW, The Queensway West, South of Bloor,...
2128    Mimico NW, The Queensway West, South of Bloor,...
2129    Mimico NW, The Queensway West, South of Bloor,...
Name: Neighborhood, Length: 2130, dtype: object

In [54]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 275 uniques categories.


## One Hot Encoding

In [55]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

In [56]:
# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

In [57]:
toronto_onehot.shape

(2130, 275)

In [58]:
# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

In [59]:
toronto_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Transportation Service,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Preparing Data Set to K-Mean Model

In [88]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Transportation Service,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.050,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
91,"Willowdale, Willowdale West",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0
92,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0
93,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0
94,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000,0.0,0.0,0.0,0.0,0.0,0.0


In [89]:
toronto_grouped.shape

(96, 275)

## Getting Hot Venues

In [90]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                        venue  freq
0                      Lounge  0.25
1                Skating Rink  0.25
2   Latin American Restaurant  0.25
3              Breakfast Spot  0.25
4  Modern European Restaurant  0.00


----Alderwood, Long Branch----
            venue  freq
0     Pizza Place  0.22
1  Sandwich Place  0.11
2        Pharmacy  0.11
3             Gym  0.11
4     Coffee Shop  0.11


----Bathurst Manor, Wilson Heights, Downsview North----
                 venue  freq
0          Coffee Shop  0.10
1                 Bank  0.10
2          Bridal Shop  0.05
3                Diner  0.05
4  Fried Chicken Joint  0.05


----Bayview Village----
                 venue  freq
0   Chinese Restaurant  0.25
1                 Café  0.25
2                 Bank  0.25
3  Japanese Restaurant  0.25
4          Yoga Studio  0.00


----Bedford Park, Lawrence Manor East----
              venue  freq
0       Coffee Shop  0.08
1   Thai Restaurant  0.08
2  Sushi Restaurant  0.08
3        

4            Restaurant  0.08


----North Park, Maple Leaf Park, Upwood Park----
                        venue  freq
0            Basketball Court  0.25
1                      Bakery  0.25
2                        Park  0.25
3  Construction & Landscaping  0.25
4                 Yoga Studio  0.00


----North Toronto West,  Lawrence Park----
                    venue  freq
0             Coffee Shop  0.11
1          Clothing Store  0.11
2             Yoga Studio  0.05
3  Furniture / Home Store  0.05
4      Chinese Restaurant  0.05


----Northwest, West Humber - Clairville----
                 venue  freq
0            Drugstore  0.33
1  Rental Car Location  0.33
2                  Bar  0.33
3          Yoga Studio  0.00
4   Mexican Restaurant  0.00


----Northwood Park, York University----
                  venue  freq
0           Coffee Shop   0.2
1        Massage Studio   0.2
2                   Bar   0.2
3  Caribbean Restaurant   0.2
4    Miscellaneous Shop   0.2


----Old Mill South, Ki

In [91]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [92]:
import numpy as np

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Latin American Restaurant,Lounge,Skating Rink,Breakfast Spot,Donut Shop,Doner Restaurant,Dog Run,Drugstore,Curling Ice,Distribution Center
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Skating Rink,Gym,Pharmacy,Athletics & Sports,Pub,Sandwich Place,Deli / Bodega,Cupcake Shop
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Pizza Place,Shopping Mall,Bridal Shop,Sandwich Place,Deli / Bodega,Ice Cream Shop,Restaurant,Supermarket
3,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Discount Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
4,"Bedford Park, Lawrence Manor East",Thai Restaurant,Coffee Shop,Italian Restaurant,Sandwich Place,Restaurant,Sushi Restaurant,Greek Restaurant,Liquor Store,Juice Bar,Indian Restaurant


## K-Means Clustering

In [93]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters,random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 1, 1, 1, 1, 1])

In [94]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.shape# check the last columns!

(103, 16)

In [95]:
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,4.0,Park,Food & Drink Shop,Women's Store,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Discount Store
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Coffee Shop,Pizza Place,French Restaurant,Hockey Arena,Portuguese Restaurant,Dessert Shop,Cupcake Shop,Curling Ice,Dance Studio,Deli / Bodega
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1.0,Coffee Shop,Park,Pub,Bakery,Theater,Café,Restaurant,Breakfast Spot,Yoga Studio,Hotel
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Accessories Store,Clothing Store,Furniture / Home Store,Event Space,Boutique,Vietnamese Restaurant,Coffee Shop,Athletics & Sports,Miscellaneous Shop,Arts & Crafts Store
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1.0,Coffee Shop,Yoga Studio,College Cafeteria,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,College Auditorium


In [96]:
toronto_merged.dropna(inplace=True)

In [97]:
toronto_merged["Cluster Labels"].shape

(100,)

## Printing Toronto Map with Clusters

In [98]:
address = 'Toronto, Canada'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


In [149]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium # plotting library
from folium import plugins

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, cluster, poi in zip(toronto_merged['Latitude'], toronto_merged['Longitude'],toronto_merged['Cluster Labels'] , toronto_merged['Neighborhood']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.vector_layers.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.9).add_to(map_clusters)
       
map_clusters

## Characterizing Clusters

In [100]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Coffee Shop,Pizza Place,French Restaurant,Hockey Arena,Portuguese Restaurant,Dessert Shop,Cupcake Shop,Curling Ice,Dance Studio,Deli / Bodega
8,East York,0.0,Pizza Place,Fast Food Restaurant,Bank,Breakfast Spot,Pharmacy,Gym / Fitness Center,Gastropub,Athletics & Sports,Intersection,Dim Sum Restaurant
35,East York,0.0,Park,Coffee Shop,Pizza Place,Convenience Store,Dim Sum Restaurant,Curling Ice,Dance Studio,Deli / Bodega,Department Store,Dessert Shop
50,North York,0.0,Pizza Place,Women's Store,Diner,Curling Ice,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Discount Store
63,York,0.0,Grocery Store,Pizza Place,Bus Line,Convenience Store,Women's Store,Dim Sum Restaurant,Dance Studio,Deli / Bodega,Department Store,Dessert Shop
70,Etobicoke,0.0,Pizza Place,Coffee Shop,Intersection,Middle Eastern Restaurant,Sandwich Place,Chinese Restaurant,Distribution Center,Discount Store,Diner,Cupcake Shop
72,North York,0.0,Coffee Shop,Pharmacy,Pizza Place,Bank,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant
77,Etobicoke,0.0,Bus Line,Pizza Place,Sandwich Place,Mobile Phone Shop,Women's Store,Dessert Shop,Curling Ice,Dance Studio,Deli / Bodega,Department Store
93,Etobicoke,0.0,Pizza Place,Coffee Shop,Skating Rink,Gym,Pharmacy,Athletics & Sports,Pub,Sandwich Place,Deli / Bodega,Cupcake Shop


In [101]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Downtown Toronto,1.0,Coffee Shop,Park,Pub,Bakery,Theater,Café,Restaurant,Breakfast Spot,Yoga Studio,Hotel
3,North York,1.0,Accessories Store,Clothing Store,Furniture / Home Store,Event Space,Boutique,Vietnamese Restaurant,Coffee Shop,Athletics & Sports,Miscellaneous Shop,Arts & Crafts Store
4,Downtown Toronto,1.0,Coffee Shop,Yoga Studio,College Cafeteria,Bar,Beer Bar,Smoothie Shop,Sandwich Place,Burrito Place,Café,College Auditorium
7,North York,1.0,Japanese Restaurant,Gym,Café,Asian Restaurant,Beer Store,Coffee Shop,Restaurant,Gym / Fitness Center,Chinese Restaurant,Supermarket
9,Downtown Toronto,1.0,Clothing Store,Coffee Shop,Café,Middle Eastern Restaurant,Cosmetics Shop,Bubble Tea Shop,Italian Restaurant,Japanese Restaurant,Restaurant,Tea Room
...,...,...,...,...,...,...,...,...,...,...,...,...
96,Downtown Toronto,1.0,Coffee Shop,Bakery,Pizza Place,Café,Pub,Park,Chinese Restaurant,Restaurant,Italian Restaurant,Deli / Bodega
97,Downtown Toronto,1.0,Coffee Shop,Café,Japanese Restaurant,Hotel,Gym,Restaurant,Seafood Restaurant,Asian Restaurant,Steakhouse,Deli / Bodega
99,Downtown Toronto,1.0,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Gay Bar,Restaurant,Pub,Bubble Tea Shop,Café,Yoga Studio,Gastropub
100,East Toronto,1.0,Light Rail Station,Spa,Burrito Place,Skate Park,Park,Restaurant,Butcher,Garden Center,Garden,Brewery


In [103]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Scarborough,2.0,Fast Food Restaurant,Women's Store,Discount Store,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Distribution Center


In [104]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,3.0,Baseball Field,Women's Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Curling Ice
101,Etobicoke,3.0,Construction & Landscaping,Baseball Field,Curling Ice,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Women's Store


In [105]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,4.0,Park,Food & Drink Shop,Women's Store,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Discount Store
16,York,4.0,Park,Hockey Arena,Field,Trail,Dim Sum Restaurant,Curling Ice,Dance Studio,Deli / Bodega,Department Store,Dessert Shop
21,York,4.0,Park,Pool,Women's Store,Gift Shop,Creperie,Donut Shop,Doner Restaurant,Dog Run,Distribution Center,Discount Store
32,Scarborough,4.0,Playground,Convenience Store,Women's Store,Dim Sum Restaurant,Curling Ice,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner
49,North York,4.0,Park,Basketball Court,Bakery,Construction & Landscaping,Donut Shop,Doner Restaurant,Dog Run,Drugstore,Distribution Center,Curling Ice
61,Central Toronto,4.0,Park,Bus Line,Swim School,Dim Sum Restaurant,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Cupcake Shop
64,York,4.0,Park,Convenience Store,Women's Store,Diner,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Discount Store
66,North York,4.0,Construction & Landscaping,Park,Bank,Convenience Store,Women's Store,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner
83,Central Toronto,4.0,Park,Restaurant,Playground,Summer Camp,Dessert Shop,Cupcake Shop,Curling Ice,Dance Studio,Deli / Bodega,Department Store
85,Scarborough,4.0,Park,Playground,Diner,Curling Ice,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Discount Store


In [135]:
toronto_merged["Cluster Labels"].value_counts(sort=True)

1.0    76
4.0    12
0.0     9
3.0     2
2.0     1
Name: Cluster Labels, dtype: int64

In [136]:
toronto_merged.groupby("Cluster Labels")["1st Most Common Venue"].describe()

Unnamed: 0_level_0,count,unique,top,freq
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,9,5,Pizza Place,4
1.0,76,38,Coffee Shop,16
2.0,1,1,Fast Food Restaurant,1
3.0,2,2,Construction & Landscaping,1
4.0,12,3,Park,10


In [131]:
toronto_merged.groupby("Cluster Labels")["2nd Most Common Venue"].describe()

Unnamed: 0_level_0,count,unique,top,freq
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,9,5,Coffee Shop,3
1.0,76,47,Coffee Shop,9
2.0,1,1,Women's Store,1
3.0,2,2,Baseball Field,1
4.0,12,10,Playground,2


In [132]:
toronto_merged.groupby("Cluster Labels")["3rd Most Common Venue"].describe()

Unnamed: 0_level_0,count,unique,top,freq
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,9,8,Pizza Place,2
1.0,76,44,Café,10
2.0,1,1,Discount Store,1
3.0,2,2,Curling Ice,1
4.0,12,9,Women's Store,4


In [133]:
toronto_merged.groupby("Cluster Labels")["4th Most Common Venue"].describe()

Unnamed: 0_level_0,count,unique,top,freq
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,9,8,Convenience Store,2
1.0,76,52,Liquor Store,5
2.0,1,1,Dance Studio,1
3.0,2,2,Department Store,1
4.0,12,9,Diner,2


In [134]:
toronto_merged.groupby("Cluster Labels")["5th Most Common Venue"].describe()

Unnamed: 0_level_0,count,unique,top,freq
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,9,7,Pharmacy,2
1.0,76,50,Restaurant,5
2.0,1,1,Deli / Bodega,1
3.0,2,2,Dessert Shop,1
4.0,12,8,Dance Studio,4


## Cluster Summary

**Cluster 0:** We can see in this cluster neighborhoods where the most popular venues are  pizza places, coffee shops and convenience stores.

**Cluster 1:** This cluster grouped the main amount of neighborhoods, with coffees everywhere , restaurants and liquor stores.

**Cluster 2:** This cluster looks almost without important venues and it's compose just by one neighborhood.

**Cluster 3:** This cluster is compose just by two neighborhoods, also, without important venues.

**Cluster 4:** This one looks like the family cluster, full of parks and R rated places. 

# **Appendix**

## Splitting Neighborhood 

We can see some neighborhood that are splitted by a comma, we can create a new dataframe separating them. This approach help us to retrieved more coordinates from geocode

In [160]:
def change_column_order(df, col_name, index):
    cols = df.columns.tolist()
    cols.remove(col_name)
    cols.insert(index, col_name)
    return df[cols]

def split_df(dataframe, col_name, sep):
    orig_col_index = dataframe.columns.tolist().index(col_name)
    orig_index_name = dataframe.index.name
    orig_columns = dataframe.columns
    dataframe = dataframe.reset_index()  # we need a natural 0-based index for proper merge
    index_col_name = (set(dataframe.columns) - set(orig_columns)).pop()
    df_split = pd.DataFrame(
        pd.DataFrame(dataframe[col_name].str.split(sep).tolist())
        .stack().reset_index(level=1, drop=1), columns=[col_name])
    df = dataframe.drop(col_name, axis=1)
    df = pd.merge(df, df_split, left_index=True, right_index=True, how='inner')
    df = df.set_index(index_col_name)
    df.index.name = orig_index_name
    # merge adds the column to the last place, so we need to move it back
    return change_column_order(df, col_name, orig_col_index)

In [184]:
df_Ap=split_df(df,"Neighborhood", ",")

In [185]:
df_Ap

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Manor
...,...,...,...
102,M8Z,Etobicoke,Mimico NW
102,M8Z,Etobicoke,The Queensway West
102,M8Z,Etobicoke,South of Bloor
102,M8Z,Etobicoke,Kingsway Park South West


In [244]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
n=df_Ap.Neighborhood.to_list()
lat_AP=[]
long_AP=[]

In [245]:
for i in range(len(n)):    

    try:
    
        address = n[i] +', Toronto'
        geolocator = Nominatim(user_agent="toronto_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
        lat_AP.append(latitude)
        long_AP.append(longitude)
    
    except:
        lat_AP.append(None)        #This manage the addrees that retrieved errors
        long_AP.append(None)

In [247]:
df_Ap["Latitude"]=lat_AP
df_Ap["Longitude"]=long_AP

In [248]:
df_Ap

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.758800,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
2,M5A,Downtown Toronto,Harbourfront,43.640080,-79.380150
3,M6A,North York,Lawrence Manor,43.722079,-79.437507
...,...,...,...,...,...
102,M8Z,Etobicoke,Mimico NW,43.616677,-79.496805
102,M8Z,Etobicoke,The Queensway West,43.623618,-79.514764
102,M8Z,Etobicoke,South of Bloor,43.666534,-79.402926
102,M8Z,Etobicoke,Kingsway Park South West,43.647381,-79.511333


## Difference between data in `coordinates.csv` and `geocode`

In [250]:
df_Ap[df_Ap["Postal Code"]=="M3A"]

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197


In [251]:
coordinates[coordinates["Postal Code"]=="M3A"]

Unnamed: 0,Postal Code,Latitude,Longitude
25,M3A,43.753259,-79.329656


In [339]:
df_Ap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217 entries, 0 to 102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Postal Code   217 non-null    object 
 1   Borough       217 non-null    object 
 2   Neighborhood  217 non-null    object 
 3   Latitude      205 non-null    float64
 4   Longitude     205 non-null    float64
dtypes: float64(2), object(3)
memory usage: 10.2+ KB


## Dropping NaN Data

In [254]:
df_Ap=df_Ap.drop(df_Ap[df_Ap["Latitude"]== 0].index) 

In [256]:
df_Ap.reset_index(drop=True)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.758800,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
3,M5A,Downtown Toronto,Harbourfront,43.640080,-79.380150
4,M6A,North York,Lawrence Manor,43.722079,-79.437507
...,...,...,...,...,...
212,M8Z,Etobicoke,Mimico NW,43.616677,-79.496805
213,M8Z,Etobicoke,The Queensway West,43.623618,-79.514764
214,M8Z,Etobicoke,South of Bloor,43.666534,-79.402926
215,M8Z,Etobicoke,Kingsway Park South West,43.647381,-79.511333
