 # Segmenting and Clustering Neighborhoods in Toronto

First, it's needed to import pandas and BeautifulSoup to get the table from wikipedia:

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium


In [2]:
req = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

nbh=pd.DataFrame(df[0])


In [3]:
nbh.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


 Let's see how many rows and columns exist in this table without processing it's data.

In [4]:
nbh.shape


(287, 3)

 Now i need to process the data, removing the rows that contains "Not assigned" in the Borough column and concatenating the neighborhoods from the same borough.

 First, let's remove the "Not Assigned" rows.

In [5]:
nbh.set_index('Borough', inplace=True) 
nbh.drop('Not assigned', axis=0, inplace=True)
nbh.reset_index(inplace=True)
nbh.head()



Unnamed: 0,Borough,Postcode,Neighbourhood
0,North York,M3A,Parkwoods
1,North York,M4A,Victoria Village
2,Downtown Toronto,M5A,Harbourfront
3,North York,M6A,Lawrence Heights
4,North York,M6A,Lawrence Manor


In [6]:
nbh = nbh[['Postcode','Borough','Neighbourhood']]
nbh.sort_values(by='Postcode', inplace=True)
nbh.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
22,M1C,Scarborough,Port Union
21,M1C,Scarborough,Rouge Hill
20,M1C,Scarborough,Highland Creek


In [7]:
nbh.shape


(210, 3)

 Now, let's concatenate the neighborhoods from the same Borough.

In [8]:
nbh2=nbh.groupby(['Postcode','Borough']).apply(lambda x: ','.join(x['Neighbourhood']))
nbh2 = nbh2.reset_index()
nbh2.columns = ['Postcode','Borough','Neighbourhood']
nbh2.head(10)



Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Golden Mile,Oakridge,Clairlea"
8,M1M,Scarborough,"Cliffcrest,Scarborough Village West,Cliffside"
9,M1N,Scarborough,"Cliffside West,Birch Cliff"


In [9]:
nbh2.shape

(103, 3)

Now, I get the latitude and longitude data from the provided CSV file and assign it to the Post Codes from the neighborhood dataframe.

In [10]:
geodata = pd.read_csv('http://cocl.us/Geospatial_data')
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
geodata.shape

(103, 3)

In the next step, I create two empty columns in my nbh2 pandas dataframe to populate with the latitude and longitude data later on.

In [12]:
nbh2['Latitude']=""
nbh2['Longitude']=""
nbh2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",,
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",,
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


In [17]:
for i in range(len(nbh2['Postcode'])):
    for j in range(len(nbh2['Postcode'])):
        if nbh2.loc[i,'Postcode']==geodata.loc[j,'Postal Code']:
            nbh2.loc[i,'Latitude']=geodata.loc[j, 'Latitude']
            nbh2.loc[i,'Longitude']=geodata.loc[j,'Longitude']

In [18]:
nbh2.tail()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
98,M9N,York,Weston,43.7069,-79.5182
99,M9P,Etobicoke,Westmount,43.6963,-79.5322
100,M9R,Etobicoke,"Richview Gardens,Kingsview Village,St. Phillip...",43.6889,-79.5547
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.7394,-79.5884
102,M9W,Etobicoke,Northwest,43.7067,-79.5941


In [15]:
nbh2.shape

(103, 5)

Checking the type of the data in my dataframe. Notice that the Latitude and Longitude data are of the type object.

In [20]:
nbh2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Postcode       103 non-null    object
 1   Borough        103 non-null    object
 2   Neighbourhood  103 non-null    object
 3   Latitude       103 non-null    object
 4   Longitude      103 non-null    object
dtypes: object(5)
memory usage: 2.1+ KB


So, to work with the data and plot the neighborhoods on a map, i transform the types of the Latitude and Longitude columns to float using the "to_numeric()" function.

In [21]:
pd.to_numeric(nbh2['Latitude'])
pd.to_numeric(nbh2['Longitude'])

0     -79.194353
1     -79.160497
2     -79.188711
3     -79.216917
4     -79.239476
         ...    
98    -79.518188
99    -79.532242
100   -79.554724
101   -79.588437
102   -79.594054
Name: Longitude, Length: 103, dtype: float64

And now I plot in the map using folium.

In [22]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [23]:
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(nbh2['Latitude'], nbh2['Longitude'], nbh2['Borough'], nbh2['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map