# Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
## By: Martin Foo Yang Chian

In [1]:
import numpy as np # library to handle data in a vectorized manner and computational calculations
import pandas as pd # library for data analsysis


In [2]:
wilipedia_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' # URL of Wikipedia for Canada postal codes list
df_list = pd.read_html(wilipedia_url) # To read HTML
canada_df = pd.DataFrame (df_list[0]) # To create dataframe from the HTML file

canada_df.drop(canada_df[canada_df['Borough'] == 'Not assigned'].index, inplace = True) # To remove rows with 'Not Assigned' in 'Borough'


canada_df_filtered = canada_df.reset_index(drop=True) #To reset dataframe index

canada_df_filtered.head() # To display filtered dataframe

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


To extract the table format as above, I used the pandas library to read the HTML file from 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' . After examining the HTML file, the table format was found in df_list[0], refering the code works above. The data cleaning process was done by removing rows of data with the 'Borough' feature having the attribute of 'Not assigned'. After reseting the index of the dataframe, the expected format of dataframe is displayed as above.


In [13]:
print('The dataframe has {} rows.'.format(canada_df_filtered.shape[0]))

The dataframe has 103 rows.


In [14]:
!wget -q -O 'geo_data_csv' http://cocl.us/Geospatial_data #To download CSV file for Geospatial Data
print('Data downloaded!')

Data downloaded!


In [18]:
geo_data = pd.read_csv('geo_data_csv') # To read CSV data using pandas
geo_data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [32]:
geo_latlng = canada_df_filtered[canada_df_filtered['Postal Code'].isin(geo_data['Postal Code'].tolist())] # Filtering out the latitude and longitude by postal codes

geo_df = pd.merge(geo_latlng,geo_data) # Merging two dataframes together with the filtered values

geo_df.head() # Displaying dataframe with additional info of latitude and longitude to the respectibe postal codes

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [34]:
pip install folium # To install Folium

Collecting folium
  Downloading folium-0.11.0-py2.py3-none-any.whl (93 kB)
[K     |████████████████████████████████| 93 kB 3.4 MB/s  eta 0:00:01
Collecting branca>=0.3.0
  Downloading branca-0.4.1-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.1 folium-0.11.0
Note: you may need to restart the kernel to use updated packages.


In [35]:
import folium # To import folium

In [45]:
# create map of Toronto latitude and longitude values
map_Toronto = folium.Map(location=[43.651070, -79.347015], zoom_start=11)

# add markers to map
for lat, lng, borough, neighbourhood in zip(geo_df['Latitude'], geo_df['Longitude'], geo_df['Borough'], geo_df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto #To display Map of Toronto with markers of Neighbourhood and Borough

Since geocoder is unreliable and does not provide any geospatial data, the latitude and longitude values of each postal codes in the dataframe above are found and paired through the CSV as available on 'http://cocl.us/Geospatial_data'. The latitude and longitude values in the merged dataframe, along with the postal codes, boroughs and neighbourhoods are then applied as a reference data to add markers in the map of Toronto, Ontario generated by folium. An evenly-separated segmentation and clustering via markers can be observed on the map of Toronto as the geospatial data are based on the postal codes provided in the CSV file.