**Part 1**  
_In this part of the assignment, we will scrap the data of all postal code, its borough and its neighborhoods in Canada from Wikipedia page.
In this data, we only consider boroughs which have assigned values. To make the data comprehensible, the values for postal code remain 
unique,which play the key role in the dataframe. The NA values for neighborhood are replaced by its borough, which makes the assumption 
that these boroughs have no neighbor or the information is missing_

In [None]:
# Import all needed libraries
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import numpy as np

In [3]:
# Scrap needed data from Wikipedia
url = "https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=890001695"
html = requests.get(url).text
soup = bs(html, 'html.parser')
ta=soup.find('table',{'class':'wikitable'})
table_headers = ta.find_all('th')
table_rows = ta.find_all('tr')

l = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    if row:
        row[-1] = row[-1].strip('\n')
        l.append(row)
df = pd.DataFrame(l, columns=["Postalcode", "Borough", "Neighbourhood"])
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# Drop rows for Borough is not assigned
df.drop(df.loc[df['Borough']== 'Not assigned'].index, inplace=True)
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [10]:
# Group by PostalCode
df = df.groupby('Postalcode').agg({'Borough':'first', 
                             'Neighbourhood': ', '.join}).reset_index()
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
# Replace a Not assigned neighborhood by its borough
df['Neighbourhood'] = np.where(df['Neighbourhood'] == 0, df['Borough'], df['Neighbourhood'])
df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [11]:
df.shape

(103, 3)

**Part 2**  
_As the geocoder library is unreliable and the runtime for requesting the coordinates is too big, we import the needed information from 
the given .csv file. The imported dataframe contains the information of postal code and its longtitude and latitude. In order to have the 
longtitute and latitute for the given postal code in the dataframe in part 1, we merge these two dataframes based on its common value, 
which is portal code_

In [13]:
!pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 10.9MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


In [31]:
# Reading the .csv file to dataframe
file_name='http://cocl.us/Geospatial_data'
coordinates=pd.read_csv(file_name)
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [33]:
# Rename a column in coordinates
coordinates.rename({'Postal Code': 'Postalcode'}, axis=1, inplace=True)
coordinates.head()

Unnamed: 0,Postalcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [28]:
coordinates.shape

(103, 3)

In [73]:
# Merge two dataframes
data = pd.merge(df, coordinates,how='left', on='Postalcode')
data.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


**Part 3**  
_In this part, we decide to work with only boroughs that contain the word Toronto, so a new dataframe which contains only needed information
is created. We use kmean clustering algorithm to cluster Torontos neighborhoods based on given longtitude and latitute and visualize the 
result_

In [71]:
# Keep only informaton for boroughs that contain the word Toronto
toronto_df = data[data['Borough'].str.contains("Toronto")]
toronto_df.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [58]:
# Cluster neighborhood based on longtitute and latitute
from sklearn.cluster import KMeans
kclusters = 4
kmeans = KMeans(n_clusters=4, random_state=0).fit(toronto_df[toronto_df.columns[3:4]])
kmeans.labels_

array([3, 3, 0, 0, 1, 1, 1, 1, 3, 3, 3, 0, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2,
       1, 3, 0, 0, 2, 2, 2, 2, 0, 0, 2, 2, 0, 2, 2, 0], dtype=int32)

In [74]:
# Add clustering labels
toronto_df.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_df.head()

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighbourhood,Latitude,Longitude
37,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,3,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,0,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,0,M4M,East Toronto,Studio District,43.659526,-79.340923
44,1,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [69]:
# Requesting coordinate of Toronto
from geopy.geocoders import Nominatim
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [61]:
!pip install folium

Collecting folium
[?25l  Downloading https://files.pythonhosted.org/packages/fd/a0/ccb3094026649cda4acd55bf2c3822bb8c277eb11446d13d384e5be35257/folium-0.10.1-py2.py3-none-any.whl (91kB)
[K     |████████████████████████████████| 92kB 10.6MB/s eta 0:00:01
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/81/6d/31c83485189a2521a75b4130f1fee5364f772a0375f81afff619004e5237/branca-0.4.0-py3-none-any.whl
Installing collected packages: branca, folium
Successfully installed branca-0.4.0 folium-0.10.1


In [70]:
# Visualize Toronto and its clustered neighborhoods
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Neighbourhood'], toronto_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters