# Segmenting and Clustering Neighborhoods in Toronto
In this notebook, we are going to explore and cluster the neighborhoods in Toronto.

## Get the latitude and the lontitude coodinates of each neighborhood
Unfortunately, I couldn't make geocode work at all, so I am going to use the csv file in the following link, http://cocl.us/Geospatial_data.

First, let's replicate the code in the first notebook.

In [19]:
from itertools import zip_longest

from bs4 import BeautifulSoup
import requests
import pandas as pd


def grouper(iterable, n, fillvalue=None):
    """Convenient function to go through a sequence in a chunk."""
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)


url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
website_url = requests.get(url).text
soup = BeautifulSoup(website_url, 'lxml')

mytable = soup.find('table', {'class': 'wikitable sortable'})
contents = mytable.findAll('td')

postcodes = []
boroughs = []
neighborhoods = []
for row in grouper(contents, 3):
    postcode = row[0].text
    borough = row[1].text
    neighborhood = row[2].text.rstrip()
    if borough == "Not assigned":
        continue
    postcodes.append(postcode)
    boroughs.append(borough)
    if neighborhood == 'Not assigned':
        neighborhoods.append(borough)
    else:
        neighborhoods.append(neighborhood)
        
postcode_list = []
borough_list = []
neighborhood_list = []
for p, b, n in zip(postcodes, boroughs, neighborhoods):
    if p in postcode_list:
        index = postcode_list.index(p)
        # Ensure that Borough is the same if Postcode is the same.  
        if b != borough_list[index]:
            raise ValueError("This table might be broken!")
        neighborhood_list[index] = neighborhood_list[index] + ", " + n
    else:
        postcode_list.append(p)
        borough_list.append(b)
        neighborhood_list.append(n)
        
df = pd.DataFrame(data={
    'PostalCode': postcode_list,
    'Borough': borough_list,
    'Neighborhood': neighborhood_list})


Now, let's get the coordinates from the link.

In [20]:
url = "http://cocl.us/Geospatial_data"
df_geo = pd.read_csv(url)

In [21]:
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [22]:
df_geo.shape

(103, 3)

In [23]:
# We probably don't need this cell, but let's just run it to be sure.
df_geo.sort_values(by='Postal Code', inplace=True)
df_geo.reset_index(drop=True, inplace=True)
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now, the postal code dataframe I created earlier is not sorted in the same manner as the coordinates dataframe. So, let's sort it by the PostalCode column.

In [24]:
df.sort_values(by='PostalCode', inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


We can now concatenate the two dataframes.

In [25]:
df[['Latitude', 'Longitude']] = df_geo[['Latitude', 'Longitude']]

In [26]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [27]:
df.shape

(103, 5)

That's it!