# Segmenting and Clustering Neighborhoods in Toronto
In this notebook, we are going to explore and cluster the neighborhoods in Toronto.

## Web scraping
First, we need to scrape the following Wikipedia page,  https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,
in order to obtain the data that is in the table of postal codes. We want to create a dataframe that contains postal codes, boroughs, and neighboods. To do this, we use the BeautifulSoup library.

In [64]:
from bs4 import BeautifulSoup
import requests

In [65]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
website_url = requests.get(url).text
soup = BeautifulSoup(website_url, 'lxml')
#print(soup.prettify())

The data we want are stored in a table tag. The class is called wikitable sortable.

In [66]:
mytable = soup.find('table', {'class': 'wikitable sortable'})

The actual contents are inside td tags.

In [None]:
contents = mytable.findAll('td')

Data are stored in a list. Since each row of the table has three items (three columns), we want to process every three item in the list together. For this purpose, let's create a covenient function.

In [4]:
from itertools import zip_longest

def grouper(iterable, n, fillvalue=None):
    """Convenient function to go through a sequence in a chunk."""
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

Let's process the table content. We create three lists, which will contain items in the three columns. We want to skip rows whose borough is 'Not assigned'.

In [6]:
postcodes = []
boroughs = []
neighborhoods = []
for row in grouper(contents, 3):
    postcode = row[0].text
    borough = row[1].text
    neighborhood = row[2].text.rstrip()
    if borough == "Not assigned":
        continue
    postcodes.append(postcode)
    boroughs.append(borough)
    if neighborhood == 'Not assigned':
        neighborhoods.append(borough)
    else:
        neighborhoods.append(neighborhood)

Some postal codes appear in multiple rows. We want to compress these rows into   a single row for each postal code.

In [7]:
postcode_list = []
borough_list = []
neighborhood_list = []
for p, b, n in zip(postcodes, boroughs, neighborhoods):
    if p in postcode_list:
        index = postcode_list.index(p)
        # Ensure that Borough is the same if Postcode is the same.  
        if b != borough_list[index]:
            raise ValueError("This table might be broken!")
        neighborhood_list[index] = neighborhood_list[index] + ", " + n
    else:
        postcode_list.append(p)
        borough_list.append(b)
        neighborhood_list.append(n)

Let's create a dataframe.

In [50]:
import pandas as pd

df = pd.DataFrame(data={
    'PostalCode': postcode_list,
    'Borough': borough_list,
    'Neighborhood': neighborhood_list})

In [51]:
df.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [52]:
df.shape

(103, 3)

That's it!