### Segmenting and Clustering Neighborhoods in Toronto
In this notebook, I am going to scrap data from wikipedia to get the list of neighborhoods in Toronto and then use Square API to cluster these neighborhoods based on the venues they contain.

## Part 1 : Getting the neighborhoods of Toronto

In [1]:
import pandas as pd

In [2]:
from IPython.display import IFrame
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
IFrame(url, width=800, height=350)

#### Using this wikipedia webpage, we will create a dataframe containing three columns: PostalCode, Borough, and Neighborhood

In [3]:
data, = pd.read_html(url, match="Postal", skiprows=1)
data.columns = ["PostalCode", "Borough", "Neighborhood"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M2A,Not assigned,
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,"Regent Park, Harbourfront"
4,M6A,North York,"Lawrence Manor, Lawrence Heights"


Let's drop all the cells that don't have any Borough assigned to them:

In [4]:
data = data[data["Borough"] != "Not assigned"]
data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,"Regent Park, Harbourfront"
4,M6A,North York,"Lawrence Manor, Lawrence Heights"
5,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
data['PostalCode'].describe()

count     103
unique    103
top       M1K
freq        1
Name: PostalCode, dtype: object

As we can notice, there is 103 different Postal Code. 
Each Postal code is associated with a Borough and a certain number of neighborhoods seperated by a comma.

Fortunately, each Borough has at least a neighborhood associated with it, so we don't need further data cleaning.

In [6]:
data[data["Neighborhood"] == "Not assigned"]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [7]:
data[data["Neighborhood"] == ""]

Unnamed: 0,PostalCode,Borough,Neighborhood


In [8]:
data.shape

(103, 3)