## Segmenting and Clustering Neighborhoods in Toronto

This script is for Coursera IBM Data Science capstone project. It is used to analyze and cluster neighborhoods in Toronto.

### Import libraries

In [16]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup # used to parse data from website
import geocoder # import geocoder

### Parse table from website

We can define a function to parse the url and search for table content

In [6]:
def parse_url_table(url):
    
    # parse url
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    
    # parse table from url
    table = soup.find_all("table")[0]
    
    # find column names
    col_names = []
    th_tags = table.find_all('th')
    for th in th_tags:
        col_names.append(th.get_text().rstrip("\n"))
    
    # create a new pandas DataFrame to restore the table
    df = pd.DataFrame(columns=col_names)
    
    # read table content
    for row in table.find_all('tr'):
        cols = row.find_all('td')
        if len(cols)>0:
            temp = []
            for col in cols:
                temp.append(col.get_text().rstrip("\n"))
            df = df.append(pd.Series(temp,index=df.columns),ignore_index=True)
        
        
    return df

Get table from the Wikipedia page

In [7]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = parse_url_table(url)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Clean the table

Define a function to clean the table

In [8]:
def clean_table(df):
    
    # drop rows with 'Not assigned' Borough
    df = df[df.Borough!='Not assigned']
    
    # set 'Not assigned' Neighborhood the same name as Borough
    df[df.Neighborhood.isna()].loc[:,'Neighborhood'] = df[df.Neighborhood.isna()].loc[:,'Borough']
    
    # clean Neighborhood, change '/' to ', '
    temp = df['Neighborhood'].values
    for idx, istr in enumerate(temp):
        temp[idx] = istr.replace(' / ',', ')
        
    df.assign(Neighborhood = temp)
    
    df = df.reset_index(drop=True)
    
    
    return df

Clean the pandas DataFrame

In [9]:
df = clean_table(df)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Let's check the shape of table

In [10]:
df.shape

(103, 3)

### Get Geographic Coordinates for boroughs

Let's define a function to get geographic coordinate for any give postal code

In [37]:
# download geographic coordinates from the link
geo_code = pd.read_csv('http://cocl.us/Geospatial_data')

def get_geo_post(postal_code):
    latitude = geo_code.loc[geo_code['Postal Code']==postal_code, 'Latitude'].values
    longitude = geo_code.loc[geo_code['Postal Code']==postal_code, 'Longitude'].values
    
    return latitude, longitude

Use above function to get lat/lon for each borough

In [42]:
# add two new columns to the table
df['Latitude'] = np.nan
df['Longitude'] = np.nan

# get geographic coordinate for each postal code (row)
for idx in range(len(df.index)):
    postal_code = df.iloc[idx,0]        # get postal code for each borough
    df.iloc[idx,3], df.iloc[idx,4] = get_geo_post(postal_code)  # fill in the lat/lon

df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [43]:
df.shape

(103, 5)