# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

This notebook is created to implement the Coursera assignment about Segmenting and Clustering Neighborhoods in Toronto.

We start with scraping the Wikipedia page in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe, then use geocoder to fetch the coordinates data
and merge with neighborhood data. Fianlly, we will apply the Foursquare API to explore venues for all neighborhoods in Toronto and analyze and visulize the clustering of neighborhoods.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>
 
</font>
</div>

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

import requests # library to handle requests

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

We use the 'request' and 'BeautifulSoup' libraries to get the 'lxml' file and then find the all table tags for subsequently dataframe generation.

In [2]:
website_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
results = requests.get(website_url).text

soup = BeautifulSoup(results, 'lxml')
my_table = soup.find('table', {'class': "wikitable sortable"})
table = my_table.findAll('td')

Build a loop to extract all data and store as dictionary.

In [3]:
col_list = ['PostCode', 'Borough', 'Neighborhood']
data_dict = {}
for ind, key in enumerate(col_list):
    i = ind
    value = []
    while i <= len(table)-1:
        text = table[i].text
        value.append(text)
        i += 3
    data_dict[key] = value

print('Store data in dictionary!')

Store data in dictionary!


#### Build dataframe and clean the dataset.

In [4]:
# Build and clean dataframe
data_df = pd.DataFrame(data_dict)
data_df = data_df[['PostCode', 'Borough', 'Neighborhood']]
data_df['Neighborhood'] = data_df['Neighborhood'].apply(lambda x: x.split("\n")[0])

# Drop the cells with Not assigned Borough
data_df = data_df[data_df.Borough != 'Not assigned']

# A cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
data_df['Neighborhood'][data_df.Neighborhood == 'Not assigned'] = data_df[data_df.Neighborhood == 'Not assigned']['Borough']

# Two rows are combined into one row with same PostCode and Borough.
data_df = data_df.groupby(['PostCode', 'Borough']).agg({'Neighborhood': lambda x: ' , '.join(x)}).reset_index()

In [5]:
# Have a look the data
data_df.sort_values('Neighborhood', inplace=True)
data_df = data_df.reset_index(drop=True)
data_df.head(10)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M5H,Downtown Toronto,"Adelaide , King , Richmond"
1,M1S,Scarborough,Agincourt
2,M1V,Scarborough,"Agincourt North , L'Amoreaux East , Milliken ,..."
3,M9V,Etobicoke,"Albion Gardens , Beaumond Heights , Humbergate..."
4,M8W,Etobicoke,"Alderwood , Long Branch"
5,M3H,North York,"Bathurst Manor , Downsview North , Wilson Heights"
6,M2K,North York,Bayview Village
7,M5M,North York,"Bedford Park , Lawrence Manor East"
8,M5E,Downtown Toronto,Berczy Park
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


In [6]:
#Print the row number of the cleaned data
print('Row number of cleaned data:', data_df.shape[0])

Row number of cleaned data: 103
