# Segmenting and Clustering Neighborhoods in Toronto

## Introduction

This notebook is created to implement the Coursera assignment about Segmenting and Clustering Neighborhoods in Toronto.

We start with scraping the Wikipedia page in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe, then use geocoder to fetch the coordinates data
and merge with neighborhood data. Fianlly, we will apply the Foursquare API to explore venues for all neighborhoods in Toronto and analyze and visulize the clustering of neighborhoods.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Download and Explore Dataset</a>

2. <a href="#item2">Explore Neighborhoods in Toronto</a>
   
</font>
</div>

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

import requests # library to handle requests

import warnings
warnings.filterwarnings('ignore')

print('Libraries imported.')

Libraries imported.


## 1. Download and Explore Dataset

We use the 'request' and 'BeautifulSoup' libraries to get the 'lxml' file and then find the all table tags for subsequently dataframe generation.

In [2]:
website_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
results = requests.get(website_url).text

soup = BeautifulSoup(results, 'lxml')
my_table = soup.find('table', {'class': "wikitable sortable"})
table = my_table.findAll('td')

Build a loop to extract all data and store as dictionary.

In [3]:
col_list = ['PostCode', 'Borough', 'Neighborhood']
data_dict = {}
for ind, key in enumerate(col_list):
    i = ind
    value = []
    while i <= len(table)-1:
        text = table[i].text
        value.append(text)
        i += 3
    data_dict[key] = value

print('Store data in dictionary!')

Store data in dictionary!


#### Build dataframe and clean the dataset.

In [4]:
# Build and clean dataframe
data_df = pd.DataFrame(data_dict)
data_df = data_df[['PostCode', 'Borough', 'Neighborhood']]
data_df['Neighborhood'] = data_df['Neighborhood'].apply(lambda x: x.split("\n")[0])

# Drop the cells with Not assigned Borough
data_df = data_df[data_df.Borough != 'Not assigned']

# A cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
data_df['Neighborhood'][data_df.Neighborhood == 'Not assigned'] = data_df[data_df.Neighborhood == 'Not assigned']['Borough']

# Two rows are combined into one row with same PostCode and Borough.
data_df = data_df.groupby(['PostCode', 'Borough']).agg({'Neighborhood': lambda x: ' , '.join(x)}).reset_index()

In [5]:
# Have a look the data
data_df.sort_values('Neighborhood', inplace=True)
data_df = data_df.reset_index(drop=True)
data_df.head(10)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M5H,Downtown Toronto,"Adelaide , King , Richmond"
1,M1S,Scarborough,Agincourt
2,M1V,Scarborough,"Agincourt North , L'Amoreaux East , Milliken ,..."
3,M9V,Etobicoke,"Albion Gardens , Beaumond Heights , Humbergate..."
4,M8W,Etobicoke,"Alderwood , Long Branch"
5,M3H,North York,"Bathurst Manor , Downsview North , Wilson Heights"
6,M2K,North York,Bayview Village
7,M5M,North York,"Bedford Park , Lawrence Manor East"
8,M5E,Downtown Toronto,Berczy Park
9,M1N,Scarborough,"Birch Cliff , Cliffside West"


In [6]:
#Print the row number of the cleaned data
print('Row number of cleaned data:', data_df.shape[0])

Row number of cleaned data: 103


#### Use geocoder to fetch the coordinates data

Using geocoder to fetch the coordinates data, as it is taking long time and not reliable, the csv file is directly downloaded and used in analysis

In [7]:
# import geocoder 

# latitude = []
# longitude = []
# for post_code, borough in zip(data_df['PostCode'], data_df['Borough']):
#     print('Fetching Coordinates For:', post_code, borough)
    
#     # initialize your variable to None
#     lat_lng_coords = None

#     # loop until you get the coordinates
#     while(lat_lng_coords is None):
#         g = geocoder.google('{}, {}'.format(post_code, borough))
#         lat_lng_coords = g.latlng

#     latitude.append(lat_lng_coords[0])
#     longitude.append(lat_lng_coords[1])

# data_df['Latitude'] = latitude
# data_df['Longitude'] = longitude

In [8]:
# Read coordinates CSV file
coord_df = pd.read_csv('http://cocl.us/Geospatial_data')
coord_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge the coordinates into the dataset by Postcode, quickly examine the resulting dataframe.

In [9]:
#Merge the coordinates into the dataset by Postcode
coord_df = coord_df.rename(columns={'Postal Code': 'PostCode'})
data_df = data_df.merge(coord_df, on='PostCode')
data_df.head(12)

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M5H,Downtown Toronto,"Adelaide , King , Richmond",43.650571,-79.384568
1,M1S,Scarborough,Agincourt,43.7942,-79.262029
2,M1V,Scarborough,"Agincourt North , L'Amoreaux East , Milliken ,...",43.815252,-79.284577
3,M9V,Etobicoke,"Albion Gardens , Beaumond Heights , Humbergate...",43.739416,-79.588437
4,M8W,Etobicoke,"Alderwood , Long Branch",43.602414,-79.543484
5,M3H,North York,"Bathurst Manor , Downsview North , Wilson Heights",43.754328,-79.442259
6,M2K,North York,Bayview Village,43.786947,-79.385975
7,M5M,North York,"Bedford Park , Lawrence Manor East",43.733283,-79.41975
8,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
9,M1N,Scarborough,"Birch Cliff , Cliffside West",43.692657,-79.264848


## 2. Explore the neighborhoods in Toronto.

Extractor the neighborhoods and corresponding coordiantes for Toronto. 
As this assignment is only to cluster the neighborhoods in Toronto. So let's slice the original dataframe and create a new dataframe of the Toronto data.

In [10]:
toronto_data = data_df[data_df.Borough.str.contains("Toronto")]
toronto_data.sort_values('Neighborhood', inplace=True)
toronto_data = toronto_data.reset_index(drop=True)
toronto_data.head(12)

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M5H,Downtown Toronto,"Adelaide , King , Richmond",43.650571,-79.384568
1,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
2,M6K,West Toronto,"Brockton , Exhibition Place , Parkdale Village",43.636847,-79.428191
3,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558
4,M5V,Downtown Toronto,"CN Tower , Bathurst Quay , Island airport , Ha...",43.628947,-79.39442
5,M4X,Downtown Toronto,"Cabbagetown , St. James Town",43.667967,-79.367675
6,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
7,M5T,Downtown Toronto,"Chinatown , Grange Park , Kensington Market",43.653206,-79.400049
8,M6G,Downtown Toronto,Christie,43.669542,-79.422564
9,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316
