<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>
<h2 align=center><font size = 5>Part 1: Extract data</font></h2>


## Introduction

In this lab, you will learn how to convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in Toronto City. You will use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. You will use the *k*-means clustering algorithm to complete this task. Finally, you will use the Folium library to visualize the neighborhoods in Toronto City and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Part 1: Scrape Data from Wikipedia</a>

2. <a href="#item2">Explore Neighborhoods in Toronto City</a>

</font>
</div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [174]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests

#!conda install -c anaconda beautifulsoup4
from bs4 import BeautifulSoup

print('Libraries imported.')


Libraries imported.


## Part 1 - Scrap data from Wikipedia and Explore Dataset

In [175]:
# request and get HTML
import requests
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
response = requests.get(url)
response.text[:100] # Access the HTML with the text property

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title'

In [176]:
#warp html into BeautifulSoup object
soup = BeautifulSoup(response.text)

In [177]:
#Extract table from HTML
htmltable = soup.find('table', { 'class' : 'wikitable sortable' })


In [180]:
#extract HTML rows <tr>
def tableDataText(table):       
    rows = []
    trs = table.find_all('tr')
    headerow = [td.get_text(strip=True) for td in trs[0].find_all('th')] # header row
    if headerow: # if there is a header row include first
        rows.append(headerow)
        trs = trs[1:]
    for tr in trs: # for every table row
        rows.append([td.get_text(strip=True) for td in tr.find_all('td')]) # data row
    return rows

list_table = tableDataText(htmltable)
print('Sample from extracted rows')
list_table[:2]

Sample from extracted rows


[['Postcode', 'Borough', 'Neighbourhood'],
 ['M1A', 'Not assigned', 'Not assigned']]

In [181]:
#from HTML tabel to pandas
df_table = pd.DataFrame(list_table[1:], columns=list_table[0])
df_table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [182]:
print('Data shape:',  df_table.shape)

Data shape: (287, 3)


### - Cleaning Data

In [184]:
# remove Borough ='Not assigned'
df_table= df_table[df_table.Borough != 'Not assigned']
print('Data shape after remove Not assigned Borough:',  df_table.shape)
df_table.head()

Data shape after remove Not assigned Borough: (210, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [185]:
#if Neighbourhood is 'Not assigned' fill it with  Borough value 
df_table[df_table.Neighbourhood=='Not assigned'].Neighbourhood=df_table[df_table.Neighbourhood=='Not assigned'].Borough 
df_table.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


In [188]:
#combined Neighbourhoods into one row with the neighborhoods separated with a comma in case have same Postcode
df_table['Neighbourhood'] = df_table.groupby(['Postcode','Borough'])['Neighbourhood'].transform(lambda x: ','.join(x))
print('Data shape after combined Neighbourhoods:',  df_table.shape)

df_table.drop_duplicates(inplace=True)
df_table.head(10)

Data shape after combined Neighbourhoods: (103, 3)


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,"Lawrence Heights,Lawrence Manor"
7,M7A,Downtown Toronto,Queen's Park
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,"Rouge,Malvern"
13,M3B,North York,Don Mills North
14,M4B,East York,"Woodbine Gardens,Parkview Hill"
16,M5B,Downtown Toronto,"Ryerson,Garden District"


In [189]:
#copy data to neighborhoods Dataframe to start processing 
neighborhoods= df_table
#rename column Neighbourhood to Neighborhood which used in NY Notebook
neighborhoods.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)
neighborhoods.head()

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,"Lawrence Heights,Lawrence Manor"
7,M7A,Downtown Toronto,Queen's Park


### - Data Summary 

In [190]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 10 boroughs and 103 neighborhoods.


In [192]:
#count number of Neighbourhood per Postcode
neighborhoods.groupby(['Postcode','Borough']).count().sort_values('Neighborhood', ascending=False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighborhood
Postcode,Borough,Unnamed: 2_level_1
M1B,Scarborough,1
M5R,Central Toronto,1
M6G,Downtown Toronto,1
M6E,York,1
M6C,York,1
M6B,North York,1
M6A,North York,1
M5X,Downtown Toronto,1
M5W,Downtown Toronto,1
M5V,Downtown Toronto,1


In [193]:
neighborhoods.shape

(103, 3)

In [194]:
neighborhoods.to_csv('Toronto_Neighborhoods.csv', index=False)

### End of part 1 

<a id='item1'></a>