<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto City</font></h1>

## Introduction

In this lab, you will learn how to web scrape Toronto Neighborhood data and convert addresses into their equivalent latitude and longitude values. Also, you will use the Foursquare API to explore neighborhoods in Toronto. You will use the Folium library to visualize the neighborhoods in New York City and their emerging clusters.

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
# install beautiful Soup for HTML parsing 

!pip install beautifulsoup4



In [2]:
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import csv

print("Libraries imported.")

Libraries imported.


Toronto Neighborhood has a total of 11 boroughs. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 11 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood.

Before we start working with dataset, we need do data mining since we don't have toronto data readily available. Luckily, we have wiki page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M from which we can web scrape using Beautiful Soup python package and extra the dataset.

#### The below function extracts the data using Beautiful Soup page and saves in CSV file locally.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

def getData(url):
    html = urlopen(url)
    bsObj = BeautifulSoup(html, 'html.parser')
    tables = bsObj.find('table', {'class':'wikitable'})
    table = tables.find('tbody')

    output_rows = []
    for table_row in table.findAll('tr'):
        columns = table_row.findAll('td')
        output_row = []
        for column in columns:
            output_row.append((column.text).rstrip())
        output_rows.append(output_row)

    with open('output.csv', 'w') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerows(output_rows)

getData(url)
print('Data downloaded!')

Data downloaded!


#### Load and explore the data

In [4]:
import pandas as pd

In [5]:
my_data = pd.read_csv("output.csv", delimiter=",")
my_data.head()

Unnamed: 0,M1A,Not assigned,Not assigned.1
0,M2A,Not assigned,Not assigned
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Heights


#### Now let us change the column Names as appropriately __PostalCode__, __Borough__, and __Neighborhood__

In [6]:
df = my_data.rename(columns={"M1A": "Postcode", "Not assigned": "Borough", "Not assigned.1": "Neighborhood"})
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M2A,Not assigned,Not assigned
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Heights


Let us check the total numbers records we have with .shape method

In [7]:
df.shape

(286, 3)

#### Now process the cells that have an assigned borough by ignoring cells with a borough that is __Not assigned__

In [8]:
df.drop(df[df['Borough'] == 'Not assigned'].index, inplace=True)
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor


The Next step is to find more than one neighborhood that exist in one postal code area and combined these two rows into one row with the neighborhoods separated with a comma as shown in below output.

In [9]:
df = df.groupby(['Postcode','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Let us check if any Borough has nieghborhood as "Not Assigned"

In [10]:
# check if any "Not Assigned" neighborhoods in the DataFrame

count = 0
for row in df['Neighborhood']:
    if row in 'Not assigned':
        count += 1
print(count)

1


Next replace the neighborhood with Borough where neighborhood as "Not Assigned"

In [11]:
df.loc[df['Neighborhood'] == 'Not assigned', ['Neighborhood']] = df['Borough']
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


#### Let us check the total numbers records we have with __.shape__ method

In [12]:
df.shape

(103, 3)