# Segmenting and Clustering Neighborhoods in Toronto


<blockquote>In this assignment, I will try to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.</blockquote>


### 1. Import Libraries

In [1]:
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests  # library to handle requests
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

### 2. Downloading and Prepping Data

The neighborhood data is not available online. So let's load and scrape the __[Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)__, wich has all the information we need to explore and cluster the neighborhoods in Toronto, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe.

#### 2.1 Download the contents of the web page

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

#### 2.2 Scrape data from HTML tables into a DataFrame
> Use the **read_html** function to directly get DataFrames from the url.
> Use the **match** parameter to select the specific table we want. 
> If the table contains a string matching the text it will be read.

In [3]:
df_Toronto = pd.read_html(url, match="Neighbourhood", flavor='bs4')[0]
df_Toronto.columns = ['PostalCode', 'Borough', 'Neighborhood']
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The dataframe must meet the following conditions:
- The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood.
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- If a cell has a borough but a *Not assigned*  neighborhood, then the neighborhood will be the same as the borough.

#### 2.3  Identify and handle missing values
As we can see, that DataFrame have missing values which represented by *Not assigned*. We will drop the whole row with them.

In [4]:
df_Toronto.Borough.replace("Not assigned", np.nan, inplace = True) # replace "Not assigned" to NaN(Not a Number)
df_Toronto.dropna(subset=["Borough"], axis=0, inplace=True) # Drop missing values along the column "Borough"
df_Toronto.reset_index(drop=True, inplace=True) # Reset index, if we droped some rows
df_Toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
df_Toronto.shape

(103, 3)