### This notebook is for Segmenting and Clustering Neighborhoods in Toronto - Week 3 Assignment 

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe
3. To create the above dataframe:

    * The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    * Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    * More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
    * If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
    * Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    * In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository.

Pandas library provides us with a way to import data tables from HTML pages directly. So we will use that instead of scraping page ourself. We will also install dependent libs for this

In [3]:
import sys
!conda install --yes --prefix {sys.prefix} lxml
!conda install --yes --prefix {sys.prefix} html5lib
!conda install --yes --prefix {sys.prefix} beautifulsoup4

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.

Collecting package metadata: done
Solving environment: done

# All requested packages already installed.



In [46]:
import pandas as pd

df = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")[0]

In [47]:
df.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


So we have the data we need but the header is also added as a row.

In [48]:
df = df.drop(df.index[0])
df.head()


Unnamed: 0,0,1,2
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


Let's add column names 

In [55]:
df.columns={'PostalCode','Borough','Neighbourhood'}
df.head()

Unnamed: 0,PostalCode,Neighbourhood,Borough
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


So we have the needed dataframe now. Let's read the coordinates now and add them to our data frame. 

In [58]:
geo = pd.read_csv('http://cocl.us/Geospatial_data')
geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Ok, so we no need to merge the dataframes based on postal code. Let's rename column name so that they match

In [67]:
dfinal = pd.merge(df,geo, left_on="PostalCode", right_on='Postal Code',how="left")

In [68]:
dfinal.head()

Unnamed: 0,PostalCode,Neighbourhood,Borough,Postal Code,Latitude,Longitude
0,M1A,Not assigned,Not assigned,,,
1,M2A,Not assigned,Not assigned,,,
2,M3A,North York,Parkwoods,M3A,43.753259,-79.329656
3,M4A,North York,Victoria Village,M4A,43.725882,-79.315572
4,M5A,Downtown Toronto,Harbourfront,M5A,43.65426,-79.360636


As we can see there are postal codes with no neighbourhood and borough values. Also there are some NaN's for longitude, latitude. Let's do some analysis

In [69]:
dfinal.count()

PostalCode       288
Neighbourhood    288
Borough          288
Postal Code      211
Latitude         211
Longitude        211
dtype: int64

Let's drop data with Not assigned values and review the mismatch again

In [71]:
dfinal = dfinal[dfinal['Neighbourhood'] != 'Not assigned']

In [72]:
dfinal.count()

PostalCode       211
Neighbourhood    211
Borough          211
Postal Code      211
Latitude         211
Longitude        211
dtype: int64

Bingo, we have a full match. This is our needed dataframe! We just need to remove additional Postal Code Column

In [74]:
dfinal.drop(['Postal Code'], axis=1)

Unnamed: 0,PostalCode,Neighbourhood,Borough,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
5,M5A,Downtown Toronto,Regent Park,43.654260,-79.360636
6,M6A,North York,Lawrence Heights,43.718518,-79.464763
7,M6A,North York,Lawrence Manor,43.718518,-79.464763
8,M7A,Queen's Park,Not assigned,43.662301,-79.389494
10,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
11,M1B,Scarborough,Rouge,43.806686,-79.194353
12,M1B,Scarborough,Malvern,43.806686,-79.194353
