# Capstone Project Week 3 - Final Project
Explore, segment, and cluster the neighborhoods in the city of Toronto. 

For the Toronto neighborhood data, scrape the Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and transform the data into a pandas dataframe

## Part 1: Create DataFrame

1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
3. More than one neighborhood can exist in one postal code area. These rows will be combined into one row with the neighborhoods separated with a comma
4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


In [13]:
import pandas as pd
import numpy as np

In [14]:
# Read table from wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
toronto_codes = pd.read_html(url)[0]
toronto_codes.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [15]:
# Remove the Borough's that are "Not assigned"
# print(toronto_codes.shape[0])
toronto_codes = toronto_codes[toronto_codes.Borough != "Not assigned"]
# print(toronto_codes["Borough"].value_counts())

# Update Column Names
toronto_codes.columns = ["PostalCode", "Borough", "Neighborhood"]

# If Neighborhood = "Not assigned", Neighborhood = Borough
print("Rows with Neighborhood == Not assigned: {}"
    .format(sum(toronto_codes["Neighborhood"].str.contains("Not", case=False))))

# Rows with repeated Postal Codes
print("Duplicate Postal Code Rows: ", sum(toronto_codes["PostalCode"].value_counts() > 1))

# Reindex
toronto_codes = toronto_codes.reset_index().drop("index", axis=1)

toronto_codes.head()

Rows with Neighborhood == Not assigned: 0
Duplicate Postal Code Rows:  0


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [16]:
print("*" * 80)
print("Number of rows in the dataframe: ", toronto_codes.shape[0])
print("*" * 80)

********************************************************************************
Number of rows in the dataframe:  103
********************************************************************************


## Part 2: Get Latitude and Longitude of Postal Codes
1. From the dataframe of postal code, borough name and neighborhood name get latitude & longitude of each neighborhood.
2. Use the Geocoder Python package to get latitude and longitude info: https://geocoder.readthedocs.io/index.html. (You may need to run a while loop for each postal code to get over `None` response):
3. In case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data
4. Create dataframe with PostalCode, Borough, Neighborhood, Latitude, Longitude

In [17]:
!pip install geocoder



In [18]:
# import 
import geocoder

# Using arcgis which gives slightly different results than geocoder.google
def get_latlong(postal_code: str) -> tuple:
    lat_lng_coords = None
    while lat_lng_coords is None:
        g = geocoder.arcgis("{}, Toronto, Ontario".format(postal_code))
        if g.ok:
           lat_lng_coords = g.latlng
        else:
           print("Error in fulfilling request for Postal Code:", postal_code, g)
           return(np.NaN, np.NaN)
    
    return lat_lng_coords[0], lat_lng_coords[1]

# print(get_latlong("M4B, Toronto, Ontario"))
# print (get_latlong(toronto_codes.PostalCode[0]))

In [19]:
# Note that Latitude and Longitude are slightly different from google lat/longs due to use
# of arcgis as provider instead of google (which was erroring out)

# Add Latitude and Longitude
toronto_codes["Latitude"], toronto_codes["Longitude"] = zip(*toronto_codes["PostalCode"].apply(get_latlong))

(103, 5)

In [20]:
print(toronto_codes.shape)
toronto_codes.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188


## Part 3: Explore and cluster neighborhoods
1. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data.
2. Add enough Markdown cells to explain what you decided to do and to report any observations you make.
3. Generate maps to visualize your neighborhoods and how they cluster together.

Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)

The submission will be a link to your Jupyter Notebook on your Github repository.


In [29]:
# Get a dataframe of only neighborhoods that contain Toronto
df_toronto_hood = toronto_codes[toronto_codes["Borough"].str.contains("Toronto", case=False)]
print("Boroughs that contain Toronto: ", df_toronto_hood.shape[0])
df_toronto_hood.head()

Boroughs that contain Toronto:  39


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
15,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
19,M4E,East Toronto,The Beaches,43.67709,-79.29547


Unnamed: 0,PostalCode
