## Scrape the necessary data and transform it into a dataframe

### 1. We Scrape Wikipedia to obtain the Postal Codes

Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [None]:
# uncomment the line below to install dependencies
# !conda install -y anaconda beautifulsoup4 lxml

In [None]:
from bs4 import BeautifulSoup
import lxml

In [None]:
# download html file from wikipedia link
!wget -O 'postal_codes.html' 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [None]:
# open and scrape downloaded html file
with open('postal_codes.html') as html_file:
    soup = BeautifulSoup(html_file, 'lxml')

l = []
table = soup.find('table')
for tr in table.find_all('tr'):
    row = []
    for td in tr.find_all('td'):
        if td.text =='Not assigned':
            row = []
            break
        row.append(td.text)
    l.append(row)

In [None]:
# import our generated list as a dataframe
import pandas as pd
df = pd.DataFrame(l, columns=["PostalCode", "Borough", "Neighborhood"])
df

### 2. We transform the scraped data into a workable pandas dataframe


In [None]:
import numpy as np
# data clean-up

# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
df_clean = df.dropna(subset=['Borough']).reset_index(drop=True)

# remove line jumps at the end of Neighborhood cell (leftover from scraping)
df_clean['Neighborhood'] = df_clean['Neighborhood'].str.replace('\n', '')

# group Neighborhoods with the same postal code
df_grouped = df_clean.groupby(['PostalCode', 'Borough'], axis=0).agg(lambda x: ','.join(x.values)).reset_index()

# If a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough. 
df_grouped[['Neighborhood']] = np.where(df_grouped[['Neighborhood']] == 'Not assigned', df_grouped[['Borough']], df_grouped[['Neighborhood']])

df_grouped

In [None]:
df_grouped.shape

## Obtain coordinates of each Neighborhood with Geocoder Python

In order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

We will use the Geocoder Python package: https://geocoder.readthedocs.io/index.html

This package can be unreliable. In order to obtain the geographical coordinates of a given postal code (sometimes, it will wrongfully return None). So, in order to make sure that we get the coordinates for all of our neighborhoods, we will run a while loop for each postal code. 

In [None]:
# uncomment the three lines below to install dependencies. If you do, remember to restart the notebook's kernel
# !git clone https://github.com/DenisCarriere/geocoder
# !cd geocoder && python setup.py install
# !cd .. && rm -rf 

In [None]:
import geocoder # import geocoder


def lat_long_retriever(postalcode):
    lat_lng_coords = None
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postalcode))
        return g.latlng
     
# run the lat_long_retriever function on each postal code and create a new column called lat_long
df_grouped['lat_long'] = df_grouped['PostalCode'].apply(lambda x : lat_long_retriever(x))

df_grouped



In [None]:
# break lat_long column into latitude and longitude columns
df_geo = df_grouped.merge(df_grouped['lat_long'].apply(lambda s: pd.Series({'Latitude':s[0], 'Longitude':s[1]})), 
    left_index=True, right_index=True)
df_geo.drop(columns=['lat_long'], inplace=True)

df_geo.head()

In [None]:
df_geo.shape

## Explore and cluster the neighborhoods in Toronto. 

We will work only with boroughs that contain the word Toronto and replicate the same analysis we did for the New York City data and generate maps to visualize the neighborhoods and how they cluster together. 