# Segmenting and Clustering Neighborhoods in Toronto

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Question 1</a>
    
1.1. <a href="#item1">Conclusion</a>
2. <a href="#item2">Question 2</a>
    
2.1. <a href="#item1">Conclusion</a>
    
3. <a href="#item3">Question 3</a>
    
3.1. <a href="#item1">Conclusion</a>
</font>
</div>

## 1. Question 1

*Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:*

<img src = "https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1588636800000&hmac=ssKIQrsG6VHIIby2_yiH4jQ1yUt124BPn_UWPv6ncGk" width="500" height="500" />

The code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

(optional) If needed, install the *pip* package in the current Jupyter kernel.

In [None]:
# import sys
# !{sys.executable} -m pip install BeautifulSoup4

Scrape the webpage from Wikipedia that contains the complete list of postal codes for the city of Toronto. We use the *request* library to scrape the page.

In [None]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

We will use  the *BeautifulSoup* library to handle the content of the page. We will use the *lxml* parser since its the recommended one and also reportedly very fast.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

Let's look for a table in the code. *BeautifulSoup* allows us to search for an HTML element by it's type. So lets look for a *table*.

In [None]:
soup.table.name

There is one table element named 'table' in the text. Let's get the first table row (the first *tr* element) to see if this has table headers.

In [None]:
#another way to get the same result: soup.table.tr.findAll()
print(soup.table.tr.text)

So now lets scrape the postal codes to a Data Frame. We will iterate in the table and look for all the *tr* elements, the rows and scrape the data from it. Since the first row contains the row header we will need to remove it from the data frame after all the data is loaded.

In [None]:
import pandas as pd

The list of headers for our data frame.

In [None]:
headers=["Postalcode","Borough","Neighbourhood"]

The data frame is named *df_toronto*.

In [None]:
df_toronto = pd.DataFrame(columns= headers)

In [None]:
for row in soup.table.find_all('tr'):
    row_data=[]
    for data in row.find_all('th'):
        row_data.append(data.text.strip())
    for data in row.find_all('td'):
        row_data.append(data.text.strip())
    df_toronto.loc[len(df_toronto)] = row_data

Now we have our data frame, let's check it.

In [None]:
df_toronto.head()

We need to remove the first row from the Dataset.

In [None]:
# delete the first row from the dataFrame
df_toronto.drop(0, inplace=True)

In [None]:
df_toronto.head()

Now we will remove the postal codes not assigned to a borough.

In [None]:
# Get names of indexes for which the column Borough has a value "Not assigned"
not_assigned = df_toronto[df_toronto['Borough'] =='Not assigned'].index

# Delete the row indexes from the data frame
df_toronto.drop(not_assigned, inplace=True)

In [None]:
df_toronto.head()

Let's check how many rows the data frame contains now.

In [None]:
df_toronto.shape

How many distinct postal codes are in the data frame?

In [None]:
df_toronto.nunique()

So there is no need to merge different rows since this version of the page has no longer repeated postal codes. But the neighbourhood names are not separeted by commas but by slashes /. So let's fix this.

In [None]:
df_toronto.Neighbourhood = df_toronto.Neighbourhood.replace(" /", ',', regex=True)
df_toronto.head()

### 1.1. Conclusion

We can see the full data frame below.

In [None]:
df_toronto

## 2. Question 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1588636800000&hmac=epN7y9Ean0VJJ-40CNmcZ9ztJhgYTUYI5v09sgxi9WY" width="500" height="500" />

There was no data retrieved from Google using the *geocoder* library. *Geocoder* offers other geodata sources, for example *arcgis* for the same purpose.

I will first retrieve the coordinates data from the csv file available for this question.

In [None]:
#add Geo-spatial data
df_coord= pd.read_csv("http://cocl.us/Geospatial_data")
#dfll.set_index("Postcode")

We can check that the file was loaded in the data frame.

In [None]:
df_coord.head()

We need to rename the first column to *Postalcode*  so we can merge this data frame with the *boroughs* dataframe we previously created, *df_toronto*.

In [None]:
df_coord.rename(columns={'Postal Code':'Postalcode'},inplace=True)

In [None]:
This dataframe has the same number of rows.

In [None]:
df_coord.shape

Lets merge the two datagrames, using the *Postalcode* column as the column name to join on.

In [None]:
df_all_boroughs = pd.merge(df_toronto, df_coord, on = 'Postalcode')

The result of the merge is the dataframe *df_all_boroughs*.

In [None]:
df_all_boroughs.head()

### Retrieving the geo coordinates from the ArcGis database using the *geocoder* library

Since using geocoder to retrieve geo coordinates from Google was not working, we used another well known geo data provider, *ArcGis*. 
We don't need an API key to use it.

First we need to install *geocoder* and import it.

In [None]:
import sys
!{sys.executable} -m pip install geocoder

In [None]:
import geocoder

Using *geocoder* to retrieve geo data from Google is not working:

In [None]:
import geocoder
g = geocoder.google('Mountain View, CA')
g.latlng
print(g.latlng)

So we will create a function to retrieve geo data from ArcGis using *geocoder*.

In [None]:
def get_latlng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    return lat_lng_coords[0],lat_lng_coords[1]

If we feed the function with an address we will get the geo data for it.

In [None]:
get_latlng('M3A')

We will create a new column on the data frame and fill it with the geo data for the postal code in each line.

In [None]:
df_toronto['coord'] = df_toronto.Postalcode.apply(lambda x: get_latlng(x))

The data frame now has all the geo data we need but now we need to split it in two columns, *Latitude* and *Longitude* and drop the *coord* column after it.

In [None]:
df_toronto.head()

In [None]:
df_toronto['Latitude'] = df_toronto.coord.apply(lambda x: x[0])
df_toronto['Longitude'] = df_toronto.coord.apply(lambda x: x[1])

In [None]:
df_toronto.drop("coord", axis=1, inplace=True)

Finnaly, we have the data frame with the desired data, labeled as we wanted.

In [None]:
df_toronto.head()

In [None]:
df_toronto.info()

In [None]:
df_toronto

### 2.1. Conclusion

We retrieved the geo data from two sources. The first source was the CSV file provided by IBM and the second source was the geo data from ArcGis retrieved using *geocoder*.

In [None]:
# from a CSV file
df_all_boroughs

In [None]:
# from the ArcGis geo data using the geocoder library
df_toronto

## 3. Question 3