# Segmenting and Clustering Neighborhoods in Toronto

## Question 1

*Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:*

<img src = "https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1588636800000&hmac=ssKIQrsG6VHIIby2_yiH4jQ1yUt124BPn_UWPv6ncGk" width="500" height="500" />

The code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

(optional) If needed, install the *pip* package in the current Jupyter kernel.

In [1]:
# import sys
# !{sys.executable} -m pip install BeautifulSoup4

Scrape the webpage from Wikipedia that contains the complete list of postal codes for the city of Toronto. We use the *request* library to scrape the page.

In [2]:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

We will use  the *BeautifulSoup* library to handle the content of the page. We will use the *lxml* parser since its the recommended one and also reportedly very fast.

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')

Let's look for a table in the code. *BeautifulSoup* allows us to search for an HTML element by it's type. So lets look for a *table*.

In [4]:
soup.table.name

'table'

There is one table element named 'table' in the text. Let's get the first table row (the first *tr* element) to see if this has table headers.

In [5]:
#another way to get the same result: soup.table.tr.findAll()
print(soup.table.tr.text)


Postal code

Borough

Neighborhood



So now lets scrape the postal codes to a Data Frame. We will iterate in the table and look for all the *tr* elements, the rows and scrape the data from it. Since the first row contains the row header we will need to remove it from the data frame after all the data is loaded.

In [6]:
import pandas as pd

The list of headers for our data frame.

In [7]:
headers=["Postalcode","Borough","Neighbourhood"]

The data frame is named *df_toronto*.

In [8]:
df_toronto = pd.DataFrame(columns= headers)

In [9]:
for row in soup.table.find_all('tr'):
    row_data=[]
    for data in row.find_all('th'):
        row_data.append(data.text.strip())
    for data in row.find_all('td'):
        row_data.append(data.text.strip())
    df_toronto.loc[len(df_toronto)] = row_data

Now we have our data frame, let's check it.

In [10]:
df_toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,Postal code,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


We need to remove the first row from the Dataset.

In [11]:
# delete the first row from the dataFrame
df_toronto.drop(0, inplace=True)

In [12]:
df_toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront


Now we will remove the postal codes not assigned to a borough.

In [13]:
# Get names of indexes for which the column Borough has a value "Not assigned"
not_assigned = df_toronto[df_toronto['Borough'] =='Not assigned'].index

# Delete the row indexes from the data frame
df_toronto.drop(not_assigned, inplace=True)

In [14]:
df_toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Regent Park / Harbourfront
6,M6A,North York,Lawrence Manor / Lawrence Heights
7,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


Let's check how many rows the data frame contains now.

In [15]:
df_toronto.shape

(103, 3)

How many distinct postal codes are in the data frame?

In [16]:
df_toronto.nunique()

Postalcode       103
Borough           10
Neighbourhood     98
dtype: int64

So there is no need to merge different rows since this version of the page has no longer repeated postal codes. But the neighbourhood names are not separeted by commas but by slashes /. So let's fix this.

In [17]:
df_toronto.Neighbourhood = df_toronto.Neighbourhood.replace(" /", ',', regex=True)
df_toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Conclusion

We can see the full data frame below.

In [18]:
df_toronto

Unnamed: 0,Postalcode,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
9,M9A,Etobicoke,Islington Avenue
10,M1B,Scarborough,"Malvern, Rouge"
12,M3B,North York,Don Mills
13,M4B,East York,"Parkview Hill, Woodbine Gardens"
14,M5B,Downtown Toronto,"Garden District, Ryerson"


## Question 2

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

Use the Geocoder package or the csv file to create the following dataframe:
<img src="https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1588636800000&hmac=epN7y9Ean0VJJ-40CNmcZ9ztJhgYTUYI5v09sgxi9WY" width="500" height="500" />

There was no data retrieved from Google using the *geocoder* library. *Geocoder* offers other geodata sources, for example *arcgis* for the same purpose.

I will first retrieve the coordinates data from the csv file available for this question.

In [19]:
#add Geo-spatial data
df_coord= pd.read_csv("http://cocl.us/Geospatial_data")
#dfll.set_index("Postcode")

We can check that the file was loaded in the data frame.

In [20]:
df_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We need to rename the first column to *Postalcode*  so we can merge this data frame with the *boroughs* dataframe we previously created, *df_toronto*.

In [21]:
df_coord.rename(columns={'Postal Code':'Postalcode'},inplace=True)

In [None]:
This dataframe has the same number of rows.

In [22]:
df_coord.shape

(103, 3)

Lets merge the two datagrames, using the *Postalcode* column as the column name to join on.

In [24]:
df_all_boroughs = pd.merge(df_toronto, df_coord, on = 'Postalcode')

The result of the merge is the dataframe *df_all_boroughs*.

In [25]:
df_all_boroughs.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


### Retrieving the geo coordinates from the ArcGis database using the *geocoder* library

Since using geocoder to retrieve geo coordinates from Google was not working, we used another well known geo data provider, *ArcGis*. 
We don't need an API key to use it.

First we need to install *geocoder* and import it.

In [26]:
import sys
!{sys.executable} -m pip install geocoder



In [27]:
import geocoder

Using *geocoder* to retrieve geo data from Google is not working:

In [29]:
import geocoder
g = geocoder.google('Mountain View, CA')
g.latlng
print(g.latlng)

None


So we will create a function to retrieve geo data from ArcGis using *geocoder*.

In [30]:
def get_latlng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
    return lat_lng_coords[0],lat_lng_coords[1]

If we feed the function with an address we will get the geo data for it.

In [32]:
get_latlng('M3A')

(43.75293455500008, -79.33564142299997)

We will create a new column on the data frame and fill it with the geo data for the postal code in each line.

In [33]:
df_toronto['coord'] = df_toronto.Postalcode.apply(lambda x: get_latlng(x))

The data frame now has all the geo data we need but now we need to split it in two columns, *Latitude* and *Longitude* and drop the *coord* column after it.

In [34]:
df_toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,coord
3,M3A,North York,Parkwoods,"(43.75293455500008, -79.33564142299997)"
4,M4A,North York,Victoria Village,"(43.72810248500008, -79.31188987099995)"
5,M5A,Downtown Toronto,"Regent Park, Harbourfront","(43.65096410900003, -79.35304116399999)"
6,M6A,North York,"Lawrence Manor, Lawrence Heights","(43.723265465000054, -79.45121077799996)"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","(43.66179000000005, -79.38938999999993)"


In [35]:
df_toronto['Latitude'] = df_toronto.coord.apply(lambda x: x[0])
df_toronto['Longitude'] = df_toronto.coord.apply(lambda x: x[1])

In [36]:
df_toronto.drop("coord", axis=1, inplace=True)

Finnaly, we have the data frame with the desired data, labeled as we wanted.

In [37]:
df_toronto.head()

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
3,M3A,North York,Parkwoods,43.752935,-79.335641
4,M4A,North York,Victoria Village,43.728102,-79.31189
5,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
6,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


In [38]:
df_toronto.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 3 to 179
Data columns (total 5 columns):
Postalcode       103 non-null object
Borough          103 non-null object
Neighbourhood    103 non-null object
Latitude         103 non-null float64
Longitude        103 non-null float64
dtypes: float64(2), object(3)
memory usage: 4.8+ KB


In [39]:
df_toronto

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
3,M3A,North York,Parkwoods,43.752935,-79.335641
4,M4A,North York,Victoria Village,43.728102,-79.311890
5,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
6,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.661790,-79.389390
9,M9A,Etobicoke,Islington Avenue,43.667481,-79.528953
10,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913
12,M3B,North York,Don Mills,43.748900,-79.357220
13,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529
14,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529


### Conclusion

We retrieved the geo data from two sources.

In [43]:
# from a CSV file
df_all_boroughs

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [None]:
#" from the geo data"