# Part 1 Obtain Neighborhoods by Scraping

In [2]:
import pandas as pd

### Scrape the Neighborhood List from Wikipedia
The provided web page contains 3 tables; the one we need is the first table

In [3]:
df_raw = pd.read_html( 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M') [0]
df_raw.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Clean the data
#### Remove the rows with Burough = 'Not assigned'

In [22]:
df_hoods = df_raw[ df_raw['Borough'] != 'Not assigned']
df_hoods.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [23]:
# Still a problem with one row that has Neighbourhood == Not assigned
df_hoods[ df_hoods.Neighbourhood == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


#### Use Burough if Neighborhood == 'Not assigned'
I am doing this before combining rows with the same postal code, because if you combine the rows first (as implied by the instructions), you might get a Neighborhood like 'Not assigned, Mr. Rogers, Boondocks, ...'

In [36]:
#df.Col2 = df.Col1.where(df.Col2 == 'X', df.Col2)

df_hoods.Neighbourhood = df_hoods.Borough.where( df_hoods.Neighbourhood == 'Not assigned', df_hoods.Neighbourhood)

#Now none of the Neighbourhood values are 'Not assigned'
df_hoods[ df_hoods.Neighbourhood == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


#### Consolidate Neighborhoods from the same Postal Code

Assumptions:
* Postcode, Borough, and Neighbourhood will never be blank
* Postcode will always be a valid CA post code prefix for Toronto
* There will not be two Boroughs with the same Postcode
* There will not be two rows with the same value for Neighbourhood

In [37]:
df_post = df_hoods.groupby(['Postcode', 'Borough'],as_index=False)['Neighbourhood'].agg(','.join)
df_post.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Final Cell
In the last cell of your notebook, use the *.shape* method to print the number of rows of your dataframe.

In [54]:
df_post.shape

(103, 3)

# Part 2 Add Geocoding

In [40]:
!pip install geocoder

Collecting geocoder
  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


Apparently, the Google geocoder is not going to work unless you have a Google Could account and an application ID. Without proper credentials, you get the **REQUEST DENIED** error shown below

In [41]:
import geocoder as geo

In [44]:
# initialize your variable to None
lat_lng_coords = None
postal_code = 'M5A'

# loop until you get the coordinates
# while(lat_lng_coords is None):
g = geo.google('{}, Toronto, Ontario'.format(postal_code))
print( g)
lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]

<[REQUEST_DENIED] Google - Geocode [empty]>


TypeError: 'NoneType' object is not subscriptable

So we go ahead and use the provided Excel Sheet

In [46]:
df_ll = pd.read_csv('https://cocl.us/Geospatial_data')
df_ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [56]:
df_hoods.shape

(211, 3)

In [57]:
# Merge neighborhoods (scraped from Wikipedia) with Longitude/Latitude of each Post Code
# have to rename the Postcode column so it is the same in both data frames
df_geo = pd.merge(
    df_post, 
    df_ll.rename(columns={'Postal Code':'Postcode'}), 
    on='Postcode', how='inner')
df_geo.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [59]:
df_geo.shape

(103, 5)

In [64]:
#Find the neighborhoods where the Borough contains 'Toronto'
df_toronto = df_geo[df_geo.Borough.str.contains( "Toronto")].reset_index(drop=True)
df_toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049
