# Parts 1 and 2: Scraping and Preprocessing Toronto Neighborhood Data

### Part 1: Scraping Toronto postal code html table from Wikipedia

In [21]:
import pandas as pd

Read Toronto postal code table from wikipedia site using pandas read html function

In [22]:
df_all = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)[0]
df_all.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Remove rows where borough is set to 'Not assigned'

In [23]:
df_int = df_all[df_all.Borough != 'Not assigned'].reset_index().drop(columns='index')
df_int.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


Set 'Not assigned neighborhoods to the name of the borough:

In [24]:
finalneigh = []

for b, n in zip(df_int.Borough, df_int.Neighbourhood):
    neigh = n
    if neigh == 'Not assigned':
        neigh = b
    finalneigh.append(neigh)
    
df_int.Neighbourhood = finalneigh
df_int.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
5,M7A,Downtown Toronto,Queen's Park
6,M9A,Queen's Park,Queen's Park
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
9,M3B,North York,Don Mills North


Aggregate the neighborhoods by postcode, and separate with comma:

In [25]:
df_proc = df_int.groupby('Postcode').agg({
    'Borough': lambda x: pd.Series(x).unique(),
    'Neighbourhood': lambda x: ', '.join(x)
})

df_proc = df_proc.reset_index()
df_proc.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Determine final shape of resulting data frame:

In [26]:
df_proc.shape

(103, 3)

### Part 2: Adding Latitude and Longitude Data to DF

In [27]:
import geocoder # import geocoder

Create function to get latitude and longitude for the different postal codes:

I used the arcgis search from geocoder because Google was giving me a request denied error.

In [28]:
def get_latlon(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.arcgis('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude

Use the function to add geospatial data to the DF:

In [29]:
lats = []
lons = []

codes = df_proc['Postcode']

for code in codes:
    lat, lon = get_latlon(code)
    lats.append(lat)
    lons.append(lon)
    
df_proc['Latitude'] = lats
df_proc['Longitude'] = lons
df_proc

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.811525,-79.195517
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.785665,-79.158725
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.765815,-79.175193
3,M1G,Scarborough,Woburn,43.768369,-79.217590
4,M1H,Scarborough,Cedarbrae,43.769688,-79.239440
5,M1J,Scarborough,Scarborough Village,43.743125,-79.231750
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.726276,-79.263625
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.713054,-79.285055
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.724235,-79.227925
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.696770,-79.259967


Set to final df, save to pkl file so can load to different notebook:

In [30]:
df = df_proc
df.to_pickle('./torontoneighborhoods.pkl')