# Neighborhoods in Toronto

This Notebook creates a Toronto Neighborhoods dataset based on the wikipage table: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Getting Dataset</a>

2. <a href="#item2">Geocoder: Latitude & Longitude</a>  
    
</font>
</div>

<a id='item1'></a>

## 1. Getting Dataset

Parsing wikipedia page using BeautifulSoup package.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_text = requests.get(wiki_url).text
wiki_soup = BeautifulSoup(html_text, 'html.parser')
# wiki_soup

### Wiki Table:

In [3]:
wiki_table = wiki_soup.find("table", attrs={"class": "wikitable sortable"})
wiki_table_data = wiki_table.tbody.find_all("tr")[1:]  # contains all rows (exclude header[0])
# wiki_table_data

In [4]:
wiki_headers = [ th.text.replace('\n', ' ').strip() for th in wiki_table.find_all("th")]
df_table = pd.DataFrame(columns=wiki_headers) #['Postal code', 'Borough', 'Neighborhood']

In [5]:
for td_row in wiki_table_data:
    row=dict(zip(wiki_headers,td_row.find_all("td"))) #[<td>M1A\n</td>, <td>Not assigned\n</td>, <td>\n</td>]
    df_table = df_table.append(row, ignore_index=True)

In [6]:
# Cleanup columns and remove "<td>" place('\n</td>',''))
df_table = df_table.astype(str)
for col in df_table.columns.tolist(): df_table[col] = df_table[col].apply(lambda x: x.replace('<td>','').replace('\n</td>',''))
df_table.rename(columns={'Postal code':'PostalCode'},inplace=True)
fname = 'toronto_data.csv'
df_table.to_csv(fname,index=False)
print('Original dataframe has {} rows'.format(df_table.shape[0]))

Original dataframe has 180 rows


### Toronto Dataframe:

The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* a) Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* b) More than one neighborhood can exist in one postal code area. Combined into one row with the neighborhoods separated with a comma.
* c) If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
* d) Use the .shape method to print the number of rows of your dataframe.

In [7]:
df_toronto = pd.read_csv(fname)
df_toronto.rename(columns={'Community':'Borough','Neighbourhood':'Neighborhood','Postal Code':'PostalCode'},inplace=True)
print(df_toronto.columns)
# a) Ignore "Not assigned" cells
df_toronto = df_toronto[df_toronto['Borough']!='Not assigned']
# b) Combine neighborhoods
df_toronto['Neighborhood'] = df_toronto['Neighborhood'].apply(lambda x: x.replace(' /',','))
# c) Not assigned neighborhood assign 'Borough'
df_toronto.loc[df_toronto['Neighborhood'] == 'Not assigned' , 'Neighborhood'] = df_toronto['Borough']
df_toronto=df_toronto.reset_index(drop=True)
# d) shape
print('Toronto dataframe has {} rows'.format(df_toronto.shape[0]))

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')
Toronto dataframe has 103 rows


In [8]:
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<a id='item2'></a>

## 2. Geocoder: Latitude & Longitude 

To built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood.

In [9]:
# import geocoder 
# initialize your variable to None
# lat_lng_coords = None
# postal_code = 'M5G'
# # loop until you get the coordinates
# while(lat_lng_coords is None):
#   g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
#   lat_lng_coords = g.latlng

# latitude = lat_lng_coords[0]
# longitude = lat_lng_coords[1]
# print(latitude,longitude)
# ERROR: <[REQUEST_DENIED] Google - Geocode [empty]>

Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [10]:
df_coord = pd.read_csv('Geospatial_Coordinates.csv')
df_coord.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df_coord.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [11]:
df_geocoder = pd.merge(df_toronto, df_coord, how='inner',on=['PostalCode'])
df_geocoder.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [12]:
df_geocoder.to_csv('toronto_neighborhoods.csv',index=False)
print('Toronto Neighborhoods dataframe has {} rows and {} columns'.format(df_geocoder.shape[0],df_geocoder.shape[1]))

Toronto Neighborhoods dataframe has 103 rows and 5 columns
