# Analyzing neighborhoods in Toronto

This notebook obtains Toronto's postal codes from Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and finds the corresponding latitudes and longitudes

The final dataframe will consist of six columns including index: PostalCode, Borough, Neighborhood, Latitude and Longitude. This is obtained in 5 steps:
1. Read the Wiki page and parse the HTML using BeautifulSoup
2. Parse through the postcode table and create a dataframe with contents
3. Cleanse the dataframe and retain useful rows
4. Read a csv file containing geographical coordinates
5. Retrieve latitude and longitude from the csv dataframe based on Postcode

The dimensions of the final dataframe are dispayed at the end of the notebook

## 1. Read the Wiki page and parse the HTML using BeautifulSoup

In [35]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

response = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(response.text, "html.parser")
    
# Find the postcode table section, which is the first table on the Wiki page
table = soup.table

## 2. Parse through the postcode table and create a dataframe with contents

In [36]:
# Read column titles and find number of rows
column_names = []
n_rows = 0
for row in table.find_all('tr'):
    # Find heading row <th> tags
    th_tags = row.find_all('th') 
    if len(th_tags) > 0 and len(column_names) == 0:
        for th in th_tags:
            column_names.append(th.get_text())
    # Count rows with data
    td_tags = row.find_all('td')
    if len(td_tags) > 0:
        n_rows+=1

# Create a dataframe and read data into it
df = pd.DataFrame(columns = column_names,index = range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
    column_marker = 0
    columns = row.find_all('td')
    for column in columns:
        df.iat[row_marker,column_marker] = column.get_text()
        column_marker += 1
    if len(columns) > 0:
        row_marker += 1                  

# Remove newline \n characters
df = df.replace('\n','', regex=True)


## 3. Cleanse the dataframe and retain useful rows

In [37]:
# Remove rows with a borough that is Not assigned
df = df[df.Borough!='Not assigned']

# Make Neighbourhood values same as Borough when Not assigned
df['Neighbourhood\n'].replace('Not assigned',df['Borough'],inplace=True)

# Combining rows with the same postcode, with the neighborhoods separated with a comma
df['Neighbourhood\n'] = df.groupby(['Postcode','Borough'])['Neighbourhood\n'].transform(lambda x: ', '.join(x))
df = df.drop_duplicates()

df = df.reset_index()


## Dimensions of the final dataframe

In [38]:
print("(Rows, Columns) - ",df.shape)

(Rows, Columns) -  (103, 4)


## 4. Find latitude and longitude for each postal code

Note: Attempt was made to use the geocoder library, however this was too slow. Therefore, the csv file with geographical coordinates was used to obtain latitude and longitude

In [48]:
# Install geocoder library

# !pip install geocoder

Requirement not upgraded as not directly required: geocoder in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: ratelim in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: future in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: click in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: decorator in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from ratelim->geocoder)
Requirement not upgraded as not directly required: chardet<3.1.0,

In [22]:
# Using geocoder to retrieve coordinates

#import geocoder 
## initialize your variable to None

#for row in df.itertuples(index=True, name='Pandas'):

#    # loop until you get the coordinates
#    lat_lng_coords = None
#    while(lat_lng_coords is None):
#        g = geocoder.google('{}, Toronto, Ontario'.format(getattr(row, "Postcode")))
#        lat_lng_coords = g.latlng
#
#    row['Latitude'] = lat_lng_coords[0]
#    row['Longitude'] = lat_lng_coords[1]
#    
#df.head(20)

KeyboardInterrupt: 

In [40]:
# Reading csv file containing geographical coordinates into a dataframe

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share your notebook.
client_e8764c861c054a20afb6b04c6ac44ea0 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='jW2kFAJXBCCzrfe6UgP3T4Lt-KyG0vY-sJqXsSHMGE8Y',
    ibm_auth_endpoint="https://iam.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_e8764c861c054a20afb6b04c6ac44ea0.get_object(Bucket='courseracapstone-donotdelete-pr-keavguhvhxmj3o',Key='Geospatial_Coordinates.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df_data_1 = pd.read_csv(body)
df_data_1.rename(columns={'Postal Code': 'Postcode'}, inplace=True)
df_data_1.head()



Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## 5. Retrieve latitude and longitude from the csv dataframe based on Postcode

In [47]:
# Merging the two datasets on Postcode

df_latlon = pd.merge(df,df_data_1,on='Postcode')
df_latlon.head()

Unnamed: 0,index,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,2,M3A,North York,Parkwoods,43.753259,-79.329656
1,3,M4A,North York,Victoria Village,43.725882,-79.315572
2,4,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,6,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,8,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
