<a href="https://colab.research.google.com/github/ldruizsan/Coursera_Capstone/blob/main/IBM_Data_Science_Final_Capstone_TorontoMap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook will contain the Final Capstone Project for IBM's Data Science Professional Certificate

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

In [2]:
!pip install geocoder



In [3]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


For this project, I will use clustering algorithms to distinguish neighborhoods in Toronto, Canada. We will use the Wikipedia table of postal codes for Toronto found [here](https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=1012023397)

Let's start by setting the url that contains the link and then use the requests module to to access the HTML source code.

In [4]:
#url = 'https://www.zipcodesonline.com/2020/06/postal-code-of-toronto-in-2020.html'
url = 'https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&direction=prev&oldid=1012023397'

In [5]:
toronto_data = requests.get(url).text

A BeatifulSoup object is created to parse the table

In [6]:
soup = BeautifulSoup(toronto_data, 'html5lib')

In [7]:
toronto_table = soup.find('table')

In [8]:
column_list = ['Postal Code','Neighborhood','Borough']
toronto_df = pd.DataFrame(columns=column_list)
for row in toronto_table.tbody.find_all('tr'):
    cols = row.find_all('td')
    if cols != []:
      zipcode = cols[0].text.strip()
      borough = cols[1].text.strip()
      neighborhood = cols[2].text.strip()
      toronto_df = toronto_df.append({'Postal Code': zipcode,'Neighborhood':neighborhood,'Borough':borough}, ignore_index=True)

print(toronto_df.head())
print(toronto_df.shape)

  Postal Code               Neighborhood           Borough
0         M1A               Not assigned      Not assigned
1         M2A               Not assigned      Not assigned
2         M3A                  Parkwoods        North York
3         M4A           Victoria Village        North York
4         M5A  Regent Park, Harbourfront  Downtown Toronto
(180, 3)


Having parsed the table, we see several postal codes have not been assigned to a neighborhood or borough yet. Let's get rid of rows that contain these unassigned areas by using the drop method. In addition, this table does not contain rows where either Borough or Neighborhood columns are unassigned so this is why I just drop the whole row.

In [9]:
toronto_df.drop(labels=toronto_df[toronto_df['Borough'] == 'Not assigned'].index,axis=0,inplace=True)
toronto_df.head()

Unnamed: 0,Postal Code,Neighborhood,Borough
2,M3A,Parkwoods,North York
3,M4A,Victoria Village,North York
4,M5A,"Regent Park, Harbourfront",Downtown Toronto
5,M6A,"Lawrence Manor, Lawrence Heights",North York
6,M7A,"Queen's Park, Ontario Provincial Government",Downtown Toronto


In [10]:
toronto_df.shape

(103, 3)

In [11]:
# @hidden
# import geocoder # import geocoder

# # initialize your variable to None
# lat_lng_coords = None

# for coordinate in toronto_df['Postal Code']:
# # loop until you get the coordinates
#   while(lat_lng_coords is None):
#     g = geocoder.google('{}, Toronto, Ontario'.format(zipcode))
#     lat_lng_coords = g.latlng

#   latitude = lat_lng_coords[0]
#   longitude = lat_lng_coords[1]

Let's download a csv file that contains geospatial information for each zip code and then load this information into a dataframe called geo_df

In [12]:
 !wget 'https://cocl.us/Geospatial_data'

--2021-03-21 02:24:24--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 52.116.127.226, 52.116.127.228
Connecting to cocl.us (cocl.us)|52.116.127.226|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-03-21 02:24:25--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.26.197
Connecting to ibm.box.com (ibm.box.com)|107.152.26.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2021-03-21 02:24:25--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [followin

In [13]:
geo_df = pd.read_csv('/content/Geospatial_data')
geo_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Now that we have a dataframe with the location information for each postal code, we can perform an inner join between geo_df and the toronto_df using the Postal Code to give the complete information in one dataframe object.

In [14]:
toronto_geo_df = pd.merge(toronto_df, geo_df, how='inner',on='Postal Code')
toronto_geo_df


Unnamed: 0,Postal Code,Neighborhood,Borough,Latitude,Longitude
0,M3A,Parkwoods,North York,43.753259,-79.329656
1,M4A,Victoria Village,North York,43.725882,-79.315572
2,M5A,"Regent Park, Harbourfront",Downtown Toronto,43.654260,-79.360636
3,M6A,"Lawrence Manor, Lawrence Heights",North York,43.718518,-79.464763
4,M7A,"Queen's Park, Ontario Provincial Government",Downtown Toronto,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,"The Kingsway, Montgomery Road, Old Mill North",Etobicoke,43.653654,-79.506944
99,M4Y,Church and Wellesley,Downtown Toronto,43.665860,-79.383160
100,M7Y,"Business reply mail Processing Centre, South C...",East Toronto,43.662744,-79.321558
101,M8Y,"Old Mill South, King's Mill Park, Sunnylea, Hu...",Etobicoke,43.636258,-79.498509


In [15]:
#!pip install folium==0.5.0
from geopy.geocoders import Nominatim
import folium
from IPython.display import HTML, display

In [16]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent='toronto_explorer')
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [17]:
from IPython.display import HTML, display


map_toronto = folium.Map(location=[latitude,longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_geo_df['Latitude'], toronto_geo_df['Longitude'], toronto_geo_df['Borough'], toronto_geo_df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto