# Part 1 - WebScraping in Wikipedia page 
### looking for Postal Codes in Canada 

In [1]:
import pandas as pd
import numpy as np
import requests

1. Get the HTML page of Wikipedia
2. Using read_html to convert the html data into list of Data frame objects
3. Remove cells which have borrow not assigned.

In [2]:
page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_page = requests.get(page)

wiki = pd.read_html(wiki_page.content, header = 0)[0]
df = wiki[wiki.Borough != 'Not assigned']
#df = df.groupby(['Postal Code']).first()
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [3]:
#checking if Postal Code has unique values
unique = len(df['Postal Code'].unique()) == len(df['Postal Code'])
print("Number of Unique Postal Codes = Total Postal Codes? Answer:", unique)

Number of Unique Postal Codes = Total Postal Codes? Answer: True


In [4]:
df[df['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postal Code,Borough,Neighbourhood


We can see that there are no Postal Codes without Neighbourhood, so there is no need to use this: <br>
>_If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough_

In [5]:
df.shape

(103, 3)

# Part 2 - Geocoding the Postal Codes found

In [8]:
!pip install geocoder
import geocoder



### The code below was supposed to get the latitude and longitude data on all postal codes, but unfortunately it runs and runs and doesn't return anything:
```python
# initialize your variable to None
lat_lng_coords = None

postal_code = 'M4A'

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```
### So instead, I will use a online database (in csv, below) containing the information:

In [11]:
data = pd.read_csv("https://cocl.us/Geospatial_data")
data.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
print("Shape of wiki data is: ", df.shape)
print("Shape of csv data is: ", data.shape)

Shape of wiki data is:  (103, 3)
Shape of csv data is:  (103, 3)


In [13]:
df = df.join(data.set_index('Postal Code'), on='Postal Code', how='inner')
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M3A,North York,Parkwoods,43.753259,-79.329656
3,M4A,North York,Victoria Village,43.725882,-79.315572
4,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
5,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


# Part 3 - Clustering the Neighborhoods in Toronto
#### Clustering Toronto based on the similarities of the venues categories using Kmeans and Foursquare API

In [None]:
!conda install -c conda-forge geocoder --yes
import geocoder
from geopy.geocoders import Nominatim 

In [None]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))