## Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto
As part of the assignment in the '__IBM Data Science Professional Course__', we will be doing web scraping on one of the tables from `Wikipedia`. After performing data cleaning steps, we run the data through the Google Map's `Geocoding` API to get the lat & long. Finally, explore and cluster the results to provide some meaningful insights.

In [None]:
# Importing libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

### Web Scraping using BeautifulSoup
For this assignment, we are only interested in the table containing postal codes of canada.

In [None]:
# Web scraping using BeautifulSoup
wiki_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(wiki_link)
page_content = BeautifulSoup(source.content, "html.parser")
table = page_content.find_all('table')[0] 
df = pd.read_html(str(table))[0]

### Data cleaning steps includes:
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.


In [39]:
# Renaming columns
df.columns = ('Postcode','Borough','Neighbourhood')
df = df[1:]

# Removing where Borough = 'Not assigned'
df = df[df['Borough'] != 'Not assigned']
df.reset_index(drop=True,inplace=True)

# Combining neighbourhoods for similar postcode (two methods of combining neighbourhoods, I chose the latter as it produces a cleaner result.)
# df = df.groupby(by=['Postcode','Borough'],as_index=False)['Neighbourhood'].apply(lambda x: "%s" % ', '.join(x['Neighbourhood'])).to_frame()
df = df.groupby(by=['Postcode','Borough'],as_index=False)['Neighbourhood'].apply(', '.join).reset_index()
df.columns = ['Postcode','Borough','Neighbourhoods']

# Dropping unassigned neighbourhoods
df = df.where(df['Neighbourhoods']!='Not assigned').dropna()

In [40]:
df.shape

(102, 3)

### Google Map's Geocoding API
- Getting the latitudes and longitudes from the postal code.

In [42]:
!pip install geocoder

Collecting geocoder
  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K    100% |████████████████████████████████| 102kB 6.4MB/s ta 0:00:01
[?25hRequirement not upgraded as not directly required: click in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Requirement not upgraded as not directly required: future in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from geocoder)
Requirement 

In [None]:
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None
postal_code = 'M4M'
# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]