## Segmenting and Clustering Neighborhoods in Toronto - 2

#### Import libraries

In [3]:
import numpy as np        # library to handle data in a vectorized manner
import pandas as pd       # library for data analysis
from bs4 import BeautifulSoup   # package to transform the data of webpage into pandas dataframe
import requests           # library to handle requests 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

In [2]:
#!pip install geopy
#!pip install beautifulsoup4

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/80/93/d384479da0ead712bdaf697a8399c13a9a89bd856ada5a27d462fb45e47b/geopy-1.20.0-py2.py3-none-any.whl (100kB)
[K     |████████████████████████████████| 102kB 7.1MB/s ta 0:00:011
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.20.0
Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/cb/a1/c698cf319e9cfed6b17376281bd0efc6bfc8465698f54170ef60a485ab5d/beautifulsoup4-4.8.2-py3-none-any.whl (106kB)
[K     |████████████████████████████████| 112kB 32.1MB/s eta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/81/94/03c0f04471fc245d08d0a99f7946ac228ca98da4fa757

#### Download and preprocess dataset

We are going to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

We will use the BeautifulSoup package to transform the data in the table on the Wikipedia page into a pandas dataframe.

In [4]:
#download data and parse it
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source, 'html.parser')

In [5]:
#obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe
table = soup.find('table')
td = table.find_all('td')
postcode = []
borough = []
neighborhood = []

for i in range(0, len(td), 3):
    postcode.append(td[i].text.strip())
    borough.append(td[i+1].text.strip())
    neighborhood.append(td[i+2].text.strip())

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df = pd.DataFrame(data=[postcode, borough, neighborhood]).transpose()
df.columns = ['Postal Code', 'Borough', 'Neighborhood']

# Drop cells with a borough that is 'Not assigned'
df['Borough'].replace('Not assigned', np.nan, inplace=True)
df.dropna(subset=['Borough'], inplace=True)

# If a cell has a Borough but a 'Not assigned' Neighborhood, then the Neighborhood will be the same as the Borough
df['Neighborhood'].replace('Not assigned',df.Borough, inplace=True)

# If more than one neighborhood exist in one postal code area, the relevant rows will be combined into one row 
# with the neighborhoods separated with a comma
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
#df.head(12)

In [6]:
# The number of rows of the dataframe
#df.shape

Now that we have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name.
Here, we will incorporate the latitude and the longitude coordinates of each neighborhood in the dataframe.

In [7]:
# csv file that has the geographical coordinates of each postal code
url2 = 'http://cocl.us/Geospatial_data'
geospatial_df = pd.read_csv(url2)
geospatial_df.columns = ['Postal Code', 'Latitude', 'Longitude']
toronto_df = pd.merge(df, geospatial_df, on=['Postal Code'], how='inner')
toronto_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848
