# Neighborhood Clustering
## Author: Jabe Hickey

Start by importing urllib3 and BeautifulSoup4 packages

In [18]:
# import libraries
import urllib3
from bs4 import BeautifulSoup



Use beautiful soup to scrape the wiki page that contain the postal codes for toronto and put into a beautiful soup object

In [87]:
scrape_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
http = urllib3.PoolManager()
response = http.request('GET', scrape_page)
soup = BeautifulSoup(response.data,'lxml')



Parse table out of the beautiful soup object and write it to a list. I printed out the first 10 entries so you can see what it looks like.

In [88]:
table = soup.find('table')
table_rows = table.find_all('tr')

scrape_table = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        scrape_table.append(row)
scrape_table[0:10]

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Harbourfront'],
 ['M5A', 'Downtown Toronto', 'Regent Park'],
 ['M6A', 'North York', 'Lawrence Heights'],
 ['M6A', 'North York', 'Lawrence Manor'],
 ['M7A', "Queen's Park", 'Not assigned'],
 ['M8A', 'Not assigned', 'Not assigned']]

Create a dataframe from the list. Clean the data. Consolidate entries where the postal code is the same.  I displayed entries so you can see the table.

In [95]:
#create data frame for postal codes
df_toronto_postal_codes = pd.DataFrame(scrape_table, columns=['PostalCode', 'Borough', 'Neighborhood'])
#drop entry if borough is not assigned
df_toronto_postal_codes.drop(df_toronto_postal_codes[df_toronto_postal_codes['Borough'] == 'Not assigned'].index, inplace=True)
#if neighborhood is not assigned then give it the same name as borough
df_toronto_postal_codes.loc[df_toronto_postal_codes['Neighborhood'].eq('Not assigned'), 'Neighborhood']= df_toronto_postal_codes['Borough']
#create an aggregated data frame that concatenates neighborhoods into one entry if they have the same postal code
aggregated_postal_codes = df_toronto_postal_codes.groupby(['PostalCode', 'Borough'],as_index=False) ['Neighborhood'].agg(lambda col: ', '.join(col))

#display first 75 entries in aggregated dataframe
aggregated_postal_codes.head(75)  

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In the following cell, I couldn't get the geocoder to return log and lat, so I used the csv file with longitude and latitude, imported it into a dataframe called geo_data and merged it with the aggregated_postal_codes dataframe to produce a new dataframe called postal_codes_lon_lat.  I printed out the head 

In [116]:
#!pip install --user geocoder 
geo_data = pd.read_csv("http://cocl.us/Geospatial_data")
geo_data.rename(columns={'Postal Code':'PostalCode'}, inplace=True)
postal_codes_lonlat= pd.merge(aggregated_postal_codes, geo_data, on='PostalCode')
postal_codes_lonlat.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
