# Week 3 CapStone Project
---
## Segmenting and Clustering Neighborhoods in Toronto


We will use `request` and `beautifulSoup` Python libraries to scrape the webpage provided to extract all the required information and put them into a pandas DataFrame

### Part 1: WebScrapping Wiki and Creating DataFrame

In [20]:
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium --yes


Collecting package metadata: ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.1.23-0, anaconda::openssl-1.1.1b-he774522_1
  - anaconda::openssl-1.1.1b-he774522_1, defaults::ca-certificates-2019.1.23-0
  - anaconda::ca-certificates-2019.1.23-0, defaults::openssl-1.1.1b-he774522_1
  - defaults::ca-certificates-2019.1.23-0, defaults::openssl-1.1.1b-he774522_1done

## Package Plan ##

  environment location: C:\Users\jpatel7\AppData\Local\Continuum\anaconda3

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.3.9           |           py37_0         149 KB  conda-forge
    conda-4.6.14               |           py37_0         2.1 MB  conda-forge
    geographiclib-1.49         |             py_0          32 KB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    

In [2]:
#import libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup as bs

In [3]:
# create a request for the url
wiki = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
content = bs(wiki,'lxml')
print(content.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className=document.documentElement.className.replace(/(^|\s)client-nojs(\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":900271985,"wgRevisionId":900271985,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communications in Ontario","Postal codes in Canada","Toronto","Ontario-related lists"],"wgBreakFrames":!1,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June",

In [4]:
table = content.find('table',{'class':'wikitable sortable'})
print(table)

<table class="wikitable sortable">
<tbody><tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A</td>
<td>Not assigned</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
</td></tr>
<tr>
<td>M4A</td>
<td><a href="/wiki/North_York" title="North York">North York</a></td>
<td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Harbourfront_(Toronto)" title="Harbourfront (Toronto)">Harbourfront</a>
</td></tr>
<tr>
<td>M5A</td>
<td><a href="/wiki/Downtown_Toronto" title="Downtown Toronto">Downtown Toronto</a></td>
<td><a href="/wiki/Regent_Park" title="Regent Park">Regent Park</a>
</td></tr>
<tr>
<td>M6A</td>

In [5]:
# saving all the rows in a csv file
raw_data = "Postcode,Borough,Neighborhood\n"
for r in table.find_all('tr'):
    row = ""
    for item in r.find_all('td'):
        row = row + "," + item.text
    raw_data = raw_data + row[1:]
file = open("toronto.csv",'wb')
file.write(bytes(raw_data,encoding='ascii',errors='ignore'))

8768

In [6]:
# store in data frame
df = pd.read_csv('toronto.csv')
df.head()
df.shape

(288, 3)

In [7]:
# since many rows have not assigned entry we remove it
# indexname = df [ df['Borough'] == 'Not assigned'].index

df.drop(df [ df['Borough'] == 'Not assigned'].index, inplace=True)
df.head(10)

Unnamed: 0,Postcode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [8]:
df.loc[df['Neighborhood'] == 'Not assigned','Neighborhood'] = df['Borough']
df.shape

(211, 3)

In [9]:
df = df.groupby(['Postcode','Borough'],sort=False).agg(', '.join).reset_index()
df.shape

(103, 3)

### Part 2 Adding Lattitude and Longitude using geocoder

In [15]:
df_loc = pd.read_csv('Toronto_locations.csv')
df_loc

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [17]:
# Merge both the data frame into one

# set index to postal code
temp_df = df.set_index('Postcode')
temp_loc = df_loc.set_index('Postal Code')
loc_df = pd.concat([temp_df,temp_loc],axis=1,join='inner')
loc_df.index.name = 'PostalCode'
loc_df.reset_index(inplace=True)

In [28]:
loc_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


### Part 3 Explore and Create Cluster in the Toronto Neighborhood

In [24]:
# importing libraries
from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

# !conda install scikit-learn --yes
from sklearn.cluster import KMeans

import folium
print('done!')

done!


In [25]:
# set up
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="tl-toronto-neigh")
location = geolocator.geocode(address)
lat = location.latitude
long = location.longitude
print(lat, long)

43.653963 -79.387207


In [35]:
map_toronto = folium.Map(location=[lat,long],zoom_start=10)

# adding labels to the map
for lt, lg, pc, bgh, ngh in zip(loc_df['Latitude'],loc_df['Longitude'],
                                loc_df['PostalCode'],loc_df['Borough'],loc_df['Neighborhood']):
    label = "{} [ {} ]: {}".format(bgh,pc,ngh)
    popup = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lt,lg],radius=5,popup=popup,color='red',fill=True,
                        fill_color='#3186cc',fill_opacity = 0.6, parse_html=False).add_to(map_toronto)
map_toronto