### Exploring and clustering the neighborhoods in Toronto, Canada.

<p> Installing and importing necessary libraries. </p>

In [2]:
!pip install BeautifulSoup4
!pip install requests
import requests
from bs4 import BeautifulSoup  # To work with a HTML page
import pandas as pd
import numpy as np

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Collecting requests
  Downloading requests-2.23.0-py2.py3-none-any.whl (58 kB)
Collecting idna<3,>=2.5
  Downloading idna-2.9-py2.py3-none-any.whl (58 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
Collecting chardet<4,>=3.0.2
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting certifi>=2017.4.17
  Downloading certifi-2020.4.5.1-py2.py3-none-any.whl (157 kB)
Installing collected packages: idna, urllib3, chardet, certifi, requests
Successfully installed certifi-2020.4.5.1 chardet-3.0.4 idna-2.9 requests-2.23.0 urllib3-1.25.9


Extracting content from a given url and stroring it using BeutifulfulSoup

In [3]:
URL = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(URL) 
# print(r.content) 
soup = BeautifulSoup(r.content, 'html5lib') 
# print(soup.prettify()) 

FeatureNotFound: Couldn't find a tree builder with the features you requested: html5lib. Do you need to install a parser library?

Dictionary to store the content of a page so that it can be directly converted to Dataframe for an easy manipulation.

In [4]:
from collections import defaultdict
dic = defaultdict(list)

Extracting data from a table tag with help of BeautifulSoup and storing it in a dictionary

In [5]:
table = soup.table
rows = table.find_all('tr')
for r in rows:
    cols = r.find_all('td')
#     row = [ dict[]=i.text[:-1] for i in cols]
    for i in range(len(cols)):
        if i == 0:
            dic['Postal Code'].append(cols[i].text[:-1])
        if i == 1:
            dic['Borough'].append(cols[i].text[:-1])
        if i == 2:
            dic['Neighborhood'].append(cols[i].text[:-1])
# dic          

Converting Dictionary to Dataframe.

In [6]:
data = pd.DataFrame.from_dict(dic)
data

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


Extracting Boroughs which are 'Not assigned'

In [7]:
to_remove = data[data['Borough'] == 'Not assigned'].index

Dropping the Boroughs which are 'Not assigned'

In [8]:
data.drop(to_remove, inplace = True)
data

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Resetting index.

In [9]:
data.reset_index(inplace = True)
data.head()

Unnamed: 0,index,Postal Code,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Dropping the unnecessary index column.

In [10]:
data.drop('index', axis = 1)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Shape of the final DataFrame.

In [11]:
data.shape

(103, 4)

Installing GeoCoder to get the latitude and longitude for Boroughs in 'data' Dataframe 

In [12]:
!pip install geocoder
import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 7.1MB/s ta 0:00:011
[?25hCollecting click (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/d2/3d/fa76db83bf75c4f8d338c2fd15c8d33fdd7ad23a9b5e57eb6c5de26b430e/click-7.1.2-py2.py3-none-any.whl (82kB)
[K     |████████████████████████████████| 92kB 15.0MB/s eta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Collecting future (from geocoder)
[?25l  Downloading https://files.pythonhosted.org/packages/45/0b/38b06fd9b92dc2b68d58b75f900e97884c45bedd2ff83203d933cf5851c9/future-0.18.2.tar.gz (829kB)
[K     |████████████████████████████████| 829kB 3.1MB/s eta 0:00:01
Building wheels

The below cell didn't worked as API was taking too much time.

In [13]:
# latitude = []
# longitude = []
# for pcode in data['Postal Code']:
#     lat_lng_coords = None
#     while(lat_lng_coords is None):
#       response = geocoder.google('{}, Toronto, Ontario'.format(pcode))
#       lat_lng_coords = response.latlng

#     latitude.append(lat_lng_coords[0])
#     longitude.append(lat_lng_coords[1])

So, I read latitudes and logitudes from a csv file

In [14]:
lat_long = pd.read_csv('Geospatial_Coordinates.csv')
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Sorting data and removing any trailing white spaces in columns

In [15]:
data.columns = data.columns.str.strip()
data.sort_values('Postal Code', ascending = True, axis = 0, inplace = True)
data.head()

Unnamed: 0,index,Postal Code,Borough,Neighborhood
6,9,M1B,Scarborough,"Malvern, Rouge"
12,18,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,27,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,36,M1G,Scarborough,Woburn
26,45,M1H,Scarborough,Cedarbrae


Sorting lat_long inorder to merge it with data so each Borough get its latitude and longitude correctly

In [16]:
lat_long.columns = lat_long.columns.str.strip()
lat_long.sort_values('Postal Code', ascending = True, axis = 0, inplace = True)
lat_long.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Adding Latitude and Longitude columns to 'data' dataframe.

In [17]:
data['Latitude'] = lat_long['Latitude']
data['Longitude'] = lat_long['Longitude']
data.head()

Unnamed: 0,index,Postal Code,Borough,Neighborhood,Latitude,Longitude
6,9,M1B,Scarborough,"Malvern, Rouge",43.727929,-79.262029
12,18,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7942,-79.262029
18,27,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.778517,-79.346556
22,36,M1G,Scarborough,Woburn,43.77012,-79.408493
26,45,M1H,Scarborough,Cedarbrae,43.745906,-79.352188


In [18]:
data.reset_index(inplace = True)
data.head()

Unnamed: 0,level_0,index,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,6,9,M1B,Scarborough,"Malvern, Rouge",43.727929,-79.262029
1,12,18,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7942,-79.262029
2,18,27,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.778517,-79.346556
3,22,36,M1G,Scarborough,Woburn,43.77012,-79.408493
4,26,45,M1H,Scarborough,Cedarbrae,43.745906,-79.352188


Removing extra columns. It may not look exactly like given in assigment as it is sorted.

In [19]:
data.drop(['level_0', 'index'], axis = 1, inplace = True)
data.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.727929,-79.262029
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7942,-79.262029
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.778517,-79.346556
3,M1G,Scarborough,Woburn,43.77012,-79.408493
4,M1H,Scarborough,Cedarbrae,43.745906,-79.352188
5,M1J,Scarborough,Scarborough Village,43.728496,-79.495697
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.70906,-79.363452
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.72802,-79.38879
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.667967,-79.367675
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.650571,-79.384568


Extractin Boroughs which contains Toronto in its name

In [20]:
temp = data['Borough'].str.contains('Scarborough')
temp.value_counts()

False    86
True     17
Name: Borough, dtype: int64

Storing Boroughs which contains Toronto

In [23]:
data_scar = data[temp].reset_index(drop=True)
data_scar.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.727929,-79.262029
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7942,-79.262029
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.778517,-79.346556
3,M1G,Scarborough,Woburn,43.77012,-79.408493
4,M1H,Scarborough,Cedarbrae,43.745906,-79.352188


In [24]:
import json
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          97 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-1.22.0-pyh9f0ad1d_0



Downloading and Extracting Packages
geopy-1.22.0         | 63 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ###############################

In [25]:
address = 'Scarborough, CA'
# geolocator = Nominatim(user_agent="ny_explorer")
# location = geolocator.geocode(address)
# latitude = location.latitude
# longitude = location.longitude
# print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))
geolocater = Nominatim(user_agent = "scar_explorer")
location = geolocater.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Coordinates of Scarborough are {}, {}'.format(latitude, longitude))

Coordinates of Scarborough are 43.773077, -79.257774


In [31]:
map_scar = folium.Map(location = [latitude, longitude], zoom_start = 12)

for lat, lang, label in zip(data_scar['Latitude'], data_scar['Longitude'], data_scar['Neighborhood']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lang],
        radius=5,
        popup=label,
        color='Blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_scar)
    
map_scar    

In [32]:
CLIENT_ID = 'K0S5WC0VPH3FXOJXNVY1WKEZOSTJCUJBGFMT52TQ2BHAT3MX' # your Foursquare ID
CLIENT_SECRET = 'IJ34AI144DDJLZ2W3ANDQ0NJSD2XMM01YHY3K0JHBRDLM0WX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: K0S5WC0VPH3FXOJXNVY1WKEZOSTJCUJBGFMT52TQ2BHAT3MX
CLIENT_SECRET:IJ34AI144DDJLZ2W3ANDQ0NJSD2XMM01YHY3K0JHBRDLM0WX


In [37]:
data_scar.loc[0, 'Neighborhood'].split(',')[0]

'Malvern'

In [None]:
neigh_lat = data_scar.loc[0, 'latitude']