# Segmenting and Clustering Neighborhoods in Toronto

Using this Notebook to build the code to scrape the following Wikipedia page: 
    https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M,
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

First we will import the libraries that will be required for the task:

In [33]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium
from geopy.geocoders import Nominatim
import pandas as pd 
import wikipedia as wp
import matplotlib.cm as cm
import matplotlib.colors as colors
from bs4 import BeautifulSoup

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    branca-0.3.1               |             py_0          25 KB  conda-forge
    altair-3.2.0               |           py36_0         770 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         868 KB

The following NEW packages will be INSTALLED:

    altair:  3.2.0-py36_0 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge


Downloading and Extracting Packages
branca-0.3.1         | 25 KB    

In order to obtain the data from the page we use wikipedia library to store the content in bytes.

In [6]:
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
html

b'<div class="mw-parser-output"><p>This is a list of <a href="/wiki/Postal_codes_in_Canada" title="Postal codes in Canada">postal codes in Canada</a> where the first letter is M. Postal codes beginning with M are located within the city of <a href="/wiki/Toronto" title="Toronto">Toronto</a> in the province of <a href="/wiki/Ontario" title="Ontario">Ontario</a>. Only the first three characters are listed, corresponding to the Forward Sortation Area.\n</p><p><a href="/wiki/Canada_Post" title="Canada Post">Canada Post</a> provides a free postal code look-up tool on its website,<sup id="cite_ref-1" class="reference"><a href="#cite_note-1">&#91;1&#93;</a></sup> via its <a href="/wiki/Mobile_app" title="Mobile app">applications</a> for such <a href="/wiki/Smartphones" class="mw-redirect" title="Smartphones">smartphones</a> as the <a href="/wiki/IPhone" title="IPhone">iPhone</a> and <a href="/wiki/BlackBerry" title="BlackBerry">BlackBerry</a>,<sup id="cite_ref-2" class="reference"><a href="#c

Now as we see in that html the required table is at 0 index so we get the table by:

In [18]:
df = pd.read_html(html, header = 0)[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


To remove all the cells with not-assigned values we use:

In [17]:
df = df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


As More than one neighborhood can exist in one postal code area. So the two neighnourhoods will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [16]:
df = df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(list).apply(lambda x:', '.join(x)).to_frame().reset_index()
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


So if the neighbourhood has "Not Assigned" we will change it to Borough.

In [15]:
for index, row in df.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


So finally we have cleaned the data for this week. We can now move towards applying K-Mean Clustering.

In [19]:
df.shape

(288, 3)

Now to find the longitude & llatitude of each neigbourhood let's use Geocoder Python package: https://geocoder.readthedocs.io/index.html

Let's start with importing some useful libraries

In [20]:
import requests
import io

Lets get the data as Panda dataframe & show how it is arranged.

In [23]:
url="http://cocl.us/Geospatial_data"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))
c.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging the two datasets:

In [28]:
c.columns = ['Postcode', 'Latitude', 'Longitude']
df = pd.merge(c, df, on='Postcode')
df.head()

Unnamed: 0,Postcode,Latitude,Longitude,Latitude_x,Longitude_x,Latitude_y,Longitude_y,Borough,Neighbourhood
0,M1B,43.806686,-79.194353,43.806686,-79.194353,43.806686,-79.194353,Scarborough,Rouge
1,M1B,43.806686,-79.194353,43.806686,-79.194353,43.806686,-79.194353,Scarborough,Malvern
2,M1C,43.784535,-79.160497,43.784535,-79.160497,43.784535,-79.160497,Scarborough,Highland Creek
3,M1C,43.784535,-79.160497,43.784535,-79.160497,43.784535,-79.160497,Scarborough,Rouge Hill
4,M1C,43.784535,-79.160497,43.784535,-79.160497,43.784535,-79.160497,Scarborough,Port Union


In [30]:
df = df[['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Rouge,43.806686,-79.194353
1,M1B,Scarborough,Malvern,43.806686,-79.194353
2,M1C,Scarborough,Highland Creek,43.784535,-79.160497
3,M1C,Scarborough,Rouge Hill,43.784535,-79.160497
4,M1C,Scarborough,Port Union,43.784535,-79.160497


Now we will define an instance of the geocoder, as we need the coordinates of torronto so:

In [34]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto, ON, Canada are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto, ON, Canada are 43.653963, -79.387207.


Creating a map of Toronto with neighborhoods superimposed on top of it, as done in ungraded assignment of this course: 

In [38]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now we will start using foursquare API for further analysis.

In [41]:
CLIENT_ID = 'WSY11R05GHMIC1440CVGMX4WGG2IDQFIOIKG2MU4EP1ZEWKI' # your Foursquare ID
CLIENT_SECRET = 'RGS2XX3TTQOYMIEZ5F5HUNXKVTG12IGXMULW3FGSPIY2RBWX' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: WSY11R05GHMIC1440CVGMX4WGG2IDQFIOIKG2MU4EP1ZEWKI
CLIENT_SECRET:RGS2XX3TTQOYMIEZ5F5HUNXKVTG12IGXMULW3FGSPIY2RBWX


#### Let's explore the first neighborhood in our dataframe.
Get the neighborhood's name.

In [43]:
df.loc[0, 'Neighbourhood']

'Rouge'

Get the neighborhood's latitude and longitude values.

In [45]:
neighborhood_latitude = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Rouge are 43.806686299999996, -79.19435340000001.


##### Now, let's get the top 100 venues that are in Marble Hill within a radius of 500 meters.
First, let's create the GET request URL. Name your URL url.

In [46]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/explore?&client_id=WSY11R05GHMIC1440CVGMX4WGG2IDQFIOIKG2MU4EP1ZEWKI&client_secret=RGS2XX3TTQOYMIEZ5F5HUNXKVTG12IGXMULW3FGSPIY2RBWX&v=20180605&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

Send the GET request and examine the resutls

In [47]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5dc82df06bdee600392b3b8d'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': "Wendy's",
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

From the Foursquare lab in the previous module, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [48]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [50]:
from pandas.io.json import json_normalize
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


And how many venues were returned by Foursquare?

In [51]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

1 venues were returned by Foursquare.


## 2. Explore Neighborhoods in Toronto

#### Let's create a function to repeat the same process to all the neighborhoods in Toronto

In [60]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            10)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new dataframe called df.

In [61]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )



Rouge
Malvern
Highland Creek
Rouge Hill
Port Union
Guildwood
Morningside
West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park
Ionview
Kennedy Park
Clairlea
Golden Mile
Oakridge
Cliffcrest
Cliffside
Scarborough Village West
Birch Cliff
Cliffside West
Dorset Park
Scarborough Town Centre
Wexford Heights
Maryvale
Wexford
Agincourt
Clarks Corners
Sullivan
Tam O'Shanter
Agincourt North
L'Amoreaux East
Milliken
Steeles East
L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview
Henry Farm
Oriole
Bayview Village
Silver Hills
York Mills
Newtonbrook
Willowdale
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Flemingdon Park
Don Mills South
Bathurst Manor
Downsview North
Wilson Heights
Northwood Park
York University
CFB Toronto
Downsview East
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens
Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
The Danforth West
Riverdale
The Beaches West
Indi

Let's check the size of the result:

In [62]:
print(toronto_venues.shape)
toronto_venues.head()

(1444, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rouge,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,Malvern,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
2,Highland Creek,43.784535,-79.160497,Chris Effects Painting,43.784343,-79.163742,Construction & Landscaping
3,Highland Creek,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
4,Rouge Hill,43.784535,-79.160497,Chris Effects Painting,43.784343,-79.163742,Construction & Landscaping
