# Generate Dataframe scraping the Canada Postal Code from the [wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M)

<div class="alert alert-warning alertwarning" style="margin-top: 20px">
    
<h3>Libraries</h3>
pandas, urllib.request, BeautifulSoup, geocoder

<h3>Keywords</h3>
Scrapping, Check installed packages, Inner Join dataframes,
</div>

<a id="top"></a>
<div class="alert alert-block alert-info" style="margin-top: 20px">
<h1>Table of Contents</h1>
<hr>
<ol>
    <li><a href="#Part_1">Part 1 - Scrape data from Wikipedia</a></li>
    <li><a href="#Part_2">Part 2 - Combining two dataframes</a></li>
    <li><a href="#Part_3">Part 3 - Explore a borough and its neighborhood venues</a></li>
</ol>
<hr>
</div>


<div class="alert alert-warning alertwarning" style="margin-top: 20px">
<a id="Part_1"></a>
<h2>Part 1 - Scrape data from Wikipedia</h2>
    
<a href="#top">Top</a>
</div>

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import pandas as pd

### The function GetCanadaPostalCode will scrape the postal code from the wikipedia webpage using the library [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all)

In [2]:
def searchPostalCode(postalcodes, postalcode):
    '''
    This function will search for the postalcode and return its position in the array
    If postcode not found it will return -1
    '''
    totalrows = len(postalcodes)
    # if totalrows is zero that means the array is empty, so no need search
    if totalrows==0:
        return -1
    
    totalcols = len(postalcodes[0]) # check the length of the first row, all rows should have the same number of columns
    
    #Loop through all thw postalcodes
    for x in range(totalrows):
        for y in range(totalcols):
            # Position zero is the postal code
            if postalcodes[x][0] == postalcodes:
                return x
    return -1

In [3]:
def GetCanadaPostalCode(baseurl):
    '''
    This function will return 1 array containing the column names and 2nd array containing all postalcodes 
    
    '''
    postalcodes = []
    
    #Read Web Page
    content = urllib.request.urlopen(baseurl).read()
    
     #---------------------------------------------
    #Scrape Table Header from Web Page
    column_names=[]
    for rows in BeautifulSoup(content, "lxml").findAll("table"):
        for index, cols in enumerate(rows.findAll("th")):
            cv = cols.get_text().rstrip("\n")
            column_names.append(cv)
        
        #We need just the first table, the page has two tables
        break
    
    #---------------------------------------------
    #Scrape Table Content from Web Page
    #find all tr tags
    #Loop through all the rows in the table
    for rows in BeautifulSoup(content, "lxml").findAll("tr"):
        
        #search inside each row all tag td
        totalcol = len(rows.findAll("td"))
        
        if totalcol==3:
            #initiate variables
            cn = 0
            col=[]
            addRow=True
            
            #Loop through all the columns in the table
            for index, cols in enumerate(rows.findAll("td")):
                #cleanup carriage return
                cv = cols.get_text().rstrip("\n")
                #cleanup blank values or "Not assigned"
                if (cv=='Not assigned' or cv=="") :
                    addRow=False
                #append value in the column array
                col.append(cv)
                
            if addRow==True:
                #Search of existing postalcode and if found it will append the neighborhood name in the found position
                pos = searchPostalCode(postalcodes,col[0][0])
                if (pos==-1) :
                    #append the new postalcode in the array
                    postalcodes.append(col)
                else:
                    #append the neighborhood name in the found position
                    postalcodes[pos][3]+=", "+col[0][2]

    return column_names, postalcodes

### Read data

In [4]:
Header, PostalCodes = GetCanadaPostalCode("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")

### Generate dataframe

In [5]:
df_postalcodes = pd.DataFrame(PostalCodes, columns=Header) 

### Check results, sorting by column Postal code

In [6]:
df_postalcodes.sort_values(by='Postal Code', ascending=True).head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
6,M1B,Scarborough,"Malvern, Rouge"
12,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
18,M1E,Scarborough,"Guildwood, Morningside, West Hill"
22,M1G,Scarborough,Woburn
26,M1H,Scarborough,Cedarbrae
32,M1J,Scarborough,Scarborough Village
38,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
44,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
51,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
58,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [7]:
df_postalcodes.shape

(103, 3)

<div class="alert alert-warning alertwarning" style="margin-top: 20px">
<a id="Part_2"></a>
<h2>Part 2 - Combining two dataframes</h2>
    
<a href="#top">Top</a>
</div>

In [8]:
import sys
import subprocess
import pkg_resources

#check if the library is installed
required = {'geocoder'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
else:
    print('Libraries are installed !')


Libraries are installed !


In [9]:
import geocoder

In [10]:
def getCoordinates(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        print(lat_lng_coords)
        

    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude

In [11]:
postal_code = "M5G"
latitude , longitude = getCoordinates(postal_code)
print(f"Postal code {postal_code} (latitude {latitude}, longitude {longitude})")   

None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None


KeyboardInterrupt: 

<div class="alert alert-danger alertdanger" style="margin-top: 20px">
<h3>Results</h3>
<a href="#top">Top</a>

As you can see above, the api is not returning anything, I tried couple different times, so I will use the data coming from the repository 
</div>

### Download data from IBM Repository into local repository

In [12]:
#remove the file, if exist
!rm -rf 'toronto_data.csv'
#Download the geospatial_data into a local json file
!wget -q -O 'toronto_data.csv' http://cocl.us/Geospatial_data
!ls
print('Data downloaded!')

toronto_data.csv
Data downloaded!


### Read the downloaded file

In [13]:
df_geodata = pd.read_csv('toronto_data.csv', keep_default_na=False, na_values=[""])

In [14]:
df_geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Execute a inner join between df_postalcodes and df_geodata [datacarpentry reference](https://datacarpentry.org/python-ecology-lesson/05-merging-data/)

In [15]:
df_toronto = pd.merge(left=df_postalcodes, right=df_geodata, left_on='Postal Code', right_on='Postal Code')

In [16]:
df_toronto.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<div class="alert alert-warning alertwarning" style="margin-top: 20px">
<a id="Part_3"></a>
<h2>Part 3 - Explore a borough and its neighborhood venues</h2>
    
<a href="#top">Top</a>
</div>

### a. Select a borough

In [17]:
borough = 'North York'
borough_data = df_toronto[df_toronto['Borough'] == borough].reset_index(drop=True)
borough_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
3,M3B,North York,Don Mills,43.745906,-79.352188
4,M6B,North York,Glencairn,43.709577,-79.445073


In [18]:
borough_data.shape

(24, 5)

#### Install the libraries

In [19]:
import sys
import subprocess
import pkg_resources

#check if the library is installed
required = {'geopy','folium'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)
    print('Libraries are ready !')
else:
    print('Libraries are installed !')


Libraries are installed !


In [20]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import folium # map rendering library

In [21]:
address = 'Toronto, CA'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [22]:
# create map of Toronto using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

print(f"These are the neighborhood(s) of \"{borough}\" borough.")
for lat, lng, bor, neighborhood, postalcode in zip(borough_data['Latitude'], 
                                                   borough_data['Longitude'], 
                                                   borough_data['Borough'], 
                                                   borough_data['Neighborhood'], 
                                                   borough_data['Postal Code']):
    
    label = '{}, {}, {}'.format(neighborhood, bor, postalcode)
    print("\t"+label)
    
    label = folium.Popup(label, parse_html=True)
    
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    


These are the neighborhood(s) of "North York" borough.
	Parkwoods, North York, M3A
	Victoria Village, North York, M4A
	Lawrence Manor, Lawrence Heights, North York, M6A
	Don Mills, North York, M3B
	Glencairn, North York, M6B
	Don Mills, North York, M3C
	Hillcrest Village, North York, M2H
	Bathurst Manor, Wilson Heights, Downsview North, North York, M3H
	Fairview, Henry Farm, Oriole, North York, M2J
	Northwood Park, York University, North York, M3J
	Bayview Village, North York, M2K
	Downsview, North York, M3K
	York Mills, Silver Hills, North York, M2L
	Downsview, North York, M3L
	North Park, Maple Leaf Park, Upwood Park, North York, M6L
	Humber Summit, North York, M9L
	Willowdale, Newtonbrook, North York, M2M
	Downsview, North York, M3M
	Bedford Park, Lawrence Manor East, North York, M5M
	Humberlea, Emery, North York, M9M
	Willowdale, North York, M2N
	Downsview, North York, M3N
	York Mills West, North York, M2P
	Willowdale, North York, M2R


In [23]:
map_Toronto

### b. Define Foursquare Credentials and Version

In [24]:
# @hidden_cell
CLIENT_ID = 'KEANXXBPW3LTT4Q2ZVTDOU4VUNCOK05VOCQDZMQI51N0FI20' # your Foursquare ID
CLIENT_SECRET = 'T0TYHBAJL4E5QEDVIM4Y2KKAFV1CFIHTYU01R2TCPULFXZ5E' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

#print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID)
#print('CLIENT_SECRET:' + CLIENT_SECRET)

#### Lets select a neighborhood

In [25]:
#pick a neighborhood - 'Bayview Village'
position = 10

borough_data.loc[position, 'Neighborhood']

'Bayview Village'

#### Select the coordinantes of the neighborhood

In [26]:
neighborhood_longitude = borough_data.loc[position, 'Longitude'] # neighborhood longitude value
neighborhood_latitude = borough_data.loc[position, 'Latitude'] # neighborhood longitude value
neighborhood_name = borough_data.loc[position, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Bayview Village are 43.7869473, -79.385975.


#### Setup the foursquare setting to request API information

In [27]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=KEANXXBPW3LTT4Q2ZVTDOU4VUNCOK05VOCQDZMQI51N0FI20&client_secret=T0TYHBAJL4E5QEDVIM4Y2KKAFV1CFIHTYU01R2TCPULFXZ5E&v=20180605&ll=43.7869473,-79.385975&radius=500&limit=100'

In [28]:
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


### c. Make an API request call for the Foursquare API

In [29]:
results = requests.get(url).json()

#### Review the results

In [30]:
results

{'meta': {'code': 200, 'requestId': '5ec17fbcedbcad001bba0a21'},
 'response': {'headerLocation': 'Bayview Woods - Steeles',
  'headerFullLocation': 'Bayview Woods - Steeles, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.791447304500004,
    'lng': -79.37975323634863},
   'sw': {'lat': 43.7824472955, 'lng': -79.39219676365137}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bd396d041b9ef3b799c00e6',
       'name': 'Sun Star Chinese Cuisine 翠景小炒',
       'location': {'address': '636 Finch Avenue East',
        'crossStreet': 'btwn Leslie St and Bayview Av',
        'lat': 43.78791448422642,
        'lng': -79.38123404311649,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.78791448

####  Selecting venues 

In [31]:
venues = results['response']['groups'][0]['items']

#Normalize the json value
nearby_venues = json_normalize(venues) # flatten JSON

#Check values
print(nearby_venues)

   reasons.count                                      reasons.items  \
0              0  [{'summary': 'This spot is popular', 'type': '...   
1              0  [{'summary': 'This spot is popular', 'type': '...   
2              0  [{'summary': 'This spot is popular', 'type': '...   
3              0  [{'summary': 'This spot is popular', 'type': '...   

                       referralId  \
0  e-0-4bd396d041b9ef3b799c00e6-0   
1  e-0-4b6dc5fcf964a5207d8e2ce3-1   
2  e-0-4bdc7dd8c79cc9287ecc86e9-2   
3  e-0-5404d153498ebbc7332f3e4e-3   

                                    venue.categories  \
0  [{'id': '4bf58dd8d48988d145941735', 'name': 'C...   
1  [{'id': '4bf58dd8d48988d10a951735', 'name': 'B...   
2  [{'id': '4bf58dd8d48988d16d941735', 'name': 'C...   
3  [{'id': '4bf58dd8d48988d111941735', 'name': 'J...   

                   venue.id venue.location.address venue.location.cc  \
0  4bd396d041b9ef3b799c00e6  636 Finch Avenue East                CA   
1  4b6dc5fcf964a5207d8e2ce3      

In [32]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

#### Preparing the venues data for further use

In [33]:
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Sun Star Chinese Cuisine 翠景小炒,Chinese Restaurant,43.787914,-79.381234
1,TD Canada Trust,Bank,43.788074,-79.380367
2,Maxim's Cafe and Patisserie,Café,43.787863,-79.380751
3,Kaga Sushi,Japanese Restaurant,43.787758,-79.38109


#### cleanup the column names : venue.name -> name 

In [34]:
# clean columns names
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Sun Star Chinese Cuisine 翠景小炒,Chinese Restaurant,43.787914,-79.381234
1,TD Canada Trust,Bank,43.788074,-79.380367
2,Maxim's Cafe and Patisserie,Café,43.787863,-79.380751
3,Kaga Sushi,Japanese Restaurant,43.787758,-79.38109


In [35]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


### d. Explore Neighborhoods in the Borough

In [36]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    print(f"Searching venues in the {borough} borough.")
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print("\t"+name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Selecting venues in the borough.

In [37]:
borough_venues = getNearbyVenues(names=borough_data['Neighborhood'],
                                   latitudes=borough_data['Latitude'],
                                   longitudes=borough_data['Longitude']
                                  )

Searching venues in the North York borough.
	Parkwoods
	Victoria Village
	Lawrence Manor, Lawrence Heights
	Don Mills
	Glencairn
	Don Mills
	Hillcrest Village
	Bathurst Manor, Wilson Heights, Downsview North
	Fairview, Henry Farm, Oriole
	Northwood Park, York University
	Bayview Village
	Downsview
	York Mills, Silver Hills
	Downsview
	North Park, Maple Leaf Park, Upwood Park
	Humber Summit
	Willowdale, Newtonbrook
	Downsview
	Bedford Park, Lawrence Manor East
	Humberlea, Emery
	Willowdale
	Downsview
	York Mills West
	Willowdale


In [38]:
print(f"These are the venues found in the \"{borough}\" borough.")
borough_venues.head()

These are the venues found in the "North York" borough.


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant


#### Count total venues categorized by neighborhood

In [39]:
borough_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Bathurst Manor, Wilson Heights, Downsview North",20,20,20,20,20,20
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",23,23,23,23,23,23
Don Mills,26,26,26,26,26,26
Downsview,16,16,16,16,16,16
"Fairview, Henry Farm, Oriole",67,67,67,67,67,67
Glencairn,4,4,4,4,4,4
Hillcrest Village,5,5,5,5,5,5
Humber Summit,2,2,2,2,2,2
"Humberlea, Emery",1,1,1,1,1,1


In [40]:
print('There are {} uniques categories.'.format(len(borough_venues['Venue Category'].unique())))

There are 102 uniques categories.


### e. Analyze Each Neighborhood

In [41]:
# one hot encoding
borough_onehot = pd.get_dummies(borough_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
borough_onehot['Neighborhood'] = borough_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [borough_onehot.columns[-1]] + list(borough_onehot.columns[:-1])
borough_onehot = borough_onehot[fixed_columns]

borough_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Women's Store
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### f.Groupd the dataframe by neighborhood to eliminate duplicates

In [42]:
borough_grouped = borough_onehot.groupby('Neighborhood').mean().reset_index()
borough_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Bakery,Bank,Bar,...,Sushi Restaurant,Tea Room,Thai Restaurant,Theater,Toy / Game Store,Trail,Video Game Store,Video Store,Vietnamese Restaurant,Women's Store
0,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,...,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043478,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.0,0.0
3,Don Mills,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Downsview,0.0,0.0625,0.0,0.0,0.0,0.0625,0.0,0.0625,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Fairview, Henry Farm, Oriole",0.0,0.0,0.014925,0.0,0.014925,0.0,0.029851,0.029851,0.014925,...,0.0,0.014925,0.0,0.014925,0.014925,0.0,0.014925,0.0,0.0,0.044776
6,Glencairn,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Hillcrest Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Humber Summit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,"Humberlea, Emery",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### print the top 5 venues for each neighborhood

In [43]:
#print the top 5 venues for each neighborhood
num_top_venues = 5
print(f"These are the venues in each neighborhood of the \"{borough}\" borough.\n")
for hood in borough_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = borough_grouped[borough_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

These are the venues in each neighborhood of the "North York" borough.

----Bathurst Manor, Wilson Heights, Downsview North----
            venue  freq
0     Coffee Shop  0.10
1            Bank  0.10
2  Ice Cream Shop  0.05
3     Bridal Shop  0.05
4        Pharmacy  0.05


----Bayview Village----
                 venue  freq
0  Japanese Restaurant  0.25
1                 Bank  0.25
2   Chinese Restaurant  0.25
3                 Café  0.25
4    Accessories Store  0.00


----Bedford Park, Lawrence Manor East----
                venue  freq
0  Italian Restaurant  0.13
1         Coffee Shop  0.09
2      Sandwich Place  0.09
3                 Pub  0.04
4        Liquor Store  0.04


----Don Mills----
                 venue  freq
0  Japanese Restaurant  0.08
1           Restaurant  0.08
2                  Gym  0.08
3           Beer Store  0.08
4          Coffee Shop  0.08


----Downsview----
            venue  freq
0   Grocery Store  0.19
1            Park  0.12
2  Baseball Field  0.06
3     

In [44]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [45]:
import numpy as np

### g. Create a dataframe selecting top 10 venues by neighborhood

In [46]:
#Select top venues by neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = borough_grouped['Neighborhood']

for ind in np.arange(borough_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(borough_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Park,Shopping Mall,Fried Chicken Joint,Ice Cream Shop,Diner,Deli / Bodega,Middle Eastern Restaurant,Pharmacy
1,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Dog Run,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,Sandwich Place,Coffee Shop,Indian Restaurant,Liquor Store,Café,Butcher,Pharmacy,Pizza Place,Juice Bar
3,Don Mills,Japanese Restaurant,Gym,Coffee Shop,Asian Restaurant,Restaurant,Beer Store,Gym / Fitness Center,Concert Hall,Clothing Store,Dim Sum Restaurant
4,Downsview,Grocery Store,Park,Baseball Field,Home Service,Food Truck,Liquor Store,Discount Store,Shopping Mall,Snack Place,Gym / Fitness Center


### h. Cluster Neighborhoods

In [47]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

#### Create 5 clusters

In [48]:
# set number of clusters
kclusters = 5

borough_grouped_clustering = borough_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(borough_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 2], dtype=int32)

In [49]:
#add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
neighborhoods_venues_sorted

Unnamed: 0,Cluster Labels,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Park,Shopping Mall,Fried Chicken Joint,Ice Cream Shop,Diner,Deli / Bodega,Middle Eastern Restaurant,Pharmacy
1,0,Bayview Village,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Dog Run,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store
2,0,"Bedford Park, Lawrence Manor East",Italian Restaurant,Sandwich Place,Coffee Shop,Indian Restaurant,Liquor Store,Café,Butcher,Pharmacy,Pizza Place,Juice Bar
3,0,Don Mills,Japanese Restaurant,Gym,Coffee Shop,Asian Restaurant,Restaurant,Beer Store,Gym / Fitness Center,Concert Hall,Clothing Store,Dim Sum Restaurant
4,0,Downsview,Grocery Store,Park,Baseball Field,Home Service,Food Truck,Liquor Store,Discount Store,Shopping Mall,Snack Place,Gym / Fitness Center
5,0,"Fairview, Henry Farm, Oriole",Clothing Store,Coffee Shop,Fast Food Restaurant,Women's Store,Restaurant,Mobile Phone Shop,Japanese Restaurant,Bakery,Chinese Restaurant,Bank
6,0,Glencairn,Pizza Place,Pub,Japanese Restaurant,Park,Women's Store,Diner,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop
7,0,Hillcrest Village,Dog Run,Golf Course,Pool,Fast Food Restaurant,Mediterranean Restaurant,Discount Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
8,1,Humber Summit,Pizza Place,Music Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
9,2,"Humberlea, Emery",Baseball Field,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant


In [50]:
borough_merged = borough_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
borough_merged = borough_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

#Eliminate NaN Values
borough_merged['Cluster Labels'] = borough_merged['Cluster Labels'].replace(np.nan, 0)

#Convert float values to int
borough_merged['Cluster Labels'] = borough_merged['Cluster Labels'].astype(int)

In [51]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [52]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

print(f"x = {x}\n")
print(f"ys = {ys}\n")
print(f"colors_array={colors_array}\n")
print(f"rainbow={rainbow}\n")


x = [0 1 2 3 4]

ys = [array([0, 1, 2, 3, 4]), array([ 1,  3,  7, 13, 21]), array([ 2,  7, 20, 41, 70]), array([  3,  13,  41,  87, 151]), array([  4,  21,  70, 151, 264])]

colors_array=[[5.00000000e-01 0.00000000e+00 1.00000000e+00 1.00000000e+00]
 [1.96078431e-03 7.09281308e-01 9.23289106e-01 1.00000000e+00]
 [5.03921569e-01 9.99981027e-01 7.04925547e-01 1.00000000e+00]
 [1.00000000e+00 7.00543038e-01 3.78411050e-01 1.00000000e+00]
 [1.00000000e+00 1.22464680e-16 6.12323400e-17 1.00000000e+00]]

rainbow=['#8000ff', '#00b5eb', '#80ffb4', '#ffb360', '#ff0000']



#### setup folium map to show the clusters

In [53]:
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(borough_merged['Latitude'], 
                                       borough_merged['Longitude'], 
                                       borough_merged['Neighborhood'], 
                                       borough_merged['Cluster Labels']):

    
    txt = str(poi) + ' Cluster ' + str(cluster)
    
    print(txt)
    
    print(f"lat {lat},lon {lon}\n")
    

    label = folium.Popup(txt, parse_html=True)

    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

Parkwoods Cluster 4
lat 43.7532586,lon -79.3296565

Victoria Village Cluster 0
lat 43.725882299999995,lon -79.31557159999998

Lawrence Manor, Lawrence Heights Cluster 0
lat 43.718517999999996,lon -79.46476329999999

Don Mills Cluster 0
lat 43.745905799999996,lon -79.352188

Glencairn Cluster 0
lat 43.709577,lon -79.44507259999999

Don Mills Cluster 0
lat 43.72589970000001,lon -79.340923

Hillcrest Village Cluster 0
lat 43.8037622,lon -79.3634517

Bathurst Manor, Wilson Heights, Downsview North Cluster 0
lat 43.7543283,lon -79.4422593

Fairview, Henry Farm, Oriole Cluster 0
lat 43.7785175,lon -79.3465557

Northwood Park, York University Cluster 0
lat 43.7679803,lon -79.48726190000001

Bayview Village Cluster 0
lat 43.7869473,lon -79.385975

Downsview Cluster 0
lat 43.737473200000004,lon -79.46476329999999

York Mills, Silver Hills Cluster 0
lat 43.7574902,lon -79.37471409999999

Downsview Cluster 0
lat 43.7390146,lon -79.5069436

North Park, Maple Leaf Park, Upwood Park Cluster 3
lat 43

In [54]:
map_clusters

### i. Examine Cluster

In [55]:
borough_merged.loc[borough_merged['Cluster Labels'] == 0, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0,Hockey Arena,Pizza Place,Portuguese Restaurant,Coffee Shop,Women's Store,Diner,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop
2,North York,0,Clothing Store,Women's Store,Miscellaneous Shop,Boutique,Coffee Shop,Event Space,Furniture / Home Store,Vietnamese Restaurant,Accessories Store,Sushi Restaurant
3,North York,0,Japanese Restaurant,Gym,Coffee Shop,Asian Restaurant,Restaurant,Beer Store,Gym / Fitness Center,Concert Hall,Clothing Store,Dim Sum Restaurant
4,North York,0,Pizza Place,Pub,Japanese Restaurant,Park,Women's Store,Diner,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop
5,North York,0,Japanese Restaurant,Gym,Coffee Shop,Asian Restaurant,Restaurant,Beer Store,Gym / Fitness Center,Concert Hall,Clothing Store,Dim Sum Restaurant
6,North York,0,Dog Run,Golf Course,Pool,Fast Food Restaurant,Mediterranean Restaurant,Discount Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega
7,North York,0,Coffee Shop,Bank,Park,Shopping Mall,Fried Chicken Joint,Ice Cream Shop,Diner,Deli / Bodega,Middle Eastern Restaurant,Pharmacy
8,North York,0,Clothing Store,Coffee Shop,Fast Food Restaurant,Women's Store,Restaurant,Mobile Phone Shop,Japanese Restaurant,Bakery,Chinese Restaurant,Bank
9,North York,0,Coffee Shop,Furniture / Home Store,Miscellaneous Shop,Caribbean Restaurant,Bar,Metro Station,Massage Studio,Women's Store,Diner,Cosmetics Shop
10,North York,0,Chinese Restaurant,Café,Bank,Japanese Restaurant,Women's Store,Dog Run,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store


In [56]:
borough_merged.loc[borough_merged['Cluster Labels'] == 1, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,North York,1,Pizza Place,Music Store,Comfort Food Restaurant,Concert Hall,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop


In [57]:
borough_merged.loc[borough_merged['Cluster Labels'] == 2, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,North York,2,Baseball Field,Women's Store,Electronics Store,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant


In [58]:
borough_merged.loc[borough_merged['Cluster Labels'] == 3, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,North York,3,Park,Construction & Landscaping,Trail,Bakery,Dog Run,Concert Hall,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store


In [59]:
borough_merged.loc[borough_merged['Cluster Labels'] == 4, borough_merged.columns[[1] + list(range(5, borough_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,4,Park,Food & Drink Shop,Women's Store,Dog Run,Construction & Landscaping,Convenience Store,Cosmetics Shop,Deli / Bodega,Department Store,Dessert Shop
