# Exploring Toronto (cont. part2)
I will be copying a lot from the previous project. For the sake of space I will be merging a lot in a single cell.

The outcome of this notebook is to simply add the geocodes for the applicbale postalcodes in Toronto Canada, which were extracted in the previous phase of this project.

### Installing Libraries

In [1]:
!conda install beautifulsoup4
!conda install lxml
!conda install requests

Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    beautifulsoup4: 4.6.0-py35h442a8c9_1 --> 4.6.3-py35_0

beautifulsoup4 100% |################################| Time: 0:00:00  42.03 MB/s
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following packages will be UPDATED:

    libgcc-ng: 7.2.0-h7cc24e2_2     --> 8.2.0-hdf63c60_1    
    libxml2:   2.9.4-h6b072ca_5     --> 2.9.8-hf84eae3_0    
    libxslt:   1.1.29-hcf9102b_5    --> 1.1.33-h7d1a2b0_0   
    lxml:      4.1.0-py35ha401a81_0 --> 4.2.5-py35hefd8a0e_0

libgcc-ng-8.2. 100% |################################| Time: 0:00:00  89.42 MB/s
libxml2-2.9.8- 100% |################################| Time: 0:00:00  68.36 MB/s
libxslt-1.1.33 100% |################################| Time: 0:00:00  67.

### Importing the libraries

In [2]:
from bs4 import BeautifulSoup as bs
import requests as rq
import pandas as pd
import numpy as np

In [3]:
source = rq.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = bs(source, 'lxml')

table = soup.table

column_names = []
for te in table.find_all('th'):
    column_names.append(te.text)


## Creating a blank dataframe
postalCodes_DF = pd.DataFrame(columns = column_names)

for i,te in enumerate(table.find_all('tr')):
    if (te.td):
        row_lst = te.text[:].split('\n')
        # Below: Eliminating the 'Not assigned'-rows
        if row_lst[2] != 'Not assigned':
            postcode = row_lst[1]
            borough = row_lst[2]
            # Below: Eliminating rows where 'Borough's are stipulated, but 'Neighbourhood's are 'Not assigned'
            if row_lst[3] == 'Not assigned': 
                nhood = borough
            else:
                nhood = row_lst[3]

            postalCodes_DF = postalCodes_DF.append({'Postcode': postcode,
                                                    'Borough': borough,
                                                    'Neighbourhood': nhood},ignore_index=True)

pc_DF = postalCodes_DF.drop(['Neighbourhood\n'],1)


new_pc_DF = pc_DF.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join)
new_pc_DF = new_pc_DF.to_frame().reset_index()
new_pc_DF.loc[new_pc_DF['Postcode'] == 'M7A']

print(('The shape of the DF is: {}').format(new_pc_DF.shape))

The shape of the DF is: (103, 3)


### GeoCoding data
In the interest of time I am moving on to the CSV file and using this method to attach the latlongs onto the dataframe.

In [4]:
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0


df_data_1 = pd.read_csv(body)
df_data_1.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### Joining the data
Below is the simple join.

In [5]:
ll_df = pd.merge(new_pc_DF,df_data_1,left_on='Postcode',right_on='Postal Code',how='left').drop('Postal Code', axis=1).rename(columns={'Postcode':'PostalCode'})
ll_df.shape

(103, 5)

### Checking the received data agains Coursera
I created a list of the firts five rows as per the Coursera table for this assignment. I then checked the latlongs against the table given in their example.

In [6]:
ll_df[ll_df['PostalCode'].isin(['M5G','M2H','M4B','M1J','M4G'])]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
17,M2H,North York,Hillcrest Village,43.803762,-79.363452
35,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
38,M4G,East York,Leaside,43.70906,-79.363452
57,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383


## Analysis of data set
The below data set is the final dataset that I will apply the analysis on.

In [7]:
print(ll_df[ll_df['Borough'].str.contains('Toronto')].shape)
ll_df[ll_df['Borough'].str.contains('Toronto')].drop('PostalCode',1).head()

(38, 5)


Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
37,East Toronto,The Beaches,43.676357,-79.293031
41,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,East Toronto,Studio District,43.659526,-79.340923
44,Central Toronto,Lawrence Park,43.72802,-79.38879


In [8]:
ll_df['Borough'].value_counts()

North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East York            5
York                 5
East Toronto         5
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

In [9]:
ll_df['Borough'].unique()[4]

'Central Toronto'

### Mapping all datapoints
Firstly plot all the datapoints on a map to see if it all falls within the city of Toronto.

In order to plot these on a map I would need to install **_folium_**:

In [10]:
!pip install folium

Collecting folium
  Downloading https://files.pythonhosted.org/packages/43/77/0287320dc4fd86ae8847bab6c34b5ec370e836a79c7b0c16680a3d9fd770/folium-0.8.3-py2.py3-none-any.whl (87kB)
[K    100% |████████████████████████████████| 92kB 7.7MB/s eta 0:00:01
[?25hRequirement not upgraded as not directly required: six in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: jinja2 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: requests in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Requirement not upgraded as not directly required: numpy in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from folium)
Collecting branca>=0.3.0 (from folium)
  Downloading https://files.pythonhosted.org/packages/63/36/1c93318e9653f4e414a2e0c3b98fc898b4970e939afeedeee6075dd3b703/branca-0.3.1-py3-none-any.whl
Requirement not upgraded a

In [11]:
import folium

The below gives a quick overview of where all the points fall within Toronto

In [12]:
# Toronto,ON, Canada from maps.google: 43.728671, -79.381604
tor_lat = 43.728671
tor_lng = -79.381604

# List of colors that I can run through
colours = ['red','blue','gray','darkred','lightred','orange','beige','green','darkgreen','lightgreen'
         ,'darkblue','lightblue','purple','darkpurple','pink','cadetblue','lightgray','black']
#source: https://stackoverflow.com/questions/36202514/foilum-map-module-trying-to-get-more-options-for-marker-colors

# Unique list of all the boroughs
bors = list(ll_df['Borough'].unique())

map_tor = folium.Map(location = [tor_lat,tor_lng], zoom_start=12)
for bor,name,lat,lng in zip(ll_df['Borough'],ll_df['Neighbourhood'],ll_df['Latitude'],ll_df['Longitude']):
    label = folium.Popup(name+'\n['+bor+']')
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colours[bors.index(bor)],
        fill=True,
        fill_color=colours[bors.index(bor)],
        fill_opacity=0.4,
        parse_html=False).add_to(map_tor)  

map_tor

In [13]:
ll_df[ll_df['Neighbourhood'].str.contains('Harbour')]

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
53,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.65426,-79.360636
59,M5J,Downtown Toronto,"Harbourfront East,Toronto Islands,Union Station",43.640816,-79.381752
68,M5V,Downtown Toronto,"CN Tower,Bathurst Quay,Island airport,Harbourf...",43.628947,-79.39442


### Starting the analysis
In this section I will be doing the following:
1. For each neighbourhood request FourSquare data (using the explore endpoint) and get nearby venues
2. Cluster the different neighbourhoods
3. Map the cluster

I would like to see if there is a correlation between the physical location of the neighbourhood venues and their assigned clusters. If so, thent there may be a correlation between the clusters and the geographical proximity.

In [14]:
import requests
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

---
### Step1: Exploring each Borough

In this step I need to get all the venues surrounding the postalcode latlongs. This happens below:

In [16]:
from IPython.display import clear_output
import math
class ProgressBar:
    
    def __init__(self):
        self.progress = 0
        
    def updateProgress(self,curr,total):
        perc = curr / total
        prog = math.floor(perc * 20)
        if (self.progress < prog):
            self.progress = prog
            clear_output()
            print('ProgressBar: <{:<20}> [{:4.0f}%]'.format('='*self.progress,math.floor(perc*100)))
        else:
            pass

In [17]:
def getNearbyVenues(names, nhoods,latitudes, longitudes, radius=500):
    LIMIT = 100
    venues_list=[]
    total = names.shape[0]
    counter = 0
    progBar = ProgressBar()
    progBar.updateProgress(counter,total)
    for name,nhood, lat, lng in zip(names,nhoods, latitudes, longitudes):
        counter += 1
        progBar.updateProgress(counter,total)
#         print(str(counter) + ": " + nhood)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        # make the GET request
        results = rq.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            nhood,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        progBar.updateProgress(counter,total)
#         progressBar(counter,total)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Neighbourhood',
                  'Borough_Latitude', 
                  'Borough_Longitude', 
                  'Venue', 
                  'Venue_Latitude', 
                  'Venue_Longitude', 
                  'Venue_Category']
    
    print(('Shape of the returned dataframe is: {}').format(nearby_venues.shape))
    
    return(nearby_venues)

In [18]:
tor_venue_df = getNearbyVenues(ll_df['Borough'],ll_df['Neighbourhood'],ll_df['Latitude'],ll_df['Longitude'])

Shape of the returned dataframe is: (2265, 8)


In [19]:
tor_venue_df.head()

Unnamed: 0,Borough,Neighbourhood,Borough_Latitude,Borough_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
0,Scarborough,"Rouge,Malvern",43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Chris Effects Painting,43.784343,-79.163742,Construction & Landscaping
2,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


The above dataframe contains all the boroughs and their nearby venues (limited to the closest 100). This will form the basis for the clustering model.

Following on from the hypothesis set out in my initial analysis intro (proximity correlated to clustering), and in order to use a managable dataset, I will be subsetting the above dataset to only where the borough contains the name 'Toronto'.

In [20]:
tor_sub_venue_df = tor_venue_df[tor_venue_df['Borough'].str.contains('Toronto')]
print(('The result is that there are {} unique boroughs. The overall shape of the dataframe is given below:\n{}')
      .format(len(tor_sub_venue_df['Borough'].unique()),tor_sub_venue_df.shape))
tor_sub_venue_df.head(3)

The result is that there are 4 unique boroughs. The overall shape of the dataframe is given below:
(1712, 8)


Unnamed: 0,Borough,Neighbourhood,Borough_Latitude,Borough_Longitude,Venue,Venue_Latitude,Venue_Longitude,Venue_Category
301,East Toronto,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
302,East Toronto,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
303,East Toronto,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop


---
### Step2: Creating the clusters
First we need to get the dummy variables based on the venue categories:

In [21]:
tor_dummy_df = pd.get_dummies(tor_sub_venue_df[['Venue_Category']], prefix="", prefix_sep="")
tor_dummy_df['Neighbourhood'] = tor_sub_venue_df['Neighbourhood'] 
fixed_columns = [tor_dummy_df.columns[-1]] + list(tor_dummy_df.columns[:-1])
tor_dummy_df = tor_dummy_df[fixed_columns]

print(('There are {} total venues in this dataframe, while there are only {} unique neighbourhoods').format(tor_dummy_df.shape[0],len(tor_dummy_df['Neighbourhood'].unique())))
tor_dummy_df.head()

There are 1712 total venues in this dataframe, while there are only 38 unique neighbourhoods


Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Business Service,Butcher,Café,Cajun / Creole Restaurant,Camera Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Gym,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Home Service,Hookah Bar,Hospital,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Mac & Cheese Joint,Malay Restaurant,Market,Martial Arts Dojo,Massage Studio,Medical Center,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Movie Theater,Museum,Music Store,Music Venue,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soup Place,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
301,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
302,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
303,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
304,The Beaches,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
357,"The Danforth West,Riverdale",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Then we need to get the venues grouped by their Neighbourhoods:

In [22]:
tor_nhood_cat_df = tor_dummy_df.groupby('Neighbourhood').mean().reset_index()
print(tor_nhood_cat_df.shape)
tor_nhood_cat_df.head(2)

(38, 243)


Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,Bus Line,Business Service,Butcher,Café,Cajun / Creole Restaurant,Camera Store,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,College Arts Building,College Gym,College Rec Center,Colombian Restaurant,Comfort Food Restaurant,Comic Shop,Concert Hall,Convenience Store,Cosmetics Shop,Coworking Space,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Filipino Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Fruit & Vegetable Store,Furniture / Home Store,Gaming Cafe,Garden,Garden Center,Gastropub,Gay Bar,General Entertainment,General Travel,German Restaurant,Gift Shop,Gluten-free Restaurant,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Health & Beauty Service,Health Food Store,Historic Site,History Museum,Hobby Shop,Home Service,Hookah Bar,Hospital,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Irish Pub,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Jewish Restaurant,Juice Bar,Korean Restaurant,Lake,Latin American Restaurant,Light Rail Station,Lingerie Store,Liquor Store,Lounge,Mac & Cheese Joint,Malay Restaurant,Market,Martial Arts Dojo,Massage Studio,Medical Center,Mediterranean Restaurant,Men's Store,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Modern European Restaurant,Molecular Gastronomy Restaurant,Monument / Landmark,Movie Theater,Museum,Music Store,Music Venue,Neighborhood,New American Restaurant,Nightclub,Noodle House,Office,Opera House,Optical Shop,Organic Grocery,Other Great Outdoors,Park,Performing Arts Venue,Pet Store,Pharmacy,Pizza Place,Plane,Playground,Plaza,Poke Place,Pool,Portuguese Restaurant,Poutine Place,Pub,Ramen Restaurant,Record Shop,Recording Studio,Rental Car Location,Restaurant,Sake Bar,Salad Place,Salon / Barbershop,Sandwich Place,Scenic Lookout,Sculpture Garden,Seafood Restaurant,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soup Place,Southern / Soul Food Restaurant,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tapas Restaurant,Tea Room,Tennis Court,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,King,Richmond",0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.01,0.01,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.01,0.02,0.0,0.0,0.01,0.03,0.01,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.06,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.03,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.03,0.0,0.02,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.035088,0.0,0.0,0.0,0.017544,0.017544,0.017544,0.0,0.017544,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.017544,0.052632,0.070175,0.0,0.0,0.0,0.0,0.017544,0.0,0.017544,0.0,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.035088,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We are now ready to do the clustering!

For the clustering it is important to use all the dummy variables and to choose a number of clusters that I would like.

In order to get a good indication on my hypothesis of the geographic location having an influence on the clustering, I will only be focussing on a subset of all the boroughs - namely the ones containing the name 'Toronto' in them.

In [23]:
clust_num = 4
tor_clust_df = tor_nhood_cat_df.drop('Neighbourhood', 1)
kmeans = KMeans(n_clusters=clust_num, random_state=0).fit(tor_clust_df)
tor_df = ll_df[ll_df['Borough'].str.contains('Toronto')].drop('PostalCode',1)
tor_df.insert(0, 'Cluster_Labels', kmeans.labels_)
print(tor_df.shape)
tor_df.head(3)

(38, 5)


Unnamed: 0,Cluster_Labels,Borough,Neighbourhood,Latitude,Longitude
37,3,East Toronto,The Beaches,43.676357,-79.293031
41,3,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,3,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572


The above dataframe has all the necessary information to compare the clustered results agains the hypothesis.

All that remains is to map the result...

---
### Step3: Mapping the cluster

There are two maps that are needed to test the hypothesis:
1. Map of the neighbourhoods coloured by their accompanying borough (the same as the first map drawn in the analysis section)
2. Map of the neighbourhoods coloured by their resulting clusters

##### 1. Borough coloured map

In [32]:
# Toronto,ON, Canada from maps.google: 43.655564, -79.392854
tor_lat = 43.655564
tor_lng = -79.392854

# List of colors that I can run through
colours = ['red','blue','gray','darkred','lightred','orange','beige','green','darkgreen','lightgreen'
         ,'darkblue','lightblue','purple','darkpurple','pink','cadetblue','lightgray','black']
#source: https://stackoverflow.com/questions/36202514/foilum-map-module-trying-to-get-more-options-for-marker-colors

# Unique list of all the boroughs
bors = list(ll_df['Borough'].unique())

map_tor = folium.Map(location = [tor_lat,tor_lng], zoom_start=12)
for bor,name,lat,lng in zip(tor_df['Borough'],tor_df['Neighbourhood'],tor_df['Latitude'],tor_df['Longitude']):
    label = folium.Popup(name+'\n['+bor+']')
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colours[bors.index(bor)],
        fill=True,
        fill_color=colours[bors.index(bor)],
        fill_opacity=0.4,
        parse_html=False).add_to(map_tor)  

map_tor

##### 2. Cluster coloured map

In [33]:
# Toronto,ON, Canada from maps.google: 43.655564, -79.392854
tor_lat = 43.655564
tor_lng = -79.392854

# List of colors that I can run through
colours = ['red','blue','gray','darkred','lightred','orange','beige','green','darkgreen','lightgreen'
         ,'darkblue','lightblue','purple','darkpurple','pink','cadetblue','lightgray','black']
#source: https://stackoverflow.com/questions/36202514/foilum-map-module-trying-to-get-more-options-for-marker-colors

# Unique list of all the boroughs
bors = list(ll_df['Borough'].unique())

map_tor = folium.Map(location = [tor_lat,tor_lng], zoom_start=12)
for clust,name,lat,lng in zip(tor_df['Cluster_Labels'],tor_df['Neighbourhood'],tor_df['Latitude'],tor_df['Longitude']):
    label = folium.Popup(name+'\n['+bor+']')
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colours[clust],
        fill=True,
        fill_color=colours[clust],
        fill_opacity=0.4,
        parse_html=False).add_to(map_tor)  

map_tor

From the above it is clear that there is a majority cluster and the remainder are far minority clusters. The differences between them are most probably also very small. This would indicate that these 4 clusters could very easily be classified as a sinlge cluster.

A furhter proof of what I am stating above is a simple count of how many neighbourhoods are in each cluster:

In [35]:
tor_df['Cluster_Labels'].value_counts()

3    34
2     2
1     1
0     1
Name: Cluster_Labels, dtype: int64

---
## Findings

Finally testing the hypothesis.

Firstly, it is a simple statement to say that latlongs (i.e. proximity) have no bearings on the outcomes of the cluster results. This is a simple fact as this variable was never considered in the clustering model. There was **only one variable** (namely _venue category_), that was considered in calculating the clusters.

Therefore the **_proximity_ does not have a bearing on the cluster results**.

What can be said is that _there is vary little diversity across these neighbourhoods_ and this is possible due to the close geographic locations (proximity) of all of these neigbourhoods. **It is a very monotone society**. This can be expected of many cities around the world, and is also very dependant on what cross section of the city is taken.