## Assignment Clustering & Segmenting Neighborhoods in the City of Toronto

___
### Task No.1
Prepare a dataframe with wikipedia data about Toronto postal codes, boroughs and neighborhoods

##### 1) Retrieve information about Toronto neighborhoods from wikipedia using BeautifulSoup and store it in a list (values[])

In [1]:
import requests
from bs4 import BeautifulSoup
values=[]
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

results = requests.get(url)

soup = BeautifulSoup(results.text, 'html.parser')

for cell in soup.find_all('td'):
    values.append(cell.string) if (cell.a==None) else values.append(cell.a.string)

##### 2) slice information from the list into a pandas dataframe

In [2]:
import pandas as pd

zip_values=pd.DataFrame()
zip_values['PostalCode']=values[0:867:3]
zip_values['Borough']=values[1:867:3]
zip_values['Neighborhood']=values[2:867:3]
zip_values.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


##### 3) Drop rows with boroughs = 'Not assigned' / 4) copy borough's name to neighborhoods with 'Not assigned' / 5) combine rows with same neighborhood

In [3]:
# drop rows
drop_indices=zip_values[zip_values['Borough']=='Not assigned'].index
zip_values.drop(drop_indices, axis=0, inplace=True)

# copy borough names
copy_indices=zip_values[zip_values['Neighborhood']=='Not assigned\n'].index
zip_values.loc[copy_indices,'Neighborhood']=zip_values.loc[copy_indices,'Borough']

# combine neighborhood names
values=[]
for name, group in zip_values.groupby(zip_values['PostalCode']):
    values.append([name,group.iloc[0,1],', '.join(group['Neighborhood'].values.tolist()).replace('\n','')]) 
zip_values_new=pd.DataFrame(values, columns=['PostalCode', 'Borough', 'Neighborhood'])

##### final dataframe (first 10 rows)

In [4]:
zip_values_new.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


##### shape of dataframe: 103 rows / 3 columns

In [5]:
zip_values_new.shape

(103, 3)

___
### Task No.2
Add geographical coordinates to each row of the dataset

In [6]:
!wget -O coordinates.csv http://cocl.us/Geospatial_data

--2019-01-02 09:42:46--  http://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 159.8.72.228
Connecting to cocl.us (cocl.us)|159.8.72.228|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cocl.us/Geospatial_data [following]
--2019-01-02 09:42:46--  https://cocl.us/Geospatial_data
Connecting to cocl.us (cocl.us)|159.8.72.228|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-01-02 09:42:47--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.24.197
Connecting to ibm.box.com (ibm.box.com)|107.152.24.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2019-01-02 09:42:48--  https://ibm.ent.box.com/shared/static

##### 1) load geographical data from csv file to dataframe

In [7]:
coordinates = pd.read_csv('coordinates.csv')
coordinates.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### 2) check, if the postal codes in both dataframes are identical

In [8]:
(zip_values_new['PostalCode']==coordinates['Postal Code']).values

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True], dtype=bool)

##### Yes, they are identical. So we can simply add the columns 'Latitude' and 'Longitude' to the Toronto's neighborhoods dataframe

In [9]:
zip_values_new['Latitude']=coordinates['Latitude']
zip_values_new['Longitude']=coordinates['Longitude']
zip_values_new.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


___
### Task No.3
Explore and cluster the neighborhoods of Toronto

In [10]:
from geopy.geocoders import Nominatim
address = 'Toronto, Canada'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [11]:
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(zip_values_new['Latitude'], zip_values_new['Longitude'], zip_values_new['Borough'], zip_values_new['Neighborhood']):
    label = '{} ({})'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    marker=folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
   
map_toronto

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge


#####  Explore subset of data. Boroughs must contain the string 'Toronto'

In [12]:
# create a dataframe only with Boroughs containing 'Toronto'
toronto_data=zip_values_new[zip_values_new['Borough'].str.contains('Toronto', regex=False)]
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


##### prepare toronto_data, so that it fits the next steps

In [13]:
toronto_data=toronto_data.iloc[:,2:].sort_values(by=['Neighborhood']).reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Neighborhood,Latitude,Longitude
0,"Adelaide, King, Richmond",43.650571,-79.384568
1,Berczy Park,43.644771,-79.373306
2,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191
3,Business reply mail Processing Centre969 Eastern,43.662744,-79.321558
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442


##### use the function from the course lab to get the top 100 venues for each neighborhood from toronto_data

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        except:
            results = []
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

hidden cell ... fousquare credentials

##### feed the toronto data into the previous function and get the top 100 venues for each neighborhood

In [16]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighborhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

Adelaide, King, Richmond
Berczy Park
Brockton, Exhibition Place, Parkdale Village
Business reply mail Processing Centre969 Eastern
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Cabbagetown, St. James Town
Central Bay Street
Chinatown, Grange Park, Kensington Market
Christie
Church and Wellesley
Commerce Court, Victoria Hotel
Davisville
Davisville North
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Design Exchange, Toronto Dominion Centre
Dovercourt Village, Dufferin
First Canadian Place, Underground city
Forest Hill North, Forest Hill West
Harbord, University of Toronto
Harbourfront East, Toronto Islands, Union Station
Harbourfront, Regent Park
High Park, The Junction South
Lawrence Park
Little Portugal, Trinity
Moore Park, Summerhill East
North Toronto West
Parkdale, Roncesvalles
Rosedale
Roselawn
Runnymede, Swansea
Ryerson, Garden District
St. James Town
Stn A PO Boxes 25 The Esplanade
Studio District
Th

##### let's check size and content of the dataframe

In [17]:
print(toronto_venues.shape)
toronto_venues.head()

(1701, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Adelaide, King, Richmond",43.650571,-79.384568,Four Seasons Centre for the Performing Arts,43.650609,-79.38628,Concert Hall
1,"Adelaide, King, Richmond",43.650571,-79.384568,Nathan Phillips Square,43.65227,-79.383516,Plaza
2,"Adelaide, King, Richmond",43.650571,-79.384568,The Keg Steakhouse & Bar,43.649937,-79.384196,Steakhouse
3,"Adelaide, King, Richmond",43.650571,-79.384568,Shangri-La Toronto,43.649129,-79.386557,Hotel
4,"Adelaide, King, Richmond",43.650571,-79.384568,Estiatorio Volos,43.650329,-79.384533,Greek Restaurant


##### let's see how many venues were found per neighborhood

In [18]:
toronto_venues.groupby('Neighborhood').count().iloc[:,2]

Neighborhood
Adelaide, King, Richmond                                                                                      100
Berczy Park                                                                                                    54
Brockton, Exhibition Place, Parkdale Village                                                                   20
Business reply mail Processing Centre969 Eastern                                                               17
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara     13
Cabbagetown, St. James Town                                                                                    49
Central Bay Street                                                                                             85
Chinatown, Grange Park, Kensington Market                                                                     100
Christie                                                                   

##### make a dataframe out of it

In [19]:
toronto_grouped=pd.DataFrame(toronto_venues.groupby('Neighborhood').count().iloc[:,2])
toronto_grouped.head()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Adelaide, King, Richmond",100
Berczy Park,54
"Brockton, Exhibition Place, Parkdale Village",20
Business reply mail Processing Centre969 Eastern,17
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",13


##### let's cluster the neighborhoods according to the number of venues they have, using K-Means

In [20]:
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 5

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 3, 4, 4, 4, 3, 0, 0, 4, 0, 0, 1, 2, 4, 0, 4, 0, 2, 1, 0, 3, 4, 2,
       3, 2, 4, 4, 2, 2, 1, 0, 0, 0, 1, 4, 2, 4, 1], dtype=int32)

##### show neighborhoods together with number of venues and cluster labels.

In [21]:
toronto_merged = toronto_data

# add clustering labels and number of venues
toronto_merged['Cluster Labels'] = kmeans.labels_
toronto_merged['number of venues'] = toronto_grouped[['Venue']].values

toronto_merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,number of venues
0,"Adelaide, King, Richmond",43.650571,-79.384568,0,100
1,Berczy Park,43.644771,-79.373306,3,54
2,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191,4,20
3,Business reply mail Processing Centre969 Eastern,43.662744,-79.321558,4,17
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442,4,13


##### finally show the map of central Toronto with neighborhoods clustered according to the number of venues. For example the different clusters may indicate areas that are more quiet (less venues) or areas with higher traffic (more venues).

In [22]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, ven in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels'], toronto_merged['number of venues']):
    label = folium.Popup(str(poi) + ' Cluster: ' + str(cluster) + ' #Venues: ' + str(ven), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters