In [1]:
# The code was removed by Watson Studio for sharing.

# Segmenting and Clustering Neighborhoods in Toronto

_Author: Miguel Acevedo_

On this notebook I explore, segment, and cluster the neighborhoods in the city of Toronto. For the Toronto neighborhood data, I scrape the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) containing the Zip code, Borough and Neighborhood data for each neighborhood in Toronto, wrangle, clean and read it into a Pandas Dataframe.   

I then convert these addresses into their equivalent latitude and longitude values using [geopy library](https://geopy.readthedocs.io/en/latest/) and use the [Foursquare Places API](https://developer.foursquare.com/docs/places-api/) to get the most common venues categories in each neigborhood to group them into clusters using K-Means.

***

En este Notebook exploro, segmento y agrupo los vecindarios de la ciudad de Toronto. Para obtener los datos del vecindario de Toronto, hago _webscraping_ de la [página de Wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) que contiene el código postal, la ciudad y los datos del vecindario para cada vecindario en Toronto. Una vez obtengo los datos entonces los limpio y lo leo en un Pandas Dataframe.

Luego, convierto estas direcciones en sus valores equivalentes de latitud y longitud y utilizo la API de Foursquare para obtener las categorías de lugares más comunes en cada vecindario y asi agruparlas en grupos similares utilizando el algoritmo de K-Means.

***

## Table of contents
> 1. [Download and Explore Data](#download)
> 2. [Explore Neighborhoods in Toronto](#explore)
> 3. [Analyze each Neighborhood](#analyze)
> 4. [Cluster Neighborhoods](#cluster)
> 5. [Examine Clusters](#examine)

***

## Importing Libraries

In [2]:
import numpy as np # handle data into vectorized manner
import pandas as pd # data analysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # handle json files
from pandas.io.json import json_normalize # transform JSON file to pandas dataframe
import requests # handle requests
from bs4 import BeautifulSoup
import csv

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# Clustering Model
from sklearn.cluster import KMeans

!pip install folium==0.5.0
import folium # map rendering library

print('Libraries imported!')

Libraries imported!


## <a name="download"></a>Download and Explore Data

***

The following code makes a request to the Wikipedia site, parses data using Beautiful Soup and saves each row into a csv file:  
**This function was deprecated. Please see note below**

In [3]:
# get html using requests and parse with Beautiful Soup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' 


def wikepedia_scraper(url):
    """ 
    This function takes in the URL as an input, gets html, parses website using beatiful soup,
    prints out the results for manual inspection, then it outputs a csv file with the data. 
    """
    
    r = requests.get(url, timeout=3).text # get html 
    soup = BeautifulSoup(r, 'lxml') # parse html with Beautiful Soup
    table = soup.table # go to table

    # create csv file to save data:
    csv_file = open("toronto_neighborhoods.csv", "w")
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['PostalCode', 'Borough', 'Neighborhood'])


    # loop through each row and scrape data:
    for row in table.find_all('td'): # find all rows within table
        try: 
            zip_code = row.p.b.text # scrape zip code
            borough = row.p.span.a.text.strip() # scrape borough
            neighborhood = row.span.text.split('(')
            neighborhood = neighborhood[1].split(')')[0]
            if '/' in neighborhood: # If multiple Neighborhoods, separate with comma
                neighborhood = neighborhood.replace(' /', ',')
                print(zip_code)
                print(borough)
                print(neighborhood)
                print()
            else:
                neighborhood = neighborhood
                print(zip_code)
                print(borough)
                print(neighborhood)
                print()
        except Exception as e: 
            zip_code = None
            borough = None
            neighborhood = None
        print() # print results
    
        csv_writer.writerow([zip_code, borough, neighborhood]) # Save to csv file
    
    csv_file.close() # close csv file

# wikepedia_scraper(url)
# df = pd.read_csv('toronto_neighborhoods.csv')
# df.head()

> **Note:** The structure of the webpage changed so the above function would not work anymore.
> I left the function on the notebook to ouline how I would scrape the webpage with the old structure.  

***

Since the data is now on a simpler format, I used Pandas library to scrape the table as shown below:

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url)[0] # use pandas to scrape table. 
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


Dropping the zip codes that are not assigned:

In [5]:
df = df.loc[df['Borough'] != 'Not assigned', :].reset_index(drop=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


In [6]:
func = lambda x: x.replace(' /', ',') # lambda function that separates Neighborhoods with a comma
df['Neighborhood'] = df['Neighborhood'].apply(func)
df.head(10)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Geolocation data

In [8]:
# combine Postal code, Borough and Neighborhood with state and country to get geolocation
x = []
for i in df.index.values:
  x.append(f"{df['Borough'][i]}, {df['Postal code'][i]}, Toronto, Canada")
df['Location'] = x

# use geopy Pandas Dataframe option to get geolocation for each row
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="toronto_location") # instantiate geolocator

from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1) # call geocoding service
df['location2'] = df['Location'].apply(geocode)
df['Latitude'] = df['location2'].apply(lambda loc: tuple(loc.point)[0] if loc else None) # get latitude
df['Longitude'] = df['location2'].apply(lambda loc: tuple(loc.point)[1] if loc else None) # get longitude
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Location,location2,Latitude,Longitude
0,M3A,North York,Parkwoods,"North York, M3A, Toronto, Canada","(North York, Toronto, Golden Horseshoe, Ontari...",43.754326,-79.449117
1,M4A,North York,Victoria Village,"North York, M4A, Toronto, Canada",,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","Downtown Toronto, M5A, Toronto, Canada",,,
3,M6A,North York,"Lawrence Manor, Lawrence Heights","North York, M6A, Toronto, Canada",,,
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","Downtown Toronto, M7A, Toronto, Canada","(Downtown Yonge, Toronto Centre, Old Toronto, ...",43.656322,-79.380916


In [9]:
df['location2'].count()

18

**Unfortunately, the Geopy module is inconsistent and was only able to find Latitude and Longitude values for 18 out of the 103 Boroughs. I will use the Google [Geocoding API](https://developers.google.com/maps/documentation/geocoding/start?hl=es_419) python library to convert those to geolocation.**

### Using Google Geolocation API

**Install and instantiate Google Maps Geocoding client:**

In [10]:
# The code was removed by Watson Studio for sharing.

Requirement already up-to-date: googlemaps in /opt/conda/envs/Python36/lib/python3.6/site-packages (4.2.0)


**Testing:**

In [11]:
# test
geocode_result = gmaps.geocode('1600 Amphitheatre Parkway, Mountain View, CA')
print(geocode_result)

[{'address_components': [{'long_name': '1600', 'short_name': '1600', 'types': ['street_number']}, {'long_name': 'Amphitheatre Parkway', 'short_name': 'Amphitheatre Pkwy', 'types': ['route']}, {'long_name': 'Mountain View', 'short_name': 'Mountain View', 'types': ['locality', 'political']}, {'long_name': 'Santa Clara County', 'short_name': 'Santa Clara County', 'types': ['administrative_area_level_2', 'political']}, {'long_name': 'California', 'short_name': 'CA', 'types': ['administrative_area_level_1', 'political']}, {'long_name': 'United States', 'short_name': 'US', 'types': ['country', 'political']}, {'long_name': '94043', 'short_name': '94043', 'types': ['postal_code']}], 'formatted_address': '1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA', 'geometry': {'location': {'lat': 37.42231, 'lng': -122.0846241}, 'location_type': 'ROOFTOP', 'viewport': {'northeast': {'lat': 37.42365898029151, 'lng': -122.0832751197085}, 'southwest': {'lat': 37.42096101970851, 'lng': -122.0859730802915

In [12]:
geocode_result[0]['geometry']['location']

{'lat': 37.42231, 'lng': -122.0846241}

**Getting All Latitude and Longitude Values:**

In [13]:
def google_geocoder(address):
    ''' 
        Request geocode and returns Latitude and Longitude tuple.
    '''
    try:
        result = gmaps.geocode(address)
        latitude = result[0]['geometry']['location']['lat']
        longitude = result[0]['geometry']['location']['lng']
    except:
        latitude = np.nan
        longitude = np.nan
    return latitude, longitude

In [14]:
df['Coord'] = df['Location'].apply(google_geocoder)

In [15]:
print("Google Geocoding API was able to retrieve {} out of {} Boroughs".format(df['Coord'].count(), df.shape[0]))
df.head()

Google Geocoding API was able to retrieve 103 out of 103 Boroughs


Unnamed: 0,Postal code,Borough,Neighborhood,Location,location2,Latitude,Longitude,Coord
0,M3A,North York,Parkwoods,"North York, M3A, Toronto, Canada","(North York, Toronto, Golden Horseshoe, Ontari...",43.754326,-79.449117,"(43.7532586, -79.3296565)"
1,M4A,North York,Victoria Village,"North York, M4A, Toronto, Canada",,,,"(43.72588229999999, -79.3155716)"
2,M5A,Downtown Toronto,"Regent Park, Harbourfront","Downtown Toronto, M5A, Toronto, Canada",,,,"(43.6542599, -79.36063589999999)"
3,M6A,North York,"Lawrence Manor, Lawrence Heights","North York, M6A, Toronto, Canada",,,,"(43.718518, -79.4647633)"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government","Downtown Toronto, M7A, Toronto, Canada","(Downtown Yonge, Toronto Centre, Old Toronto, ...",43.656322,-79.380916,"(43.6623015, -79.3894938)"


**Cleanup and format**

In [16]:
df['Latitude'] = df['Coord'].apply(lambda x: x[0])
df['Longitude'] = df['Coord'].apply(lambda x: x[1])

df = df.drop(columns=['Location', 'location2', 'Coord'])
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [17]:
# Saving file into Cloud Object Storage
project.save_data(file_name = "toronto_geolocation.csv", data = df.to_csv(index=False), overwrite=True)
print('Dataframe Saved as csv!')

Dataframe Saved as csv!


## <a name="explore"></a>Explore Neighborhoods in Toronto

**Visualize the Neighborhoods**

In [22]:
# create map of Toronto using Nominatim Toronto Latitude and Longitude

# get Toronto Geolocation coordinates
geolocator = Nominatim(user_agent='toronto_neigh')
location = geolocator.geocode('Toronto, Ontario, Canada')
latitude = location.latitude
longitude = location.longitude

# instantiate map
map_toronto = folium.Map(location=[latitude, longitude], 
                         zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
  label = '{}, {}'.format(neighborhood, borough)
  label = folium.Popup(label, parse_html=True)
  folium.CircleMarker(
      [lat, lng],
      radius=5,
      popup=label,
      color='blue',
      fill=True,
      fill_color='#3186cc',
      fill_opacity=0.7,
      parse_html=False).add_to(map_toronto)

map_toronto

## <a name="analize"></a>Analyze each Neighborhood

## <a name="cluster"></a>Cluster Neighborhoods

## <a name="examine"></a>Examine Clusters