This notebook presents a solution to the Neighborhood Segmentation project outlined in Week 3 of the IBM Applied Data Science Capstone Course on Coursers

Created 5/21/19 by K Fullerton

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from bs4 import BeautifulSoup
import requests, json
import geocoder
import folium
from sklearn.cluster import KMeans

# Web Scraping to Collect Neighborhood Data

## Create the page object

In [2]:
url_to_scrape = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url_to_scrape)
page

<Response [200]>

<Respone [200]> Indicates that the page object was created and requested successfully.

## Create the Pandas dataframe in which to store the data
And use the .head() command to check its format

In [3]:
code_list = list()
borough_list = list()
neighborhood_list = list()

## Use Beautiful Soup to scrape the table data into the pandas dataframe

In [4]:
soup = BeautifulSoup(page.content,'html.parser')
for tr in soup.find_all('tr')[1:287]:
    tds = tr.find_all('td')
#     print(tds[0].text)
#     print(tds[1].text)
#     print(tds[2].text)
    code_list.append(tds[0].text)
    borough_list.append(tds[1].text)
    neighborhood_list.append(tds[2].text)
neighborhood_list = list(map(lambda s: s.strip(), neighborhood_list))
zippedList = list(zip(code_list, borough_list, neighborhood_list))
toronto_data = pd.DataFrame(zippedList, columns=['PostalCode', 'Borough', 'Neighborhood'])
    
toronto_data.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


## Clean the dataframe by removing any postal codes that have not been assigned

In [5]:
toronto_data = toronto_data[toronto_data.Borough != 'Not assigned']
toronto_data.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


## Assign borough name for any not assigned neighborhood names

In [6]:
# Find the indices for rows where Neighborhood is Not assigned
missing_indices = toronto_data[toronto_data.Neighborhood == 'Not assigned']
# Iterate over those indices to replace the not assigned with the borough name
for i, row in missing_indices.iterrows():
    toronto_data.at[i, 'Neighborhood'] = row.Borough
toronto_data.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Check the data frame to make sure that there are no remaining "not assigned" values.

Note that this secondary check of the "missing indices" list should result in an empty data frame- when we perform this second check, all of the not assigned neighborhoods should have been replaced with the borough name.

In [7]:
missing_indices = toronto_data[toronto_data.Neighborhood == 'Not assigned']
print(missing_indices)

Empty DataFrame
Columns: [PostalCode, Borough, Neighborhood]
Index: []


## Merge Neighborhood Names for postal codes with multiple neighborhoods listed

In [8]:
def concat_string_values(group):
    string = ''
    for name in group.Neighborhood:
        string += name + ' '
        
    return string

grouped_data = toronto_data.groupby(['PostalCode','Borough']).apply(concat_string_values) 

In [9]:
cleaned_toronto_data = pd.DataFrame(grouped_data.reset_index())
cleaned_toronto_data.rename(columns={'PostalCode': 'Postal Code', 0:'Neighborhoods'},  inplace=True)
cleaned_toronto_data.head(10)

Unnamed: 0,Postal Code,Borough,Neighborhoods
0,M1B,Scarborough,Rouge Malvern
1,M1C,Scarborough,Highland Creek Rouge Hill Port Union
2,M1E,Scarborough,Guildwood Morningside West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,East Birchmount Park Ionview Kennedy Park
7,M1L,Scarborough,Clairlea Golden Mile Oakridge
8,M1M,Scarborough,Cliffcrest Cliffside Scarborough Village West
9,M1N,Scarborough,Birch Cliff Cliffside West


# Collecting Geographic Data

## Geocoder function
I attempted to use the geocoder function provided in the project instructions, but was unable to connect with the API. In order to complete the project, I used the provided geospatial data file.

## Alternative method for finding latitude and longitude
Since the geocoder API is not working, we will use the provided csv file to populate the latitude and longitude values.

We will set the indicies of both the neighborhood data frame and the geospatial dataframe to be the postal code to facilitate merging.


In [10]:
lat_long_data = pd.read_csv('Geospatial_Coordinates.csv')
lat_long_data.set_index('Postal Code', inplace=True)
lat_long_data.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [11]:
cleaned_toronto_data.set_index('Postal Code', inplace=True)
cleaned_toronto_data.head()

Unnamed: 0_level_0,Borough,Neighborhoods
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,Scarborough,Rouge Malvern
M1C,Scarborough,Highland Creek Rouge Hill Port Union
M1E,Scarborough,Guildwood Morningside West Hill
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [12]:
for index, row in cleaned_toronto_data.iterrows():
    lat = lat_long_data.loc[index,'Latitude']
    long = lat_long_data.loc[index, 'Longitude']
    cleaned_toronto_data.loc[index, 'Latitude'] = lat
    cleaned_toronto_data.loc[index, 'Longitude'] = long
cleaned_toronto_data.head()

Unnamed: 0_level_0,Borough,Neighborhoods,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,Scarborough,Rouge Malvern,43.806686,-79.194353
M1C,Scarborough,Highland Creek Rouge Hill Port Union,43.784535,-79.160497
M1E,Scarborough,Guildwood Morningside West Hill,43.763573,-79.188711
M1G,Scarborough,Woburn,43.770992,-79.216917
M1H,Scarborough,Cedarbrae,43.773136,-79.239476


The example provided in the project instructions has a generic index, so we will reset the index on this dataframe and bring the Postal Code column into the body of the dataframe.

In [13]:
cleaned_toronto_data.reset_index(inplace=True)
cleaned_toronto_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhoods,Latitude,Longitude
0,M1B,Scarborough,Rouge Malvern,43.806686,-79.194353
1,M1C,Scarborough,Highland Creek Rouge Hill Port Union,43.784535,-79.160497
2,M1E,Scarborough,Guildwood Morningside West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Initial Clustering Analysis

In order to explore the Toronto space, we will first map the neighborhoods and color code the points by Borough name.

## Create Map
To begin, we will set the map center coordinates to those for down town Toronto and create the map object. To facilitate visualization, we will use the Stamen Toner format in folium.

In [14]:
# Set central coordinates for Downtown Toronto to pin the map
toronto_lat = 43.6532
toronto_long = -79.3832
# Create map object
display_map = folium.Map(location=[toronto_lat, toronto_long],
                         tiles='Stamen Toner',
                        zoom_start=10)

## Create color mapping for boroughs
Since the borough names are strings, we will create a dataframe to map the name of each borough into a marker color string.

In [15]:
borough_list = cleaned_toronto_data.Borough.unique()
color_list = ['red', 'blue', 'green', 'purple', 'orange', 'darkred',
             'lightred', 'beige', 'darkblue', 'darkgreen', 'cadetblue']
borough_color_translation = pd.DataFrame()
borough_color_translation['Borough'] = borough_list
borough_color_translation['Color'] = color_list
borough_color_translation.set_index('Borough', inplace=True)
borough_color_translation

Unnamed: 0_level_0,Color
Borough,Unnamed: 1_level_1
Scarborough,red
North York,blue
East York,green
East Toronto,purple
Central Toronto,orange
Downtown Toronto,darkred
York,lightred
West Toronto,beige
Queen's Park,darkblue
Mississauga,darkgreen


In [16]:
borough_color_translation.loc['Scarborough'].iloc[0]

'red'

## Create Markers for each neighborhood
Now we will loop through the dataframe and create a circle on the map corresponding to each latitude and longitude. The popup name for each point will be the Neighborhood, and the color is determined by the borough.

In [17]:
# Loop through the dataframe to create a cir
for index, row in cleaned_toronto_data.iterrows():
    borough = row[1]
    color_name = borough_color_translation.loc[borough].iloc[0]
    folium.Circle(
        radius = 100, 
        location = [row[3], row[4]],
        popup = row[2], 
        fill = True,
        color=color_name).add_to(display_map)

In [18]:
display_map