# Capstone Segmentation and Clustering of Toronto Neighborhoods

<hr></hr>

This workbook will satisfy the week 3 requirements for the IBM Data Science Capstone course on Coursera. In it, I will scrape tabular data from wikipedia, read it into a pandas dataframe, then clean that data. First, I will need to import the BeautifulSoup library to be able to read this data into a dataframe.

# Part 1: Web Scraping and Data Wrangling
<hr></hr>

### Step 1: Scrape the Website

In [4]:
import pandas as pd
import requests

In [5]:
pip install beautifulsoup4

Collecting beautifulsoup4
[?25l  Downloading https://files.pythonhosted.org/packages/1a/b7/34eec2fe5a49718944e215fde81288eec1fa04638aa3fb57c1c6cd0f98c3/beautifulsoup4-4.8.0-py3-none-any.whl (97kB)
[K     |████████████████████████████████| 102kB 10.2MB/s ta 0:00:01
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4)
  Downloading https://files.pythonhosted.org/packages/0b/44/0474f2207fdd601bb25787671c81076333d2c80e6f97e92790f8887cf682/soupsieve-1.9.3-py2.py3-none-any.whl
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.8.0 soupsieve-1.9.3
Note: you may need to restart the kernel to use updated packages.


In [6]:
from bs4 import BeautifulSoup

In [7]:
BeautifulSoup

bs4.BeautifulSoup

In [8]:
r1 = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r1.content,'html.parser')
table = soup.find(lambda tag: tag.name =='table' and ("wikitable" in tag['class']))

In [9]:
df = pd.read_html(str(table), flavor='bs4')[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Step 2: Remove the "Not Assigned" data
Do this by creating a dataframe where it selects those that are not assigned, then inverse it.

In [10]:
df_2 = (df['Borough'] == 'Not assigned')|(df['Neighbourhood'] == 'Not assigned')
df = df[~df_2]

### Step 3: Check to see if there are any "Not Assigned" variables left

In [11]:
(df.Borough == 'Not assigned').sum()

0

In [12]:
(df.Neighbourhood == 'Not Assigned').sum()

0

In [13]:
post=df.Postcode.unique()

### Step 4: Group the dataframe by postcode
Some entries have the same postcode, so we need to group them together based on the unique postcode using the variable above.

In [14]:
Toronto = pd.DataFrame(columns=['Postcode','Borough','Neighbourhood'])
for code in post:
    temp_df = df[['Borough','Neighbourhood']][df['Postcode'] == code]
    boro = temp_df.Borough.unique()
    hood = temp_df.Neighbourhood.unique()
    Toronto = Toronto.append({
        'Postcode':code,
        'Borough':",".join(boro),
        'Neighbourhood':",".join(hood)},ignore_index=True)

In [15]:
Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M9A,Etobicoke,Islington Avenue


In [16]:
Toronto.shape

(102, 3)

# Part 2: Geocoding
<hr></hr>

### Step 5: Install the Geocoder to Geocode the Post Codes
#### Note: This did not work so I got the data from the CSV

In [17]:
#!conda install -c conda-forge geocoder --yes

In [18]:
#import geocoder # import geocoder

# initialize your variable to None
#lat_lng_coords = None

# loop until you get the coordinates
#while(lat_lng_coords is None):
  #g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  #lat_lng_coords = g.latlng

#latitude = lat_lng_coords[0]
#longitude = lat_lng_coords[1]

In [19]:
postalcodes_from_csv = pd.read_csv('http://cocl.us/Geospatial_data')

In [20]:
postalcodes_from_csv.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [21]:
Toronto = Toronto.sort_values(by=['Postcode'])
Toronto.reset_index(inplace=True, drop=True)

In [22]:
pcode = postalcodes_from_csv.sort_values(by=['Postal Code'])

In [23]:
pcode.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [24]:
Toronto=pd.concat([Toronto, pcode[['Latitude','Longitude']]], axis = 1)

In [25]:
Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# Part 3: Clustering of Neighborhoods
<hr></hr>
First, I wanted to only use those boroughs that contained the word "Toronto".

In [26]:
Toronto = Toronto[Toronto.Borough.str.contains('Toronto', na=False)]

In [27]:
Toronto.reset_index( inplace = True)
Toronto

Unnamed: 0,index,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,37,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,43,M4M,East Toronto,Studio District,43.659526,-79.340923
4,44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,46,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,47,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,48,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,49,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


In [28]:
Toronto.drop(['index'], axis = 1, inplace = True)

##### Then I reset the index so it would be properly ordered, starting at 0.

In [29]:
Toronto

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
5,M4P,Central Toronto,Davisville North,43.712751,-79.390197
6,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
7,M4S,Central Toronto,Davisville,43.704324,-79.38879
8,M4T,Central Toronto,"Moore Park,Summerhill East",43.689574,-79.38316
9,M4V,Central Toronto,"Deer Park,Forest Hill SE,Rathnelly,South Hill,...",43.686412,-79.400049


### Mapmaking
###### For the next part, I will map the above boroughs.

In [30]:
import folium

In [31]:
map_toronto = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(Toronto['Latitude'], Toronto['Longitude'], Toronto['Borough'], Toronto['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
map_toronto

##### Next, we will get information from Foursquare's API regarding establishments in each borough.

In [32]:
# @hidden_cell
CLIENT_ID = 'RNKIHBWGNRSAOCCLS1HI3CCQYEZTMJ5NF4P054EJFN5JLWKI' # your Foursquare ID
CLIENT_SECRET = 'PAJ2BMGBQEVPBU21HNV4IO3UGUQKLI4GEYNUTRDUW4WRH4NO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [33]:
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import json

def getNearbyVenues(postcodes, latitudes, longitudes, radius=500, limit=100):
    
    venues_list=[]
    postcodes_done = ""
    for code, lat, lng in zip(postcodes, latitudes, longitudes):
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        res = requests.get(url).json()["response"]
        if 'groups' not in res:
            continue;
        
        postcodes_done += code+","
        results = res['groups'][0]['items']
            
        # return only relevant information for each nearby venue
        venues_list.append([(
            code, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print(postcodes_done[:-1])
    return(nearby_venues)

Earlier, we created a dataframe named 'Toronto' that only selected boroughs in Toronto. Now, we will combine that with the above pull request into a new dataframe so we have venues in these areas.

In [34]:
toronto_venues = getNearbyVenues(Toronto['Postcode'], Toronto['Latitude'], Toronto['Longitude'],radius=500, limit=100)

M4E,M4K,M4L,M4M,M4N,M4P,M4R,M4S,M4T,M4V,M4W,M4X,M4Y,M5A,M5B,M5C,M5E,M5G,M5H,M5J,M5K,M5L,M5N,M5P,M5R,M5S,M5T,M5V,M5W,M5X,M6G,M6H,M6J,M6K,M6P,M6R,M6S,M7Y


In [35]:
toronto_venues.head()

Unnamed: 0,Postcode,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M4E,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,M4E,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,M4E,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,M4E,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,M4K,43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


### One-Hot Encoding
Now, we will do one-hot encoding so we can make the categorical data ready to be read by a machine learning algorithm.

In [36]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Postcode'] = toronto_venues['Postcode'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.shape

(1703, 231)

In [37]:
toronto_grouped = toronto_onehot.groupby('Postcode').sum()

In [38]:
toronto_grouped.reset_index(inplace=True)
toronto_grouped.head()

Unnamed: 0,Postcode,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,M4E,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,M4K,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,1
2,M4L,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,M4M,0,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,1
4,M4N,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, we will take the most common venues and put them into a dataframe.

In [43]:
def return_most_common_venues(row, num_top_venues=3):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [44]:
import numpy as np
toronto_top = 3
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postcode']
for ind in np.arange(toronto_top):
    columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postcode'] = toronto_grouped.Postcode


for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], toronto_top)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,M4E,Health Food Store,Trail,Pub
1,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant
2,M4L,Sandwich Place,Gym,Pub
3,M4M,Café,Coffee Shop,Italian Restaurant
4,M4N,Park,Bus Line,Swim School


##### Now we can begin k-means clustering of these common venues to see what boroughs are similar.

In [46]:
from sklearn.cluster import KMeans
kclusters = 5

toronto_clustered = toronto_grouped.drop('Postcode', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustered)

In [48]:
Toronto['K_label'] = kmeans.labels_

In [49]:
Toronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,K_label
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,0
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,2
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2


In [50]:
Toronto_final = pd.merge(left=Toronto, right=neighborhoods_venues_sorted, on='Postcode')
Toronto_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,K_label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,2,Health Food Store,Trail,Pub
1,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant
2,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,2,Sandwich Place,Gym,Pub
3,M4M,East Toronto,Studio District,43.659526,-79.340923,0,Café,Coffee Shop,Italian Restaurant
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Park,Bus Line,Swim School


Now that all data is in one frame, we can go about mapping.

In [51]:
map_toronto = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

colors = ['red','blue','green','orange','purple']

# add markers to map
for lat, lng, borough, neighborhood, lb in zip(Toronto_final['Latitude'], Toronto_final['Longitude'], 
                                         Toronto_final['Borough'],Toronto_final['Neighbourhood'],
                                        Toronto_final['K_label']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colors[lb],
        fill=True,
        fill_color= colors[lb],
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
    
map_toronto


Based on earlier k-means clustering, I can now see which neighborhoods are similar in terms of venues they contain, which would help inform a choice for moving.