# Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

## Part 2 week 3 IBM Data Science - Applied Data Science Capstone

In Part 1 a dataframe with the Postal Code of each canadian borough and neighborhood was developed. This Part 2 consists in adding the latitude and the longitude coordinates of all neighborhoods.

### **PART 1**

**1. Installing and importing libraries**

In [1]:
# This code installs the required libraries
!pip install beautifulsoup4
!conda install -c conda-forge geopy --yes
!conda install -c conda-forge folium=0.5.0 --yes

print('Libraries installed successfully!')

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries installed successfully!


In [2]:
# This code imports the required libraries
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

print('Libraries imported successfully!')

Libraries imported successfully!


**2. Scraping the list of postal codes in Canada dataset from Wikipedia**

In [3]:
# This code reads the Wikipedia website content of the server's response
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
results = requests.get(url).text

# This code parses the html document from the Wikipedia website using BeautifulSoup library
Canada_data = BeautifulSoup(results, 'html.parser')
wikipedia_table = Canada_data.find('table')

# This code converts the Wikipedia html table into a DataFrame using Pandas library
column_names = ['Postal Code', 'Borough', 'Neighborhood']
df = pd.DataFrame(columns=column_names)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood


In [4]:
# This code searches all the postcodes, boroughs, neighborhoods available
for tr_cell in wikipedia_table.find_all('tr'):
    row_data = []
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data) == 3:
        df.loc[len(df)] = row_data
    
# This code displays the 12 first results in the df DataFrame
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


**3. Data cleaning**

In [5]:
# This code removes boroughs with "not assigned" values
df = df[df['Borough'].str.contains("Not assigned") == False].reset_index()

# This code displays the 12 first results in the df DataFrame
df.head(12)

Unnamed: 0,index,Postal Code,Borough,Neighborhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,9,M1B,Scarborough,"Malvern, Rouge"
7,11,M3B,North York,Don Mills
8,12,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
# This code removes the first column
df.drop(['index'], axis = 1, inplace = True)

# This code displays the 12 first results in the cleaned df DataFrame
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [7]:
# This code prints the dimensions of the df DataFrame
print('The shape for df DataFrame is:', df.shape)

The shape for df DataFrame is: (103, 3)


The dataframe contains 103 Postal Codes (rows) and 3 columns: Postal Code, Borough, Neighborhood

### **PART 2**

**4. Adding the latitude and longitude coordinates to the DataFrame**

In [8]:
# This code defines a function to get the coordinates (latitude, longitude) for all neighborhoods
def get_geocode(postal_code):
    
    # initialize the variable to None
    lat_lng_coords = None
    
    # loop until the coordinates are obtained
    while(lat_lng_coords is None):
        g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
        lat_lng_coords = g.latlng
        
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    return latitude, longitude

In [9]:
# This code reads the geographical coordinates of each postal code using pandas library
locgeo_df = pd.read_csv('https://cocl.us/Geospatial_data')

# This code displays the 12 first results of the locgeo_df DataFrame
locgeo_df.head(12)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


The shape of the 'locgeo_df' DataFrame should be checked before joining 'df' and 'locgeo_df' DataFrames to ensure both have the same shape.

In [10]:
# This code prints the dimensions of the locgeo_df DataFrame
print('The shape for locgeo_df DataFrame is:', locgeo_df.shape)

The shape for locgeo_df DataFrame is: (103, 3)


The shape of 'df' and 'locgeo_df' DataFrames is the same so Latitude and Longitude columns from the 'locgeo_df' DataFrame can be added to the existing 'df' DataFrame 

In [11]:
# This code joins df and locgeo_df DataFrames on Postal Code column
df = df.join(locgeo_df.set_index('Postal Code'), on = 'Postal Code')

# This code displays the 12 first results of the new df DataFrame
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
