# Segmenting and Clustering Neighborhoods in Toronto
### Cousera Course: IBM Data Science Professional Certificate
#### Capstone Project - Week 3 Assignment

---
# Part 1 - Build Neighborhood Dataset in Toronto

Data source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  
Use pandas, or the BeautifulSoup package to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [3]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

class HTMLTableParser:

    def parse_url(self, url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        return [(table.get('id', "table"), self.parse_html_table(table)) for table in soup.find_all('table')]  

    def parse_html_table(self, table):
        n_columns = 0
        n_rows = 0
        column_names = []

        table_name = table.get('id', "no name")
        print(f'\nProcessing table[{table_name}] ...')
        # Find number of rows and columns
        # we also find the column titles if we can
        for row in table.find_all('tr'):

            # Determine the number of rows in the table
            td_tags = row.find_all('td')
            if len(td_tags) > 0:
                n_rows += 1
                if n_columns == 0:
                    # Set the number of columns for our table
                    n_columns = len(td_tags)
                    print(f"Number of Columns: {n_columns}")

            # Validate number of column for each line
            if n_columns != 0 and n_columns != len(td_tags):
                print(f"Number of Column MISMATCH! Require: {n_columns}, Found: {len(td_tags)}, table ignored!")
                return None

            # Handle column names if we find them
            if len(column_names) == 0:
                th_tags = row.find_all('th') 
                if len(th_tags) > 0:
                    for th in th_tags:
                        column_names.append(th.get_text().strip())
                    print(f"Column name: {column_names}")

        print(f"Number of Rows: {n_rows}")

        # Safeguard on Column Titles
        if len(column_names) > 0 and len(column_names) != n_columns:
            print("Column titles do not match the number of columns, table ignored!")
            return None

        columns = column_names if len(column_names) > 0 else range(0, n_columns)
        df = pd.DataFrame(columns = columns, index = range(0, n_rows))
        row_index = 0
        for row in table.find_all('tr'):
            column_index = 0
            columns = row.find_all('td')
            for column in columns:
                df.iat[row_index, column_index] = column.get_text().strip()
                column_index += 1
            if len(columns) > 0:
                row_index += 1

        # Convert to float if possible
        for col in df:
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                pass

        return df

In [4]:
# Fetch table from wikipedia link and convert it into pandas dataframe
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

hp = HTMLTableParser()
Toronto_df = hp.parse_url(url)[0][1]    # Grabbing the table from the tuple
Toronto_df.head()


Processing table[no name] ...
Column name: ['Postal Code', 'Borough', 'Neighborhood']
Number of Columns: 3
Number of Rows: 180

Processing table[no name] ...
Number of Columns: 2
Column name: ['Canadian postal codes']
Number of Column MISMATCH! Require: 2, Found: 31, table ignored!

Processing table[no name] ...
Number of Columns: 12
Number of Column MISMATCH! Require: 12, Found: 18, table ignored!


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
# Print total record in original table
Toronto_df.shape

(180, 3)

#### The dataframe will consist of three columns: `PostalCode`, `Borough`, and `Neighborhood`

In [6]:
Toronto_df.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
Toronto_df.columns

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is `Not assigned`.

In [7]:
Toronto_df = Toronto_df[Toronto_df['Borough'] != 'Not assigned'].reset_index(drop=True)
Toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### More than one neighborhood can exist in one postal code area.  
For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

> The table is cleaned recently, there is no duplicated value in `Postal Code` column now. Let's double check it.

In [8]:
# Check whether there is duplicated value in first column
print("Number of duplicated value in PostalCode column:", len(Toronto_df[Toronto_df.duplicated(['PostalCode'])]))

Number of duplicated value in PostalCode column: 0


#### If a cell has a borough but a `Not assigned` neighborhood, then the neighborhood will be the same as the borough.

> The table is cleaned recently, there is no **Not assigned** or **Empty value** in `Neighborhood` column now. Let's double check it.

In [9]:
# Check whether there is "Not assigned" or empty neighborhood
print('Number of Neighborhood column with "Not assigned" or empty value: ',
      len(Toronto_df[(Toronto_df['Neighborhood'] == 'Not assigned') | (Toronto_df['Neighborhood'].isnull())]))

Number of Neighborhood column with "Not assigned" or empty value:  0


#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
Toronto_df.shape

(103, 3)

---
# Part 2 - Find Geographical Coordinates for all Postal Code in Toronto

Since `geocoder` doesn't work, I use geolocation dataset from link: http://cocl.us/Geospatial_data

In [11]:
!wget -q -O 'Geospatial_data.csv' https://cocl.us/Geospatial_data
print('Geospatial data of Toronto downloaded!')

Geospatial data of Toronto downloaded!


Load dataframe from CSV file

In [12]:
geo_df = pd.read_csv('Geospatial_data.csv')
geo_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merge two datasets using `Postal Code` column

In [13]:
all_df = Toronto_df.merge(geo_df, left_on='PostalCode', right_on='Postal Code').drop('Postal Code', axis=1)
all_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [14]:
all_df.shape

(103, 5)

---
# Part 3 - Explore and Cluster the Neighborhoods in Toronto

Exploring only boroughs that contains the word `Toronto`

In [15]:
toronto_data = all_df[all_df['Borough'].str.contains("Toronto", case=False)].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [16]:
toronto_data.shape

(39, 5)

In [2]:
#!conda install -c conda-forge geopy --yes # uncomment this line if the library is not installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
#from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if the library is not installed
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    folium-0.11.0              |             py_0          61 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    branca:          0.4.1-py_0        conda-forge
    folium:          

#### Use geopy library to get the latitude and longitude values of Toronto City.
In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent `tor_explorer`, as shown below.

In [17]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="tor_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f'The geograpical coordinate of Toronto City are {latitude}, {longitude}.')

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


#### Generate map to visualize neighborhoods of Toronto and show how they cluster together.

In [19]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, label in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto