# Segmenting and Clustering Neighborhoods in Toronto
### Cousera Course: IBM Data Science Professional Certificate
#### Capstone Project - Week 3 Assignment

---
# Part 1 - Build Neighborhood Dataset in Toronto

Data source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M  
Use pandas, or the BeautifulSoup package to transform the data in the table on the Wikipedia page into the above pandas dataframe.

In [52]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

class HTMLTableParser:

    def parse_url(self, url):
        response = requests.get(url)
        soup = BeautifulSoup(response.text, 'lxml')
        return [(table.get('id', "table"), self.parse_html_table(table)) for table in soup.find_all('table')]  

    def parse_html_table(self, table):
        n_columns = 0
        n_rows = 0
        column_names = []

        table_name = table.get('id', "no name")
        print(f'\nProcessing table[{table_name}] ...')
        # Find number of rows and columns
        # we also find the column titles if we can
        for row in table.find_all('tr'):

            # Determine the number of rows in the table
            td_tags = row.find_all('td')
            if len(td_tags) > 0:
                n_rows += 1
                if n_columns == 0:
                    # Set the number of columns for our table
                    n_columns = len(td_tags)
                    print(f"Number of Columns: {n_columns}")

            # Validate number of column for each line
            if n_columns != 0 and n_columns != len(td_tags):
                print(f"Number of Column MISMATCH! Require: {n_columns}, Found: {len(td_tags)}, table ignored!")
                return None

            # Handle column names if we find them
            if len(column_names) == 0:
                th_tags = row.find_all('th') 
                if len(th_tags) > 0:
                    for th in th_tags:
                        column_names.append(th.get_text().strip())
                    print(f"Column name: {column_names}")

        print(f"Number of Rows: {n_rows}")

        # Safeguard on Column Titles
        if len(column_names) > 0 and len(column_names) != n_columns:
            print("Column titles do not match the number of columns, table ignored!")
            return None

        columns = column_names if len(column_names) > 0 else range(0, n_columns)
        df = pd.DataFrame(columns = columns, index = range(0, n_rows))
        row_index = 0
        for row in table.find_all('tr'):
            column_index = 0
            columns = row.find_all('td')
            for column in columns:
                df.iat[row_index, column_index] = column.get_text().strip()
                column_index += 1
            if len(columns) > 0:
                row_index += 1

        # Convert to float if possible
        for col in df:
            try:
                df[col] = df[col].astype(float)
            except ValueError:
                pass

        return df

In [53]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

hp = HTMLTableParser()
Toronto_df = hp.parse_url(url)[0][1]    # Grabbing the table from the tuple
Toronto_df.head()


Processing table[no name] ...
Column name: ['Postal Code', 'Borough', 'Neighborhood']
Number of Columns: 3
Number of Rows: 180

Processing table[no name] ...
Number of Columns: 2
Column name: ['Canadian postal codes']
Number of Column MISMATCH! Require: 2, Found: 31, table ignored!

Processing table[no name] ...
Number of Columns: 12
Number of Column MISMATCH! Require: 12, Found: 18, table ignored!


Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [54]:
# Print total record in original table
Toronto_df.shape

(180, 3)

#### The dataframe will consist of three columns: `PostalCode`, `Borough`, and `Neighborhood`

In [55]:
Toronto_df.rename(columns={"Postal Code": "PostalCode"}, inplace=True)
Toronto_df.columns

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

#### Only process the cells that have an assigned borough. Ignore cells with a borough that is `Not assigned`.

In [56]:
Toronto_df = Toronto_df[Toronto_df['Borough'] != 'Not assigned'].reset_index(drop=True)
Toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


#### More than one neighborhood can exist in one postal code area.  
For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

> The table is cleaned recently, there is no duplicated value in `Postal Code` column now. Let's double check it.

In [57]:
# Check whether there is duplicated value in first column
print("Number of duplicated value in PostalCode column:", len(Toronto_df[Toronto_df.duplicated(['PostalCode'])]))

Number of duplicated value in PostalCode column: 0


#### If a cell has a borough but a `Not assigned` neighborhood, then the neighborhood will be the same as the borough.

> The table is cleaned recently, there is no **Not assigned** or **Empty value** in `Neighborhood` column now. Let's double check it.

In [58]:
# Check whether there is "Not assigned" or empty neighborhood
print('Number of Neighborhood column with "Not assigned" or empty value: ',
      len(Toronto_df[(Toronto_df['Neighborhood'] == 'Not assigned') | (Toronto_df['Neighborhood'].isnull())]))

Number of Neighborhood column with "Not assigned" or empty value:  0


#### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [59]:
Toronto_df.shape

(103, 3)