### Coursera IBM Data Science Capstone Project
#### Assignment: Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import pandas as pd
import numpy as np

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

# scrape webpage
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Python Tutorial: Web Scraping with BeautifulSoup and Requests
https://www.youtube.com/watch?v=ng2o98k983k

Web Scraping HTML Tables with Python
https://towardsdatascience.com/web-scraping-html-tables-with-python-c9baba21059

Extracting HTML Tables using requests + beautiful soup &&
Saving it as CSV File in Python
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

Beautiful Soup Documentation https://beautiful-soup-4.readthedocs.io/en/latest/

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url)

In [3]:
# create a handle (response) to access the contents of the website
# with requests library
response

<Response [200]>

In [4]:
# parse the HTML for a nicer, nested BeautifulSoup data structure
soup = BeautifulSoup(response.text, 'html.parser')

##### Define Functions for Web Scraping

In [5]:
# define a function to accept the target URL and create a soup object
def get_soup(url):
    html = requests.get(url)
    soup = BeautifulSoup(html.content, 'html.parser')
    return soup

In [6]:
# extract and return all tables in a soup object
# find the 'table' HTML tag
def get_all_tables(soup):
    return soup.find_all('table')

In [7]:
# extract the table headers
# find the 'th' tags
def get_table_headers(table):
    headers = []
    for th in table.find('tr').find_all('th'):
        headers.append(th.text.strip())
    return headers

In [8]:
# extract all the table rows
def get_table_rows(table):
    rows = []
    for tr in table.find_all('tr')[1:]:
        cells = []
        tds = tr.find_all('td') #grab all td tags in the table row
        
        # if no td tags, search for th tags
        if len(tds) == 0:
            ths = tr.find_all('th')
            for th in ths:
                cells.append(th.text.strip())
        
        # use regular 
        else:
            for td in tds:
                cells.append(td.text.strip())
        
        rows.append(cells)
    return rows

In [9]:
def save_as_cvs(table_name, headers, rows):
    pd.DataFrame(rows, columns=headers).to_csv(f"{table_name}.csv")

In [10]:
def main(url):
    soup = get_soup(url) #get the soup
    tables = get_all_tables(soup) #extract all tables
    print(f"[+] Found a total of {len(tables)} tables.")
    
    # iterate over all tables
    for i, table in enumerate(tables, start=1):
        headers = get_table_headers(table) #get table headers
        rows = get_table_rows(table) #get table rows
        table_name = f"table-{i}"
        print(f"[+] Saving {table_name}")
        save_as_cvs(table_name, headers, rows)

In [11]:
main(url)

[+] Found a total of 5 tables.
[+] Saving table-1


AttributeError: 'NoneType' object has no attribute 'find_all'

##### Clean Data
file name = table-1.csv

In [12]:
file = pd.read_csv('/Users/hahatrisha/Downloads/table-1.csv')
print('The shape of the Wiki table is', file.shape)
print('Columns are: ', file.columns)
file.head()

The shape of the Wiki table is (287, 4)
Columns are:  Index(['Unnamed: 0', 'Postcode', 'Borough', 'Neighbourhood'], dtype='object')


Unnamed: 0.1,Unnamed: 0,Postcode,Borough,Neighbourhood
0,0,M1A,Not assigned,Not assigned
1,1,M2A,Not assigned,Not assigned
2,2,M3A,North York,Parkwoods
3,3,M4A,North York,Victoria Village
4,4,M5A,Downtown Toronto,Harbourfront


To create the above dataframe:
1. The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [13]:
# remove the 'Unnamed: 0' numbering
file.drop('Unnamed: 0', axis=1, inplace=True)

2. Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

In [14]:
del_index = file[file['Borough'] == 'Not assigned'].index
file.drop(del_index, inplace=True)
print("Deleted", 287-file.shape[0], "rows that do not have an assigned borough!")
print("The shape of the updated dataframe is", file.shape)

Deleted 77 rows that do not have an assigned borough!
The shape of the updated dataframe is (210, 3)


3. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [15]:
file

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


In [16]:
# group the rows together and then join up the column values
# Use groupby and apply and reset_index
data = file.groupby(['Postcode','Borough'])['Neighbourhood'].unique().agg(','.join).reset_index()
data

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village,Martin Grove Gardens,Richvie..."
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam..."


4. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [17]:
replace_index = data[data['Neighbourhood'] == 'Not assigned'].index
replace_index

Int64Index([93], dtype='int64')

In [18]:
data.loc[93,:]

Postcode                  M9A
Borough          Queen's Park
Neighbourhood    Not assigned
Name: 93, dtype: object

In [19]:
replace_text = data['Neighbourhood'][replace_index].values
new_text = data['Borough'][replace_index].values
data.replace(to_replace=replace_text, value=new_text, inplace=True)

In [20]:
data.loc[93,:]

Postcode                  M9A
Borough          Queen's Park
Neighbourhood    Queen's Park
Name: 93, dtype: object

5. Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
6. In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [21]:
print("The number of rows in the dataframe is:", data.shape[0])

The number of rows in the dataframe is: 103
