# Explore and Cluster Neighborhood Data in Toronto

## Retrieve Data from Wikipedia Using Beautiful Soup

In [1]:
# Wikipedia host a page of postal codes used in Toronto, Ontario, Canada
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [2]:
#Import the necessary libraries.
import requests
import csv
import pandas as pd
from bs4 import BeautifulSoup

Create a place holder for the information.

In [3]:
# List
postalcodes_list = []

In [4]:
# Open a file handle using the comma separated value format to store the parsed output
csv_file = open('cms_scrape.csv', 'w')
csv_writer = csv.writer(csv_file)

In [5]:
# Get the webpage to parse
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

# Find the table that contains the information
table = soup.find('table',class_='wikitable sortable').tbody

## Check the Content of the HTML Page

Find the table that contains the information. Below is an excerpt of the HTML source view of the web page showing the table and several rows of data. Note the tags, HTML elements and attributes used.
Note that some rows have carriage return or newline marks included in the content of the last <td> element in the row.

    <table class="wikitable sortable">
        <tbody><tr>
        <th>Postcode</th>
        <th>Borough</th>
        <th>Neighbourhood
        </th></tr>
        <tr>
        <td>M1A</td>
        <td>Not assigned</td>
        <td>Not assigned
        </td></tr>
        <tr>
        <td>M2A</td>
        <td>Not assigned</td>
        <td>Not assigned
        </td></tr>
        <tr>
        <td>M3A</td>
        <td><a href="/wiki/North_York" title="North York">North York</a></td>
        <td><a href="/wiki/Parkwoods" title="Parkwoods">Parkwoods</a>
        </td></tr>
        <tr>
        <td>M4A</td>
        <td><a href="/wiki/North_York" title="North York">North York</a></td>
        <td><a href="/wiki/Victoria_Village" title="Victoria Village">Victoria Village</a>
        </td></tr>
        <tr>
    </table>

### Parse the Table Information

In [6]:
# Parse the table headers from the table
# Remove any end of line characters
tableheaderTags = table.find_all('th')
columnHeaders = []
for header in tableheaderTags:
    columnHeaders.append(header.get_text().rstrip())
csv_writer.writerow(columnHeaders)

32

### Apply the Specified Rules

Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned. More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [7]:
# Parse the contents of each row
for table_row in table.find_all('tr'):
    cells = table_row.find_all('td')
    if len(cells) == 3:
        code = cells[0].get_text().rstrip()
        boro = cells[1].get_text().rstrip()
        hood = cells[2].get_text().rstrip()

        if (boro != "Not assigned"):
            if (hood == "Not assigned"):
                # Assign a default value for neighbourhoods that are not assigned
                hood = boro
                print("Assigned neighborhood with borough {0}, {1}, {2}".format(code,boro,hood))		
            postalcode = {'postcode': code, 'borough': boro, 'neighborhood': hood}
            postalcodes_list.append(postalcode)
            print("Added {0}, {1}, {2} to the list".format(code,boro,hood))
            csv_writer.writerow([code, boro, hood])
        else:
            # Skip any lines where the Borough is "Not Assigned"
            print("Skipped {0}, {1}, {2}".format(code,boro,hood))
csv_file.close()

Skipped M1A, Not assigned, Not assigned
Skipped M2A, Not assigned, Not assigned
Added M3A, North York, Parkwoods to the list
Added M4A, North York, Victoria Village to the list
Added M5A, Downtown Toronto, Harbourfront to the list
Added M5A, Downtown Toronto, Regent Park to the list
Added M6A, North York, Lawrence Heights to the list
Added M6A, North York, Lawrence Manor to the list
Assigned neighborhood with borough M7A, Queen's Park, Queen's Park
Added M7A, Queen's Park, Queen's Park to the list
Skipped M8A, Not assigned, Not assigned
Added M9A, Etobicoke, Islington Avenue to the list
Added M1B, Scarborough, Rouge to the list
Added M1B, Scarborough, Malvern to the list
Skipped M2B, Not assigned, Not assigned
Added M3B, North York, Don Mills North to the list
Added M4B, East York, Woodbine Gardens to the list
Added M4B, East York, Parkview Hill to the list
Added M5B, Downtown Toronto, Ryerson to the list
Added M5B, Downtown Toronto, Garden District to the list
Added M6B, North York, G

## Load the information into a Data Frame
Show the first several rows of the table.

In [8]:
# load the data from the csv file via the file handler, csv_file
#df_new = pd.read_csv(csv_file)
df_postalCodesToronto = pd.read_csv('cms_scrape.csv')
df_postalCodesToronto.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


### Apply the additional rule to the data
Collapse the Postcodes where the Borough overlaps with more than one neighborhood.

More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [9]:
# Recall the column names
df_postalCodesToronto.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

Use the group by function on the post codes and boroughs.
This let's you find the boroughs that have more than one neighborhood and collaspe the neighborhood information into the same row.

In [10]:
df_postalCodesCollapsed = df_postalCodesToronto.groupby(by=['Postcode','Borough']).agg(lambda x: ','.join(x))
df_postalCodesCollapsed.reset_index(level=['Postcode','Borough'], inplace=True)
df_postalCodesCollapsed.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Now the table is formatted.

Show the size of the dataframe; print the number of rows in the dataframe.

In [11]:
# This outputs (103,3)
df_postalCodesCollapsed.shape

(103, 3)

Export the revised dataframe to a CSV file for later use.

In [12]:
export_csv = df_postalCodesCollapsed.to_csv (r'postalCodesToronto_dataframe.csv', index = None, header=True)