# Segmenting and Clustering Neighborhoods in Toronto - Part I

<hr style="height:5px;background-color:black">

Let's initialize the target dataframe called `neighborhoods` with three columns: **PostalCode**, **Borough**, and **Neighborhood**

### Let's scrap the Wikipedia table thanks to <a href="import requests">this tutorial</a>

In [6]:
# install BeautifulSoup
import pandas as pd
pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/e8/b5/7bb03a696f2c9b7af792a8f51b82974e51c268f15e925fc834876a4efa0b/beautifulsoup4-4.9.0-py3-none-any.whl (109kB)
[K     |████████████████████████████████| 112kB 21.5MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/05/cf/ea245e52f55823f19992447b008bcbb7f78efc5960d77f6c34b5b45b36dd/soupsieve-2.0-py2.py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.0 soupsieve-2.0
Note: you may need to restart the kernel to use updated packages.


In [9]:
import requests

# Assign the link of the website through which we are going to scrape the data and assign it
# to a variable named website_url.

# requests.get(url).text will ping a website and return you HTML of the website.
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Reading the source code of the web page and creating a BeautifulSoup (soup) object with the BeautifulSoup function.
# Prettify() function in BeautifulSoup will enable us to view how the tags are nested in the document.

from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'html.parser')
#print(soup.prettify())

<hr style="height:3px;background-color:crimson">

### Our first task is to find the `wikitable sortable` class in the HTML script

In [12]:
my_table = soup.find('table',{'class':'wikitable sortable'})
#my_table

In [14]:
# Let's extract the table headers for our dataframe columns
th = my_table.findAll('th')
th

[<th>Postal code
 </th>,
 <th>Borough
 </th>,
 <th>Neighborhood
 </th>]

In [70]:
# Tip 1 (find all tip links at the bottom of this notebook)
# Let's put these headers in a list
headerz = [element.text for element in soup.find_all('th')]
headerz

['Postal code\n', 'Borough\n', 'Neighborhood\n', 'Canadian postal codes\n']

In [71]:
# Tip 2
# Let's prepare the headers of our dataframe
headerz = [s.replace('\n', '') for s in headerz] # remove all the \ns 
headerz.pop() # The last header is not required, let's pop it: it comes from the table at the bottom of the Wikipedia page)
headerz

['Postal code', 'Borough', 'Neighborhood']

<hr style="height:3px;background-color:crimson">

### Now we will prepare the content of our dataframe

In [58]:
# Let's retrieve the rows of the table, delimited by the `<tr>` tag
rowz = [element.text for element in soup.find_all('tr')]
rowz = [s.replace('\n\n', ',') for s in rowz] # remove all the double-\ns
rowz.pop(0) # remove the first element of the list, containing the headers

# Let's remove the last 4 items at the bottom of the list as they come from the table
# 'Canadian postal codes' at the bottom of the Wikipedia page
for i in range(4):
    rowz.pop()

In [59]:
# Tip 3
# Let's remove the \ns which start each item of the list
def remove_cruft(s):
    return s[1:]

rowz = [remove_cruft(s) for s in rowz]
rowz

['M1A,Not assigned,\n',
 'M2A,Not assigned,\n',
 'M3A,North York,Parkwoods\n',
 'M4A,North York,Victoria Village\n',
 'M5A,Downtown Toronto,Regent Park / Harbourfront\n',
 'M6A,North York,Lawrence Manor / Lawrence Heights\n',
 "M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government\n",
 'M8A,Not assigned,\n',
 'M9A,Etobicoke,Islington Avenue\n',
 'M1B,Scarborough,Malvern / Rouge\n',
 'M2B,Not assigned,\n',
 'M3B,North York,Don Mills\n',
 'M4B,East York,Parkview Hill / Woodbine Gardens\n',
 'M5B,Downtown Toronto,Garden District / Ryerson\n',
 'M6B,North York,Glencairn\n',
 'M7B,Not assigned,\n',
 'M8B,Not assigned,\n',
 'M9B,Etobicoke,West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale\n',
 'M1C,Scarborough,Rouge Hill / Port Union / Highland Creek\n',
 'M2C,Not assigned,\n',
 'M3C,North York,Don Mills\n',
 'M4C,East York,Woodbine Heights\n',
 'M5C,Downtown Toronto,St. James Town\n',
 'M6C,York,Humewood-Cedarvale\n',
 'M7C,Not assigned,\n',
 'M8C,Not ass

In [73]:
# Tip 4
# Let's convert the list to a dataframe with the headers we got
neighborhoods = pd.DataFrame([sub.split(",") for sub in rowz], columns=headerz)
neighborhoods.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,\n
1,M2A,Not assigned,\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Regent Park / Harbourfront\n


<hr style="height:3px;background-color:crimson">

### Let's remove the rows of our dataframe  where the Borough is 'Not assigned'

In [74]:
neighborhoods = neighborhoods.drop(neighborhoods[neighborhoods['Borough'] == 'Not assigned'].index)
neighborhoods.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Regent Park / Harbourfront\n
5,M6A,North York,Lawrence Manor / Lawrence Heights\n
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government\n


In [92]:
# Tip 5
# Let's remove the appending \n at the end of each string in the Neighborhood column
neighborhoods['Neighborhood'] = neighborhoods['Neighborhood'].str.replace('\n','')

# Let's replace the slashes (/) by commas (,) in the Neighborhood column
neighborhoods['Neighborhood'] = neighborhoods['Neighborhood'].str.replace(' /',',')
neighborhoods.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [97]:
# Finally, let's reindex our dataframe
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

neighborhoods = neighborhoods.reset_index(drop=True)
neighborhoods = neighborhoods.rename(columns={'Postal code': 'PostalCode'}) # Let's match the assignment example
neighborhoods.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [109]:
# Tip 6
# Let's check if we have duplicated Postal codes as mentioned in the assignment statement

duplicateRowsDF = neighborhoods[neighborhoods.duplicated(['PostalCode'])]
if duplicateRowsDF.empty == True:
    print('No duplicates!')
else:
    print("Duplicate Rows except first occurrence based on all columns are: ")

No duplicates!


In [106]:
neighborhoods.shape

(103, 3)

# <span style="color:crimson">Our dataframe consists in 103 rows and 3 columns</span>

<hr style="height:3px;background-color:black">

# Acknowledgments: list of tips

1. Extract a string from within an HTML tag: <br />
https://stackoverflow.com/questions/28212766/extract-string-from-tag-with-beautifulsoup

2. Delete character from a list of strings: <br />
https://stackoverflow.com/questions/8282553/removing-character-in-list-of-strings

3. Delete first and last characters of a list's item: <br />
https://stackoverflow.com/questions/11832984/removing-first-four-and-last-four-characters-of-strings-in-list-or-removing-spe

4. Convert a list to a dataframe: <br />
https://stackoverflow.com/questions/32224363/python-convert-comma-separated-list-to-pandas-dataframe

5. Replace characters in a column: <br />
https://stackoverflow.com/questions/28986489/how-to-replace-a-characters-in-a-column-of-a-pandas-dataframe

6. Identify duplicates: <br />
https://thispointer.com/pandas-find-duplicate-rows-in-a-dataframe-based-on-all-or-selected-columns-using-dataframe-duplicated-in-python/