# Segmenting and Clustering Neighborhoods in Toronto. PART 1

I will use two approaches to extract the Postal Code information from the Wikipedia page

1. Using BeautifulSoap library

2. Using dedicated Wikipedia library

### ===================================================================================
## 1. Complete the task using BeautifulSoap library

Import libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

Scrape Wikipedia page

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
source = requests.get(url).text
Canada_pcode_data = BeautifulSoup(source, 'lxml')

Create the DataFrame

In [3]:
# Create a table with postal code data
post_code_table = Canada_pcode_data.find('table', {'class':'wikitable sortable'})

# Parse the table to extract relevant information
table_rows = post_code_table.find_all('tr')
data = []
for row in table_rows:
    td = []
    for t in row.find_all('td'):
        td.append(t.text.strip())
    data.append(td)

# Extract names for the DataFrame columns
column_names = (post_code_table.tr.text).split("\n")
column_names = ' '.join(column_names).split()

# Create new DataFrame
Toronto_df = pd.DataFrame(data, columns = column_names)
Toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,,,
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


Cleanup the DataFrame

In [4]:
# Remove empty cells
Toronto_df = Toronto_df[~Toronto_df['Borough'].isnull()]

# Remove cells with a Borough names that are Not assigned
Toronto_df = Toronto_df[Toronto_df.Borough != 'Not assigned']

# More than one neighborhood can exist in one postal code area.
# These rows will be combined into one row with the neighborhoods separated with a comma
Toronto_df = Toronto_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x)).reset_index()

# If a cell has a Borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
Toronto_df['Neighbourhood'].replace('Not assigned', Toronto_df['Borough'], inplace = True)

Toronto_df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [5]:
Toronto_df.shape

(103, 3)

### ===================================================================================
## 2. Complete the task using Wikipedia library

Import libraries

In [6]:
import pandas as pd 
import wikipedia as wp

Load the data from a Wikipedia page using wikipedia library

In [7]:
Toronto_wiki = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")

Create the DataFrame containing postal code information from wikipedia page

In [8]:
# Create a DataFrame
Toronto_df2 = pd.read_html(Toronto_wiki, header = 0)[0]
Toronto_df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Cleanup the DataFrame

In [9]:
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
Toronto_df2 = Toronto_df2[Toronto_df2.Borough != 'Not assigned']

# More than one neighborhood can exist in one postal code area.
# These rows will be combined into one row with the neighborhoods separated with a comma
Toronto_df2 = Toronto_df2.groupby(['Postcode','Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x)).reset_index()

# If a cell has a Borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
Toronto_df2['Neighbourhood'].replace('Not assigned',Toronto_df2['Borough'],inplace=True)

Toronto_df2.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [10]:
Toronto_df2.shape

(103, 3)