<h1 align="center"><b>Segmenting and Clustering Neighborhoods in Toronto</b></h1>
<h1 align="center"><b>PART 1</b></h1>

 #### 1. Install and import the necessary packages

Install the necessary packages [Beautiful Soup, lxml]

In [1]:
# Install Beautiful Soup 4
!conda install -c conda-forge beautifulsoup4 --yes

# Install lxml parser
!conda install -c conda-forge lxml --yes

Collecting package metadata: ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.1.23-0, anaconda::openssl-1.1.1b-he774522_1
  - anaconda::openssl-1.1.1b-he774522_1, defaults::ca-certificates-2019.1.23-0
  - anaconda::ca-certificates-2019.1.23-0, defaults::openssl-1.1.1b-he774522_1
  - defaults::ca-certificates-2019.1.23-0, defaults::openssl-1.1.1b-he774522_1done

# All requested packages already installed.

Collecting package metadata: ...working... done
Solving environment: ...working... 
  - anaconda::ca-certificates-2019.1.23-0, anaconda::openssl-1.1.1b-he774522_1
  - anaconda::openssl-1.1.1b-he774522_1, defaults::ca-certificates-2019.1.23-0
  - anaconda::ca-certificates-2019.1.23-0, defaults::openssl-1.1.1b-he774522_1
  - defaults::ca-certificates-2019.1.23-0, defaults::openssl-1.1.1b-he774522_1done

# All requested packages already installed.



Import the necessary packages

In [2]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

 #### 2. Scrape the Toronto post codes Wikipedia page with Beautiful Soup

Read the Toronto post codes wikipedia page with Beautiful Soup

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')

# Get the table with the postal codes [class = 'wikitable sortable']
table = soup.find('table', class_='wikitable sortable')

Convert the postal codes table to a pandas Data Frame

In [4]:
postal_codes_initial_df = pd.read_html(str(table))[0]

Give the columns the required names

In [5]:
postal_codes_initial_df.rename(columns={'Postcode': 'PostalCode', 'Neighbourhood': 'Neighborhood'}, inplace=True)
print(postal_codes_initial_df.head(10), '\n')
print('Initial Toronto postal codes dataframe shape: ', postal_codes_initial_df.shape, '\n')

  PostalCode           Borough      Neighborhood
0        M1A      Not assigned      Not assigned
1        M2A      Not assigned      Not assigned
2        M3A        North York         Parkwoods
3        M4A        North York  Victoria Village
4        M5A  Downtown Toronto      Harbourfront
5        M5A  Downtown Toronto       Regent Park
6        M6A        North York  Lawrence Heights
7        M6A        North York    Lawrence Manor
8        M7A      Queen's Park      Not assigned
9        M8A      Not assigned      Not assigned 

Initial Toronto postal codes dataframe shape:  (288, 3) 



 #### 3. Process the initial Toronto post codes Data Frame

Drop the rows with the 'Borough' value 'Not Assigned'

In [6]:
postal_codes_initial_df.drop(postal_codes_initial_df[postal_codes_initial_df['Borough'] == 'Not assigned'].index, axis=0, inplace=True)

Replace the 'Not assigned' value in 'Neighborhood' with the corresponding value of 'Borough'

In [7]:
postal_codes_initial_df.loc[postal_codes_initial_df['Neighborhood'] == 'Not assigned', 'Neighborhood'] = postal_codes_initial_df.loc[postal_codes_initial_df['Neighborhood'] == 'Not assigned']['Borough']
postal_codes_initial_df.head(10) # Notice the Queen's Park Neighborhood

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


Create a new Data Frame to combine the neighborhoods of the same postal code

In [8]:
postal_codes_grouped_df = pd.DataFrame(postal_codes_initial_df.groupby('PostalCode')['Neighborhood'].apply(lambda tags: ', '.join(tags))).reset_index()

Check that the dataframe was created successfully

In [9]:
postal_codes_grouped_df.head(10)

Unnamed: 0,PostalCode,Neighborhood
0,M1B,"Rouge, Malvern"
1,M1C,"Highland Creek, Rouge Hill, Port Union"
2,M1E,"Guildwood, Morningside, West Hill"
3,M1G,Woburn
4,M1H,Cedarbrae
5,M1J,Scarborough Village
6,M1K,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,"Clairlea, Golden Mile, Oakridge"
8,M1M,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,"Birch Cliff, Cliffside West"


Now merge the initial and combined neighborhoods data frames to the final data frame

In [10]:
toronto_postal_codes_df = pd.merge(postal_codes_initial_df, postal_codes_grouped_df, how='inner', on=['PostalCode', 'PostalCode'])

# Drop the single neighborhood column
toronto_postal_codes_df.drop('Neighborhood_x', axis=1, inplace=True)

# Now drop the created duplicate rows
toronto_postal_codes_df.drop_duplicates(inplace=True)

# Rename the created Neighborhood_y to Neighborhood
toronto_postal_codes_df.rename(columns={'Neighborhood_y': 'Neighborhood'}, inplace=True)

# Reset the data frame index
toronto_postal_codes_df.reset_index(drop=True, inplace=True)

Check an example from the instruction where one postal code has many neighborhoods

In [11]:
# Set the pandas dataframe display.max_colwidth to -1 so that the full data frame columns are shown
pd.set_option('display.max_colwidth', -1)
toronto_postal_codes_df[toronto_postal_codes_df['PostalCode'] == 'M5V']

Unnamed: 0,PostalCode,Borough,Neighborhood
87,M5V,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara"


Examine the final dataframe

In [12]:
toronto_postal_codes_df.head(20)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Finally print the shape of the final data frame

In [13]:
print('The final dataframe has {} rows.'.format(toronto_postal_codes_df.shape[0]))

The final dataframe has 103 rows.
