# Segmenting and Clustering Neighborhoods in Toronto (part 1)

The goal of This project is to Explore, Segment and Cluster the neighborhoods in the city of Toronto. 
For the Toronto neighborhood data, a <a href='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'> Wikipedia</a> page exists that has all the information we need to explore and cluster the neighborhoods in Toronto.

**In this first part** we will scrape the wikipedia page in order to obtain the data that is in the table of postal codes and to tranform the data into a pandas dataframe.

### import libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd 

### Grap the Data

**make a request to server to get infos from wikipedia page**

In [2]:
res = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
res

<Response [200]>

**create a BeautifulSoup object** 
which will parse the string (res.text) into a HTML that we can manipulate with python 

In [3]:
soup_data = BeautifulSoup(res.text, 'html.parser')

**Extract all rows of the table using selector**   
The list of postal codes is a table with class='wikitable'

In [4]:
all_rows = soup_data.select('.wikitable tr')

**let's separate the header (1st row) from the content (all remaining rows)**

In [5]:
tab_header = all_rows[0] #table header
tab_content = all_rows[1:] #table content

**Extract the header's column name**

In [6]:
columns = []

for column in tab_header.select('th'): 
    col_name = column.getText()
    columns.append(col_name.replace('\n', ''))  
    
columns

['Postal code', 'Borough', 'Neighborhood']

**Extract the content (remaining rows) of the table**

In [7]:
data = []

for table_row in tab_content:
    row = []
    for table_data in table_row.select('td'): 
        row.append(table_data.getText().replace('\n', ''))
    data.append(row)

data


[['M1A', 'Not assigned', ''],
 ['M2A', 'Not assigned', ''],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Regent Park / Harbourfront'],
 ['M6A', 'North York', 'Lawrence Manor / Lawrence Heights'],
 ['M7A', 'Downtown Toronto', "Queen's Park / Ontario Provincial Government"],
 ['M8A', 'Not assigned', ''],
 ['M9A', 'Etobicoke', 'Islington Avenue'],
 ['M1B', 'Scarborough', 'Malvern / Rouge'],
 ['M2B', 'Not assigned', ''],
 ['M3B', 'North York', 'Don Mills'],
 ['M4B', 'East York', 'Parkview Hill / Woodbine Gardens'],
 ['M5B', 'Downtown Toronto', 'Garden District, Ryerson'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', ''],
 ['M8B', 'Not assigned', ''],
 ['M9B',
  'Etobicoke',
  'West Deane Park / Princess Gardens / Martin Grove / Islington / Cloverdale'],
 ['M1C', 'Scarborough', 'Rouge Hill / Port Union / Highland Creek'],
 ['M2C', 'Not assigned', ''],
 ['M3C', 'North York', 'Don Mills'],
 ['M4C', 'East York',

**Let's create a DataFrame for our raw data**

In [8]:
toronto_raw = pd.DataFrame(data=data, columns=columns)
toronto_raw.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Let's clean our Data

In [9]:
# Remove all the rows where Borough = 'Not assigned'
toronto_raw = toronto_raw[toronto_raw.Borough != 'Not assigned']
toronto_raw

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
160,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing CentrE
169,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


In [10]:
# check that Neighborhood doesnt contains any empty string
(toronto_raw.Neighborhood=='').sum() 

0

In [11]:
# check that Neighborhood doesn't contains any 'nan' or 'None'
toronto_raw['Neighborhood'].isnull().sum()

0

In [12]:
# replace the separator '/' by a comma in the Neighborhood column
toronto_raw['Neighborhood'] = toronto_raw['Neighborhood'].apply(lambda n: n.replace(' / ', ', '))
toronto_cleaned = toronto_raw.reset_index(drop=True)

In [13]:
toronto_cleaned.head(12)

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [14]:
toronto_cleaned.shape

(103, 3)

In [15]:
toronto_cleaned.to_csv('Toronto_postal_code.csv')