# Segmenting and Clusturing Neighbourhoods in Toronto- Assignment

<h4>Introduction:
    In this tab, We will we fetch data from a website which is in the table of postal codes and to transform the data into a pandas dataframe.
For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Below we get the data and start scraping it from the website and clean it. Lets install all the dependencies

In [1]:
import pandas as pd
import requests
import lxml.html as lh

from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

import matplotlib.cm as cm
import matplotlib.colors as colors

import folium

Send the get request and examine the results

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

htmlrequest = requests.get(url, verify = False)

website_content = lh.fromstring(htmlrequest.content)
trtagElements = website_content.xpath('//tr')

htmlrequest.status_code




200

Checking the length of all the rows.
Note : length for all the rows must be same if not that means we got something more

In [3]:
[len(element) for element in trtagElements[:12]]

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

Parsing the first row as header

In [4]:
Headers=[]
counter=0
#For each row, store each first element (header) and an empty list
for element in trtagElements[0]:
    counter+=1
    name=element.text_content()
    print('%d:%s'%(counter,name))
    Headers.append((name,[]))

1:Postal Code

2:Borough

3:Neighborhood



Creating pandas Dataframe. Each header is appended to a tuple along with an empty list

In [5]:
for element in range(len(trtagElements)):
    row_length = trtagElements[element]
    if len(row_length) != 3:
        break
        
    i = 0
    for row in row_length.iterchildren():
        data=row.text_content() 
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        Headers[i][1].append(data)
        i += 1
    
#type(Headers)


Headers

[('Postal Code\n',
  ['Postal Code\n',
   'M1A\n',
   'M2A\n',
   'M3A\n',
   'M4A\n',
   'M5A\n',
   'M6A\n',
   'M7A\n',
   'M8A\n',
   'M9A\n',
   'M1B\n',
   'M2B\n',
   'M3B\n',
   'M4B\n',
   'M5B\n',
   'M6B\n',
   'M7B\n',
   'M8B\n',
   'M9B\n',
   'M1C\n',
   'M2C\n',
   'M3C\n',
   'M4C\n',
   'M5C\n',
   'M6C\n',
   'M7C\n',
   'M8C\n',
   'M9C\n',
   'M1E\n',
   'M2E\n',
   'M3E\n',
   'M4E\n',
   'M5E\n',
   'M6E\n',
   'M7E\n',
   'M8E\n',
   'M9E\n',
   'M1G\n',
   'M2G\n',
   'M3G\n',
   'M4G\n',
   'M5G\n',
   'M6G\n',
   'M7G\n',
   'M8G\n',
   'M9G\n',
   'M1H\n',
   'M2H\n',
   'M3H\n',
   'M4H\n',
   'M5H\n',
   'M6H\n',
   'M7H\n',
   'M8H\n',
   'M9H\n',
   'M1J\n',
   'M2J\n',
   'M3J\n',
   'M4J\n',
   'M5J\n',
   'M6J\n',
   'M7J\n',
   'M8J\n',
   'M9J\n',
   'M1K\n',
   'M2K\n',
   'M3K\n',
   'M4K\n',
   'M5K\n',
   'M6K\n',
   'M7K\n',
   'M8K\n',
   'M9K\n',
   'M1L\n',
   'M2L\n',
   'M3L\n',
   'M4L\n',
   'M5L\n',
   'M6L\n',
   'M7L\n',
   'M8L\n',
 

checking length of each column

In [6]:
[len(C) for (title,C) in Headers]

[182, 182, 182]

Converting a dictionary into pandas dataframe

In [7]:
Dict={title:column for (title,column) in Headers}


df=pd.DataFrame(Dict, columns = Dict.keys())

df = df.replace('\n','', regex=True)
df.columns=df.columns.str.replace('\n','')
df = df.drop(df.index[0])

df

Unnamed: 0,Postal Code,Borough,Neighborhood
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
177,M6Z,Not assigned,
178,M7Z,Not assigned,
179,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."
180,M9Z,Not assigned,


Filtering out the rows if Borough is not assigned

In [8]:
df.drop(df[df.Borough == "Not assigned"].index, inplace=True)
df.drop(df[df.Borough == "Canadian postal codes"].index, inplace=True)
df = df.set_index(i for i in range(len(df.index)))

df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Getting the shape of dataframe

In [9]:
df.shape

(103, 3)