# Toronto Neighborhood Segmentation Data

### By: Gyan Prakash

*The notebook below fetches the table from wikipedia page, and converts it into pandas dataframe. After this data wrangling is performed to clean the data.*

### Let's import the libraries 

In [93]:
import numpy as np
import requests
import pandas as pd


### Now we will fetch the tables from the given page into a list of dataframe objects

In [94]:
url= 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
page=pd.read_html(url)

*We have got tabless from wikipedia page. It is a list.*
### Let's check the datatype of 'page':

In [95]:
type(page)

list

*Since we need to work only with the first table,* 
### Let's take out the first table from page: 

In [96]:
df=page[0]

In [97]:
type(df)

pandas.core.frame.DataFrame

*Let's have a look at our dataframe:*

In [98]:
df.head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


### Insert the column names:

In [99]:
df.columns=['Post Code','Borough','Neighborhood' ]

*Missing values in our dataframe are displayed as 'Not assigned'. Let's replace them with numpy NaN. It will make the processing easier.'*

In [100]:
df.replace( "Not assigned",np.nan, inplace=True)

### Droping NaN rows for Borough

In [101]:
df.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

In [103]:
df.head(10)

Unnamed: 0,Post Code,Borough,Neighborhood
0,Postcode,Borough,Neighbourhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Rouge


### Now, let's replace missing neighbourhood values with the values of corresponding Borough, as instructed in the assignment question

In [104]:
df['Neighborhood'].replace(np.nan, df['Borough'],inplace=True)

In [105]:
df.head(10)

Unnamed: 0,Post Code,Borough,Neighborhood
0,Postcode,Borough,Neighbourhood
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Queen's Park
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,Rouge


## Hurrah!
### We are done with the cleaning phase.
### Finally, let's check the number of rows and columns in the dataframe:

In [106]:
df.shape

(212, 3)

### *That's all for now. Thanks for visiting*