# Toronto Neighborhood Segmentation Data

### By: Gyan Prakash

*The notebook below fetches the table from wikipedia page, and converts it into pandas dataframe. After this data wrangling is performed to clean the data.*

### Let's import the libraries 

In [1]:
import numpy as np
import requests
import pandas as pd


### Now we will fetch the tables from the given page into a list of dataframe objects

In [2]:
url= 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
page=pd.read_html(url)

*We have got tabless from wikipedia page. It is a list.*
### Let's check the datatype of 'page':

In [3]:
type(page)

list

*Since we need to work only with the first table,* 
### Let's take out the first table from page: 

In [4]:
df=page[0]

In [5]:
type(df)

pandas.core.frame.DataFrame

In [6]:
df.drop(labels=0, axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

*Let's have a look at our dataframe:*

In [7]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M2A,Not assigned,Not assigned
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park


### Insert the column names:

In [8]:
df.columns=['Post Code','Borough','Neighborhood' ]

*Missing values in our dataframe are displayed as 'Not assigned'. Let's replace them with numpy NaN. It will make the processing easier.'*

In [9]:
df.replace( "Not assigned",np.nan, inplace=True)

### Droping NaN rows for Borough

In [10]:
df.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

In [11]:
df.head(10)

Unnamed: 0,Post Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### Now, let's replace missing neighbourhood values with the values of corresponding Borough, as instructed in the assignment question

In [12]:
df['Neighborhood'].replace(np.nan, df['Borough'],inplace=True)

### Grouping by Postal code

In [13]:
df = df.groupby('Post Code').agg({'Borough':'first','Neighborhood': ', '.join}).reset_index()


In [14]:
df.head(10)

Unnamed: 0,Post Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


## Hurrah!
### We are done with the cleaning phase.
### Finally, let's check the number of rows and columns in the dataframe:

In [15]:
df.shape

(103, 3)

### *That's all for now. Thanks for visiting*