# Toronto Neighborhood Segmentation Data

### By: Gyan Prakash

*The notebook below fetches the table from wikipedia page, and converts it into pandas dataframe. After this data wrangling is performed to clean the data.*

### Let's import the libraries 

In [25]:
import numpy as np
import requests
import pandas as pd


### Now we will fetch the tables from the given page into a list of dataframe objects

In [26]:
url= 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
page=pd.read_html(url)

*We have got tabless from wikipedia page. It is a list.*
### Let's check the datatype of 'page':

In [27]:
type(page)

list

*Since we need to work only with the first table,* 
### Let's take out the first table from page: 

In [28]:
df=page[0]

In [29]:
type(df)

pandas.core.frame.DataFrame

In [30]:
df.drop(labels=0, axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

*Let's have a look at our dataframe:*

In [31]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M2A,Not assigned,Not assigned
1,M3A,North York,Parkwoods
2,M4A,North York,Victoria Village
3,M5A,Downtown Toronto,Harbourfront
4,M5A,Downtown Toronto,Regent Park


In [32]:
df.shape

(287, 3)

### Insert the column names:

In [33]:
df.columns=['Post Code','Borough','Neighborhood']

*Missing values in our dataframe are displayed as 'Not assigned'. Let's replace them with numpy NaN. It will make the processing easier.'*

In [34]:
df.replace( "Not assigned",np.nan, inplace=True)

### Droping NaN rows for Borough

In [35]:
df.dropna(subset=["Borough"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

In [36]:
df.shape

(211, 3)

In [37]:
df.head(10)

Unnamed: 0,Post Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [39]:
df['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

### Now, let's replace missing neighbourhood values with the values of corresponding Borough, as instructed in the assignment question

In [40]:
df['Neighborhood'].replace(np.nan, df['Borough'],inplace=True)

In [41]:
df['Neighborhood'].unique()

array(['Parkwoods', 'Victoria Village', 'Harbourfront', 'Regent Park',
       'Lawrence Heights', 'Lawrence Manor', "Queen's Park",
       'Islington Avenue', 'Rouge', 'Malvern', 'Don Mills North',
       'Woodbine Gardens', 'Parkview Hill', 'Ryerson', 'Garden District',
       'Glencairn', 'Cloverdale', 'Islington', 'Martin Grove',
       'Princess Gardens', 'West Deane Park', 'Highland Creek',
       'Rouge Hill', 'Port Union', 'Flemingdon Park', 'Don Mills South',
       'Woodbine Heights', 'St. James Town', 'Humewood-Cedarvale',
       'Bloordale Gardens', 'Eringate', 'Markland Wood',
       'Old Burnhamthorpe', 'Guildwood', 'Morningside', 'West Hill',
       'The Beaches', 'Berczy Park', 'Caledonia-Fairbanks', 'Woburn',
       'Leaside', 'Central Bay Street', 'Christie', 'Cedarbrae',
       'Hillcrest Village', 'Bathurst Manor', 'Downsview North',
       'Wilson Heights', 'Thorncliffe Park', 'Adelaide', 'King',
       'Richmond', 'Dovercourt Village', 'Dufferin',
       'Scarborou

In [42]:
df.head(10)

Unnamed: 0,Post Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


In [43]:
df = df.groupby('Post Code').agg({'Borough':'first','Neighborhood': ', '.join}).reset_index()


## Hurrah!
### We are done with the cleaning phase.
### Finally, let's check the number of rows and columns in the dataframe:

In [44]:
df.shape

(103, 3)

# Finding out latitude and longitude for each borough
### Let's create another dataframe which will combine above dataframe with latitude ad longitude of each borough:

In [45]:
column_names=['Post Code','Borough','Neighborhood','latitude','longitude']
df2=pd.DataFrame(columns=column_names)

In [46]:
#!conda install -c conda-forge geopy --yes

In [47]:
df2[['Post Code','Borough','Neighborhood']]=df[['Post Code','Borough','Neighborhood']]

### Import the library for finding out Latitude and Longitude. We are going to use Foursquare agent

In [48]:
from geopy.geocoders import Nominatim

In [49]:
CLIENT_ID = 'removed for privacy' # your Foursquare ID
CLIENT_SECRET = 'removed for privacy' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: removed for privacy
CLIENT_SECRET:removed for privacy


In [64]:
for index, row in df2.iterrows():
    geolocator = Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(row['Borough'])
    row['latitude'] = location.latitude
    row['longitude'] = location.longitude


In [51]:
df2.head()

Unnamed: 0.1,Unnamed: 0,Post Code,Borough,Neighborhood,latitude,longitude
0,0,M1B,Scarborough,"Rouge, Malvern",54.28476,-0.409034
1,1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",54.28476,-0.409034
2,2,M1E,Scarborough,"Guildwood, Morningside, West Hill",54.28476,-0.409034
3,3,M1G,Scarborough,Woburn,54.28476,-0.409034
4,4,M1H,Scarborough,Cedarbrae,54.28476,-0.409034


# Finally we have our dataframe with longitude and latitude columns