# City of Toronto
### <i>This notebook will apply a technique to segment the city of Toronto into neighborhoods

### Step1: Scrape data from wikipedia

Let´s import the packages we will be using to scrape the data out of the wiki page of Toronto.

In [1]:
from bs4 import BeautifulSoup
import requests


Next we will create our soup object, which will contain all the data read from the wikipage

In [2]:
#define the page when want to scrap
wikipage = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

#use requests to get document behind url
r = requests.get(wikipage)
data = r.text

#create soup object
soup = BeautifulSoup(data, 'lxml')

#Lets make sure we are reading the write page by looking at the title
wikititle = soup.title.text
print(wikititle)

List of postal codes of Canada: M - Wikipedia


Lets search for the table in the document.

In [3]:
import numpy as np

postal_table =soup.table.text
#actually this works for us because the table we want is the first one on the document. 
#the function find_all would have been used to return all tables in the document.

#lets split the obtained string into a list each time \n is found
postal_table = postal_table.split('\n')

#now lets get rid of empty fills 
postal_table = np.array(list(filter(None, postal_table)))

#by now we have an array containing all the elements we need. The array was create cause we want to use a method on it.
#we need to reshape the array into a 3 x n matrix, n being the total number of neighborhoods
n = int(len(postal_table) / 3)

final_table = postal_table.reshape(n,3)

We have created our matrix containing all elements of the table in the wikipage, 
Lets use pandas to build the dataframe.

### Step2: Building our DataFrame

In [4]:
import pandas as pd

headers = final_table[0]
df = pd.DataFrame(final_table[1:], columns = headers)

#Now let's get rid of the rows where Borough is not assigned

df = df[df.Borough != "Not assigned"]
df.reset_index(inplace = True)
df.drop('index', axis = 1, inplace = True)



Next we will groupby Postcode and join the list of Neighbourhood separated by a comma

In [5]:
new_df = df.groupby(['Postcode', 'Borough']).agg(lambda x: list(x)).reset_index()
for i in range(len(new_df['Neighbourhood'])):
    new_df['Neighbourhood'][i] = ', '.join(new_df['Neighbourhood'][i])

new_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Now lets find any Neighbourhood with "Not Assigned" value and replace it with Borough name

In [6]:
list_bool = new_df['Neighbourhood'].str.contains(r'Not assigned')
ind_repl = []

for i in range(len(new_df['Neighbourhood'])):
    if list_bool[i] == True:
        ind_repl.append(i)
        new_df['Neighbourhood'][i] = new_df['Neighbourhood'][i].replace('Not assigned', new_df['Borough'][i])
        print('Found a non assigned neighbourhood for PostCode: ', new_df['Postcode'][i])
        print('Value replaced to: ', new_df['Borough'][i])

Found a non assigned neighbourhood for PostCode:  M7A
Value replaced to:  Queen's Park


Let's check that the value was properly replaced and dataframe also updated

In [7]:
new_df.iloc[ind_repl]


Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


Finally, let's print the shape of our final dataframe

In [8]:
print(new_df.shape)
print('Dataframe contains', new_df.shape[0] ,'rows and', new_df.shape[1], 'columns' )

(103, 3)
Dataframe contains 103 rows and 3 columns


### Step3: Adding geo coordinates to our DataFrame

Let's start by reading the CSV file containing all coordinates of the Toronto postal codes.

In [9]:
GeoFile = 'Geospatial_Coordinates.csv'

#use pandas to read the csv file and create new dataframe
df_geo = pd.read_csv(GeoFile)
df_geo.head()


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Next step is to merge the geo dataframe with our main dataframe. 
Here we will use the merge method and we specify on both dataframes which columns should match.
The how argument on the function specifies on which dataframe we want to add the coordinate columns.

In [10]:
#now we merge the two dataframes using the merge method
new_df = pd.merge(new_df, df_geo, left_on = 'Postcode', right_on = 'Postal Code', how = 'left')

#the merge also adds the postal code from the second dataframe so we drop it
new_df.drop('Postal Code', axis = 1, inplace = True)

new_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
