# Segmenting and Clustering Neighborhoods in Toronto

#### I will explore, segment, and cluster the neighborhoods in the city of Toronto.

In order to retrieve the information about neighborhoods in Toronto at the Wikipedia, it will be used the Python package called Beatiful Soup.

## *Assignment - Part 1*

In [None]:
#install BeautifulSoup package (run if necessary)
!pip install beautifulsoup4

In [1]:
#import packages
import pandas as pd
from bs4 import BeautifulSoup
import requests # library to handle requests

Get wikipage as html and create a BeautifulSoup Object

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url)
#print (source.status_code)
#soup = BeautifulSoup(source.text,"html.parser")
soup = BeautifulSoup(source.text,"lxml")
#print(soup.prettify())

Creation of a list with the data

In [3]:
list_with_data = []
table = soup.find('table')

for tr in table.findAll('tr'):
    aux_list = []
    for td in tr.findAll('td'):
        data = td.text.strip()
        aux_list.append(data)
    if len(aux_list)==3:
        list_with_data.append(aux_list)

### **3.1 - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [4]:
#The dataframe will have 3 Columns = PostalCode, Borough, and Neighborhood
columns_name = ['PostalCode', 'Borough','Neighborhood']
df = pd.DataFrame(data = list_with_data, columns=columns_name)
df.head(5)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### **3.2 - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**

In [5]:
index_names = df[df['Borough'] == 'Not assigned'].index

df.drop(index_names, inplace = True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### **3.3 - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.**

In the current wikipedia version, each Postal Code has already several neighborhoods. However we can check if there is any postal code duplicated

In [6]:
df_duplicated = df[df.duplicated(['PostalCode'])]
df_duplicated

Unnamed: 0,PostalCode,Borough,Neighborhood


### **3.4 - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**

In [7]:
#First chech which Borough doesn't have a neighborhood assigned
df_neighborhood_check = df[df.Neighborhood == 'Not assigned']
df_neighborhood_check.shape[0]

0

The dataframe doesn't have any 'Not Assigned' neighborhood in the Borough.

### **3.5 - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.**

In [21]:
#Need to reset index column
df = df.reset_index(drop = True)
df.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


### **3.6 - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.**

Let's check how many rows has the dataframe

In [10]:
print('The dataframe has {} rows'.format(df.shape[0]))

The dataframe has 103 rows


## *Assignment - Part 2*

Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

Here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

In [11]:
# Downlink the csv file from the given url
!wget -q -O Geospatial_data_Toronto.csv http://cocl.us/Geospatial_data
print('Download finished')

Download finished


In [15]:
import pandas as pd

df_geospatial = pd.read_csv('Geospatial_data_Toronto.csv')

df_geospatial.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In order to add the Latitude and Longitude columns from the geospatial dataframe, first we need to rename the *Postal Code* column and then merge the initial dataframe (which contains the Neighborhoods) with the geospatial dataframe.

In [18]:
df_geospatial.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
df_geospatial.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The final dataframe after merging.

In [22]:
df_merged = pd.merge(df, df_geospatial)
df_merged.head(15)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
