# Segmenting and Clustering Neighbourhoods in Toronto

## Part One


### Obtain data from the table in Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, and transform it in a pandas dataframe.
### In order to do so I will use the read_html method of pandas library and take the first table of the Wikipedia page.

In [1]:
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
toronto_nh_df = pd.read_html(url, flavor='html5lib', header=0)[0]
toronto_nh_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Ignore cells with not assigned Borough. If a cell has not assigned Neighbourhood it will be the same as the Borough.

In [2]:
# removing cells with not assigned borough
toronto_nh_df = toronto_nh_df[toronto_nh_df.Borough != 'Not assigned']
# replacing cell with not assigned neighbourhood with corresponding borough
toronto_nh_df.Neighbourhood.replace('Not assigned',toronto_nh_df.Borough,inplace=True)

toronto_nh_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### Combine neighbourhoods in one postal code area in the same row

In [3]:
toronto_nh_df = toronto_nh_df.groupby(['Postal Code', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
toronto_nh_df.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


### Dispaly the number of rows in the dataframe

In [4]:
toronto_nh_df.shape

(103, 3)

## Part two

### Update the dataframe by adding geographical coordinates of each neighbourhood.

### In order to do so I will get the data from the csv file and add it to the original dataframe

In [5]:
csv = 'http://cocl.us/Geospatial_data'
toronto_coor_df = pd.read_csv(csv)
toronto_coor_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### The 'Postal Code' column is redundant because is already present in the neighbourhood dataframe, so I will drop it

In [6]:
toronto_coor_df.drop("Postal Code", axis=1, inplace=True)
toronto_coor_df.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476


In [7]:
toronto_coor_df.shape

(103, 2)

### Estabilished that the two dataframes have the same number of rows, I can now concatenate them

In [8]:
toronto_nh_df = pd.concat([toronto_nh_df, toronto_coor_df], axis=1)
toronto_nh_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
toronto_nh_df.shape

(103, 5)

## Part three

### Explore and cluster the neighborhoods in Toronto

### Import needed libraries

In [10]:
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
print('All libraries imported')

All libraries imported


### Create a map of Toronto neighbourhoods

In [11]:
latitude = '43.653963'
longitude = '-79.387207'

toronto_map = folium.Map(location=[latitude, longitude], zoom_start=10)

for lat, lng, borough, neighborhood in zip(toronto_nh_df['Latitude'], 
                                           toronto_nh_df['Longitude'], 
                                           toronto_nh_df['Borough'], 
                                           toronto_nh_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  


toronto_map