<h1 align=center><font size = 5>Segmenting and Clustering Neighbourhoods in Toronto</font></h1>

## Introduction

In this assignment, you will be required to explore, segment, and cluster the neighbourhoods in the city of Toronto. However, unlike New York, the neighbourhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighbourhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighbourhoods in the city of Toronto.

Your submission will be a link to your Jupyter Notebook on your Github repository.

For this assignment, you will be required to explore and cluster the neighbourhoods in Toronto.

Start by creating a new Notebook for this assignment.
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
   

# 1ª PART --------------------------------

In [1]:
#!pip install beautifulsoup4
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import requests 
from bs4 import BeautifulSoup 

print('Libraries imported.')

Libraries imported.


## Scrap List of postal codes of Canada wiki page content by using BeautifulSoup

In [2]:
# Download and Explore Dataset
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
# Scrap List of postal codes of Canada wiki page content by using BeautifulSoup
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
display_html(tab,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal code,Borough,Neighborhood
M1A,Not assigned,
M2A,Not assigned,
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Regent Park / Harbourfront
M6A,North York,Lawrence Manor / Lawrence Heights
M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
M8A,Not assigned,
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,Malvern / Rouge


## Convert content of PostCode HTML table as dataframe

## The dataframe will consist of three columns: PostCode, Borough, and Neighbourhood

In [3]:
dfs = pd.read_html(tab)
toronto=dfs[0]
toronto.head()
# Change de columns Name
toronto.columns
toronto.columns= ['Postcode','Borough','Neighbourhood']
toronto.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

In [4]:
# Total rows and columns
toronto.shape

(180, 3)

## Clean dataframe 

In [5]:
# clean dataframe 
# Ignore cells with a borough that is Not assigned

toronto = toronto[toronto.Borough!='Not assigned']
toronto = toronto[toronto.Borough!= 0]
toronto.reset_index(drop = True, inplace = True)
toronto.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


## More than one neighborhood can exist in one postal code area

In [6]:
# More than one neighborhood can exist in one postal code area.Same Postcode that have more than one Neighborhoods, it join in one row with all neighborhoods.  
i = 0
for i in range(0,toronto.shape[0]):
    if toronto.iloc[i][2] == 'Not assigned':
        toronto.iloc[i][2] = toronto.iloc[i][1]
        i = i+1
                                 
toronto_groupby = toronto.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()
toronto_groupby.head()
toronto_groupby

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


## If a cell has a borough but a Not assigned neighbourhood, then the neighbourhood will be the same as the borough.

In [7]:
# Replacing the name of the neighbourhoods which are 'Not assigned' with value of Borough
toronto_groupby['Neighbourhood'].replace('Not assigned',toronto_groupby['Borough'],inplace=True)
toronto_groupby.shape


(103, 3)

# 2ª PART ------------------------------------

Now that you have built a dataframe of the postal code of each neighbourhood along with the borough name and neighbourhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighbourhood.

In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighbourhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html.

The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking postal code M5G as an example, your code would look something like this:

In [8]:
## Importing the csv file conatining the latitudes and longitudes for various neighbourhoods in Canada

In [9]:
!pip install geocoder
import geocoder
#####################################################################
# Decline this code for errors in the used of geocoder.google('{},
#for i in range(0,len(toronto_groupby)):
    #print("Numeros:"+str(i))
    #initialize your variable to None
    #lat_lng_coords = None
    #loop until you get the coordinates
    #postal_code_from_df=toronto_groupby.iloc[i]['Postcode']
    #print("postal_code_from_df:"+str(postal_code_from_df))
    #while(lat_lng_coords is None):
        #print('bucle while')
        #g = geocoder.google('{}, Toronto, Ontario'.format(postal_code_from_df))
        #print("valor g:"+str(g))
        #lat_lng_coords = g.latlng
        #print("lat_lng_coords:"+str(lat_lng_coords))
######################################################################
df_geo_coordinate = pd.read_csv('https://cocl.us/Geospatial_data')
df_geo_coordinate.rename(columns={'Postal Code':'Postcode'},inplace=True)        
df_geo_coordinate.shape        


Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 10.4MB/s ta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


(103, 3)

In [10]:
df_geo_coordinate

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [11]:
toronto_groupby

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,Malvern / Rouge
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek
2,M1E,Scarborough,Guildwood / Morningside / West Hill
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West
9,M1N,Scarborough,Birch Cliff / Cliffside West


## We must merge dataset coordenate and neighbourhood by Postcode 

In [12]:
toronto_merge=pd.merge(toronto_groupby, df_geo_coordinate, on='Postcode')
toronto_merge.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# 3ª PART ------------------------------------

## Visualizing all the Neighbourhoods of the above data frame using Folium

generate maps to visualize your neighbourhoods and how they cluster together

In [18]:
!conda install -c conda-forge folium=0.5.0 
#!pip install  folium=0.5.0 

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    altair-4.1.0               |             py_1         614 KB  conda-forge
    openssl-1.1.1f             |       h516909a_0         2.1 MB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    certifi-2019.11.28         |   py36h9f0ad1d_1         149 KB  conda-forge
    ------------------------------------------------------------
                       

In [19]:
#!pip install folium==0.5.0
import folium # plotting library

map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)
toronto_merge['Latitude']
toronto_merge['Longitude']
toronto_merge['Borough']
toronto_merge['Neighbourhood']


for lat,lng,borough,neighbourhood in zip(toronto_merge['Latitude'],toronto_merge['Longitude'],toronto_merge['Borough'],toronto_merge['Neighbourhood']):

    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

## KMeans clustering for the clustering of the neighbourhoods

In [20]:
toronto_merge

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,Malvern / Rouge,43.806686,-79.194353
1,M1C,Scarborough,Rouge Hill / Port Union / Highland Creek,43.784535,-79.160497
2,M1E,Scarborough,Guildwood / Morningside / West Hill,43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,Kennedy Park / Ionview / East Birchmount Park,43.727929,-79.262029
7,M1L,Scarborough,Golden Mile / Clairlea / Oakridge,43.711112,-79.284577
8,M1M,Scarborough,Cliffside / Cliffcrest / Scarborough Village West,43.716316,-79.239476
9,M1N,Scarborough,Birch Cliff / Cliffside West,43.692657,-79.264848


In [21]:
from sklearn.cluster import KMeans
kmeans_num=6
toronto_clustering = toronto_merge.drop(['Postcode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = kmeans_num,random_state=0).fit(toronto_clustering)
kmeans.labels_
toronto_merge.insert(0, 'Cluster Labels', kmeans.labels_)

In [22]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(kmeans_num)
ys = [i + x + (i*x)**2 for i in range(kmeans_num)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_merge['Latitude'], toronto_merge['Longitude'], toronto_merge['Neighbourhood'], toronto_merge['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters