# Segmenting and Clustering Neighborhoods in Toronto

# ---- PART 1 BEGINS ----

### Step 1: Start by creating a new Notebook for this assignment.

### Step 2: Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas  dataframe like the one shown \[IMAGE IN COURSERA\].

I was able to do the following steps by reading about the .read_html method in pandas.


In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
import pandas as pd
df = pd.read_html(url, header=0)
df

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

In [4]:
len(df)

3

In [5]:
df[0]

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


### Step 3: To create the above dataframe (shown in coursera):
- #### The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

I will give a value to the .columns attribute of the dataframe as follows:

In [6]:
df[0].columns = ['PostalCode','Borough','Neighborhood']
df[0].head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


- #### Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
To do this, I will drop all the cells where Borough has a 'Not assigned' value, as follows:

In [7]:
filtered_df = df[0].drop(df[0][df[0].Borough == 'Not assigned'].index)
filtered_df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [26]:
filtered_df.reset_index(drop=True,inplace=True) # Since the index column values in Coursera start as 0, 1, 2..., I will reset the index column.
filtered_df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


- #### More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11  in the table (image is in Coursera).

It seems this was already in Wikipedia by the time I did this assignment so I will go to the next task


- #### If a cell has a borough but a Not assigned  neighborhood, then the neighborhood will be the same as the borough.
After filtering the rows by removing the ones where the Borough was 'Not assigned', there were no values in the 'Neighborhood' column equal to 'Not assigned', so this is already done.

- #### Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

- #### In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [8]:
filtered_df.shape

(103, 3)

# ---- PART 1 ENDS HERE ----

# ---- PART 2 BEGINS ----
_Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood._

_In an older version of this course, we were leveraging the Google Maps Geocoding API to get the latitude and the longitude coordinates of each neighborhood. However, recently Google started charging for their API: http://geoawesomeness.com/developers-up-in-arms-over-google-maps-api-insane-price-hike/, so we will use the Geocoder Python package instead: https://geocoder.readthedocs.io/index.html._

_The problem with this Package is you have to be persistent sometimes in order to get the geographical coordinates of a given postal code. So you can make a call to get the latitude and longitude coordinates of a given postal code and the result would be None, and then make the call again and you would get the coordinates. So, in order to make sure that you get the coordinates for all of our neighborhoods, you can run a while loop for each postal code. Taking  postal code M5G as an example, your code would look something like this:__

In [193]:
'''
import geocoder # import geocoder

# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
'''

"\nimport geocoder # import geocoder\n\n# initialize your variable to None\nlat_lng_coords = None\n\n# loop until you get the coordinates\nwhile(lat_lng_coords is None):\n  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))\n  lat_lng_coords = g.latlng\n\nlatitude = lat_lng_coords[0]\nlongitude = lat_lng_coords[1]\n"

_Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data._

_Use the Geocoder package or the csv file to create the following dataframe: \[IMAGE IN COURSERA\]_

_Important Note: There is a limit on how many times you can call geocoder.google function. It is 2500 times per day. This should be way more than enough for you to get acquainted with the package and to use it to get the geographical coordinates of the neighborhoods in the Toronto._

_Once you are able to create the above dataframe, submit a link to the new Notebook on your Github repository._

### In the next cells I tried to connect to geocoder and geopy.geocoders, but I decided to use the given .csv file after all.

In [243]:
#df_ll = pd.DataFrame(columns = ['PostalCode','Latitude','Longitude'])
#df_ll['PostalCode'] = filtered_df['PostalCode']
#df_ll

In [195]:
#!pip install geocoder

In [197]:
#from geopy.geocoders import Nominatim
#
#geolocator = Nominatim(user_agent="IBMcoursething")
#
#def get_lat_lng(row):
#    print(row)
#    location = geolocator.geocode('{}, Toronto, ON'.format(row[0]))
#    print(location[:])
#    return location[-1]

In [198]:
#df_ll = df_ll.apply(get_lat_lng,axis=0,result_type='expand')

In [201]:
df_ll=pd.read_csv('http://cocl.us/Geospatial_data')

In [204]:
df_ll.columns = ['PostalCode','Latitude','Longitude']
df_ll.columns

Index(['PostalCode', 'Latitude', 'Longitude'], dtype='object')

In [205]:
df_ll

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


### After obtaining the dataframe I merged it with the previous one

In [221]:
merged_df = pd.merge(df_ll, filtered_df, on='PostalCode')

In [235]:
merged_df

Unnamed: 0,PostalCode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae
...,...,...,...,...,...
98,M9N,43.706876,-79.518188,York,Weston
99,M9P,43.696319,-79.532242,Etobicoke,Westmount
100,M9R,43.688905,-79.554724,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,43.739416,-79.588437,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


### Then I sorted the columns as it appeared in the dataframe image at Coursera

In [238]:
merged_df = merged_df[['PostalCode','Borough','Neighborhood','Latitude','Longitude']]

### And finally I chose the specific Postal Codes that appear in the dataframe image in Coursera, in the same order.

In [242]:
items = ['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A']
merged_df.iloc[pd.Index(merged_df['PostalCode']).get_indexer(items)].reset_index(drop=True)
#merged_df.loc[merged_df['PostalCode'].isin(['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A'])] 

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Wexford, Maryvale",43.750072,-79.295849
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442


# ---- PART 2 ENDS HERE ----

# ---- PART 3 BEGINS ----

_Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you._

_Just make sure:_

1. _to add enough Markdown cells to explain what you decided to do and to report any observations you make._
2. _to generate maps to visualize your neighborhoods and how they cluster together._

_Once you are happy with your analysis, submit a link to the new Notebook on your Github repository. (3 marks)_

### First I will select all rows where Toronto is included in the Borough name.

In [291]:
toronto_df = merged_df[merged_df['Borough'].str.contains('Toronto')]
toronto_df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


### Now I installed the Folium library to show a map within this notebook.

In [292]:
!pip install folium==0.5.0
import folium



### Then I graphed the dataframe I already have, to see if the data shows properly before starting to cluster

In [293]:
central_toronto_lat = 43.665860
central_toronto_lng = -79.388790

toronto_map = folium.Map(location=[central_toronto_lat,central_toronto_lng],zoom_start=12)

for lat,lng,label in zip(toronto_df.Latitude,toronto_df.Longitude,toronto_df.Neighborhood):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        fill=True,
        color='blue',
        fill_opacity=0.5
    ).add_to(toronto_map)

# MAPS DON'T LOAD ON GITHUB, IF YOU WANT TO SEE THEM GO TO THIS LINK:
https://nbviewer.jupyter.org/github/melipass/Coursera_Capstone/blob/master/Week3Notebook.ipynb

In [294]:
toronto_map

### Now I ran the .get_dummies() function in order to separate each possible value within the Borough column to a separate column, to work with them numerically.

In [295]:
toronto_df_onehot = pd.get_dummies(toronto_df[['Borough']],prefix="",prefix_sep="")
toronto_df_onehot

Unnamed: 0,Central Toronto,Downtown Toronto,East Toronto,West Toronto
37,0,0,1,0
41,0,0,1,0
42,0,0,1,0
43,0,0,1,0
44,1,0,0,0
45,1,0,0,0
46,1,0,0,0
47,1,0,0,0
48,1,0,0,0
49,1,0,0,0


### In this step I created the clusters based on the previous dataframe, using KMeans

In [296]:
from sklearn.cluster import KMeans
kclusters = 4
#toronto_df_clustering = toronto_df.drop('Neighborhood',1)
#toronto_df_clustering = toronto_df_clustering.drop('PostalCode',1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_df_onehot)
kmeans.labels_[0:10]

array([3, 3, 3, 3, 1, 1, 1, 1, 1, 1], dtype=int32)

### Now I add the resulting clusters to the Toronto dataframe I'm working with

In [297]:
toronto_df.insert(0,'ClusterLabels',kmeans.labels_)

In [298]:
toronto_df

Unnamed: 0,ClusterLabels,PostalCode,Borough,Neighborhood,Latitude,Longitude
37,3,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,3,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,3,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,3,M4M,East Toronto,Studio District,43.659526,-79.340923
44,1,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,1,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,1,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,1,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,1,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


In [289]:
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

### Finally, I draw each point in the map with a different color depending on the cluster it belongs to. Here we have 4 clusters of color purple, red, cyan and olive.

# MAPS DON'T LOAD ON GITHUB, IF YOU WANT TO SEE THEM GO TO THIS LINK:
https://nbviewer.jupyter.org/github/melipass/Coursera_Capstone/blob/master/Week3Notebook.ipynb

In [300]:
#central_toronto_lat = 43.665860
#central_toronto_lng = -79.388790

toronto_map_clusters = folium.Map(location=[central_toronto_lat,central_toronto_lng],zoom_start=12)

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0,1,len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat,lng,label,clus in zip(toronto_df.Latitude,toronto_df.Longitude,toronto_df.Neighborhood,toronto_df.ClusterLabels):
    label = folium.Popup(label + ', Cluster ' + str(clus), parse_html=True)
    folium.CircleMarker(
        [lat,lng],
        radius=5,
        popup=label,
        fill=True,
        color=rainbow[clus-1],
        fill_opacity=0.5,
        fill_color=rainbow[clus-1]
    ).add_to(toronto_map_clusters)

In [301]:
toronto_map_clusters

# This finishes the last part. Thanks for taking a look at my Week 3 assignment submission!

# MAPS DON'T LOAD ON GITHUB, IF YOU WANT TO SEE THEM GO TO THIS LINK:
https://nbviewer.jupyter.org/github/melipass/Coursera_Capstone/blob/master/Week3Notebook.ipynb