# Capstone Project 2: Neighborhoods in Toronto
<BLOCKQUOTE>This section of the notebook explores and visualizes the neighborhood clusters of Toronto using k-means. Data is taken from two sources:

<UL>
<LI><A HREF="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"> List_of_postal_codes_of_Canada: M</A>
<LI><A HREF="http://cocl.us/Geospatial_data"> Toronto Lat/Long data</A>
</UL>
    
#### Link to Github repository: 

### Step 1: Import Libraries

In [1]:
import pandas as pd   # library for data analysis
import numpy as np    # library to handle data in a vectorized manner
import random         # library for random number generation
import requests       # library to handle requests

#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

!conda install -c conda-forge folium=0.5.0 --yes  # uncomment this line if you haven't completed the Foursquare API lab
import folium # plotting library
print('folium imported')

from sklearn.cluster import KMeans  # import k-means from clustering stage. 

print('All libraries imported')

Solving environment: done

# All requested packages already installed.

folium imported
All libraries imported


### Step 2: Load two datasets -- Toronto Neighborhoods and Toronto Lat/Long
<UL>
<LI>Load the Lat/Long data directly from <A HREF="http://cocl.us/Geospatial_data"> http://cocl.us/Geospatial_data</A>.
<LI>Use pandas read_html to scrape the Toronto Neighborhoods table from <A HREF="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">List_of_postal_codes_of_Canada:_M</A>.
</UL>

##### Load the Toronto Lat/Long data directly from the url and read into a dataframe called df_geo.

In [3]:
geo_data='http://cocl.us/Geospatial_data/Geospatial_Coordinates.csv'  # download directly from the url
df_geo=pd.read_csv(geo_data)

print('\n There are {} rows and {} postal codes in the Lat/Long dataframe dataframe: \n'.format(
              ( df_geo.shape[0]), len(df_geo['Postal Code'].unique()) ))
df_geo.head()


 There are 103 rows and 103 postal codes in the Lat/Long dataframe dataframe: 



Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


##### Use pandas read_html() to scrape the Toronto Neighborhoods table from the website and read into a dataframe called df_nbh.

In [4]:
nbh_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(nbh_url)
df_nbh = df[0]

# Change the column name from Postcode to PostalCode
df_nbh=df_nbh.rename(columns = {'Postcode':'PostalCode'})

# Sort the dataframe by PostalCode and Neighborhood in ascending order(default), and reset the index
df_nbh.sort_values(['PostalCode', 'Neighborhood'], axis=0, ascending=True, inplace=True) 
df_nbh.reset_index(drop=True, inplace=True)

print('\n There are {} rows and {} postal codes in the original Neighborhoods dataframe, shown below: \n'.format(
              ( df_nbh.shape[0]), len(df_nbh['PostalCode'].unique()) ))
df_nbh.head()


 There are 287 rows and 180 postal codes in the original Neighborhoods dataframe, shown below: 



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M1B,Scarborough,Malvern
2,M1B,Scarborough,Rouge
3,M1C,Scarborough,Highland Creek
4,M1C,Scarborough,Port Union


Observation: There is a difference in the number of rows between the two datasets. After processing the Neighborhoods dataframe in Step 3 below, we expect the number of postal codes in both datasets to become equal.

### Step 3: Pre-process the Neighborhood dataframe
<UL>
<LI>Task 1: Drop rows with 'Not assigned' Boroughs
<LI>Task 2: For 'Not assigned' Neighborhoods, make the Borough name the Neighborhood name
<LI>Task 3: Combine rows with PostalCodes of more than one neighborhood into a single row    
</UL>

In [5]:
df_nbh.columns # First, inspect the column names

Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

#### Task 1: Remove dataframe rows with Borough = Not assigned

In [6]:
df_nbh = df_nbh[df_nbh['Borough'] != 'Not assigned']

print('\n This new dataframe has {} rows and {} postal codes, shown below: \n'.format(
                  ( df_nbh.shape[0]), len(df_nbh['PostalCode'].unique()) ))
df_nbh.head()


 This new dataframe has 210 rows and 103 postal codes, shown below: 



Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1B,Scarborough,Malvern
2,M1B,Scarborough,Rouge
3,M1C,Scarborough,Highland Creek
4,M1C,Scarborough,Port Union
5,M1C,Scarborough,Rouge Hill


##### Observation: We see that both datasets now have the same number of postal codes (103).

#### Task 2: For 'Not assigned' Neighborhoods, make the Borough name the Neighborhood name

In [7]:
for i in range(0, df_nbh.shape[0]):    
    if df_nbh['Neighborhood'].values[i]=='Not assigned': # check each Neighborhood value to see if it is 'Not assigned'
        
#For 'Not assigned' Neighborhoods, print the current row, then replace with the Borough name, and print the corrected row.
        print("\nCurrent:   Row=", i, " Borough=", df_nbh['Borough'].values[i], "  Neighborhood=", df_nbh['Neighborhood'].values[i] )
        df_nbh.replace(to_replace=df_nbh['Neighborhood'].values[i], value=df_nbh['Borough'].values[i], inplace=True)
        print("Corrected: Row=", i, " Borough=", df_nbh['Borough'].values[i], "  Neighborhood=", df_nbh['Neighborhood'].values[i], "\n" )

df_nbh.head()


Current:   Row= 182  Borough= Queen's Park   Neighborhood= Not assigned
Corrected: Row= 182  Borough= Queen's Park   Neighborhood= Queen's Park 



Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1B,Scarborough,Malvern
2,M1B,Scarborough,Rouge
3,M1C,Scarborough,Highland Creek
4,M1C,Scarborough,Port Union
5,M1C,Scarborough,Rouge Hill


#### Task 3: Combine rows with PostalCodes of more than one neighborhood into a single row

In [8]:
# Nested loopss to compare the postal codes of two rows at a time.
# prev_row is the index for the first row to be compared ("previous row").
# cur_row is the index for the second row to be compared ("current row").

for i in range(1,4):  # run the loop three times to consolidate all postal codes and their neighborhoods. 
    df_nbh.reset_index(drop=True, inplace=True)  # reset the index at the start of each outer loop

    for prev_row in range (0, df_nbh.shape[0]):  
        if(prev_row < df_nbh.shape[0]-1):
            cur_row = prev_row+1
            Px=df_nbh['PostalCode'].values[prev_row]  # postal code for the first row
            Py=df_nbh['PostalCode'].values[cur_row]   # postal code for the second row
           
    # if postal codes match, concatenate the Neighborhood names and delete the first row from the dataframe:
            if (Px==Py):
                Nx=df_nbh['Neighborhood'].values[prev_row] 
                Ny=df_nbh['Neighborhood'].values[cur_row]
            
                df_nbh.replace( to_replace=df_nbh['Neighborhood'].values[cur_row], 
                                 value=(Nx+", "+Ny), inplace=True )            
              
                df_nbh.drop(df_nbh.index[prev_row], inplace=True)  #Drop the first row in the match

print('\n This new dataframe has {} rows and {} postal codes, shown below: \n'.format(
                  ( df_nbh.shape[0]), len(df_nbh['PostalCode'].unique()) ))    
df_nbh.head()


 This new dataframe has 103 rows and 103 postal codes, shown below: 



Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


##### Observation: After pre-processing the Neighborhoods dataframe, the number of rows and postal codes are now the same, as expected.

### Step 4: Left-Join the dataframe with the Lat/Long dataframe

In [9]:
# check the column names of each dataframe.
print("df_geo columns: ", df_geo.columns)
print("df_nbh columns: ", df_nbh.columns)

# Create a new dataframe df_tor. Join df_nbh and df_geo on the PostalCode column. 
df_tor = df_nbh.join(df_geo.set_index('Postal Code'),on='PostalCode') 

print('\n This dataframe has {} rows and {} postal codes, shown below: \n'.format(
                  ( df_tor.shape[0]), len(df_tor['PostalCode'].unique()) ))    

df_tor.reset_index(drop=True, inplace=True)
df_tor.head()

df_geo columns:  Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')
df_nbh columns:  Index(['PostalCode', 'Borough', 'Neighborhood'], dtype='object')

 This dataframe has 103 rows and 103 postal codes, shown below: 



Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Submit this cell's results: we create a separate dataframe in the row order specified in the assignment instructions:

In [10]:
df_submit = df_tor # Create dataframe df_submit
df_submit.index = df_submit.PostalCode # Dataframe index column = PostalCode

row_list = ['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A'] # postal codes, listed in order for submission
df_submit=df_submit.loc[row_list] # locate and keep only the postal codes needed for submission

df_submit.reset_index(drop=True, inplace=True)

In [11]:
df_submit

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
1,M2H,North York,Hillcrest Village,43.803762,-79.363452
2,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
3,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
4,M4G,East York,Leaside,43.70906,-79.363452
5,M4M,East Toronto,Studio District,43.659526,-79.340923
6,M1R,Scarborough,"Maryvale, Wexford",43.750072,-79.295849
7,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437
8,M9L,North York,Humber Summit,43.756303,-79.565963
9,M5V,Downtown Toronto,"Bathurst Quay, CN Tower, Harbourfront West, Is...",43.628947,-79.39442


### Step 5: Run k-means and visualize Toronto's neighborhoods with k=5
<UL>
<LI>Red star = City center
<LI>Blue circles = Neighborhood with pop-up of Neighborhood name, PostalCode, Lat/Long, Borough
</UL>

In [12]:
# Create a new dataframe df_tor_clusters, to be used in k_means modeling
# Since k_means does not handle categorical variables, 
# drop all the columns in df_tor_clusters, except for the latitude/longitude columns

df_tor5 = df_tor # for k=5
df_tor_clusters5 = df_tor5
df_tor_clusters5 = df_tor_clusters5.drop(['PostalCode', 'Borough', 'Neighborhood'], axis=1)
df_tor_clusters5.head()

Unnamed: 0_level_0,Latitude,Longitude
PostalCode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


#### Run k=5 and k=11 side by side

In [16]:
k5 = 5  #specify 5 clusters
k_means5 = KMeans(init="k-means++", n_clusters=k5, n_init=12)
tor_clusters5 = k_means5.fit(df_tor_clusters5)
labels5 = k_means5.labels_

In [17]:
# Add a Cluster Label column to the new Toronto dataframe df_tor5
# and place the k_means labels in 'Cluster Label'

df_tor5.insert(0, 'Cluster Label', labels5)
df_tor5.reset_index(drop=True, inplace=True)  # reset the index after sorting
df_tor5.head()

Unnamed: 0,Cluster Label,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,3,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,3,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill",43.784535,-79.160497
2,3,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,3,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [18]:
df_tor5.groupby('Cluster Label').mean( ) # center lat/long of each cluster

Unnamed: 0_level_0,Latitude,Longitude
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,43.707334,-79.324039
1,43.681909,-79.527637
2,43.662429,-79.397089
3,43.776575,-79.242084
4,43.748934,-79.418171


In [19]:
df_tor5.groupby('Cluster Label').count() # number of postalcodes in each cluster

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood,Latitude,Longitude
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,18,18,18,18,18
1,24,24,24,24,24
2,30,30,30,30,30
3,14,14,14,14,14
4,17,17,17,17,17


In [20]:
df_tor5.groupby(['Borough', 'Cluster Label']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PostalCode,Neighborhood,Latitude,Longitude
Borough,Cluster Label,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Central Toronto,2,5,5,5,5
Central Toronto,4,4,4,4,4
Downtown Toronto,2,19,19,19,19
East Toronto,0,5,5,5,5
East York,0,5,5,5,5
Etobicoke,1,11,11,11,11
Mississauga,1,1,1,1,1
North York,0,5,5,5,5
North York,1,6,6,6,6
North York,4,13,13,13,13


We make 3 general observations:
    1. The number of postal codes per cluster ranges from a low of 11 (Cluster 3) to a high of 33 (Cluster 2)
    2. There are 11 Boroughs
    3. Many Boroughs ended up being in more than one Cluster. 
       For example, Central Toronto is in Clusters 2 and 4, North York is in Clusters 0, 1 and 4, etc.

### Visualize the clusters and City Center
#### Red star = Toronto city center 
#### Circles = Toronto Neighborhoods with pop-up label of Neighborhood, Borough, cluster

In [23]:
# Create map with Toronto's city center coordinates
tor_lat = 43.6532 
tor_long = -79.3832
map = folium.Map(location=[tor_lat, tor_long], zoom_start=10)

# Add popup red star marker for the City Center
folium.Marker(
    location=[tor_lat, tor_long],
    popup='Toronto City Center',
    icon=folium.Icon(color='red', icon='star')
).add_to(map)

# set color scheme for the clusters
x = np.arange(k5)
ys = [i + x + (i*x)**2 for i in range(k5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, nbh, boro, cluster in zip(df_tor5['Latitude'], df_tor5['Longitude'], df_tor5['Neighborhood'], df_tor5['Borough'], df_tor5['Cluster Label']):
    label = '{}, {}, cluster {}'.format(nbh, boro, cluster)
    label = folium.Popup(label, parse_html=True)

    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map)

map