<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1> 

<h3 align=left><font size = 5>1. Import libraries</font></h3> 

In [1]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!conda install lxml --yes
!conda update -c conda-forge -y pandas
!conda install -c anaconda beautifulsoup4
print('Libraries installed.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be UPDATED:

    certifi:         2019.11.28-py36_1 anaconda --> 2019.11.28-py36h9f0ad1d_1 conda-forge

The following packages will be DOWNGRADED:

    ca-certificates: 2020.1.1-0        anaconda --> 2019.11.28-hecc5488_0     conda-forge
    openssl:         1.1.1-h7b6447c_0  anaconda --> 1.1.1f-h516909a_0         conda-forge

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Solving environment: done

# All requested packages already installed.

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2019.11.28         |           py36_1         157 KB
    c

In [2]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from bs4 import BeautifulSoup
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import requests # Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.

print('Libraries imported.')

Libraries imported.


<h2 align=left><font size = 5>2. Scraping table from Wikipedia page </font></h2> 
<h3 align=left><font size = 3>Use the BeautifulSoup package or <b>any other way</b> you are comfortable with </font></h3> 
<h3 align=left><font size = 3>I am using pd.read_html </font></h3> 



In [3]:
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M',match= 'Neighborhood')
df = df[0]
df = pd.DataFrame(df) 
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


<h3 align=left><font size = 3> - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood <br>
- Ignore cells with a borough that is Not assigned. </font></h3> 

In [4]:
df.columns= ["PostalCode","Borough","Neighborhood"] # The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood

In [5]:
df.drop(df.index[df['Borough'] == 'Not assigned'], inplace = True) # Ignore cells with a borough that is Not assigned.

In [6]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


<h3 align=left><font size = 3> Merge cells with same postal code. <br>
More than one neighborhood can exist in one postal code area, separated them with ","  </font></h3> 

In [7]:
df=df.groupby(['PostalCode','Borough'],as_index=False).agg(lambda x: ','.join(x)) #Merge cells with same postal code.
#More than one neighborhood can exist in one postal code area, separated them with ","
df["Neighborhood"]= df["Neighborhood"].str.replace(" / ", ", ")
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<h3 align=left><font size = 3> If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.  </font></h3> 

In [8]:
#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 
e = len(df)
i = 1
while i < e:
    y = df.iat[i,2]
    
    if y == 'Not assigned':
        df.iat[i,2] = df.iat[i,1]
    i += 1
df.head()
# See Queen's Park (last row) for example

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<h3 align=left><font size = 3> Show final result - with 15 random rows</font></h3> 

In [9]:
df_random = df.sample(n=15)
df_random

Unnamed: 0,PostalCode,Borough,Neighborhood
69,M5W,Downtown Toronto,Stn A PO Boxes
57,M5G,Downtown Toronto,Central Bay Street
67,M5T,Downtown Toronto,"Kensington Market, Chinatown, Grange Park"
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
73,M6C,York,Humewood-Cedarvale
63,M5N,Central Toronto,Roselawn
5,M1J,Scarborough,Scarborough Village
90,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
12,M1S,Scarborough,Agincourt
61,M5L,Downtown Toronto,"Commerce Court, Victoria Hotel"


<h3 align=left><font size = 3> Use the .shape method to print the number of rows of your dataframe.  <br>
Total rows: 103</font></h3> 

In [10]:
df.shape[0]

103

<h3 align=left><font size = 3> Save Dataframe to CSV file</font></h3> 


In [11]:
df.to_csv(r'Toronto.csv', index = False)

In [12]:
coordinate = pd.read_csv('Geospatial_Coordinates.csv') # downloaded file from https://cocl.us/Geospatial_data
df=df.merge(coordinate,how='left',left_on='PostalCode',right_on='Postal Code')
del df['Postal Code'] # get rid of additional column
df

FileNotFoundError: [Errno 2] File Geospatial_Coordinates.csv does not exist: 'Geospatial_Coordinates.csv'