## Introduction

This Notebook will convert addresses into their equivalent latitude and longitude values. It will also use the Foursquare API to explore neighborhoods in Toronto, use the **explore** function to get the most common venue categories in each neighborhood, and then use this feature to group the neighborhoods into clusters. It will use the *k*-means clustering algorithm to complete this task. Finally, it will use the Folium library to visualize the neighborhoods in Toronto and their emerging clusters.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Create Data Frame of Toronto Neighborhoods from Wikipedia Page </a>

2. <a href="#item2">Obtain Latitude and Longitude of Neighborhoods </a> 

   
</font>
</div>

### Download Dependencies 

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### 1. Create Data Frame of Toronto Neighborhoods from Wikipedia Page

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1545868800000&hmac=u23NZ3SR85OnKCQRnfOqtsMbnGrHBrgQGuCS4Pakc9s)

In [2]:
#Install website scraping package and parcer package, import BeautifulSoup
#!conda install -c conda-forge beautifulsoup4 --yes #Uncomment line if beautifulsoup4 has not been installed
#!conda install -c conda-forge lxml --yes #Uncomment line if lxml has not been installed
from bs4 import BeautifulSoup

#### Scrape data from Wikipedia

In [3]:
html = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(html,'lxml')
#print(soup.prettify())
table = soup.find('tbody') #create an object containing the table information from Wikipedia
#print(table.prettify())

#The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
toronto_df = pd.DataFrame(columns=['PostalCode','Borough','Neighborhood']) #Create an empty data frame with the desired column names

Rindex = 0 #Set the starting row index
Cindex = 0 #Set the starting column index
cells = table.find_all('td')
i = 0
for cell in cells: #loop through all occurences of <td> within <tbody>. Note: <td> contains each cell within the table
        Rindex = int(i/3)
        Cindex = i % 3
        if Cindex == 0:
            toronto_df.loc[Rindex] = '' #Create a blank row to fill in with data from Wikipedia
        toronto_df.iloc[Rindex,Cindex] = cell.text #assign the text of the current instance of <td> to the corresponding cell
        i += 1
toronto_df.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


#### Clean Data

In [4]:
#Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
toronto_df.replace('\n', '', regex=True, inplace=True) # remove the \n from strings in the Neighborhood column
toronto_df.drop(toronto_df[toronto_df.Borough == 'Not assigned'].index, inplace=True) #drop rows where borough is not assigned

#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.
for r in range(0,len(toronto_df['Neighborhood'])): #Cycle through all the rows in the data frame
    if toronto_df.iloc[r,2]=='Not assigned': #check if the neighborhood is unassigned
        toronto_df.iloc[r,2] = toronto_df.iloc[r,1] #replace the neighborhood with the borough
        
#More than one neighborhood can exist in one postal code area. 
#For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
#These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.
groups = pd.DataFrame(toronto_df.groupby(['PostalCode'])['Neighborhood'].apply(', '.join))
for pc in range(0,len(toronto_df['PostalCode'])): #Loop through the rows of the data frame
    PC = toronto_df.iloc[pc,0]  #Determine current postal code in for loop
    PCindex = groups.index.get_loc(PC) #Find the index for the row in groups that matches the current postal code
    toronto_df.iloc[pc,2] = groups.iloc[PCindex,0] #replace neighborhood in toronto_df with the neighborhoods from groups

toronto_df.drop_duplicates(inplace=True) #Drop duplicate rows so there is only 1 row per postal code
toronto_df.reset_index(drop=True, inplace=True) #reset the index

toronto_df.head() 

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [5]:
toronto_df.shape #In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

(103, 3)

### 2. Obtain Latitude and Longitude of Neighborhoods

Use the Geocoder package or the csv file to create the following dataframe:
![](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1545868800000&hmac=JbgRgo-KXbGsROeU07QAOwJWpLCImBGiMluM3PgizGM)

** This cell has been converted to markdown, because the geocoder was taking too long. The coordinates obtained from the .csv file and processed in the next cell. **


#!conda install -c conda-forge geocoder --yes #Uncomment line if beautifulsoup4 has not been installed
import geocoder # import geocoder

for code in range(0,len(toronto_df['PostalCode'])):
    postal_code = toronto_df.iloc[code,0]
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    latitude[code] = lat_lng_coords[0]
    longitude[code] = lat_lng_coords[1]

In [6]:
# Read in the data from the .csv file
filename = 'Geospatial_Coordinates.csv'
lat_lng_coords = pd.read_csv(filename)
lat_lng_coords.head()
lat_lng_coords.set_index('Postal Code', inplace=True)

In [7]:
#enter the geospacial data in to the toronto_df data frame
toronto_df['Latitude'] = ''
toronto_df['Longitude'] = ''
 
for l in range(0,len(toronto_df['PostalCode'])): #Loop through the rows of the data frame
    code = toronto_df.iloc[l,0]  #Determine current postal code in for loop
    #ind = lat_lng_coords.index.get_loc(code) #Find the index for the row in groups that matches the current postal code 
    coords = lat_lng_coords.loc[code] #enter the latitude in to the toronto_df data frame
    toronto_df.iloc[l,3] = coords[0]
    toronto_df.iloc[l,4] = coords[1] #enter the longitude in to the toronto_df data frame
    
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6543,-79.3606
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7185,-79.4648
4,M7A,Queen's Park,Queen's Park,43.6623,-79.3895
