# Segmenting and Clustering Neighborhoods in Toronto
_This is a notebook for my analysis of the Neighborhoods in Toronto. All steps used are highlighted for reference purposes._

## 1. Import Libraries
_All libraries that will be used for the analysis is imported in this section. This is preferred to enable anslysis run smoothly and ensure you have all required packages in your notebook_

In [37]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


## 2. Data Mining
_Since Wikipedia page exists that has all the information needed to explore the neighborhoods in Toronto. The data was scraped from the Wikipedia page and read into a Pandas dataframe. Since the dats was already in a structured dataframe i.e. in tabular form, Pandas was the preferred option, otherwise, Beautiful soup package would have been used._

In [65]:
# Send the GET request
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
# Send the GET request
html = requests.get(url).content
tor_df = pd.read_html(html)
tor_df = tor_df[0]
tor_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [66]:
tor_df.shape

(180, 3)

In [67]:
borough_list = tor_df.index[tor_df['Borough'] == 'Not assigned']
neighborhood_list = tor_df.index[tor_df['Neighbourhood'] == 'Not assigned']
print('Prior to Cleaning, the Dataframe has: ')
print('{} Postal codes'.format(tor_df['Postal Code'].unique().shape[0]))
print('{} rows with Not assigned Borough'.format(borough_list.shape[0]))
print('{} rows with Not assigned Neighborhood'.format(neighborhood_list.shape[0]))



Prior to Cleaning, the Dataframe has: 
180 Postal codes
77 rows with Not assigned Borough
77 rows with Not assigned Neighborhood


## 3. Data Cleaning
_All the rows with "Not assigned" in the Boroughs column are dropped. In addition, the column titles are renamed as specified to PostalCode, Borough, and Neighborhood._

In [68]:
tor_df = tor_df.rename(columns={"Postal Code": "PostalCode", "Neighbourhood": "Neighborhood"})
tor_df = tor_df[tor_df.Borough !='Not assigned']
tor_df.reset_index(drop=True, inplace=True)
tor_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [69]:
tor_df.shape

(103, 3)

In [70]:
print('The final DataFrame shape is {}'.format(tor_df.shape))

The final DataFrame shape is (103, 3)
