## Segmenting Neighborhoods in Toronto

#### Section 0: Required packages

This notebook requires pandas, lxml, and folium to run.

#### Section 1: Scraping Neighborhoods from Toronto Postal Codes

##### Step 1: importing required packages

In [1]:
import requests
import lxml.html as lh
import pandas as pd
#import beautifulsoup4 as soup  I'm having some environment issues while trying to import beautifulsoup4. Not used for analysis.

##### Step 2: Store the website we're scraping as a variable, and store the elements from the table

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

In [3]:
page = requests.get(url)
doc = lh.fromstring(page.content)
table_elements = doc.xpath('//tr')

In [4]:
[len(T) for T in table_elements[:12]] #verifies I pulled elements from just the table I wanted

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]

In [5]:
tr_elements = doc.xpath('//tr')#Create empty list
col=[]
i=0 #For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
    i+=1
    name=t.text_content()
    print('%d:"%s"'%(i,name))
    col.append((name,[]))

1:"Postcode"
2:"Borough"
3:"Neighborhood
"


Now onto creating a dictionary of the postal codes, boroughs, and neighorhoods. When all is said and done for the dataframe, we want one row for each postal code, which corresponds to one or more neighborhoods within one borough. If a postal code is "Not assigned" in the borough cell, it is skipped over.

In [6]:
postaldict = {}

j = 0
for r in range(1,len(tr_elements)):
    j+=1
    row=tr_elements[r]
    
    if len(row)!=3:
        break
    item = row.text_content().split('\n')

    if item[2] != 'Not assigned':
        if item[1] not in postaldict.keys():
            postaldict[item[1]] = [item[2], [item[3]]]
        else:
            postaldict[item[1]][1].append(item[3])

##### Step 3: Clean-up

Assign any "Not assigned" neighborhoods the name of the borough, and convert the neighborhood lists into strings

In [7]:
for key in postaldict:
    if 'Not assigned' in postaldict[key][1]:
        postaldict[key][1] = postaldict[key][0]
    elif type(postaldict[key][1]) != str:
        y = ", ".join(postaldict[key][1])
        postaldict[key][1] = y
        print(postaldict[key][1])
print(postaldict)

Parkwoods
Victoria Village
Harbourfront
Lawrence Heights, Lawrence Manor
Queen's Park
Rouge, Malvern
Don Mills North
Woodbine Gardens, Parkview Hill
Ryerson, Garden District
Glencairn
Cloverdale, Islington, Martin Grove, Princess Gardens, West Deane Park
Highland Creek, Rouge Hill, Port Union
Flemingdon Park, Don Mills South
Woodbine Heights
St. James Town
Humewood-Cedarvale
Bloordale Gardens, Eringate, Markland Wood, Old Burnhamthorpe
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
East Birchmount Park, Ionview, Kennedy Park
Bayview Village
CFB Toronto, Downsview East
The Danforth West, Riverdale
Design

For the dataframe, we want the index to be numeric. With that in mind, I'm making a dictionary with a number as the index, and the postal code as the frist element in the value.

In [9]:
finaldict = {}
i=0
for k in postaldict.keys():
    finaldict[i] = [k,postaldict[k][0],postaldict[k][1]]
    i+=1
len(finaldict)

103

##### Final Step: Creating the Dataframe from the dictionary

In [14]:
df = pd.DataFrame.from_dict(finaldict,orient='index', columns = ['PostalCode','Borough', 'Neighborhood'])

In [28]:
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [16]:
df.shape

(103, 3)

#### Section 2: Retrieving the geographic coordinates of each postal code

In [17]:
import geocoder

ModuleNotFoundError: No module named 'geocoder'

In [26]:
geocodecsv = "C:/Users/Mercedes/Documents/DSCapstone/Coursera_Capstone/Geospatial_Coordinates.csv"
geodf = pd.read_csv(geocodecsv)
geodf.rename(columns = {'Postal Code':'PostalCode'}, inplace = True)
geodf.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The numeric indexes of the pre-existing dataframe and the coordinate dataframe do not line up, so we'll be joining them together, using PostalCode as the unifying column. 

In [25]:
fulldf = pd.merge(df,geodf, how = 'outer', on = 'PostalCode', left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

In [27]:
fulldf.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


#### Section 3: Exploring the Neighborhoods in Toronto Using Foursquare