<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>
<h1 align=right><font size = 4>by Pavel Milchev</font></h1>

<h1> Step 1 - Parse data from site and shape it</h1>

*Install the needed librabies.*

In [48]:
!pip install beautifulsoup4 # used to parse the html
print("BeautifulSoup 4 is installed!")
!pip install html5lib # already installed but was in the example ^_^
print("HTML5 libraty is installed!")
!pip install lxml # already installed but was in the example ^_^
print("lxml libraty is installed!")

BeautifulSoup 4 is installed!
HTML5 libraty is installed!
lxml libraty is installed!


*Import the needed libraries*

BeautifulSoup is used by pd.read_html

In [49]:
from bs4 import BeautifulSoup
import requests
import lxml
import pandas as pd

*Take the complete text of the wiki-page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M and create a data frame*

**I assume that:** I need the first table from the page and that there is at least one

In [50]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

**I assume that:** the names of the columns from the html page are as follows: Postcode, Borough, Neighbourhood

In [51]:
# rename the column to ber the same as the expected from the assignement
df.rename(columns={'Postcode':'PostalCode'}, inplace=True)

*Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.*

In [52]:
#Ignore cells with a borough that is Not assigned.
indexNames = df[df['Borough'] == 'Not assigned'].index
df.drop(indexNames , inplace=True)

*If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.*

In [53]:
nonNeighbourhoods = df[df['Neighbourhood'] == 'Not assigned'].index
for k in nonNeighbourhoods:
    df.at[k,'Neighbourhood'] = df.at[k,'Borough']

***The procedure to fulfill the following requirement:*** More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

In [64]:
# create a new empty DataFrame to store the combined Neighbourhoods
refined_df = pd.DataFrame(columns=['PostalCode', 'Borough', 'Neighbourhood'])
# group by PostalCode to find the Neighbourhoods with the same postal
grouped = df.groupby(['PostalCode'], sort=False)
for postalCode, postalCode_df in grouped:
    # transform the list of unique neighbourhoods to a comma separated single string
    postalCode_df['Neighbourhood'] = ",".join(postalCode_df['Neighbourhood'].unique())
    # drop duplicated rows based on the Borough column
    postalCode_df = postalCode_df.drop_duplicates(subset='Borough')
    # add the current dataframe with ONE LINE to the final refined_df
    refined_df = pd.concat([refined_df, postalCode_df])
    
# reindex the table starting the rows from 0
refined_df.reset_index( drop=True, inplace = True)
print(refined_df.shape)
refined_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


(103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."


In [65]:
#save the transformed data frame for the next steps of the assignement
refined_df.to_csv('toronto_postal_step1.csv', mode='w')


<h1> Step 2 - Add geolocaion to the data </h1>

*Install the geocoder library.*

In [44]:
!pip install geocoder
print("The geocoder library is installed!")

The geocoder library is installed!


*Define a getLatLng function which returns the latitude and longitude of Toronto postal code, which are passed as arguments*

In [38]:
import geocoder # import geocoder
import pandas as pd

def getLatLng(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    failsafe = 1
    adress = '{}, Toronto, Ontario'.format(postal_code)
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis(adress)
        lat_lng_coords = g.latlng
        failsafe = failsafe + 1
        if(failsafe > 10):
            break;
    
    return lat_lng_coords

*Read the dataframe from the csv saved at the end of the previous step*

In [39]:
df = pd.read_csv('toronto_postal_step1.csv')
df.drop(columns = ['Unnamed: 0'], inplace = True)
df

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."


*Create lists of latitudes and longitudes for each postal code from the data frame*

In [45]:
latList = []
lngList = []

for i in df.index:
    #print("finding the coords of: ", df.at[i,'PostalCode'])
    coords = getLatLng(df.at[i,'PostalCode'])
    latList.append(coords[0])
    lngList.append(coords[1])

*Add the columns Latitude and Longitude to the data frame*

In [46]:
df['Latitude'] = latList
df['Longitude'] = lngList
df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752420,-79.329242
1,M4A,North York,Victoria Village,43.730600,-79.313265
2,M5A,Downtown Toronto,Harbourfront,43.650295,-79.359166
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.723270,-79.451286
4,M7A,Downtown Toronto,Queen's Park,43.661150,-79.391715
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North",43.653760,-79.510890
99,M4Y,Downtown Toronto,Church and Wellesley,43.666585,-79.381302
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.648690,-79.385440
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout...",43.632835,-79.489550


In [47]:
#save the transformed data frame for the next steps of the assignement
df.to_csv('toronto_postal_step2.csv', mode='w')

<h1> Step 3 - Explore and cluster </h1>