## Segmenting and Clustering Neighborhoods in Toronto

##### We need to import in the associated libraries for scraping web data (urlib) and reading the html data (Beautiful Soup)

In [1]:
import urllib.request
from bs4 import BeautifulSoup


##### Define the url and put it into a variable

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
url

'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

##### Open the defined url variable and put it into a variable using the urlib request

In [3]:
html_page = urllib.request.urlopen(url)
html_page

<http.client.HTTPResponse at 0x10a8a1610>

##### Next we need to parse the html using beautiful soup and put it into a variable

In [4]:
html_postal_codes = BeautifulSoup(html_page, "lxml")


##### Find the table in the html code after exploring the html

In [5]:
postal_table=html_postal_codes.find('table', class_='wikitable sortable')


##### Notice above we have three identifiers in the postal html, Postal Code, Borough, and Neighborhood identified in the header as th which repeat every 3 lines for the detail td..  Create empty lists and run a loop through the html code matching up every 3 lines and append the data to the lists.

In [6]:
#create empty lists for each segment
postal_code=[]
borough=[]
neighborhood=[]

#run a loop through to append data to the lists from the html for each td element
for row in postal_table.findAll('tr'):
    cells=row.findAll('td')
    if len(cells)==3:
        postal_code.append(cells[0].find(text=True))
        borough.append(cells[1].find(text=True))
        neighborhood.append(cells[2].find(text=True))


##### Next after reviewing we see a lot of new lines which need to be stripped out from the data, strip these out for each list
       

In [7]:
#Strip out the new lines from the list
postal_code = [ p.strip() for p in postal_code ]
borough = [ b.strip() for b in borough ]
neighborhood = [n.strip() for n in neighborhood]


#### Create a pandas data frame and map the data to the data frame

In [26]:
#Assign the data to a pandas data frameand map the data to columns in this data frame
import pandas as pd
df=pd.DataFrame(postal_code,columns=['PostalCode'])
df['Borough']=borough
df['Neighborhood']=neighborhood
#Convert the slashes to coomas in Neighborhood
df['Neighborhood']=df['Neighborhood'].str.replace('/',',')

#Filter out the Not Assigned from the Borough
df=df[df.Borough != 'Not assigned']

df.head(30)


Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park , Harbourfront"
5,M6A,North York,"Lawrence Manor , Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government"
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern , Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill , Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [27]:
df.shape

(103, 3)

## Second Part of the Project find the Lattiitude and Longitude for each postal code

#### We now need to read in the csv file that has latitude and longitude coordinates by postal code into a pandas data frame so that we can begin to combine it with our existing dataset. I also needed to rename the Postal Code column so column names match between data frames

In [28]:
lat_long=pd.read_csv('Geospatial_Coordinates.csv')
lat_long.rename(columns = {'Postal Code':'PostalCode'}, inplace=True)
lat_long.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


#### Now that we have the data in a dataframe we can merge it with our current dataframe we will do an inner join based on Postal Code

In [30]:
df2=pd.merge(df,lat_long, on='PostalCode', how='inner')

In [31]:
df2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park , Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor , Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park , Ontario Provincial Government",43.662301,-79.389494


In [32]:
df2.shape

(103, 5)