 ## Segmenting and Clustering Neighborhoods in Toronto 
 please see line 103 for table with Latitude and longitude

Let's import the necessary libraries

In [82]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd

Next, we will get the information from the link using beautiful soup!

In [83]:
info = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(info.content, "html.parser")

In [84]:
table = soup.find('tbody') 
rangees = table.find_all('tr') 

rangee = [k.get_text() for k in rangees]


We create a dataframe that will contain the extracted data, making the first rows'values, found by using dfmamou.iloc[0], the rows's names. 

In [85]:
dfmamou = pd.DataFrame(rangee)
dfmamou1 = dfmamou[0].str.split('\n', expand = True)
dfmamou2 = dfmamou1.rename(columns = dfmamou1.iloc[0])  


In [86]:
dfmamou2.head(3)

Unnamed: 0,Unnamed: 1,Postcode,Borough,Neighbourhood,Unnamed: 5
0,,Postcode,Borough,Neighbourhood,
1,,M1A,Not assigned,Not assigned,
2,,M2A,Not assigned,Not assigned,


The duplicate rows (title and row of index 0 ) must be dealt with. the latter is removed. 

In [87]:
dfmamou3 = dfmamou2.drop(dfmamou2.index[0]) 

In [88]:
dfmamou3.head(3)

Unnamed: 0,Unnamed: 1,Postcode,Borough,Neighbourhood,Unnamed: 5
1,,M1A,Not assigned,Not assigned,
2,,M2A,Not assigned,Not assigned,
3,,M3A,North York,Parkwoods,


in this step, we will remove all rows where the Borough has a value of "Not assigned"

In [89]:
dfmamou4 = dfmamou3[dfmamou3.Borough != 'Not assigned']

In [90]:
dfmamou4.head()

Unnamed: 0,Unnamed: 1,Postcode,Borough,Neighbourhood,Unnamed: 5
3,,M3A,North York,Parkwoods,
4,,M4A,North York,Victoria Village,
5,,M5A,Downtown Toronto,Harbourfront,
6,,M5A,Downtown Toronto,Regent Park,
7,,M6A,North York,Lawrence Heights,


Next, we group Boroughs that have the same postal code with a comma at the row level:

In [91]:
dfmamou5 = dfmamou4.groupby(['Postcode','Borough'], sort = False).agg(','.join)

In [92]:
dfmamou5.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Neighbourhood
Postcode,Borough,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront,Regent Park"
M6A,North York,"Lawrence Heights,Lawrence Manor"
M7A,Queen's Park,Not assigned


In [93]:
dfmamou5.reset_index(inplace = True)
dfmamou5.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Not assigned


replacing all values not assigned to "Queens'Park"

In [94]:
dfmamou6 = dfmamou5.replace("Not assigned","Queens's park")

In [95]:
dfmamou6.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Queen's Park,Queens's park


In [96]:
dfmamou6.shape

(103, 3)

## creating the dataframe from the assignment

First, let's rename the column that will help us to merge the two dataframes. It is "Postal Code"

In [98]:
dfmamou7 = dfmamou6 
df3 = dfmamou7.rename(columns={"Postcode":"Postal Code"})
df3.head(3)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront,Regent Park"


Second, let's use the geocoding CSV data provided in the assignment. 

In [100]:
dfgeo = pd.read_csv('http://cocl.us/Geospatial_data')
dfgeo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Third, let's Examine the columns name to verify we can create a join. Postal Code is a match!

In [101]:
dfgeo.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [102]:
df3.columns

Index(['Postal Code', 'Borough', 'Neighbourhood'], dtype='object')

Final step: we merge the two dataframes, choosing an inner join because we want to only include values that are in both dataframes.

In [112]:
df4 = pd.merge(df3, dfgeo, on='Postal Code', how ='inner')
df4

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront,Regent Park",43.654260,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queens's park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens,Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937


# Phase II : neighborhood visualization 

In [109]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

Package plan for installation in environment /opt/conda/envs/DSX-Python35:

The following NEW packages will be INSTALLED:

    altair:  2.2.2-py35_1 conda-forge
    branca:  0.3.1-py_0   conda-forge
    folium:  0.5.0-py_0   conda-forge
    vincent: 0.4.4-py_1   conda-forge

altair-2.2.2-p 100% |################################| Time: 0:00:00  37.51 MB/s
branca-0.3.1-p 100% |################################| Time: 0:00:00  32.96 MB/s
vincent-0.4.4- 100% |################################| Time: 0:00:00  38.99 MB/s
folium-0.5.0-p 100% |################################| Time: 0:00:00  47.51 MB/s
Libraries imported.


#### let's create a map to visualize Toronto neighborhoods 

In [110]:
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="on_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


#### next we create a map of the City with superimposed neighborhood using latitude and longitude values

In [115]:

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df4['Latitude'], df4['Longitude'], df4['Borough'], df4['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto