<a href="https://colab.research.google.com/github/marienbaptiste/IBM-Capstone/blob/master/Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">Segmenting and Clustering Neighborhoods in Toronto</h1>


*The whole assigment is in this notebook*



##Part 1

In [0]:
#Libraries
import requests
import pandas as pd
import numpy as np

**1.   Scraping data with BeautifulSoup**

In [0]:
# Make the request to a url
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# Create soup from content of request
c = r.content

from bs4 import BeautifulSoup

soup = BeautifulSoup(c) #soup in now the html output

In [0]:
# Inspecting the code, we can find that our data in wrapped in a "wikitable sortable jquery-tablesorter", let's isolate that
neigh_table = soup.find('table',{'class':'wikitable sortable'})

# Our data in <td> is nested into <tr>
table_rows = neigh_table.find_all('tr')

#Let's append it all in a list called data
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

**2.   The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [4]:
df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])

#From a first inspection, it looks like we created rows with empty values
print((~df['PostalCode'].isnull()).value_counts())

df = df[~df['PostalCode'].isnull()]  #Filter the artifact at the beginning (empty row)

#Let's check the cleaned up result
print(df.shape)
df.head()

True     287
False      1
Name: PostalCode, dtype: int64
(287, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


**3.   Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**

In [5]:
#Droping the rows in "Borough" containing the string "Not assigned"

#Building the condition
indexNa = df[ df['Borough'] == 'Not assigned' ].index
 
#Delete these row indexes from dataFrame
df.drop(indexNa, inplace=True)

#Reset the index to 0
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


**4&5.   If more than one neighborhood exists per postal code, it should be combined with a comma**

In [6]:
#Finding evidence of those duplicates
def find_dupes():
	return str(df[df.groupby('PostalCode')['Neighbourhood'].transform('nunique') > 1].shape[0])

print ("Number of duplicates before processing: " + find_dupes()) #163 duplicates found

#Proceed to aggregate neighborhood sharing PostalCode
df=df.groupby(['PostalCode','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True) #Reset the index to 0

#Checking
print ("Number of duplicates after processing: " + find_dupes()) #0 duplicate found, nice!
df.head() #M5A doesn't have any duplicate, this is confirmed when inspecting the Wikipedia page

Number of duplicates before processing: 163
Number of duplicates after processing: 0


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned


**6.  When a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**

In [7]:
#Conditional return
df['Neighbourhood'] = np.where(df['Neighbourhood'] == 'Not assigned',df['Borough'], df['Neighbourhood'])
df.head() #Observing that M7A got the right value

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


**7.  In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe**

In [8]:
#OK
df.shape

(103, 3)

##Part 2

In [9]:
#Libraries
!pip install geocoder #No conda on Google Colab...
import geocoder # import geocoder

#Utility function
def get_loc(postal_code):
     # initialize your variable to None
     lat_lng_coords = None
     # loop until you get the coordinates
     while(lat_lng_coords is None):
       g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
       lat_lng_coords = g.latlng
       
     print(lat_lng_coords)
     latitude = lat_lng_coords[0]
     longitude = lat_lng_coords[1]
     return latitude,longitude

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 16.2MB/s eta 0:00:01[K     |██████▋                         | 20kB 1.8MB/s eta 0:00:01[K     |██████████                      | 30kB 2.6MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 1.7MB/s eta 0:00:01[K     |████████████████▋               | 51kB 2.1MB/s eta 0:00:01[K     |████████████████████            | 61kB 2.5MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 2.9MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 3.3MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 3.7MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 2.3MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad4

**1.   Import Geocodes**

In [10]:
#We built the function above, we just have to iterate
#for i in range(0,len(df)):
#    df['Latitude'][i],df['Longitude'][i]=get_loc(df.iloc[i]['PostalCode'])

#Falling back to the cvs file, as the API is broken
lat_lon_data = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon_data.head() #We can observe that we must merge the tables

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**2.   Fixing before join**

In [11]:
#Postal Code in the newly imported csv is called PostalCode in our Dataframe, let's rename that
lat_lon_data.rename(columns={'Postal Code':'PostalCode'},inplace=True)
lat_lon_data.head() #We can observe that we must merge the tables

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


**3.   Merging**

In [12]:
#And now the merging using an inner joint, which is default but making it explicit
df_merged = pd.merge(df,lat_lon_data,on='PostalCode', how='inner')
#Checking, rows should be conserved
df_merged.shape #103, rows count has been untouched

(103, 5)

**4.   Output**

In [13]:
#Final result
df_merged.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494


##Part 3

In [20]:
#Libraries

!pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

!pip install folium
import folium # mapping library



**1.   Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto.**

In [14]:
#Generating a subset of our dataframe based on filter
df_toronto = df_merged[df_merged['Borough'].str.contains('Toronto')]
df_toronto.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


**2.   Replicate the same analysis we did to the New York City data. It is up to you. First we shall plot the neighborhood**

In [18]:
#Let's get the geographical coordinates of Toronto
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [28]:
#Visualizing all the Neighbourhoods of the above data frame using Folium

#Create a map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

#Add markers to map
for lat, lng, borough, neighbourhood in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Borough'], df_toronto['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto