<a href="https://colab.research.google.com/github/marienbaptiste/IBM-Capstone/blob/master/Toronto.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">Segmenting and Clustering Neighborhoods in Toronto</h1>


*The whole assigment is in this notebook*



##Part 1

In [0]:
#Libraries
import requests
import pandas as pd
import numpy as np

**1.   Scraping data with BeautifulSoup**

In [0]:
# Make the request to a url
r = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# Create soup from content of request
c = r.content

from bs4 import BeautifulSoup

soup = BeautifulSoup(c) #soup in now the html output

In [0]:
# Inspecting the code, we can find that our data in wrapped in a "wikitable sortable jquery-tablesorter", let's isolate that
neigh_table = soup.find('table',{'class':'wikitable sortable'})

# Our data in <td> is nested into <tr>
table_rows = neigh_table.find_all('tr')

#Let's append it all in a list called data
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

**2.   The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood**

In [29]:
df = pd.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])

#From a first inspection, it looks like we created rows with empty values
print((~df['PostalCode'].isnull()).value_counts())

df = df[~df['PostalCode'].isnull()]  #Filter the artifact at the beginning (empty row)

#Let's check the cleaned up result
print(df.shape)
df.head()

True     287
False      1
Name: PostalCode, dtype: int64
(287, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


**3.   Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.**

In [30]:
#Droping the rows in "Borough" containing the string "Not assigned"

#Building the condition
indexNa = df[ df['Borough'] == 'Not assigned' ].index
 
#Delete these row indexes from dataFrame
df.drop(indexNa, inplace=True)

#Reset the index to 0
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


**4&5.   If more than one neighborhood exists per postal code, it should be combined with a comma**

In [31]:
#Finding evidence of those duplicates
def find_dupes():
	return str(df[df.groupby('PostalCode')['Neighbourhood'].transform('nunique') > 1].shape[0])

print ("Number of duplicates before processing: " + find_dupes()) #163 duplicates found

#Proceed to aggregate neighborhood sharing PostalCode
df=df.groupby(['PostalCode','Borough'], sort=False).agg(', '.join)
df.reset_index(inplace=True) #Reset the index to 0

#Checking
print ("Number of duplicates after processing: " + find_dupes()) #0 duplicate found, nice!
df.head()

Number of duplicates before processing: 163
Number of duplicates after processing: 0


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned


**6.  When a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.**

In [32]:
#Conditional return
df['Neighbourhood'] = np.where(df['Neighbourhood'] == 'Not assigned',df['Borough'], df['Neighbourhood'])
df.head() #Observing that M7A got the right value

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
