# Part I:
##  Segmenting and Clustering Neighborhoods In Toronto. In this assignment we will segment and cluster neighborhoods in Toronto. This will be accomplished by scrapping the Wikipedia page below and transforming it into a pandas dataframe

#### Wikipedia page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.


In [2]:
# installing html-table-parser for scrap the page
!pip install html-table-parser-python3



In [3]:
import urllib.request  
from html_table_parser import HTMLTableParser
import pandas as pd 
  
# function to get the url content

def url_get_contents(url): 
    
    req = urllib.request.Request(url=url) 
    f = urllib.request.urlopen(req) 
  
    #reading contents of the website 
    return f.read() 
  
# Get the HTML content with the functon
xhtml = url_get_contents('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').decode('utf-8') 
  
# Using HTMLTableParser
p = HTMLTableParser() 
p.feed(xhtml) 
  
# extract data into pandasDataframe and convert df header and columns 
df1 = pd.DataFrame(p.tables[0])
dfheader = df1.iloc[0]
df1 = df1[1:]
df1.columns=dfheader
df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### Only Processing the cells that have an Assigned Borough. Ignoring the cells with a Borough that is Not assigned.

In [4]:
# remove 'Not assigned' values from Borough column
indexNames = df1[ df1['Borough'] =='Not assigned'].index
df1.drop(indexNames , inplace=True)

# Set neighbourhood as Borough
df1.loc[df1['Neighbourhood'] =='Not assigned' , 'Neighbourhood'] = df1['Borough']

# Group and clean DataFrame
result = df1.groupby(['Postal Code','Borough'], sort=False).agg( ', '.join)

df=result.reset_index()
df.head(15)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"



### If a cell has a borough but a Not assigned neighborhood then the neighborhood will be the same as the borough Checking to see if this condition exists in our database

In [5]:
#checking to see if "Not assigned appears anywhere in the dataframe"
df[df.eq("Not assigned").any(1)]

Unnamed: 0,Postal Code,Borough,Neighbourhood


### In this case "Not assigned" no longer appears anywhere in the dataframe. Therefore we don't have a case where the dataframe has an entry with an assigned burough but a not assigned neighborhood"

In [6]:
#finding and replacing "Not assigned" in the Neighborhood column with the result of the Borough entry in the same row
df.Neighbourhood.replace('Not assigned',df.Borough, inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


### Rows that have the same postal code will be combined into one row with the neighborhoods separated with a comma

In [7]:
df = df.groupby(['Postal Code', 'Borough'])['Neighbourhood'].apply(', '.join).reset_index()
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


#### Using the shape method to confirm the number of rows of the dataframe

In [8]:
# see the dimmension of dataframe
df.shape

(103, 3)

In [9]:
#Export dataframe result into a new .CSV file
df.to_csv('df1.csv')

#    End of Part 1.  