## Toronto Neighbourhoods Segmentation and Clustering

<img src="https://typicalbritto.files.wordpress.com/2015/03/mapa-de-dosbarrios-11.jpg" alt="Toronto Neighborhoods" align="left">

<p><strong> Step 1: Building the code to scrape the following Wikipedia page: <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M ">https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M </a> in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. </strong></p>

In [None]:
#Installing the packages required
conda install -c conda-forge lxml
conda install -c anaconda beautifulsoup4

In [17]:
from pandas.io.html import read_html

page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

Extracted 1 wikitables


In [18]:
toronto_postal_codes = wikitables[0]
toronto_postal_codes.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront


In [19]:
toronto_postal_codes.shape

(288, 2)

<p><strong>Step 2: Processing the dataframe according to the assignment instructions below:</strong></p>
<ul>
<li>The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood</li>
</ul>

In [23]:
toronto_postal_codes.reset_index(inplace=True)

In [24]:
toronto_postal_codes.columns

Index(['Postcode', 'Borough', 'Neighbourhood'], dtype='object')

<ul>
<li>Only process the cells that have an assigned borough. Ignore cells with a borough that is&nbsp;<strong>Not assigned.</strong></li>
</ul>

In [26]:
condition = toronto_postal_codes[ toronto_postal_codes['Borough'] == 'Not assigned' ].index
 
# Delete these row indexes from dataFrame
toronto_postal_codes.drop(condition , inplace=True)

In [28]:
toronto_postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


<ul>
<li>More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that&nbsp;<strong>M5A</strong>&nbsp;is listed twice and has two neighborhoods:&nbsp;<strong>Harbourfront&nbsp;</strong>and&nbsp;<strong>Regent Park</strong>. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in&nbsp;<strong>row 11&nbsp;</strong>in the above table.</li>
</ul>

In [45]:
toronto_postal_codes = toronto_postal_codes.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [46]:
toronto_postal_codes.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<ul>
<li>If a cell has a borough but a&nbsp;<strong>Not assigned&nbsp;</strong>neighborhood, then the neighborhood will be the same as the borough. So for the&nbsp;<strong>9th</strong>&nbsp;cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be&nbsp;<strong>Queen's Park.</strong></li>
</ul>

In [56]:
condition = toronto_postal_codes[ toronto_postal_codes['Neighbourhood'] == 'Not assigned' ].index

In [68]:
for i in condition:
    toronto_postal_codes.loc[i]['Neighbourhood'] = toronto_postal_codes.loc[i]['Borough']

In [73]:
toronto_postal_codes.loc[(toronto_postal_codes['Postcode'] == 'M7A')]

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park


<ul>
<li>Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.</li>
<li>In the last cell of your notebook, use the&nbsp;<strong>.shape</strong>&nbsp;method to print the number of rows of your dataframe.</li>
</ul>

In [63]:
toronto_postal_codes.shape

(103, 3)