## Toronto Neighbourhoods Segmentation and Clustering

<img src="https://typicalbritto.files.wordpress.com/2015/03/mapa-de-dosbarrios-11.jpg" alt="Toronto Neighborhoods" align="left">

<p><strong> Step 1: Building the code to scrape the following Wikipedia page: <a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M ">https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M </a> in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe. </strong></p>

In [None]:
#Install the packages if required (remove #)
#conda install -c conda-forge lxml
#conda install -c anaconda beautifulsoup4

In [1]:
from pandas.io.html import read_html

page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikitables = read_html(page, index_col=0, attrs={"class":"wikitable"})

print ("Extracted {num} wikitables".format(num=len(wikitables)))

In [2]:
toronto_postal_codes = wikitables[0]
toronto_postal_codes.head()

In [3]:
toronto_postal_codes.shape

<p><strong>Step 2: Processing the dataframe according to the assignment instructions below:</strong></p>
<ul>
<li>The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood</li>
</ul>

In [4]:
toronto_postal_codes.reset_index(inplace=True)

In [5]:
toronto_postal_codes.columns

<ul>
<li>Only process the cells that have an assigned borough. Ignore cells with a borough that is&nbsp;<strong>Not assigned.</strong></li>
</ul>

In [6]:
condition = toronto_postal_codes[ toronto_postal_codes['Borough'] == 'Not assigned' ].index
 
# Delete these row indexes from dataFrame
toronto_postal_codes.drop(condition , inplace=True)

In [7]:
toronto_postal_codes.head()

<ul>
<li>More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that&nbsp;<strong>M5A</strong>&nbsp;is listed twice and has two neighborhoods:&nbsp;<strong>Harbourfront&nbsp;</strong>and&nbsp;<strong>Regent Park</strong>. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in&nbsp;<strong>row 11&nbsp;</strong>in the above table.</li>
</ul>

In [8]:
toronto_postal_codes = toronto_postal_codes.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [9]:
toronto_postal_codes.head()

<ul>
<li>If a cell has a borough but a&nbsp;<strong>Not assigned&nbsp;</strong>neighborhood, then the neighborhood will be the same as the borough. So for the&nbsp;<strong>9th</strong>&nbsp;cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be&nbsp;<strong>Queen's Park.</strong></li>
</ul>

In [10]:
condition = toronto_postal_codes[ toronto_postal_codes['Neighbourhood'] == 'Not assigned' ].index

In [11]:
for i in condition:
    toronto_postal_codes.loc[i]['Neighbourhood'] = toronto_postal_codes.loc[i]['Borough']

In [12]:
toronto_postal_codes.loc[(toronto_postal_codes['Postcode'] == 'M7A')]

<ul>
<li>Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.</li>
<li>In the last cell of your notebook, use the&nbsp;<strong>.shape</strong>&nbsp;method to print the number of rows of your dataframe.</li>
</ul>

In [13]:
toronto_postal_codes.shape

<p><strong>Step 3: Now we need to get the latitude and the longitude coordinates of each neighborhood in order to utilize the Foursquare location data.</strong></p>

In [18]:
# Tried to use geocoder with no luck. Will use the CSV file offered by Coursera
import pandas as pd
df_coord = pd.read_csv("https://cocl.us/Geospatial_data")
df_coord.head()

In [24]:
toronto_postal_codes = toronto_postal_codes.join(df_coord.set_index('Postal Code'), on='Postcode')

In [25]:
toronto_postal_codes.head()

<p><strong>Step 4: Finally let's explore and cluster the neighborhoods in Toronto.</strong></p>