# Segmenting and Clustering Neighborhoods in Toronto

This notebook is divided in three sections, as required in the assignement.

## SECTION 1

### Getting data from Wikipedia

First of all we import required libraries

In [38]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

Next we use requests and BeautifulSoap to get HTML from Wikipedia 

In [39]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

Uncommenting and running the following line gives a long text output: we use it to find the class name of the table we want to parse 

In [None]:
#print(soup.prettify())

The above line of code let us discover the class name: wikitable.     
Now we use the name of the class to focus only on the table:

In [40]:
tb = soup.find('table', class_='wikitable')

Now tb has just the table HTML and content.    
We use pandas to turn table content into a pandas data frame:

In [41]:
df = pd.read_html(str(tb))[0]
df.shape

(180, 3)

Now we remove all the lines where Borough is 'Not assigned' and check how many lines remain:

In [42]:
df.drop(df[df.Borough == 'Not assigned'].index, inplace=True)
df.shape

(103, 3)

Next, we change column names. Basically we remove the white space from 'Postal Code':

In [43]:
df.columns = [c.replace(' ', '') for c in df.columns]
df.columns.values

array(['PostalCode', 'Borough', 'Neighborhood'], dtype=object)

Now, we know that the data frame has 103 lines, so let's check that we have 103 unique postal codes:

In [44]:
len(df.PostalCode.unique())

103

Last check: see if there is any not assigned neighborhood:

In [45]:
len(df[df.Neighborhood=='Not assigned'])

0

**Here is the final data frame:**

In [46]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,Business reply mail Processing Centre
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


**Here is the shape of the data frame:**

In [47]:
df.shape

(103, 3)

## SECTION 2

### Getting Latitude and Longitude

We use the provided csv file, because other possible resources are unstable. The easiest way first!

In [33]:
url2='http://cocl.us/Geospatial_data'
ll=pd.read_csv(url2)
ll.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Now just a rapid check that this file has the same number of rows (103) that we have in the Toronto postal codes data frame:

In [29]:
len(ll)

103

In [34]:
ll.columns = [c.replace(' ', '') for c in ll.columns]
ll.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [50]:
new_df = pd.merge(df, ll, on='PostalCode')
new_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
