# IBM Applied Data Science Capstone - Week Three Assignment

### Segmenting and Clustering Toronto's Neighborhoods

#### Part One: Scraping Toronto Postal Code Information

We would like to take data from a Wikipedia page that has a table listing all Postal Codes in Toronto. The URL is: "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

First we must import dependencies

In [1]:
import pandas as pd

Now we must scrape the data from the Wikipedia page using Pandas:

In [14]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

toronto_df = pd.read_html(url)[0]
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


As we can see, we have a number of postal codes that are 'Not assigned'. 

Let's specify 'Not assigned' as a na-value and eliminate those rows:

In [15]:
toronto_df = pd.read_html(url, na_values=['Not assigned'])[0]
toronto_df.dropna(inplace=True)
toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Our row numbering is off, let's try to reset it:

In [16]:
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [5]:
toronto_df.shape

(103, 3)

As you can see, we still have 103 rows. All data is accounted for.

#### Part Two: Assigning Latitude and Longitude Data for Each Postal Code in Toronto:

Let's first grab the data from the following URL: "https://cocl.us/Geospatial_data"

In [17]:
url2 = 'https://cocl.us/Geospatial_data'

latlong_df = pd.read_csv(url2)
latlong_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Since both dataframes have 'Postal Code' - we can merge the two dataframes with 'Postal Code' as the inidex:

In [23]:
neighborhoods = pd.merge(left=toronto_df, right=latlong_df, left_on='Postal Code', right_on='Postal Code')
neighborhoods

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
