In [2]:
import pandas as pd

# Scraping Wikipedia

We read from the HTML table using pandas directly. This returns a list of dataframes, and we are interested in the first dataframe. We take only the rows of the dataframe where `'Borough'` is _not_ `'Not assigned'`. We reset the index numbering.

In [41]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url, header=0)[0]
df = df[df['Borough'] != 'Not assigned']
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


At this point, the dataframe looks almost right, but the instructions imply that some postal codes may be listed twice. Let's check if any postal code appears more than once in the dataframe:

In [42]:
(pd.value_counts(df['Postal code']) >1).any()

False

Great, now we need to make sure the Neighborhoods are separated by commas instead of forward slashes.

In [43]:
df['Neighborhood'] = df['Neighborhood'].str.replace(' /', ',')
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Now we need to check for assigned boroughs with "Not assigned" neighborhoods.

In [44]:
(df['Neighborhood'] == 'Not assigned').any()

False

There do not appear to be any. Apparently the dataset has changed since the assignment was written.

# Get Latitude and Longitude

Now we load and inspect the latitude and longitude data from the provided csv file.

In [21]:
!wget --quiet http://cocl.us/Geospatial_data -O lat_lng.json

In [45]:
lat_lng = pd.read_csv('lat_lng.json')
lat_lng.rename(columns={'Postal Code': 'Postal code'}, inplace=True)

In [46]:
lat_lng

Unnamed: 0,Postal code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


Merge the two dataframes.

In [57]:
df = pd.merge(df, lat_lng)
df

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509
