## Step 1: Import libraries

**Please, make sure that the next pacakages are available:**
* numpy
* pandas
* matplotlib
* requests
* sklearn
* beautifulsoup4

In [2]:
import pandas as pd # library to process data as dataframes
import numpy as np
import matplotlib.pyplot as plt # plotting library
import requests
# backend for rendering plots within the browser
%matplotlib inline 
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

print('Libraries imported.')

Libraries imported.




## Step 2: Create a Pandas Dataframe from the Wikipedia table
In this case I am going to use the Beautiful Soup library to read the table into a dataframe

In [3]:
# Obtain the html code of the wikipedia page with the list of post codes of Cananda and parse it with BeautifulSoup
website_cotent_in_html = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
soup = BeautifulSoup(website_cotent_in_html,"html.parser")

# Get the table with the post codes from the html code
my_table = soup.find('table',{'class':'wikitable sortable'})

# Iterate throw all the rows in the table to get the differente elementes and paste them in a python list
table_rows=my_table.find_all('tr')
table_data = []

for row in table_rows:
    table_data.append([t.text.strip() for t in row.find_all('td')])

#Create the pandas dataframe from the pyton list
postal_codes_raw_df = pd.DataFrame(table_data, columns=['PostalCode', 'Borough', 'Neighbourhood'])

print(postal_codes_raw_df.shape)
postal_codes_raw_df.head(5)

(181, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,,,
1,M1A,Not assigned,
2,M2A,Not assigned,
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


## Step 3: Format and present the Dataframe meeting requirements

* *Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned*

I will replace all the empty, "Not assigned" and "None" values with NaN values to use the dropna method, then drop rows with NaN in the Borough column



In [39]:
postal_codes_raw_df.replace(('None','','Not assigned'), np.nan, inplace=True)
postal_codes_df = postal_codes_raw_df.dropna(subset=['Borough'])
postal_codes_df.reset_index(drop=True, inplace=True)
print(postal_codes_df.shape)
postal_codes_df.head(20)

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


* *More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table*

In this case I will check How many lines per Postal code there are in the data frame first:

In [40]:
postal_codes_df[['PostalCode','Neighbourhood']].groupby('PostalCode').count().sort_values(by="Neighbourhood", ascending=False)


Unnamed: 0_level_0,Neighbourhood
PostalCode,Unnamed: 1_level_1
M1B,1
M5R,1
M6G,1
M6E,1
M6C,1
...,...
M3L,1
M3K,1
M3J,1
M3H,1


There is no PostalCode with more than One Neighbourhood in multple lines, all of them are already combined in one row separate with commas

* *If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.*

As all the empties and not assigned values were previously converted in NaN, lets check how many NaN values are in the Neighbourhood column:

In [49]:
empties_in_neighbourhood = len(postal_codes_df[postal_codes_df['Neighbourhood']==np.nan])
print('Number of Empty values in the Neighbouhood columns: {}'.format(empties_in_neighbourhood))

Number of Empty values in the Neighbouhood columns: 0


There aren't any neighbourhood with empty or NaN value so there is no need to meet the condition

* *In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.*

In [51]:
postal_codes_df.shape

(103, 3)