### Data Frame with Postal Codes of Canada 
This notebook creates a data frame which will be used for the clustering analysis of the Toronto neighborhoods. The data frame has the following properties:


1. The dataframe consists of three columns: PostalCode, Borough, and Neighborhood.

2. The rows that have no assigned borough are ignored.
    
3. If a postal code area has more than one neighborhood, these rows will be combined into one row with the neighborhoods separated with a comma. 
    
4. If a cell has a borough but a "Not assigned" neighborhood, then the neighborhood will be the same as the borough.  
          

In the last two cells of the notebook, .shape method shows the number of rows in the final data frame, which is then saved into a csv file.

First import the libraries needed to create the data frame.

In [10]:
import requests # library for making HTTP requests in Python
import pandas as pd # library for data analysis
import numpy as np # library to handle data in a vectorized manner
from bs4 import BeautifulSoup # library for pulling data out of HTML and XML files

In the url below, scrape the data in the table with the postal codes.

In [11]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
index = requests.get(url).text #Request the text data from the Wikipedia page
soup = BeautifulSoup(index, 'html.parser') 
mtable = soup.find('table',{'class':'wikitable sortable'}) #Create a table object in Beautiful Soup which contains the table with the postal codes on the Wikipedia page 


Create a list using the table data above.

In [12]:
df=[
    [td.get_text(strip=True) for td in tr.find_all('td') if td.string or td.a]
    for tr in mtable.find_all('tr')[1:]
] 

Append the headings in the first row of the table into a list.

In [13]:
Header=[]
for st in mtable.find_all('tr')[0].stripped_strings:
    Header.append(st)

Create a data frame with the proper column names and the data in the table.

In [14]:
df=pd.DataFrame(df,columns=Header) 
df.rename(columns={'Postcode':'PostalCode','Neighbourhood':'Neighborhood'},inplace=True) #Change the column names
df=df[~df.Borough.str.contains("Not assigned")] #Remove the rows which are not assigned a borough.
df.reset_index(drop=True,inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Let the neighborhood be the same as the borough if a cell has a borough but a "Not assigned" neighborhood.

In [15]:
df['Neighborhood'][df.Neighborhood=='Not assigned']=df['Borough'][df.Neighborhood=='Not assigned']
df.head(10)    

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


If a postal code area has more than one neighborhood, these rows are combined into one row with the neighborhoods separated with a comma. 

In [16]:
df_new = df.groupby('PostalCode',as_index=False).agg({'Borough':'first','Neighborhood':lambda col: ", ".join(col)})
df_new.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Use the .shape method to show the number of rows in the final data frame ```df_new```.

In [17]:
df_new.shape[0] 

103

Save the final data frame into a csv file.

In [18]:
df_new.to_csv("Postal_Codes_DataFrame.csv",index=False)