# Toronto Data Frame
## Applied Data Science Capstone
## Week 3 - Part 1

#### Author: Sumit Chhabra

<font color="red"> Note this Jupyter Notebook is created in Cognitive Class AI Lab</font>

#### Skip next step if beautifulsoup already installed

In [None]:
# Install BeautifulSoup4
!conda install -c conda-forge beautifulsoup4 --yes

# Read from wikipedia page
Use BeautifulSoup library to parse html

In [1]:
#Read from wikipedia
import requests
from bs4 import BeautifulSoup

res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
#soup

In [2]:
#extract table from html
table = soup.find_all('table')[0]
#table

### Data Preparation

1. Convert html to pandas dataframe
2. Convert PostCode to PostalCode
3. Next step drop Borough with "Not Assigned" values

In [3]:
import pandas as pd

NA = 'Not assigned'

#convert to dataframe
df = pd.read_html(str(table))[0]
print(df.head())
print("Old:", df.shape)

#rename PostCode to PostalCode
df.rename(columns={'Postcode': 'PostalCode'}, inplace=True)
print(df.head())

#drop Borough with "Not assigned"
df = df[~df['Borough'].isin([NA])]
#df.head()
print("New:", df.shape)
df.head()

  Postcode           Borough     Neighbourhood
0      M1A      Not assigned      Not assigned
1      M2A      Not assigned      Not assigned
2      M3A        North York         Parkwoods
3      M4A        North York  Victoria Village
4      M5A  Downtown Toronto      Harbourfront
Old: (289, 3)
  PostalCode           Borough     Neighbourhood
0        M1A      Not assigned      Not assigned
1        M2A      Not assigned      Not assigned
2        M3A        North York         Parkwoods
3        M4A        North York  Victoria Village
4        M5A  Downtown Toronto      Harbourfront
New: (212, 3)


Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. 

In [4]:
#Replace Not assigned Neighborhood with Borough name
for index, row in df.iterrows():
    if row['Neighbourhood'] == NA:
        print ('Found ', row['Borough'], ' - replace it')
        df.at[index, 'Neighbourhood'] = row['Borough']

Found  Queen's Park  - replace it


#### Data Validation

In [5]:
#validate Not assigned in Borough and Neighbourhood

b_count =  df['Borough'].str.contains(NA).sum()
if b_count > 0:
    print ('Present in Borough')
else:
    print ('Not Present in Borough')

n_count =  df['Neighbourhood'].str.contains(NA).sum()
if n_count > 0:
    print ('Present in Neighbourhood')
else:
    print ('Not Present in Neighbourhood')

Not Present in Borough
Not Present in Neighbourhood


#### Combine neighbourhoods
More than one neighborhood can exist in one postal code area. 
These two rows will be combined into one row with the neighborhoods separated with a comma

In [6]:
#groupy by postal code & borough and comma separated neighbourhoods

grouped = df.groupby(['PostalCode', 'Borough'])['Neighbourhood'].agg(','.join).reset_index()
#grouped
print(grouped.loc[grouped['PostalCode'] == 'M4B'])
print(grouped.loc[grouped['PostalCode'] == 'M5A'])


   PostalCode    Borough                   Neighbourhood
35        M4B  East York  Woodbine Gardens,Parkview Hill
   PostalCode           Borough             Neighbourhood
53        M5A  Downtown Toronto  Harbourfront,Regent Park


In [7]:
#final output for assignment 1
grouped.head(11)

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"
