# IBM Data Science Capstone Notebook

### Loyda, Jayred

This Jupyter Notebook will be encompassing the Capstone section of the IBM Data Science Professional Certificate course.

In [73]:
import pandas as pd
import numpy as np

In [74]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Part 1.1: Segmenting and Clustering Neighborhoods in Toronto

For this assignment, you will be required to explore and cluster the neighborhoods in Toronto.

1. Start by creating a new Notebook for this assignment.
2. Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

![Canada](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/7JXaz3NNEeiMwApe4i-fLg_40e690ae0e927abda2d4bde7d94ed133_Screen-Shot-2018-06-18-at-7.17.57-PM.png?expiry=1582761600000&hmac=IpmKvZARyx0Nnai2V60_NkUXZEXtzjgva_i4RLEyWek "Canada")

3. To create the above dataframe:

    * The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    * Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned.**
    * More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that **M5A** is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park.** These two rows will be combined into one row with the neighborhoods separated with a comma as shown in **row 11** in the above table.
    * If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough. So for the **9th** cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be **Queen's Park.**
    * Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    * In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe.
4. Submit a link to your Notebook on your Github repository.

In [75]:
import requests
from bs4 import BeautifulSoup

### Use requests to call url

In [76]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
res = requests.get(url)

### Use soup to scrape and find table

In [77]:
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
#print(table.prettify())

# Add Table to dataframe
df = pd.read_html(str(table))
#print(df[0].to_json())

### Convert html->json, Read from json

In [78]:
df_can = pd.read_json(df[0].to_json()).sort_index()
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Drop 'Not assigned' Boroughs

In [79]:
df_filter = df_can[df_can['Borough'] == 'Not assigned'].index
df_can.drop(df_filter,inplace=True)
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


### Find 'Not assigned' Neighbourhood  

In [80]:
df_can[df_can['Neighbourhood'] == 'Not assigned']

Unnamed: 0,Postcode,Borough,Neighbourhood


In [81]:
df_can['Neighbourhood'].replace( "Not assigned", df_can['Borough'] , inplace=True)
df_can[df_can['Borough'] == "Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighbourhood


### Dupilicate Handling

In [82]:
# Checking for duplicates
if df_can['Postcode'].duplicated().any(): print("Duplicates Exist!")
else: print("No duplicates!")

# Handling duplicate rows
df_dupe = df_can.groupby(['Postcode']).agg({'Borough':'first','Neighbourhood':', '.join})
df_can = df_dupe.reset_index()

# Checking for duplicates
if df_can['Postcode'].duplicated().any(): print("Duplicates Exist!")
else: print("No duplicates!")

Duplicates Exist!
No duplicates!


In [83]:
df_can.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [84]:
df_can.shape

(103, 3)