# Segmenting and Clustering Neighborhoods in Toronto
Week 3 assignment for IBM Data Science Capstone Project

## Part 1: Scraping Wikipedia Page for Postal Code Data
This section covers the scraping of data to create a dataframe in pandas.
First import required packages:

In [2]:
!conda install -c conda-forge geocoder

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geocoder:        1.38.1-py_1       conda-forge
    python_abi:    

In [3]:
import numpy as np
import pandas as pd
import geocoder

Retrieve table from website using pandas and rename columns to match instructions:

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
postal_df = pd.read_html(url, attrs={'class': 'wikitable sortable'})[0]
postal_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
postal_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Remove rows with not borough assigned:

In [8]:
postal_filt_df = postal_df[postal_df['Borough'] != 'Not assigned']
postal_filt_df.reset_index(drop=True, inplace=True)
postal_filt_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


We confirm that there are no duplicates in the PostalCode column so that no merging of this info is necessary:

In [10]:
print('Number of rows in the dataframe:', postal_filt_df.shape[0])
print('Number of unique Postal Codes:', len(pd.unique(postal_filt_df['PostalCode'])))

Number of rows in the dataframe: 103
Number of unique Postal Codes: 103


We confirm that there are no missing neighborhoods in the DataFrame, so we do not need to copy borough names:

In [12]:
print('Number of neighborhoods with NaN value:', postal_filt_df['Neighborhood'].isna().sum())
print("Number of neighborhoods with 'Not assigned' value:", (postal_filt_df['Neighborhood'] == 'Not assigned').sum())

Number of neighborhoods with NaN value: 0
Number of neighborhoods with 'Not assigned' value: 0


We're now ready to show the number of rows in this filtered dataframe:

In [13]:
print('Number of rows in the filtered dataframe:', postal_filt_df.shape[0])

Number of rows in the filtered dataframe: 103
