## Week 3 - CAPSTONE: IBM Data Science Professional Certificate
# Segmenting and Clustering Neighborhoods in Toronto

In [36]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## SECTION 1: Zipcode Data
### Scrape the Data

Pull the data from the table on the Wikipedia Page by requesting the page and parsing the HTML.  We will use the popular requests package to make our get request and use BeautifulSoup to parse the HTML.  BeautifulSoup has an easy API to traverse the HTML nodes and search based on HTML tags and attributes.

#### Make the get request

In [37]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(url)
# print the sever response to make sure we get a good 200 code
print(r) 

<Response [200]>


#### Parse the page

In [38]:
# parse the page and get a list of all the tables on the page
page = BeautifulSoup(r.text, 'html.parser')


#### Find the table with the data

In [39]:
tables = page.find_all('table')
print(f'There are {len(tables)} HTML tables on the page')

There are 5 HTML tables on the page


Since there are more then one table lets pull the header cells with HTML tag of **th** to find the correct table.

In [40]:
[table.find_all('th') for table in tables]

[[<th>Postcode</th>,
  <th>Borough</th>,
  <th>Neighbourhood
  </th>],
 [],
 [],
 [<th class="navbox-title" style="font-size:110%"><a href="/wiki/Postal_codes_in_Canada" title="Postal codes in Canada">Canadian postal codes</a>
  </th>],
 []]

From the python list above we can see that the first table on the page is the one we want to scrape.

#### Parse the table

Loop over ever row in the table to get the text from each cell and map it to the column name.  Then convert to the pandas DataFrame to clean.

In [41]:
# create list of every row in the table finding the HTML tag <tr>
zipcode_table = tables[0].find_all('tr')
header = [th.get_text(strip=True) for th in zipcode_table[0].find_all('th')]

# loop over ever row and create a dict for each row
zipcodes = map(
    # function to create the dict using the header names and text value inside each cell
    lambda row: {head: cell.get_text(strip=True) for head, cell in zip(header, row.find_all('td'))},
    # skip the first row as this was the header row
    zipcode_table[1:]
)

# convert to pandas dataframe
zip_df = pd.DataFrame(zipcodes)
zip_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


### Clean the Dataframe

Use the new panda.NA singleton for all missing values with the place holder "Not assigned."  Any postalcode without a Borough will be dropped, and any empty Neighbourhood will be replaced with the Borough value.

In [42]:
# convert "Not assigned" to pandas.NA types
zip_df = zip_df.replace("Not assigned", pd.NA)

In [43]:
# drop all rows without a Borough
zip_df = zip_df.dropna(subset=['Borough'])
# fill any missing Neighbourhoods with the Borough Name
zip_df['Neighbourhood'] = zip_df['Neighbourhood'].fillna(value=zip_df['Borough'])

### Transform the Dataframe

Combine all duplicated Postalcodes together by forming a comma seperated list of every Neighbourhood.  Also assuming that there is one Borough name per zipcode.

In [44]:
# For each postal code and borough group create a list of all the Neighbourhoods 
zip_df = zip_df.groupby(['Postcode','Borough']).apply(lambda x: ", ".join(x['Neighbourhood'].unique())).reset_index()
zip_df.columns = ['Postcode','Borough', 'Neighbourhood']
zip_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### Print Shape of the Dataframe

In [45]:
zip_df.shape

(103, 3)

## SECTION 2: Latitude and Longitude

Create a function to pull the latitude and longitude from the "zipcode, city, country" from geocoder form MapBox using the geopy package

In [46]:
from geopy.geocoders import MapBox
from geopy.extra.rate_limiter import RateLimiter

from config import mapbox_api_key # api key is hidden in config file create a MapBox account
locator = MapBox(api_key=mapbox_api_key)
# limit the rate of api calls to 1 every second to avoid being blocked
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

def zip_to_coords(zipcode: str, city: str, state: str, country: str):
    """ Get the latitude and longitude from the the given zipcode using the gecode funciton above"""
    address = f'{city}, {state}, {zipcode}, {country}'
    location = geocode(address)
    return pd.Series((location.latitude, location.longitude))

Create a new column in the zipcode dataframe for the latitude and longitude

In [47]:
zip_df[['Latitude', 'Longitude']] = zip_df['Postcode'].apply(zip_to_coords, args=('Toronto', 'Ontario', 'Canada'))
zip_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.808241,-79.220533
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.78,-79.19
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.77,-79.19
3,M1G,Scarborough,Woburn,43.78,-79.23
4,M1H,Scarborough,Cedarbrae,43.78,-79.25


In [48]:
# export dataframe as a csv
zip_df.to_csv('week3_data.csv')

## SECTION 3: Cluster and Map

In [49]:
# download the csv if created, so we can skip the code above
zip_df = pd.read_csv('week3_data.csv', index_col=0)
zip_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.808241,-79.220533
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.78,-79.19
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.77,-79.19
3,M1G,Scarborough,Woburn,43.78,-79.23
4,M1H,Scarborough,Cedarbrae,43.78,-79.25
