# Capstone - IBM Applied Data Science #
# Segmenting, Clustering and GeoCoding Canadian Neighborhoods#

This notebook is the first step of peer-graded project evaluating neighborhoods in a Canadian province.

Processes completed in this notebook include:    

1. Download data via Internet page scrape,        
2. Using the data from the page scrape, create pandas dataframe,     
3. Clean data in pandas dataframe (remove/rename empty cells),     
4. Explore and cluster neighborhoods,
5. Geocode using postal codes.

### Step 0 - Libraries and Notebook Setup

In [1]:
# library imports
import pandas as pd
pd.set_option('display.max_columns', None)    
pd.set_option('display.max_rows', None)
import numpy as np
from pandas.io.json import json_normalize # JSON file -> pandas DF
import json # JSON file manipulation
import requests # HTTP library
from bs4 import BeautifulSoup # scrapes info from web pages, on top of HTML or XLM parser
import matplotlib.pyplot as plt # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors
# %matplotlib inline # magic that allows matplot in Jupyter
# !conda update -n base -c defaults conda --yes     # may needto run first time in new environment
# !conda install -c conda-forge folium=0.5.0 --yes  # may need to run first time in new environment
import folium # map rendering
from sklearn.cluster import KMeans # clustering algorithm

  _nan_object_mask = _nan_object_array != _nan_object_array


Sklearn library note: Large pink FutureWarning re sklearn.cluster library Kmeans -- currently when comparing two arrays of dtype=object, numpy checks if the return of the comparison function is False when both objects being compared are the exact same. Right now (2019), it assumes that all objects compare equal to themselves, but that will be changed at same time in the future.  Since it's a Future Warning, ignoring for now. (this info courtesy of stackoverflow, https://stackoverflow.com/questions/43591503/error-when-using-numpy-to-encode-categorical-features-of-dataset

### Step 1 - Download data via Internet page scrape ###
Download Canadian post code data from wikipedia (url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")        
and scrape using BeautifulSoup 

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
page = requests.get(url).text  # wikiArticle -> text document
wikiHtml = BeautifulSoup(page, "html.parser")  # text -> parseTree
neighborhoodTable = wikiHtml.find("table", class_ = "wikitable")  # exracting info from html data
neighborhoodRows = neighborhoodTable.find_all("tr")  # exracting info from html data

### Step 2 - Create and Populate Pandas Dataframe ###
Project instructions require:
- Dataframe has three columns: PostalCode, Borough, and Neighborhood.
- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- Group on postal code; more than one neighborhood can exist in one postal code area. 
- If cell has a borough but a Not assigned neighborhood, then neighborhood = borough.

In [3]:
# create and set up dataframe
neighborhoodInfo = []
colNames = ['PostalCode','Borough','Neighborhood'] # project requires columns PostalCode, Borough, Neighborhood
for row in neighborhoodRows:
    hoodInfo = row.text.split('\n')[1:-1]
    neighborhoodInfo.append(hoodInfo)
hoodDF = pd.DataFrame(neighborhoodInfo[1:], columns = colNames)
initialDataFrameShape = hoodDF.shape
print("initial dataframe shape", initialDataFrameShape)
print(hoodDF.head())

initial dataframe shape (289, 3)
  PostalCode           Borough      Neighborhood
0        M1A      Not assigned      Not assigned
1        M2A      Not assigned      Not assigned
2        M3A        North York         Parkwoods
3        M4A        North York  Victoria Village
4        M5A  Downtown Toronto      Harbourfront


### Step 3 - Clean data in pandas dataframe (remove/rename empty cells)

In [4]:
# remove entries with notAssigned Borough
noBorough = hoodDF.index[hoodDF["Borough"] == "Not assigned"] 
hoodDF.drop(hoodDF.index[noBorough], inplace=True)
hoodDF.reset_index(drop=True, inplace=True)
print("after removing cells without assigned boroughs, dataframe shape:", hoodDF.shape)
hoodDF.head()

after removing cells without assigned boroughs, dataframe shape: (212, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [5]:
# rename neighborhood=borough for cells with a borough but a Not assigned neighborhood
noHood = hoodDF.index[hoodDF["Neighborhood"] == "Not assigned"] 
for hoodIndex in noHood:
    hoodDF["Neighborhood"][hoodIndex] = hoodDF["Borough"][hoodIndex]
hoodDF.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights
5,M6A,North York,Lawrence Manor
6,M7A,Queen's Park,Queen's Park
7,M9A,Etobicoke,Islington Avenue
8,M1B,Scarborough,Rouge
9,M1B,Scarborough,Malvern


### Step 4 - Explore and Cluster Neighborhoods

In [6]:
postalCodeGroup = hoodDF.groupby('PostalCode')
hoodGroups = postalCodeGroup['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
boroughGroups = postalCodeGroup['Borough'].apply(lambda x: set(x).pop())
neighborhood = pd.DataFrame(list(zip(boroughGroups.index, boroughGroups, hoodGroups)))
neighborhood.columns = ['PostalCode', 'Borough', 'Neighborhood']
print("The dataframe when clustered by postal codes is", neighborhood.shape, '\n')
print(neighborhood.head(10))

The dataframe when clustered by postal codes is (103, 3) 

  PostalCode      Borough                                     Neighborhood
0        M1B  Scarborough                                   Rouge, Malvern
1        M1C  Scarborough           Highland Creek, Rouge Hill, Port Union
2        M1E  Scarborough                Guildwood, Morningside, West Hill
3        M1G  Scarborough                                           Woburn
4        M1H  Scarborough                                        Cedarbrae
5        M1J  Scarborough                              Scarborough Village
6        M1K  Scarborough      East Birchmount Park, Ionview, Kennedy Park
7        M1L  Scarborough                  Clairlea, Golden Mile, Oakridge
8        M1M  Scarborough  Cliffcrest, Cliffside, Scarborough Village West
9        M1N  Scarborough                      Birch Cliff, Cliffside West


In [7]:
# borough_count = len(neighborhood['Borough'].unique())
# neighborhood_count = neighborhood.shape[0]

# print("Dataframe has", borough_count, "boroughs and", neighborhood_count, "neighborhoods")

### Step 5 - Geocode

In [8]:
# create DF from csv file w geographical coordinates of each postal code ("https://cocl.us/Geospatial_data")
geocodeDF = pd.read_csv("https://cocl.us/Geospatial_data") 
geocodedHood = neighborhood.merge(geocodeDF, left_on='PostalCode', right_on='Postal Code') # csv file uses 'Postal Code'
geocodedHood = geocodedHood.drop('Postal Code', 1)   # drop extraneous heading, leaving only 'PostalCode'
print(geocodedHood.head())

  PostalCode      Borough                            Neighborhood   Latitude  \
0        M1B  Scarborough                          Rouge, Malvern  43.806686   
1        M1C  Scarborough  Highland Creek, Rouge Hill, Port Union  43.784535   
2        M1E  Scarborough       Guildwood, Morningside, West Hill  43.763573   
3        M1G  Scarborough                                  Woburn  43.770992   
4        M1H  Scarborough                               Cedarbrae  43.773136   

   Longitude  
0 -79.194353  
1 -79.160497  
2 -79.188711  
3 -79.216917  
4 -79.239476  
