<font size="6">Capstone Project - Applied Data Science Specialization</font><br>

This notebook will be used to complete the Capstone Project for the Applied Data Science Specialization by IBM. It is divided into different sections, each with different parts. Every section corresponds to a different requirement on the capstone project.<br>




<div style="text-align: right"><font size="5">Making the Notebook</font></div>

<div style="text-align: right">Peer Graded Asignment for IBM's Applied Data Science Capstone, Week 1</div>
<br>
<br>

In [1]:
import numpy as np
import pandas as pd

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!




<div style="text-align: right"><font size="5">Segmenting and Clustering Neighborhoods in Toronto</font></div>

<div style="text-align: right">Peer Graded Asignment for IBM's Applied Data Science Capstone, Week 3</div>
<br>
<br>

<font size="4">Part 1 - Getting and Wrangling Data</font>


First, I import and install everything I need. The following lines of code are taken (almost) directly from the course's "Segmenting and Clustering Neighborhoods in New York City" Jupyter Notebook:


In [5]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

These only have to be installed once, so I uncomment them only in the need of working in a different environment.

In [6]:
!conda install -c conda-forge geopy --yes 
!conda install -c conda-forge folium=0.5.0 --yes

^C


And proceed to import and install the remaining ones.

In [8]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



Now, according to the instructions, I need data on different buroughs, neighborhoods and postal codes from Toronto, which I can obtain here:
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

Let's try just reading it as a csv.

In [9]:
df = pd.read_csv('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
df.head()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 43


Let's try checking online. There appears to be a pandas command, read_html, which can "read HTML tables into a list of DataFrame objects" according to the documentation.

In [10]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
df

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 5           M6A        North York   
 6           M7A  Downtown Toronto   
 7           M8A      Not assigned   
 8           M9A         Etobicoke   
 9           M1B       Scarborough   
 10          M2B      Not assigned   
 11          M3B        North York   
 12          M4B         East York   
 13          M5B  Downtown Toronto   
 14          M6B        North York   
 15          M7B      Not assigned   
 16          M8B      Not assigned   
 17          M9B         Etobicoke   
 18          M1C       Scarborough   
 19          M2C      Not assigned   
 20          M3C        North York   
 21          M4C         East York   
 22          M5C  Downtown Toronto   
 23          M6C              York   
 24          M7C      Not assigned   
 25         

We seem to need the first object in the list. Let's convert it into a dataframe.

In [11]:
df = pd.DataFrame(df[0])
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now let's make the changes required in the assignment:

    1.- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.

    2.- More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

    3.- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

    4.- Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.

    5.- In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.

In [12]:
#1. We use "loc" to remove the ones that have "Not assigned".
df = df.loc[df.Borough != 'Not assigned']

#2. Appereantly, as of this date (July 7, 2020), there's no repeated values of Postal Code, for total amount of
# postal codes equals the total amount of unique postal codes, so we leave it as it is.
df.shape[0] == df['Postal Code'].unique().shape[0]

#3. Given the sum of values in Neighborhood that are 'Not assigned' is 0, we conclude there are none.
(df.Neighborhood == 'Not assigned').sum()

#4. Everything up to the point has been exbplained in Markdown cells, or inside the comments.

#5. The total number of rows is:
print('Total number of rows is: {} '.format(df.shape[0]))

Total number of rows is: 103 


<font size="4">Part 2 - Adding Latitude and Longitude</font>

I still need latitude and longitude coordinates to visualize the information into a map. As of this date, we are given a csv with the corresponding coordinates, available here: https://cocl.us/Geospatial_data .

In [13]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


I make a join to paste every coordinate where it should be. As it can be seen, no rows were eliminated, and because pd.merge i by default an inner join, then every Postal Code in the original dataframe now has coordinates.

In [14]:
toronto = pd.merge(df,lat_lon, on ='Postal Code')
print('Total number of rows after join is: {} '.format(toronto.shape[0]))
toronto.head()

Total number of rows after join is: 103 


Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<font size="4">Part 3 - Clustering and Visualizing</font>

I proceed to make clusters and create maps to understand the behavior of the wrangled data. Let' first see how many 

In [15]:
toronto.groupby('Borough').count()

Unnamed: 0_level_0,Postal Code,Neighborhood,Latitude,Longitude
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Central Toronto,9,9,9,9
Downtown Toronto,19,19,19,19
East Toronto,5,5,5,5
East York,5,5,5,5
Etobicoke,12,12,12,12
Mississauga,1,1,1,1
North York,24,24,24,24
Scarborough,17,17,17,17
West Toronto,6,6,6,6
York,5,5,5,5
