# Applied Data Science Capstone Project Note book
* This notebook will be mainly used for the capstone project.

## Peer-graded Assignment Part 1: Capstone Project Notebook
### Learning Objectives
* Learn about the problem that you will be working on in this capstone course
* Learn how to get started with Git and Github
* Apply your data analysis and machine learning skills to solve a problem using real world data
* Create a project on Watson Studio, create a project, start a notebook and share it with your peers.

In [1]:
import pandas as pd
import numpy as np

In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Peer-graded Assignment Part 2: Segmenting and Clustering Neighborhoods in Toronto

### Requirements

* Required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.
    ( columns = Borough,	Neighborhood,	Latitude,	Longitude )
* Replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.
* Required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.

### Prerequisite Libraries

In [3]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import io
import geocoder
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Folium installed')
print('Libraries imported.')

Folium installed
Libraries imported.


### Scrape the Wikipedia page of Canadian Postal Codes begining with "M"

In [4]:
source = requests.get('https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=1011037969').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
display_html(tab,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal Code,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


### Create a Pandas Dataframe

In [5]:
dfs = pd.read_html(tab)
df=dfs[0]
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


#### Clean the dataframe

In [6]:
# Dropping the rows where Borough is 'Not assigned'
indexNames = df[ (df['Borough'] == 'Not assigned') ].index
df.drop(indexNames , inplace=True)

# Reset the index and rename column
df.reset_index(drop=True, inplace=True)
df.rename(columns={'Postal Code': 'POSTAL_CODE'}, inplace=True)

# Set POSTAL_CODE to object
df['POSTAL_CODE'].astype('object')

df



Unnamed: 0,POSTAL_CODE,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Figure 1: Dataframe created by scraping Wikipedia website - Answer to Question 1

#### CSV file of Postal Code Latitude and Longitude data

* CSV file contains the latitude and longitude data for the Canadian 3 digit Postal Code egining with "M".
* CSV file is in my GitHub respository.

In [7]:
df.dtypes

POSTAL_CODE      object
Borough          object
Neighbourhood    object
dtype: object

In [8]:
# Downloading the csv file from GitHub account
# jj url = "https://raw.githubusercontent.com/jocko1984/Applied_Data_Science_Capstone_Project/main/TorontoNeighbourhood_GeoData_Final.csv"  #url is the raw version of the file on GitHub
url = "https://cocl.us/Geospatial_data"
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

df_LL = pd.read_csv(io.StringIO(download.decode('utf-8')))
df_LL.rename(columns={'Postal Code': 'POSTAL_CODE'}, inplace=True)

# Printing out the first 5 rows of the dataframe

print (df_LL.head())

  POSTAL_CODE   Latitude  Longitude
0         M1B  43.806686 -79.194353
1         M1C  43.784535 -79.160497
2         M1E  43.763573 -79.188711
3         M1G  43.770992 -79.216917
4         M1H  43.773136 -79.239476


In [9]:
df_LL.shape

(103, 3)

In [10]:
df_LL.dtypes

POSTAL_CODE     object
Latitude       float64
Longitude      float64
dtype: object

In [11]:
ML_df = df
NLP_df = df_LL

ML_NLP = ML_df[ML_df.POSTAL_CODE.isin(NLP_df.POSTAL_CODE) == False]

print(ML_NLP) 

Empty DataFrame
Columns: [POSTAL_CODE, Borough, Neighbourhood]
Index: []


#### Merge to create desired structure

In [12]:
pd.options.display.max_rows = 999
df_Struct = pd.merge(df,df_LL,on='POSTAL_CODE')
df_Struct

Unnamed: 0,POSTAL_CODE,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


* Figure 2: Dataframe created geographical coordinates of the neighborhoods in the Toronto M Postal Code. - Answer to Question 2

### Explore and cluster the neighborhoods in Toronto. 

* Working with only boroughs that contain the word Toronto, Explore, Cluster and Visualze the data.  

#### Create dataframe for Boroughs containing the word Toronto

In [13]:
Dtoronto_data = df_Struct[df_Struct['Borough'] == 'Downtown Toronto']  
Etoronto_data = df_Struct[df_Struct['Borough'] == 'East Toronto']
Ctoronto_data = df_Struct[df_Struct['Borough'] == 'Central Toronto']
TYtoronto_data = df_Struct[df_Struct['Borough'] == 'Toronto/York'] 
Wtoronto_data = df_Struct[df_Struct['Borough'] == 'West Toronto']

toronto_data = pd.concat([Dtoronto_data, Etoronto_data, Ctoronto_data, TYtoronto_data, Wtoronto_data])
toronto_data.reset_index(drop=True, inplace=True)
toronto_data

Unnamed: 0,POSTAL_CODE,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
8,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
9,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576


##### Explore the first Toronto Neighbourhood

In [14]:
toronto_data.loc[0, 'Neighbourhood']

'Regent Park, Harbourfront'

In [15]:
neighbourhood_latitude = toronto_data.loc[0, 'Latitude'] # neighbourhood latitude value
neighbourhood_longitude = toronto_data.loc[0, 'Longitude'] # neighbourhood longitude value

neighbourhood_name = toronto_data.loc[0, 'Neighbourhood'] # neighbourhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighbourhood_name, 
                                                               neighbourhood_latitude, 
                                                               neighbourhood_longitude))

Latitude and longitude values of Regent Park, Harbourfront are 43.6542599, -79.3606359.


##### Visualize the first Toronto Neighbourhood

In [16]:
map_neighbourghood = folium.Map(location=[neighbourhood_latitude,neighbourhood_longitude],zoom_start=100)

folium.Marker(
    location=[neighbourhood_latitude, neighbourhood_longitude],
    popup=neighbourhood_name,
    icon=folium.Icon(icon="cloud"),
).add_to(map_neighbourghood)


map_neighbourghood

#### Cluster the Boroughs containing the word Toronto

In [17]:
toronto_grouped = toronto_data.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Berczy Park,43.644771,-79.373306
1,"Brockton, Parkdale Village, Exhibition Place",43.636847,-79.428191
2,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
3,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.39442
4,Central Bay Street,43.657952,-79.387383
5,Christie,43.669542,-79.422564
6,Church and Wellesley,43.66586,-79.38316
7,"Commerce Court, Victoria Hotel",43.648198,-79.379817
8,Davisville,43.704324,-79.38879
9,Davisville North,43.712751,-79.390197


#### Cluster the Boroughs containing the word Toronto into Dataframe

In [18]:
k=5
toronto_clustering = toronto_data.drop(['POSTAL_CODE','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
toronto_data.insert(0, 'Cluster Labels', kmeans.labels_)

In [19]:
toronto_data

Unnamed: 0,Cluster Labels,POSTAL_CODE,Borough,Neighbourhood,Latitude,Longitude
0,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
1,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
2,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
3,2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,2,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
5,2,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
6,0,M6G,Downtown Toronto,Christie,43.669542,-79.422564
7,2,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
8,2,M5J,Downtown Toronto,"Harbourfront East, Union Station, Toronto Islands",43.640816,-79.381752
9,2,M5K,Downtown Toronto,"Toronto Dominion Centre, Design Exchange",43.647177,-79.381576


#### Visualize Dataframe for Boroughs containing the word Toronto

In [20]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Neighbourhood'], toronto_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters