# **APPLIED DATA SCIENCE CAPSTONE - WEEK 3**

## **SOLUTION TO QUESTION 1**

### **Importing the required libraries**

In [11]:
import pandas as pd
import numpy as np
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

### **Scraping the data from the url**

In [3]:
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

### **Loading the scraped data into a data frame**

In [4]:
df = data[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Now we will start to modify the data frame according to our requirements.

### **Replacing the not assigned values in Borough column with nan** 

In [5]:
df['Borough'].replace('Not assigned', np.nan, inplace = True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


### **Dropping the rows where the value in Borough column is nan**

In [6]:
df.dropna(subset=['Borough'], inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### **Replacing the not assigned values in Neighborhood column with the values in corresponding Borough column**

In [7]:
df['Neighborhood'].replace('Not assigned', df['Borough'], inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


### **Grouping the data on the basis of postal code, borough and then combining the respective neighborhood**

In [8]:
df = df.groupby(['Postal Code', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


### **Using the shape attribute to check the shape of our data**

In [9]:
df.shape

(103, 3)

# SOLUTION TO QUESTION 2

### **Importing the required libraries for this part**

In [13]:
import json
from geopy.geocoders import Nominatim

### **Reading the coordinate values from the csv file**

In [15]:
df_coord = pd.read_csv("http://cocl.us/Geospatial_data")
df_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### **Merging the canada data frame and coordinate data frame into a single data frame using left join**

In [16]:
final_df = df.merge(df_coord, on='Postal Code', how='left')
final_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


# **SOLUTION TO QUESTION 3**

### **Importing the libraries required for this part**

In [27]:
from sklearn.cluster import KMeans
#!conda install -c conda-forge folium=0.5.0 --yes
import folium

### **Creating a new data frame that only contains details of the neighborhoods that contain Toronto in their borough**

In [29]:
toronto_df = final_df.copy()
toronto_df = toronto_df[final_df['Borough'].str.contains("Toronto")]
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


### **Converting the coordinates data frame into an array to implement KMeans clustering later**

In [37]:
df_coordinates = toronto_df[['Latitude','Longitude']]
X = np.asarray(df_coordinates)

### **Creating a KMeans classifier using sklearn and using it to create clusters in the neighborhoods of Toronto**

In [42]:
clf = KMeans(n_clusters = 4, random_state = 0)
clf.fit(X)
clusters = clf.labels_
toronto_df['Clusters'] = clusters
toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Clusters
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,3
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,3
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,3
43,M4M,East Toronto,Studio District,43.659526,-79.340923,3
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,2


### **Creating a map of Toronto using the latitudes and longitudes of various neighborhoods**

In [43]:
toronto_map = folium.Map(location=[43.65, -79.4], zoom_start = 15)

colors = ['red','green','blue','yellow']

for lat, lon, bor, clu in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Clusters']):
    label = folium.Popup(bor, parse_html = False)
    folium.CircleMarker(
    [lat, lon],
    radius = 5,
    popup = label,
    color = 'black',
    fill = True,
    fill_color = colors[clu],
    fill_opacity = 0.7
    ).add_to(toronto_map)

### **Displaying our map of various clusters in the Toronto Borough**

In [44]:
toronto_map

### We have successfully created our notebook with all the requirements. The Map created using Folium is visible above.

# **THANK YOU FOR READING MY NOTEBOOK.**