<h1 align="center">Clustering Covid-19 Cases in Toronto - Canada</h1>

<p align="justify">In this notebook we are going to cluster the data related to the Covid-19 cases in the city of Toronto, Canada. To accomplish this, we must download the data sources from the following URLs: <a href="https://open.toronto.ca/dataset/covid-19-cases-in-toronto/">Toronto Covid Data</a> and <a href="https://cocl.us/Geospatial_data">Toronto Postal Code Coordinates</a>.Of course it will be necessary to clean and normalize the data to be able to get the visual insight that we are looking for in the project. The main goal is to see the cluster distribution if the Covid-19 cases in the city of Toronto at the date of the data downloaded, with this, the people arriving to Toronto can be informed about the virus hot spots and, they will avoid the venues in the neighborhoods with most active cases of this disease.</p>

<h2>1. Managing the Data</h2>

<h3>1.1 Downloading the data</h3>
<p>We must obtain the csv files to get the data for our insights. To do this, we must import the <b>Pandas</b> and <b>Wget</b> libraries</p>

In [1]:
import pandas as pd
import wget
print("Libraries imported!")

Libraries imported!


<p>Proceeding to download the data from their respective sources. Both files will be saved in the project folder</p>

In [9]:
wget.download("https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/e5bf35bc-e681-43da-b2ce-0242d00922ad?format=csv")
wget.download("https://cocl.us/Geospatial_data")
print("Csv files downloaded!")

Csv files downloaded!


<h3>1.2 Transforming the data</h3>
<p>With the csv files in our project folder, we have to convert those files into pandas dataframe. Of course, pandas has the tools to do this task</p>

In [2]:
covid_df = pd.read_csv('COVID19 cases.csv')
coordinates_df = pd.read_csv('Geospatial_Coordinates.csv')
print("Dataframe conversion done")

Dataframe conversion done


<p></p>Let's check our dataframes

In [3]:
# Toronto Covid-19 dataframe
covid_df.head()

Unnamed: 0,_id,Assigned_ID,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
0,1,1,Sporadic,50 to 59 Years,Willowdale East,M2N,Travel,CONFIRMED,2020-01-22,2020-01-23,FEMALE,RESOLVED,No,No,No,No,No,No
1,2,2,Sporadic,50 to 59 Years,Willowdale East,M2N,Travel,CONFIRMED,2020-01-21,2020-01-23,MALE,RESOLVED,No,No,No,Yes,No,No
2,3,3,Sporadic,20 to 29 Years,Parkwoods-Donalda,M3A,Travel,CONFIRMED,2020-02-05,2020-02-21,FEMALE,RESOLVED,No,No,No,No,No,No
3,4,4,Sporadic,60 to 69 Years,Church-Yonge Corridor,M4W,Travel,CONFIRMED,2020-02-16,2020-02-25,FEMALE,RESOLVED,No,No,No,No,No,No
4,5,5,Sporadic,60 to 69 Years,Church-Yonge Corridor,M4W,Travel,CONFIRMED,2020-02-20,2020-02-26,MALE,RESOLVED,No,No,No,No,No,No


In [4]:
# Toronto Coordinates Dataframe
coordinates_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>1.3 Cleaning and normalizing the data</h3>
<p align="justify">Observing the dataframes is obvious that it will be necessary to clean and normalize the Toronto Covid dataframes because, we will not need all the info there. For our project we only need the following features: Neighborhood name, FSA(Postal Code), Classification and Outcome. But first we need to remove the space in the Postal Code column in the Toronto Coordinates dataframe</p>

In [5]:
coordinates_df.rename(columns={'Postal Code': 'Postal_Code'}, inplace=True)
coordinates_df.head()

Unnamed: 0,Postal_Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [6]:
# Drop the columns in the Covid-19 dataframe that are not necessary for our project
covid_df.drop(columns=[ '_id', 'Assigned_ID', 'Outbreak Associated', 'Age Group', 'Source of Infection', 'Episode Date', 'Reported Date', 'Client Gender', 'Currently Hospitalized', 'Currently in ICU', 'Currently Intubated', 'Ever Hospitalized', 'Ever in ICU', 'Ever Intubated'], inplace=True)
covid_df.head()

Unnamed: 0,Neighbourhood Name,FSA,Classification,Outcome
0,Willowdale East,M2N,CONFIRMED,RESOLVED
1,Willowdale East,M2N,CONFIRMED,RESOLVED
2,Parkwoods-Donalda,M3A,CONFIRMED,RESOLVED
3,Church-Yonge Corridor,M4W,CONFIRMED,RESOLVED
4,Church-Yonge Corridor,M4W,CONFIRMED,RESOLVED


<p>We already have the data that we will use, but lets rename some of the columns to names with more sense and order the columns position for a better view</p>

In [7]:
# Let's rename the columns Neighbourhood Name, Classification and FSA
covid_df.rename(columns={'Neighbourhood Name': 'Neighborhood', 'FSA': 'Postal_Code', 'Classification': 'Status'}, inplace=True)

# Reordering the columns positions

covid_reduced_df = covid_df[['Postal_Code', 'Neighborhood', 'Status', 'Outcome']]
covid_reduced_df.head()

Unnamed: 0,Postal_Code,Neighborhood,Status,Outcome
0,M2N,Willowdale East,CONFIRMED,RESOLVED
1,M2N,Willowdale East,CONFIRMED,RESOLVED
2,M3A,Parkwoods-Donalda,CONFIRMED,RESOLVED
3,M4W,Church-Yonge Corridor,CONFIRMED,RESOLVED
4,M4W,Church-Yonge Corridor,CONFIRMED,RESOLVED


<p>Now, we are going to clean the data, we will follow the next instructions:<br>
<ul>
<li>Drop the nan/null values in Postal Code, because without this data we cannot map the Neighborhood</li>
<li>The nan/null values in the Neighborhood column will be replaced by the Postal Code value</li>
<li>In the Status column, we only need the current confirmed cases</li>
<li>In the Outcome column, we only need the current active cases</li>
</ul>
</p>

In [8]:
# Before the changes let's check the dataframe shape
covid_reduced_df.shape

(15338, 4)

In [9]:
# Drop the nan/null values in the Postal Code column
covid_reduced_df.dropna(subset=['Postal_Code'], inplace=True)
covid_reduced_df.shape

(14775, 4)

In [10]:
# Fill the nan/null values in Neigborhood with the Postal Code. First, we must now how many records in Neighborhood without data we have in the dataframe
count = covid_reduced_df["Neighborhood"].isna().sum()
print(count)

46


In [11]:
# Lets replace the null data in Neighborhoods
covid_reduced_df.Neighborhood.fillna(covid_reduced_df.Postal_Code, inplace=True)

# Checking if there are still some null rows
count = covid_reduced_df["Neighborhood"].isna().sum()
print(count)

0


In [12]:
# Removing values that are not necessary in the column Status

covid_clean_df = covid_reduced_df[covid_reduced_df.Status == 'CONFIRMED']
covid_clean_df.shape

(13673, 4)

In [13]:
# Check if we have rows in Status with other values than CONFIRMED

covid_clean_df.groupby(by='Status').agg('count')

Unnamed: 0_level_0,Postal_Code,Neighborhood,Outcome
Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CONFIRMED,13673,13673,13673


In [14]:
# Removing values that are not necessary in the column Outcome

covid_clean_df = covid_clean_df[covid_clean_df.Outcome == 'ACTIVE']
covid_clean_df.shape

(349, 4)

In [15]:
# Check if we have rows in Outcome with other values than ACTIVE

covid_clean_df.groupby(by='Outcome').agg('count')

Unnamed: 0_level_0,Postal_Code,Neighborhood,Status
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACTIVE,349,349,349


<p>Ok, we finished this data cleaning, we start with a dataframe with 14911 records and was filtered until get 532 records, lets go with the final part of the data managing</p>

<h3>1.4 Merging the dataframes</h3>
<p>With both dataframes ready, we will proceed to merge the Latitude and Longitude from the coordinates_df to the covid_clean_df to get a new dataframe called covid_toronto_df</p>

In [16]:
# We need to reset the index in the covid_clean_df

covid_clean_df.reset_index(inplace=True, drop=True)

# Merge the Latitude and Longitude postal code values

covid_toronto_df = pd.merge(covid_clean_df, coordinates_df, on='Postal_Code')
covid_toronto_df.head()

Unnamed: 0,Postal_Code,Neighborhood,Status,Outcome,Latitude,Longitude
0,M8V,Mimico (includes Humber Bay Shores),CONFIRMED,ACTIVE,43.605647,-79.501321
1,M8V,New Toronto,CONFIRMED,ACTIVE,43.605647,-79.501321
2,M8V,New Toronto,CONFIRMED,ACTIVE,43.605647,-79.501321
3,M8V,Mimico (includes Humber Bay Shores),CONFIRMED,ACTIVE,43.605647,-79.501321
4,M1B,Rouge,CONFIRMED,ACTIVE,43.806686,-79.194353


In [17]:
# Now we can combine the Postal_Code and Neighbohood columns

covid_toronto_df['Neighborhood'] = covid_toronto_df['Neighborhood'] + ' ' + covid_toronto_df['Postal_Code']

In [18]:
# Finally we will group by Neighborhood and add the total active cases in each one of them. We will get ride of Outcome and Status too

covid_toronto_total = covid_toronto_df.groupby(['Neighborhood','Latitude', 'Longitude']).Outcome.agg('count').to_frame('Active_Cases').reset_index()
covid_toronto_total

Unnamed: 0,Neighborhood,Latitude,Longitude,Active_Cases
0,Agincourt South-Malvern West M1S,43.794200,-79.262029,2
1,Alderwood M8W,43.602414,-79.543484,2
2,Annex M5R,43.672710,-79.405678,2
3,Banbury-Don Mills M3B,43.745906,-79.352188,1
4,Banbury-Don Mills M3C,43.725900,-79.340923,1
...,...,...,...,...
128,Woburn M1G,43.770992,-79.216917,4
129,Woburn M1H,43.773136,-79.239476,5
130,York University Heights M3J,43.767980,-79.487262,3
131,Yorkdale-Glen Park M6A,43.718518,-79.464763,1


<h2>2. Maping and Clustering the Data</h2>

<h3>2.1 Maping the data</h3>
<p>Using the previous dataframe, lets map the Covid-19 cases in Toronto. First, we need to import the libraries and modules to accomplish the task</p>

In [19]:
from geopy.geocoders import Nominatim
import folium
import requests
import numpy as np
from folium.plugins import FastMarkerCluster
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
print('Libraries imported.')

Libraries imported.


<p>We will need to save the Toronto coordinates in their respective variables</p>

In [20]:
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="toronto_mapping")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<p>Generating Toronto's map</p>

In [21]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

map_toronto

<h3>2.2 Clustering the data</h3>
<p>We will use KMeans to cluster the covid-19 cases in Toronto to be able to show it in the map</p>

In [22]:
# set number of clusters
kclusters = 3

# drop the neighborhood column to work only with numeric values in a new dataframe
toronto_grouped_clustering = covid_toronto_total.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [23]:
# add clustering labels to the dataframe
toronto_grouped_clustering.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_grouped_clustering.head()

Unnamed: 0,Cluster Labels,Latitude,Longitude,Active_Cases
0,1,43.7942,-79.262029,2
1,1,43.602414,-79.543484,2
2,1,43.67271,-79.405678,2
3,1,43.745906,-79.352188,1
4,1,43.7259,-79.340923,1


<h3>2.2 Adding the clustered data to the Toronto map</h3>
<p>We will configure the color schema for the clusters and add the markers to the map based in that schema</p>

In [29]:
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.Reds(np.linspace(0, 1, len(ys)))
heat = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, cases in zip(toronto_grouped_clustering['Latitude'], toronto_grouped_clustering['Longitude'], covid_toronto_total['Neighborhood'], toronto_grouped_clustering['Cluster Labels'], toronto_grouped_clustering['Active_Cases']):
    label = folium.Popup('Neighborhood:' + str(poi) + " Active Covid Cases " + str(cases), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=8,
        popup=label,
        color=heat[cluster-1],
        fill=True,
        fill_color=heat[cluster-1],
        fill_opacity=1).add_to(map_toronto)
       
map_toronto

<h3>2.3 Clustering the covid-19 cases without using KMeans</h3>
<p>In Python we have multiple ways to show the data clustered without programming manually KMeans or other cluster methods. Folium has a tool that can help us in a very fast way, to cluster our covid-19 dataframe from Toronto</p>

In [91]:
# Using FastMarkerCluster to easily cluster the dataframe info that we have created in thee previuos steps
FastMarkerCluster(data=list(zip(covid_toronto_total['Latitude'].values, covid_toronto_total['Longitude'].values))).add_to(map_toronto)
folium.LayerControl(position='topright').add_to(map_toronto)

map_toronto

<p>As you see this is the easiest way to cluster the data but, if we want to work with the clusters doing this manually will be the best option, not only with KMeans but others clustering methods</p>