<h1 align="center">Clustering Covid-19 Cases in Toronto - Canada</h1>

<p align="justify">In this notebook we are going to cluster the data related to the Covid-19 cases in the city of Toronto, Canada. To accomplish this, we must download the data sources from the following URLs: <a href="https://open.toronto.ca/dataset/covid-19-cases-in-toronto/">Toronto Covid Data</a> and <a href="https://cocl.us/Geospatial_data">Toronto Postal Code Coordinates</a>.Of course it will be necessary to clean and normalize the data to be able to get the visual insight that we are looking for in the project. The main goal is to see the cluster distribution if the Covid-19 cases in the city of Toronto at the date of the data downloaded, with this, the people arriving to Toronto can be informed about the virus hot spots and, they will avoid the venues in the neighborhoods with most active cases of this disease.</p>

<h2>1. Managing the Data</h2>

<h3>1.1 Downloading the data</h3>
<p>We must obtain the csv files to get the data for our insights. To do this, we must import the <b>Pandas</b> and <b>Wget</b> libraries</p>

In [1]:
import pandas as pd
import wget
print("Libraries imported!")

Libraries imported!


<p>Proceeding to download the data from their respective sources. Both files will be saved in the project folder</p>

In [2]:
covid_csv = wget.download("https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/e5bf35bc-e681-43da-b2ce-0242d00922ad?format=csv")
coordinates_csv = wget.download("https://cocl.us/Geospatial_data")
print("Csv files downloaded!")

Csv files downloaded!


<h3>1.2 Transforming the data</h3>
<p>With the csv files in our project folder, we have to convert those files into pandas dataframe. Of course, pandas has the tools to do this task</p>

In [140]:
covid_df = pd.read_csv(covid_csv)
coordinates_df = pd.read_csv(coordinates_csv)
print("Dataframe conversion done")

Dataframe conversion done


<p></p>Let's check our dataframes

In [141]:
# Toronto Covid-19 dataframe
covid_df.head()

Unnamed: 0,_id,Outbreak Associated,Age Group,Neighbourhood Name,FSA,Source of Infection,Classification,Episode Date,Reported Date,Client Gender,Outcome,Currently Hospitalized,Currently in ICU,Currently Intubated,Ever Hospitalized,Ever in ICU,Ever Intubated
0,44294,Sporadic,50-59,Malvern,M1B,Institutional,CONFIRMED,2020-03-25,2020-03-27,MALE,RESOLVED,No,No,No,No,No,No
1,44295,Sporadic,20-29,Malvern,M1B,Community,CONFIRMED,2020-03-20,2020-03-28,MALE,RESOLVED,No,No,No,Yes,No,No
2,44296,Sporadic,60-69,Malvern,M1B,Travel,CONFIRMED,2020-03-04,2020-03-08,FEMALE,RESOLVED,No,No,No,Yes,Yes,Yes
3,44297,Outbreak Associated,50-59,Rouge,M1B,N/A - Outbreak associated,CONFIRMED,2020-05-02,2020-05-04,FEMALE,RESOLVED,No,No,No,No,No,No
4,44298,Sporadic,30-39,Rouge,M1B,Close contact,CONFIRMED,2020-05-31,2020-06-06,FEMALE,RESOLVED,No,No,No,No,No,No


In [142]:
# Toronto Coordinates Dataframe
coordinates_df.head()

Unnamed: 0,Postal_Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>1.3 Cleaning and normalizing the data</h3>
<p align="justify">Observing the dataframes is obvious that it will be necessary to clean and normalize the Toronto Covid dataframes because, we will not need all the info there. For our project we only need the following features: Neighborhood name, FSA(Postal Code), Classification and Outcome. Let's work on it</p>

In [143]:
# Drop the columns in the Covid-19 dataframe that are not necessary for our project
covid_df.drop(columns=[ '_id', 'Outbreak Associated', 'Age Group', 'Source of Infection', 'Episode Date', 'Reported Date', 'Client Gender', 'Currently Hospitalized', 'Currently in ICU', 'Currently Intubated', 'Ever Hospitalized', 'Ever in ICU', 'Ever Intubated'], inplace=True)
covid_df.head()

Unnamed: 0,Neighbourhood Name,FSA,Classification,Outcome
0,Malvern,M1B,CONFIRMED,RESOLVED
1,Malvern,M1B,CONFIRMED,RESOLVED
2,Malvern,M1B,CONFIRMED,RESOLVED
3,Rouge,M1B,CONFIRMED,RESOLVED
4,Rouge,M1B,CONFIRMED,RESOLVED


<p>We already have the data that we will use, but lets rename some of the columns to names with more sense and order the columns position for a better view</p>

In [144]:
# Let's rename the columns Neighbourhood Name, Classification and FSA
covid_df.rename(columns={'Neighbourhood Name': 'Neighborhood', 'FSA': 'Postal_Code', 'Classification': 'Status'}, inplace=True)

# Reordering the columns positions

covid_reduced_df = covid_df[['Postal_Code', 'Neighborhood', 'Status', 'Outcome']]
covid_reduced_df.head()

Unnamed: 0,Postal_Code,Neighborhood,Status,Outcome
0,M1B,Malvern,CONFIRMED,RESOLVED
1,M1B,Malvern,CONFIRMED,RESOLVED
2,M1B,Malvern,CONFIRMED,RESOLVED
3,M1B,Rouge,CONFIRMED,RESOLVED
4,M1B,Rouge,CONFIRMED,RESOLVED


<p>Now, we are going to clean the data, we will follow the next instructions:<br>
<ul>
<li>Drop the nan/null values in Postal Code, because without this data we cannot map the Neighborhood</li>
<li>The nan/null values in the Neighborhood column will be replaced by the Postal Code value</li>
<li>In the Status column, we only need the current confirmed cases</li>
<li>In the Outcome column, we only need the current active cases</li>
</ul>
</p>

In [145]:
# Before the changes let's check the dataframe shape
covid_reduced_df.shape

(14911, 4)

In [146]:
# Drop the nan/null values in the Postal Code column
covid_reduced_df.dropna(subset=['Postal_Code'], inplace=True)
covid_reduced_df.shape

(14344, 4)

In [147]:
# Fill the nan/null values in Neigborhood with the Postal Code. First, we must now how many records in Neighborhood without data we have in the dataframe
count = covid_reduced_df["Neighborhood"].isna().sum()
print(count)

46


In [148]:
# Lets replace the null data in Neighborhoods
covid_reduced_df.Neighborhood.fillna(covid_reduced_df.Postal_Code, inplace=True)

# Checking if there are still some null rows
count = covid_reduced_df["Neighborhood"].isna().sum()
print(count)

0


In [149]:
# Removing values that are not necessary in the column Status

covid_clean_df = covid_reduced_df[covid_reduced_df.Status == 'CONFIRMED']
covid_clean_df.shape

(13262, 4)

In [150]:
# Check if we have rows in Status with other values than CONFIRMED

covid_clean_df.groupby(by='Status').agg('count')

Unnamed: 0_level_0,Postal_Code,Neighborhood,Outcome
Status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CONFIRMED,13262,13262,13262


In [151]:
# Removing values that are not necessary in the column Outcome

covid_clean_df = covid_clean_df[covid_clean_df.Outcome == 'ACTIVE']
covid_clean_df.shape

(532, 4)

In [152]:
# Check if we have rows in Outcome with other values than ACTIVE

covid_clean_df.groupby(by='Outcome').agg('count')

Unnamed: 0_level_0,Postal_Code,Neighborhood,Status
Outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ACTIVE,532,532,532


<p>Ok, we finished this data cleaning, we start with a dataframe with 14911 records and was filtered until get 532 records, lets go with the final part of the data managing</p>

<h3>1.4 Merging the dataframes</h3>
<p>With both dataframes ready, we will proceed to merge the Latitude and Longitude from the coordinates_df to the covid_clean_df to get a new dataframe called covid_toronto_df</p>

In [154]:
# We need to reset the index in the covid_clean_df

covid_clean_df.reset_index(inplace=True, drop=True)

# Merge the Latitude and Longitude postal code values

covid_toronto_df = pd.merge(covid_clean_df, coordinates_df, on='Postal_Code')
covid_toronto_df.head()

Unnamed: 0,Postal_Code,Neighborhood,Status,Outcome,Latitude,Longitude
0,M1B,Rouge,CONFIRMED,ACTIVE,43.806686,-79.194353
1,M1B,Rouge,CONFIRMED,ACTIVE,43.806686,-79.194353
2,M1B,Malvern,CONFIRMED,ACTIVE,43.806686,-79.194353
3,M1B,Malvern,CONFIRMED,ACTIVE,43.806686,-79.194353
4,M1B,Malvern,CONFIRMED,ACTIVE,43.806686,-79.194353


In [155]:
# Check the dataframe shape
covid_toronto_df.shape

(532, 6)