# Covid-19 Dashboard's Dataset
By Abdullah Almuzaini

## Table of Contents
- [Introduction](#intro)
- [Part I - Data Gathering](#Gathering)
- [Part II - Preparing Data](#Preparing)
- [Part III- Exploring the datasets](#Exploring)
- [Part IV- Exporting the datasets](#Exporting)



## <a id='intro'> Introduction </a>

<p style="font-size:18px">The first step in my project is to collect the data needed when building up the dashboard. Second, after gathering the data I need, I will have to prepare the datasets downloaded in the first step. By preparing the data I mean assessing and doing some cleaning process. Lastly, when all the data needed for the project is collected and be in the proper format and shape, it has to be extracted from this notebook and stored in the project dataset folder in the project directory. These Processes require using some python scripts I need to write and libraries, which I will be describing next. </p><br>


In [69]:
import pandas as pd


from create_new_directory import new_folder
from get_dataset import get_dataset
from reshape_dataset import reshape

## <a id='Gathering'>Gathering Data </a>

<p style="font-size:18px">The python scripts I will be using this part of the project are:<br><br>
    - <b>create_new_directory.py:</b> which contains the method <b>new_folder()</b> that takes a string variable as a folder name. The purpose of this little script is to create a new directory inside the current working directory. For now, it will be used to create a new directory in which the datasets will be stored. <br><br>
    - <b>get_dataset.py:</b> the function downloads the dataset from the internet using requests library and
        save it in the 'dataset' directory inside the current running direcroty. It takes two arguments one is the URL, and the second argument is file name which should includ the file format `csv` for example. <br><br><b>

    
    
</p>

#### Collect the daily covid-19 confirmed cases dataset

In [70]:
# Create a new directory in the current running directory if it does not exist 

folder_name = 'dataset'
new_folder(folder_name)

In [71]:
# Download the confirmed covid-19 cases from its source https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
file_name = 'covid19_confirmed_cases.csv'

# The functoin of get_dataset takes url and file name
data = get_dataset(URL,file_name)
confirmed_df = pd.read_csv('dataset'+ '/'+file_name)


In [72]:
confirmed_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/30/21,10/1/21,10/2/21,10/3/21,10/4/21,10/5/21,10/6/21,10/7/21,10/8/21,10/9/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,155174,155191,155191,155191,155287,155309,155380,155429,155448,155466
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,170131,170778,171327,171794,171794,172618,173190,173723,174168,174643
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,203359,203517,203657,203789,203915,204046,204171,204276,204388,204490
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,15222,15222,15222,15222,15267,15271,15284,15288,15291,15291
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,56583,58076,58603,58943,58943,59895,60448,60803,61023,61245


#### Collect the daily covid-19 death cases dataset

In [73]:
# Download the death covid-19 cases from its source https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

URL = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
file_name = 'covid19_death_cases.csv'

# The functoin of get_dataset takes url and file name
data = get_dataset(URL,file_name)
death_df = pd.read_csv('dataset'+ '/'+file_name)

death_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/30/21,10/1/21,10/2/21,10/3/21,10/4/21,10/5/21,10/6/21,10/7/21,10/8/21,10/9/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,7204,7206,7206,7206,7212,7214,7220,7221,7221,7221
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,2698,2705,2710,2713,2713,2725,2734,2746,2753,2759
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,5812,5815,5819,5822,5826,5831,5838,5843,5846,5850
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,130,130,130,130,130,130,130,130,130,130
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1537,1567,1574,1577,1577,1587,1598,1603,1613,1618


#### Collect the data of the daily recovery from covid19 

In [74]:
URL = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
file_name = 'covid19_recovered_cases.csv'

# The functoin of get_dataset takes url and file name
data = get_dataset(URL,file_name)
recovery_df = pd.read_csv('dataset'+ '/'+file_name)

recovery_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,9/30/21,10/1/21,10/2/21,10/3/21,10/4/21,10/5/21,10/6/21,10/7/21,10/8/21,10/9/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


> <p style="font-size:18px"><br>
    Now, we have three datasets downloaded and stored in the dataset directory. The dataset are in the wide format as they appear above. 
    </p>

## <a id='Preparing'>Part II- Preparing Data</a>

<p style="font-size:18px">  - In this part of the project, I will need to use the following python script: <br><br>
    - <b>reshape_dataset.py:</b> the function contains a method called <b>reshape()</b> that transforms the dataset from the wide format into the long format using the pandas function melt(). The method takes three arguments. One is the dataframe. The second one is variable name, the column name.
    
</p>

#### Prepare the three datasets to be in an appropriate format and shape by converting them from the wide format to the long format.

In [75]:
confirmed = reshape(confirmed_df,
               'date',
               'confirmed')
confirmed.shape

(174933, 6)

In [76]:
confirmed.sample(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
40970,,South Sudan,6.877,31.307,6/16/20,1776
144452,,Papua New Guinea,-6.314993,143.95555,6/22/21,17013
33770,Queensland,Australia,-27.4698,153.0251,5/22/20,1060
163006,Hong Kong,China,22.3,114.2,8/28/21,12100
99170,Reunion,France,-21.1151,55.5364,1/11/21,9359


In [77]:
death = reshape(death_df,
               'date',
               'deaths')
death.shape

(174933, 6)

In [78]:
death.tail(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
174928,,Vietnam,14.058324,108.277199,10/9/21,20442
174929,,West Bank and Gaza,31.9522,35.2332,10/9/21,4465
174930,,Yemen,15.552727,48.516388,10/9/21,1775
174931,,Zambia,-13.133897,27.849332,10/9/21,3653
174932,,Zimbabwe,-19.015438,29.154857,10/9/21,4636


In [79]:
recovery = reshape(recovery_df,
               'date',
               'recovery')
recovery.shape

(165528, 6)

In [80]:
recovery.sample(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recovery
67818,,Timor-Leste,-8.8742,125.7275,10/4/20,28
12927,,Uruguay,-32.5228,-55.7658,3/10/20,0
33797,,Antigua and Barbuda,17.0608,-61.7964,5/29/20,19
132141,,Jordan,31.24,36.51,6/5/21,720190
50389,,Syria,34.8021,38.9968,7/30/20,229


><p style="font-size:18px"><br>
    As we can see above, the datasets are now looking in a suitable format. I have transformed the columns after the Long column, which represent the date of each day in the dataset from January 22, 2020, till September 25, 2021, into one column called date and stored all the values under each of the date columns in a new variable called recovery in the recovery dataset, deaths in the death dataset, and confirmed in the confirmed dataset. 
    </p><br>

## <a id='Exploring'> Part III- Exploring the datasets</a>

<p style="font-size:18px"><br>
    Next, I will explore the datasets taking advantage of the pandas method info(), which will return general information about each dataset such as the number of records, the name of the variables we have, the count of the non-null records, as well as the data type of each column.<br>
    </p>

In [81]:
confirmed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174933 entries, 0 to 174932
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  54549 non-null   object 
 1   Country/Region  174933 non-null  object 
 2   Lat             173679 non-null  float64
 3   Long            173679 non-null  float64
 4   date            174933 non-null  object 
 5   confirmed       174933 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 8.0+ MB


In [82]:
death.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174933 entries, 0 to 174932
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  54549 non-null   object 
 1   Country/Region  174933 non-null  object 
 2   Lat             173679 non-null  float64
 3   Long            173679 non-null  float64
 4   date            174933 non-null  object 
 5   deaths          174933 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 8.0+ MB


In [83]:
recovery.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165528 entries, 0 to 165527
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  44517 non-null   object 
 1   Country/Region  165528 non-null  object 
 2   Lat             164901 non-null  float64
 3   Long            164901 non-null  float64
 4   date            165528 non-null  object 
 5   recovery        165528 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 7.6+ MB


> <p style="font-size:18px"><br> According to the above summary about the confirmed and death dataframes, we have some missing values in the Province/State, Lat, and Long columns. We also can see that the column data is not in the proper format. Furthermore, the confirmed and death datasets are a match. However, the recovery dataset has overall less records than the other two and has more missing values that needs to be inspected to find out what the issue is. 
    </p><br>


<p style="font-size:18px"><br>
In the next cell, I will make copies of three datasets, and the copies will contain only records with NaN values in the `Province/State` to find out where the issue is. 
    </p><br>

In [84]:
# Storing only rows with null values in ['Province/State'] column

recovery_na = recovery[recovery['Province/State'].isna()]
confirmed_na = confirmed[confirmed['Province/State'].isna()]
death_na = death[death['Province/State'].isna()]

<p style="font-size:18px"><br>
Next I will use pandas.DataFrame.sample to return a random samples form the above sub-datasets to invistigate the issue.
    </p><br>

In [85]:
recovery_na.sample(20)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recovery
129627,,Andorra,42.5063,1.5218,5/27/21,13405
87090,,Timor-Leste,-8.8742,125.7275,12/16/20,30
19046,,Cameroon,3.848,11.5021,4/3/20,17
61631,,Germany,51.165691,10.451526,9/11/20,231349
112343,,Kenya,-0.0236,37.9062,3/22/21,90376
64020,,India,20.593684,78.96288,9/20/20,4396399
128131,,Djibouti,11.8251,42.5903,5/21/21,11313
70397,,Montenegro,42.708678,19.37439,10/14/20,10201
36868,,Mongolia,46.8625,103.8467,6/9/20,87
36798,,Fiji,-17.7134,178.065,6/9/20,18


In [86]:
confirmed_na.sample(20)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
175,,Malawi,-13.2543,34.3015,1/22/20,0
91796,,Antigua and Barbuda,17.0608,-61.7964,12/16/20,151
155590,,Mongolia,46.8625,103.8467,8/1/21,164155
12444,,Liberia,6.428055,-9.429499,3/6/20,0
132115,,Indonesia,-0.7893,113.9213,5/9/21,1713684
4415,,Singapore,1.2833,103.8333,2/6/20,28
91463,,Singapore,1.2833,103.8333,12/14/20,58325
136687,,Ukraine,48.3794,31.1656,5/25/21,2244084
40307,,Gabon,-0.8037,11.6094,6/14/20,3463
45111,,Nepal,28.1667,84.25,7/1/20,14046


In [87]:
death_na.sample(20)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
55892,,Colombia,4.5709,-74.2973,8/9/20,12842
21990,,Seychelles,-4.6796,55.492,4/9/20,0
52982,,Trinidad and Tobago,10.6918,-61.2225,7/29/20,8
96196,,Saint Lucia,13.9094,-60.9789,12/31/20,5
115652,,Iceland,64.9631,-19.0208,3/11/21,29
133003,,New Zealand,-40.9006,174.886,5/12/21,26
75463,,Georgia,42.3154,43.3569,10/18/20,136
76111,,San Marino,43.9424,12.4578,10/20/20,42
4398,,Poland,51.9194,19.1451,2/6/20,0
165905,,Malta,35.9375,14.3754,9/7/21,445


> <p style="font-size:18px"><br> 
    After Running the above three cells several times, it turns out that the reason for the null values in the column `Province/State` is that many countries did not report covid-19 cases by Province/State. Instead, they just counted the covid-19 cases for the entire country as a whole. However, there is one country that did not constantly report the cases. That country is Canada. Canada reports covid-19 deaths and confirmed cases by the Province/State, but they do not count the recovery cases in the same way. They report the recovery cases for the entire country as a whole without considering the Province/State. This issue would raise an issue when merging the three datasets. <br><br> </p><br>


<p style="font-size:18px"><br> The following three cells will demonstrate the issue of Canada's covid-19 reporting method
   </p> <br>

In [88]:
death_na[death_na['Country/Region']== "Canada"]

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths


In [89]:
confirmed_na[confirmed_na['Country/Region']== "Canada"]

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed


In [90]:
recovery_na[recovery_na['Country/Region']== "Canada"]

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recovery
39,,Canada,56.1304,-106.3468,1/22/20,0
303,,Canada,56.1304,-106.3468,1/23/20,0
567,,Canada,56.1304,-106.3468,1/24/20,0
831,,Canada,56.1304,-106.3468,1/25/20,0
1095,,Canada,56.1304,-106.3468,1/26/20,0
...,...,...,...,...,...,...
164247,,Canada,56.1304,-106.3468,10/5/21,0
164511,,Canada,56.1304,-106.3468,10/6/21,0
164775,,Canada,56.1304,-106.3468,10/7/21,0
165039,,Canada,56.1304,-106.3468,10/8/21,0


<p style="font-size:18px"><br> 
In order to resolve the above issue, I will recalculate the deaths and confirmed cases of Covid-19 in Canada by the Country to match the recovery dataset. 
    </p><br>

<p style="font-size:18px"><br> 
    The first step we need to take is to fetch confirmed cases and deaths records of Canada and aggregate them by `date`, then store them in a separate sub-data frame
    </p><br>

In [91]:
canada_conf = confirmed[confirmed['Country/Region'] == 'Canada'].groupby('date').sum()[['confirmed']]
canada_conf.head()

Unnamed: 0_level_0,confirmed
date,Unnamed: 1_level_1
1/1/21,591149
1/10/21,666375
1/11/21,674624
1/12/21,681015
1/13/21,688097


In [92]:
canada_dth = death[death['Country/Region'] == 'Canada'].groupby('date').sum()[['deaths']]
canada_dth.head()

Unnamed: 0_level_0,deaths
date,Unnamed: 1_level_1
1/1/21,15806
1/10/21,17074
1/11/21,17199
1/12/21,17359
1/13/21,17539


<p style="font-size:18px"><br> 
    The next step is to copy the recovery dataframe without including the `recovery` column. The reason is that I want to apply these columns on death and confirmed cases dataframes to make the death and confirmed cases calculated for the entire country as a whole instead of by Province/State, so they eventually  match the recovery dataframe
    </p><br>

In [93]:
canada_recovery = recovery[recovery['Country/Region'] == 'Canada'][recovery.columns[:-1]].reset_index(drop=True)
canada_recovery.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date
0,,Canada,56.1304,-106.3468,1/22/20
1,,Canada,56.1304,-106.3468,1/23/20
2,,Canada,56.1304,-106.3468,1/24/20
3,,Canada,56.1304,-106.3468,1/25/20
4,,Canada,56.1304,-106.3468,1/26/20



<p style="font-size:18px"><br> 
    Now, we are set to join and apply `canada_recovery` on the `canada_conf` and `canada_dth` 
    </p><br>

In [94]:
canada_covid_19_conf = canada_recovery.merge(canada_conf, how='inner', left_on='date', right_index=True)
canada_covid_19_conf.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
0,,Canada,56.1304,-106.3468,1/22/20,0
1,,Canada,56.1304,-106.3468,1/23/20,0
2,,Canada,56.1304,-106.3468,1/24/20,0
3,,Canada,56.1304,-106.3468,1/25/20,0
4,,Canada,56.1304,-106.3468,1/26/20,1


In [95]:
canada_covid_19_conf.tail()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
622,,Canada,56.1304,-106.3468,10/5/21,1651603
623,,Canada,56.1304,-106.3468,10/6/21,1655406
624,,Canada,56.1304,-106.3468,10/7/21,1659517
625,,Canada,56.1304,-106.3468,10/8/21,1663716
626,,Canada,56.1304,-106.3468,10/9/21,1665312


In [96]:
canada_covid_19_deaths = canada_recovery.merge(canada_dth, how='inner', left_on='date', right_index=True)
canada_covid_19_deaths.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
0,,Canada,56.1304,-106.3468,1/22/20,0
1,,Canada,56.1304,-106.3468,1/23/20,0
2,,Canada,56.1304,-106.3468,1/24/20,0
3,,Canada,56.1304,-106.3468,1/25/20,0
4,,Canada,56.1304,-106.3468,1/26/20,0


In [97]:
canada_covid_19_deaths.tail()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
622,,Canada,56.1304,-106.3468,10/5/21,28109
623,,Canada,56.1304,-106.3468,10/6/21,28165
624,,Canada,56.1304,-106.3468,10/7/21,28196
625,,Canada,56.1304,-106.3468,10/8/21,28239
626,,Canada,56.1304,-106.3468,10/9/21,28246


> <p style="font-size:18px"><br> 
    Finally, we have Canada's Covid-19 information matched in the three dataframes
</p><br><br>
<p style="font-size:18px"><br> 
   Now, we need to put Canada's data back in the original dataframes (`confirmed`, `death`) by copying the dataframes excluding records of Canada, then insert Canada's data from (`canada_covid_19_deaths` and `canada_covid_19_conf`)
</p><br>

In [98]:
confirmed = confirmed[confirmed['Country/Region'] != 'Canada'].append(canada_covid_19_conf)
confirmed.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


In [99]:
death = death[death['Country/Region'] != 'Canada'].append(canada_covid_19_deaths)
death.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


<p style="font-size:18px"><br> 
    To make sure that the three dataframes match, I will run the same test I did here.
    <p><br>

In [100]:
recovery_na = recovery[recovery['Province/State'].isna()]
confirmed_na = confirmed[confirmed['Province/State'].isna()]
death_na = death[death['Province/State'].isna()]

In [101]:
print(confirmed_na[confirmed_na['Country/Region']== "Canada"].shape)
print(recovery_na[recovery_na['Country/Region']== "Canada"].shape)
print(death_na[death_na['Country/Region']== "Canada"].shape)

(627, 6)
(627, 6)
(627, 6)


> <p style="font-size:18px"><br> The three dataframes now are matched </p>

<p style="font-size:18px"><br> The Final Step is to combine the three datasets into one master dataset
    </p><br>

In [102]:
columns = ['Country/Region','Province/State','date']
master_df = confirmed.merge(death, how='inner', on=columns)
master_df = master_df.merge(recovery, how='inner', on=columns)
master_df = master_df.drop(columns=['Lat_x', 'Long_x','Lat_y','Long_y'])
master_df = master_df[['Country/Region','Province/State','Lat','Long','date', 'confirmed','recovery','deaths']]
master_df.head()

Unnamed: 0,Country/Region,Province/State,Lat,Long,date,confirmed,recovery,deaths
0,Afghanistan,,33.93911,67.709953,1/22/20,0,0,0
1,Albania,,41.1533,20.1683,1/22/20,0,0,0
2,Algeria,,28.0339,1.6596,1/22/20,0,0,0
3,Andorra,,42.5063,1.5218,1/22/20,0,0,0
4,Angola,,-11.2027,17.8739,1/22/20,0,0,0


> <p style="font-size:18px"><br> Now, we have a dataframe containing all the covid-19 data we need and in the proper format.
    </p><br>

<p style="font-size:18px"><br> 
    After having the covid-19 data downloaded and almost prepared, there is only one last step to make the all the data ready for my fueature analysis. The data we have so far needs the population information for each country. Thus, I will add the population information to the my dataset list, and the population dataset I will be using is <a href='https://www.kaggle.com/tanuprabhu/population-by-country-2020'>Population by Country - 2020</a> by <a href='https://www.kaggle.com/tanuprabhu'>Tanu N Prabhu</a>.

<p style="font-size:18px"><br> 
    To download the dataset, I will need to use <a href='https://github.com/Kaggle/kaggle-api'>Kaggle API</a>. Kaggle API is an API that can be used to interact with the Kaggle website through the command line. The use of the Kaggle can be downloading and uploading datasets or interacting with competitions.

In [103]:
# Download the population dataset from its source https://www.kaggle.com/tanuprabhu/population-by-country-2020
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

api.dataset_download_files('tanuprabhu/population-by-country-2020', path = 'dataset',  unzip=True)
population_df = pd.read_csv('dataset/population_by_country_2020.csv')
population_df.head()

Unnamed: 0,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,China,1440297825,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,India,1382345085,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,United States,331341050,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,Indonesia,274021604,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,Pakistan,221612785,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %


<p style="font-size:18px"> After downloading the dataset, the first thing I would like to check is the matching of the countries names. </p>

In [104]:
countries_covid = master_df['Country/Region'].unique()
countries_pop = population_df['Country (or dependency)'].unique()
unmatched = [x for x in countries_covid if x not in countries_pop]
print(unmatched)
print(len(unmatched))
print([x for x in countries_pop if x not in countries_covid])

['Burma', 'Congo (Brazzaville)', 'Congo (Kinshasa)', "Cote d'Ivoire", 'Czechia', 'Diamond Princess', 'Korea, South', 'Kosovo', 'MS Zaandam', 'Saint Kitts and Nevis', 'Saint Vincent and the Grenadines', 'Sao Tome and Principe', 'Summer Olympics 2020', 'Taiwan*', 'US', 'West Bank and Gaza']
16
['United States', 'DR Congo', 'Myanmar', 'South Korea', "Côte d'Ivoire", 'North Korea', 'Taiwan', 'Czech Republic (Czechia)', 'Hong Kong', 'Turkmenistan', 'Congo', 'State of Palestine', 'Puerto Rico', 'Réunion', 'Macao', 'Western Sahara', 'Guadeloupe', 'Martinique', 'French Guiana', 'New Caledonia', 'French Polynesia', 'Mayotte', 'Sao Tome & Principe', 'Channel Islands', 'Guam', 'Curaçao', 'St. Vincent & Grenadines', 'Aruba', 'Tonga', 'U.S. Virgin Islands', 'Isle of Man', 'Cayman Islands', 'Bermuda', 'Northern Mariana Islands', 'Greenland', 'American Samoa', 'Saint Kitts & Nevis', 'Faeroe Islands', 'Sint Maarten', 'Turks and Caicos', 'Saint Martin', 'Gibraltar', 'British Virgin Islands', 'Caribbean

<p style="font-size:18px"> There is 13 unmatched countries and 2 Cruises (Diamond Princess and MS Zaandam) and the Summer Olympics 2020. The way to fix the unmatched countries names is by simply replacing them directly from the dataframe, and I will leave the other four for later analysis. </p>

<p style="font-size:18px">

In [105]:
country_mapper = {
    'Congo (Brazzaville)': 'Congo',
    'Congo (Kinshasa)': 'Congo',
    "Cote d'Ivoire": "Côte d'Ivoire",
    'Czechia': 'Czech Republic (Czechia)',
    'Korea, South': 'South Korea',
    'Saint Vincent and the Grenadines': 'St. Vincent & Grenadines',
    'Taiwan*': 'Taiwan',
    'US': 'United States',
    'West Bank and Gaza': 'Israel',
    'Saint Kitts and Nevis': 'Saint Kitts & Nevis',
    'Burma': 'Myanmar',
    'Sao Tome and Principe': 'Sao Tome & Principe'
}
master_df['Country/Region'] = master_df['Country/Region'].replace(country_mapper)

In [106]:
countries_covid = master_df['Country/Region'].unique()
[x for x in countries_covid if x not in countries_pop]


['Diamond Princess', 'Kosovo', 'MS Zaandam', 'Summer Olympics 2020']

> <p style="font-size:18px"> Now we don't have any mismatchs within the countries names.</p>

## <a id='Exporting'>Part IV- Exporting the datasets</a>

<p style="font-size:18px"> The last remaining step is to export the datasets since they are almost ready to be uploaded on Tableau </p>

In [107]:
master_df.to_csv('dataset/COVID-19.csv')