# Covid-19 Dashboard's Dataset
By Abdullah Almuzaini

## Table of Contents
- [Introduction](#intro)
- [Part I - Data Gathering](#Gathering)
- [Part II - Preparing Data](#Preparing)
- [Part III- Exploring the datasets](#Exploring)
- [Part IV- Exporting the datasets](#Exporting)



## <a id='intro'> Introduction </a>

<p style="font-size:18px">The first step in my project is to collect the data needed when building up the dashboard. Second, after gathering the data I need, I will have to prepare the datasets downloaded in the first step. By preparing the data I mean assessing and doing some cleaning process. Lastly, when all the data needed for the project is collected and be in the proper format and shape, it has to be extracted from this notebook and stored in the project dataset folder in the project directory. These Processes require using some python scripts I need to write and libraries, which I will be describing next. </p><br>


In [1]:
import pandas as pd


from create_new_directory import new_folder
from get_dataset import get_dataset
from reshape_dataset import reshape

## <a id='Gathering'>Gathering Data </a>

<p style="font-size:18px">The python scripts I will be using this part of the project are:<br><br>
    - <b>create_new_directory.py:</b> which contains the method <b>new_folder()</b> that takes a string variable as a folder name. The purpose of this little script is to create a new directory inside the current working directory. For now, it will be used to create a new directory in which the datasets will be stored. <br><br>
    - <b>get_dataset.py:</b> the function downloads the dataset from the internet using requests library and
        save it in the 'dataset' directory inside the current running direcroty. It takes two arguments one is the URL, and the second argument is file name which should includ the file format `csv` for example. <br><br><b>

    
    
</p>

#### Collect the daily covid-19 confirmed cases dataset

In [2]:
# Create a new directory in the current running directory if it does not exist 

folder_name = 'dataset'
new_folder(folder_name)

In [3]:
# Download the confirmed covid-19 cases from its source https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
file_name = 'covid19_confirmed_cases.csv'

# The functoin of get_dataset takes url and file name
data = get_dataset(URL,file_name)
confirmed_df = pd.read_csv('dataset'+ '/'+file_name)


In [4]:
confirmed_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,11/18/21,11/19/21,11/20/21,11/21/21,11/22/21,11/23/21,11/24/21,11/25/21,11/26/21,11/27/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,156739,156812,156864,156896,156911,157015,157032,157144,157171,157190
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,195021,195523,195988,195988,196611,197167,197776,198292,198732,199137
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,208532,208695,208839,208952,209111,209283,209463,209624,209817,209980
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,16035,16086,16086,16086,16299,16342,16426,16566,16712,16712
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,64985,64997,65011,65024,65033,65061,65080,65105,65130,65139


#### Collect the daily covid-19 death cases dataset

In [5]:
# Download the death covid-19 cases from its source https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series

URL = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
file_name = 'covid19_death_cases.csv'

# The functoin of get_dataset takes url and file name
data = get_dataset(URL,file_name)
death_df = pd.read_csv('dataset'+ '/'+file_name)

death_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,11/18/21,11/19/21,11/20/21,11/21/21,11/22/21,11/23/21,11/24/21,11/25/21,11/26/21,11/27/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,7297,7361,7363,7365,7365,7305,7306,7307,7307,7308
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,3022,3029,3035,3035,3049,3053,3063,3068,3077,3085
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,6009,6015,6017,6021,6026,6030,6035,6041,6046,6052
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,130,130,130,130,130,130,131,131,131,131
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,1729,1729,1730,1730,1730,1730,1731,1732,1733,1733


#### Collect the data of the daily recovery from covid19 

In [6]:
URL = 'https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'
file_name = 'covid19_recovered_cases.csv'

# The functoin of get_dataset takes url and file name
data = get_dataset(URL,file_name)
recovery_df = pd.read_csv('dataset'+ '/'+file_name)

recovery_df.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,11/18/21,11/19/21,11/20/21,11/21/21,11/22/21,11/23/21,11/24/21,11/25/21,11/26/21,11/27/21
0,,Afghanistan,33.93911,67.709953,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,,Albania,41.1533,20.1683,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,,Algeria,28.0339,1.6596,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,,Andorra,42.5063,1.5218,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,,Angola,-11.2027,17.8739,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


> <p style="font-size:18px"><br>
    Now, we have three datasets downloaded and stored in the dataset directory. The dataset are in the wide format as they appear above. 
    </p>

## <a id='Preparing'>Part II- Preparing Data</a>

<p style="font-size:18px">  - In this part of the project, I will need to use the following python script: <br><br>
    - <b>reshape_dataset.py:</b> the function contains a method called <b>reshape()</b> that transforms the dataset from the wide format into the long format using the pandas function melt(). The method takes three arguments. One is the dataframe. The second one is variable name, the column name.
    
</p>

#### Prepare the three datasets to be in an appropriate format and shape by converting them from the wide format to the long format.

In [7]:
confirmed = reshape(confirmed_df,
               'date',
               'confirmed')
confirmed.shape

(189280, 6)

In [8]:
confirmed.sample(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
113360,,Summer Olympics 2020,35.6491,139.7737,3/1/21,0
180927,Nova Scotia,Canada,44.682,-63.7443,10/29/21,7354
182638,Macau,China,22.1667,113.55,11/4/21,77
35051,Quebec,Canada,52.9399,-73.5491,5/26/20,49019
119196,Sint Maarten,Netherlands,18.0425,-63.0548,3/22/21,2104


In [9]:
death = reshape(death_df,
               'date',
               'deaths')
death.shape

(189280, 6)

In [10]:
death.tail(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
189275,,Vietnam,14.058324,108.277199,11/27/21,24692
189276,,West Bank and Gaza,31.9522,35.2332,11/27/21,4789
189277,,Yemen,15.552727,48.516388,11/27/21,1945
189278,,Zambia,-13.133897,27.849332,11/27/21,3667
189279,,Zimbabwe,-19.015438,29.154857,11/27/21,4704


In [11]:
recovery = reshape(recovery_df,
               'date',
               'recovery')
recovery.shape

(179140, 6)

In [12]:
recovery.sample(5)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recovery
18535,Gibraltar,United Kingdom,36.1408,-5.3536,3/31/20,34
17354,,Honduras,15.2,-86.2419,3/27/20,0
37324,,Sudan,12.8628,30.2176,6/10/20,2202
108361,,Uganda,1.373333,32.290275,3/5/21,15065
126328,,North Macedonia,41.6086,21.7453,5/12/21,139266


><p style="font-size:18px"><br>
    As we can see above, the datasets are now looking in a suitable format. I have transformed the columns after the Long column, which represent the date of each day in the dataset from January 22, 2020, till September 25, 2021, into one column called date and stored all the values under each of the date columns in a new variable called recovery in the recovery dataset, deaths in the death dataset, and confirmed in the confirmed dataset. 
    </p><br>

## <a id='Exploring'> Part III- Exploring the datasets</a>

<p style="font-size:18px"><br>
    Next, I will explore the datasets taking advantage of the pandas method info(), which will return general information about each dataset such as the number of records, the name of the variables we have, the count of the non-null records, as well as the data type of each column.<br>
    </p>

In [13]:
confirmed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189280 entries, 0 to 189279
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  58812 non-null   object 
 1   Country/Region  189280 non-null  object 
 2   Lat             187928 non-null  float64
 3   Long            187928 non-null  float64
 4   date            189280 non-null  object 
 5   confirmed       189280 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 8.7+ MB


In [14]:
death.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 189280 entries, 0 to 189279
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  58812 non-null   object 
 1   Country/Region  189280 non-null  object 
 2   Lat             187928 non-null  float64
 3   Long            187928 non-null  float64
 4   date            189280 non-null  object 
 5   deaths          189280 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 8.7+ MB


In [15]:
recovery.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 179140 entries, 0 to 179139
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   Province/State  47996 non-null   object 
 1   Country/Region  179140 non-null  object 
 2   Lat             178464 non-null  float64
 3   Long            178464 non-null  float64
 4   date            179140 non-null  object 
 5   recovery        179140 non-null  int64  
dtypes: float64(2), int64(1), object(3)
memory usage: 8.2+ MB


> <p style="font-size:18px"><br> According to the above summary about the confirmed and death dataframes, we have some missing values in the Province/State, Lat, and Long columns. We also can see that the column data is not in the proper format. Furthermore, the confirmed and death datasets are a match. However, the recovery dataset has overall less records than the other two and has more missing values that needs to be inspected to find out what the issue is. 
    </p><br>


<p style="font-size:18px"><br>
In the next cell, I will make copies of three datasets, and the copies will contain only records with NaN values in the `Province/State` to find out where the issue is. 
    </p><br>

In [16]:
# Storing only rows with null values in ['Province/State'] column

recovery_na = recovery[recovery['Province/State'].isna()]
confirmed_na = confirmed[confirmed['Province/State'].isna()]
death_na = death[death['Province/State'].isna()]

<p style="font-size:18px"><br>
Next I will use pandas.DataFrame.sample to return a random samples form the above sub-datasets to invistigate the issue.
    </p><br>

In [17]:
recovery_na.sample(20)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recovery
64372,,Ukraine,48.3794,31.1656,9/20/20,80438
41211,,Ireland,53.1424,-7.6921,6/25/20,23364
96597,,Israel,31.046051,34.851612,1/20/21,488158
124157,,Israel,31.046051,34.851612,5/4/21,831090
20143,,Andorra,42.5063,1.5218,4/7/20,39
112219,,Guinea,9.9456,-9.6966,3/20/21,15862
104940,,Afghanistan,33.93911,67.709953,2/21/21,48834
110851,,Costa Rica,9.7489,-83.7534,3/15/21,188967
51825,,Latvia,56.8796,24.6032,8/4/20,1070
27502,,Samoa,-13.759,-172.1046,5/4/20,0


In [18]:
confirmed_na.sample(20)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
92840,,"Korea, South",35.907757,127.766922,12/18/20,48570
12184,,Honduras,15.2,-86.2419,3/5/20,0
162686,,Argentina,-38.4161,-63.6167,8/25/21,5155079
108800,,"Korea, South",35.907757,127.766922,2/13/21,83525
174921,,Niger,17.607789,8.081666,10/7/21,6084
51105,,Hungary,47.1625,19.5033,7/22/20,4366
167390,,Singapore,1.2833,103.8333,9/10/21,70612
36965,,Antigua and Barbuda,17.0608,-61.7964,6/2/20,26
51685,,Latvia,56.8796,24.6032,7/24/20,1205
9061,,Czechia,49.8175,15.473,2/23/20,0


In [19]:
death_na.sample(20)

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
141282,,Kuwait,29.31166,47.481766,6/9/21,1806
146974,,Turkey,38.9637,35.2433,6/29/21,49687
68134,,Congo (Brazzaville),-0.228,15.8277,9/21/20,89
145990,,Egypt,26.820553,30.802498,6/26/21,16062
3746,,Djibouti,11.8251,42.5903,2/4/20,0
175388,,Dominican Republic,18.7357,-70.1627,10/9/21,4067
147441,,Kosovo,42.602636,20.902977,7/1/21,2262
148326,,Pakistan,30.3753,69.3451,7/4/21,22427
166047,,Armenia,40.0691,45.0382,9/6/21,4924
53038,,Finland,61.92411,25.748151,7/29/20,329


> <p style="font-size:18px"><br> 
    After Running the above three cells several times, it turns out that the reason for the null values in the column `Province/State` is that many countries did not report covid-19 cases by Province/State. Instead, they just counted the covid-19 cases for the entire country as a whole. However, there is one country that did not constantly report the cases. That country is Canada. Canada reports covid-19 deaths and confirmed cases by the Province/State, but they do not count the recovery cases in the same way. They report the recovery cases for the entire country as a whole without considering the Province/State. This issue would raise an issue when merging the three datasets. <br><br> </p><br>


<p style="font-size:18px"><br> The following three cells will demonstrate the issue of Canada's covid-19 reporting method
   </p> <br>

In [20]:
death_na[death_na['Country/Region']== "Canada"]

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths


In [21]:
confirmed_na[confirmed_na['Country/Region']== "Canada"]

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed


In [22]:
recovery_na[recovery_na['Country/Region']== "Canada"]

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,recovery
39,,Canada,56.1304,-106.3468,1/22/20,0
304,,Canada,56.1304,-106.3468,1/23/20,0
569,,Canada,56.1304,-106.3468,1/24/20,0
834,,Canada,56.1304,-106.3468,1/25/20,0
1099,,Canada,56.1304,-106.3468,1/26/20,0
...,...,...,...,...,...,...
177854,,Canada,56.1304,-106.3468,11/23/21,0
178119,,Canada,56.1304,-106.3468,11/24/21,0
178384,,Canada,56.1304,-106.3468,11/25/21,0
178649,,Canada,56.1304,-106.3468,11/26/21,0


<p style="font-size:18px"><br> 
In order to resolve the above issue, I will recalculate the deaths and confirmed cases of Covid-19 in Canada by the Country to match the recovery dataset. 
    </p><br>

<p style="font-size:18px"><br> 
    The first step we need to take is to fetch confirmed cases and deaths records of Canada and aggregate them by `date`, then store them in a separate sub-data frame
    </p><br>

In [23]:
canada_conf = confirmed[confirmed['Country/Region'] == 'Canada'].groupby('date').sum()[['confirmed']]
canada_conf.head()

Unnamed: 0_level_0,confirmed
date,Unnamed: 1_level_1
1/1/21,591149
1/10/21,666375
1/11/21,674624
1/12/21,681015
1/13/21,688097


In [24]:
canada_dth = death[death['Country/Region'] == 'Canada'].groupby('date').sum()[['deaths']]
canada_dth.head()

Unnamed: 0_level_0,deaths
date,Unnamed: 1_level_1
1/1/21,15806
1/10/21,17074
1/11/21,17199
1/12/21,17359
1/13/21,17539


<p style="font-size:18px"><br> 
    The next step is to copy the recovery dataframe without including the `recovery` column. The reason is that I want to apply these columns on death and confirmed cases dataframes to make the death and confirmed cases calculated for the entire country as a whole instead of by Province/State, so they eventually  match the recovery dataframe
    </p><br>

In [25]:
canada_recovery = recovery[recovery['Country/Region'] == 'Canada'][recovery.columns[:-1]].reset_index(drop=True)
canada_recovery.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date
0,,Canada,56.1304,-106.3468,1/22/20
1,,Canada,56.1304,-106.3468,1/23/20
2,,Canada,56.1304,-106.3468,1/24/20
3,,Canada,56.1304,-106.3468,1/25/20
4,,Canada,56.1304,-106.3468,1/26/20



<p style="font-size:18px"><br> 
    Now, we are set to join and apply `canada_recovery` on the `canada_conf` and `canada_dth` 
    </p><br>

In [26]:
canada_covid_19_conf = canada_recovery.merge(canada_conf, how='inner', left_on='date', right_index=True)
canada_covid_19_conf.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
0,,Canada,56.1304,-106.3468,1/22/20,0
1,,Canada,56.1304,-106.3468,1/23/20,0
2,,Canada,56.1304,-106.3468,1/24/20,0
3,,Canada,56.1304,-106.3468,1/25/20,0
4,,Canada,56.1304,-106.3468,1/26/20,1


In [27]:
canada_covid_19_conf.tail()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
671,,Canada,56.1304,-106.3468,11/23/21,1780643
672,,Canada,56.1304,-106.3468,11/24/21,1783319
673,,Canada,56.1304,-106.3468,11/25/21,1787542
674,,Canada,56.1304,-106.3468,11/26/21,1790579
675,,Canada,56.1304,-106.3468,11/27/21,1792561


In [28]:
canada_covid_19_deaths = canada_recovery.merge(canada_dth, how='inner', left_on='date', right_index=True)
canada_covid_19_deaths.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
0,,Canada,56.1304,-106.3468,1/22/20,0
1,,Canada,56.1304,-106.3468,1/23/20,0
2,,Canada,56.1304,-106.3468,1/24/20,0
3,,Canada,56.1304,-106.3468,1/25/20,0
4,,Canada,56.1304,-106.3468,1/26/20,0


In [29]:
canada_covid_19_deaths.tail()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
671,,Canada,56.1304,-106.3468,11/23/21,29608
672,,Canada,56.1304,-106.3468,11/24/21,29635
673,,Canada,56.1304,-106.3468,11/25/21,29655
674,,Canada,56.1304,-106.3468,11/26/21,29671
675,,Canada,56.1304,-106.3468,11/27/21,29681


> <p style="font-size:18px"><br> 
    Finally, we have Canada's Covid-19 information matched in the three dataframes
</p><br><br>
<p style="font-size:18px"><br> 
   Now, we need to put Canada's data back in the original dataframes (`confirmed`, `death`) by copying the dataframes excluding records of Canada, then insert Canada's data from (`canada_covid_19_deaths` and `canada_covid_19_conf`)
</p><br>

In [30]:
confirmed = confirmed[confirmed['Country/Region'] != 'Canada'].append(canada_covid_19_conf)
confirmed.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,confirmed
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


In [31]:
death = death[death['Country/Region'] != 'Canada'].append(canada_covid_19_deaths)
death.head()

Unnamed: 0,Province/State,Country/Region,Lat,Long,date,deaths
0,,Afghanistan,33.93911,67.709953,1/22/20,0
1,,Albania,41.1533,20.1683,1/22/20,0
2,,Algeria,28.0339,1.6596,1/22/20,0
3,,Andorra,42.5063,1.5218,1/22/20,0
4,,Angola,-11.2027,17.8739,1/22/20,0


<p style="font-size:18px"><br> 
    To make sure that the three dataframes match, I will run the same test I did here.
    <p><br>

In [32]:
recovery_na = recovery[recovery['Province/State'].isna()]
confirmed_na = confirmed[confirmed['Province/State'].isna()]
death_na = death[death['Province/State'].isna()]

In [33]:
print(confirmed_na[confirmed_na['Country/Region']== "Canada"].shape)
print(recovery_na[recovery_na['Country/Region']== "Canada"].shape)
print(death_na[death_na['Country/Region']== "Canada"].shape)

(676, 6)
(676, 6)
(676, 6)


> <p style="font-size:18px"><br> The three dataframes now are matched </p>

<p style="font-size:18px"><br> The Final Step is to combine the three datasets into one master dataset
    </p><br>

In [34]:
columns = ['Country/Region','Province/State','date']
master_df = confirmed.merge(death, how='inner', on=columns)
master_df = master_df.merge(recovery, how='inner', on=columns)
master_df = master_df.drop(columns=['Lat_x', 'Long_x','Lat_y','Long_y'])
master_df = master_df[['Country/Region','Province/State','Lat','Long','date', 'confirmed','recovery','deaths']]
master_df.head()

Unnamed: 0,Country/Region,Province/State,Lat,Long,date,confirmed,recovery,deaths
0,Afghanistan,,33.93911,67.709953,1/22/20,0,0,0
1,Albania,,41.1533,20.1683,1/22/20,0,0,0
2,Algeria,,28.0339,1.6596,1/22/20,0,0,0
3,Andorra,,42.5063,1.5218,1/22/20,0,0,0
4,Angola,,-11.2027,17.8739,1/22/20,0,0,0


> <p style="font-size:18px"><br> Now, we have a dataframe containing all the covid-19 data we need and in the proper format.
    </p><br>

<p style="font-size:18px"><br> 
    After having the covid-19 data downloaded and almost prepared, there is only one last step to make the all the data ready for my fueature analysis. The data we have so far needs the population information for each country. Thus, I will add the population information to the my dataset list, and the population dataset I will be using is <a href='https://www.kaggle.com/tanuprabhu/population-by-country-2020'>Population by Country - 2020</a> by <a href='https://www.kaggle.com/tanuprabhu'>Tanu N Prabhu</a>.

<p style="font-size:18px"><br> 
    To download the dataset, I will need to use <a href='https://github.com/Kaggle/kaggle-api'>Kaggle API</a>. Kaggle API is an API that can be used to interact with the Kaggle website through the command line. The use of the Kaggle can be downloading and uploading datasets or interacting with competitions.

In [35]:
# Download the population dataset from its source https://www.kaggle.com/tanuprabhu/population-by-country-2020
from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

api.dataset_download_files('tanuprabhu/population-by-country-2020', path = 'dataset',  unzip=True)
population_df = pd.read_csv('dataset/population_by_country_2020.csv')
population_df.head()

Unnamed: 0,Country (or dependency),Population (2020),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,China,1440297825,0.39 %,5540090,153,9388211,-348399.0,1.7,38,61 %,18.47 %
1,India,1382345085,0.99 %,13586631,464,2973190,-532687.0,2.2,28,35 %,17.70 %
2,United States,331341050,0.59 %,1937734,36,9147420,954806.0,1.8,38,83 %,4.25 %
3,Indonesia,274021604,1.07 %,2898047,151,1811570,-98955.0,2.3,30,56 %,3.51 %
4,Pakistan,221612785,2.00 %,4327022,287,770880,-233379.0,3.6,23,35 %,2.83 %


<p style="font-size:18px"> After downloading the dataset, the first thing I would like to check is the matching of the countries names. </p>

In [36]:
countries_covid = master_df['Country/Region'].unique()
countries_pop = population_df['Country (or dependency)'].unique()
unmatched = [x for x in countries_covid if x not in countries_pop]
print(unmatched)
print(len(unmatched))
print([x for x in countries_pop if x not in countries_covid])

['Burma', 'Congo (Brazzaville)', 'Congo (Kinshasa)', "Cote d'Ivoire", 'Czechia', 'Diamond Princess', 'Korea, South', 'Kosovo', 'MS Zaandam', 'Saint Kitts and Nevis', 'Saint Vincent and the Grenadines', 'Sao Tome and Principe', 'Summer Olympics 2020', 'Taiwan*', 'US', 'West Bank and Gaza']
16
['United States', 'DR Congo', 'Myanmar', 'South Korea', "Côte d'Ivoire", 'North Korea', 'Taiwan', 'Czech Republic (Czechia)', 'Hong Kong', 'Turkmenistan', 'Congo', 'State of Palestine', 'Puerto Rico', 'Réunion', 'Macao', 'Western Sahara', 'Guadeloupe', 'Martinique', 'French Guiana', 'New Caledonia', 'French Polynesia', 'Mayotte', 'Sao Tome & Principe', 'Channel Islands', 'Guam', 'Curaçao', 'St. Vincent & Grenadines', 'Aruba', 'U.S. Virgin Islands', 'Isle of Man', 'Cayman Islands', 'Bermuda', 'Northern Mariana Islands', 'Greenland', 'American Samoa', 'Saint Kitts & Nevis', 'Faeroe Islands', 'Sint Maarten', 'Turks and Caicos', 'Saint Martin', 'Gibraltar', 'British Virgin Islands', 'Caribbean Netherla

<p style="font-size:18px"> There is 13 unmatched countries and 2 Cruises (Diamond Princess and MS Zaandam) and the Summer Olympics 2020. The way to fix the unmatched countries names is by simply replacing them directly from the dataframe, and I will leave the other four for later analysis. </p>

<p style="font-size:18px">

In [37]:
country_mapper = {
    'Congo (Brazzaville)': 'Congo',
    'Congo (Kinshasa)': 'DR Congo',
    "Cote d'Ivoire": "Côte d'Ivoire",
    'Czechia': 'Czech Republic (Czechia)',
    'Korea, South': 'South Korea',
    'Saint Vincent and the Grenadines': 'St. Vincent & Grenadines',
    'Taiwan*': 'Taiwan',
    'US': 'United States',
    'West Bank and Gaza': 'Israel',
    'Saint Kitts and Nevis': 'Saint Kitts & Nevis',
    'Burma': 'Myanmar',
    'Sao Tome and Principe': 'Sao Tome & Principe'
}
master_df['Country/Region'] = master_df['Country/Region'].replace(country_mapper)

In [38]:
countries_covid = master_df['Country/Region'].unique()
[x for x in countries_covid if x not in countries_pop]


['Diamond Princess', 'Kosovo', 'MS Zaandam', 'Summer Olympics 2020']

> <p style="font-size:18px"> Now we don't have any mismatchs within the countries names.</p>

## <a id='Exporting'>Part IV- Exporting the datasets</a>

<p style="font-size:18px"> The last remaining step is to export the datasets since they are almost ready to be uploaded on Tableau </p>

In [39]:
master_df.to_csv('dataset/COVID-19.csv')