# In this notebook, we will analyze the COVID report per country using Python.

#### This analysis is based on the third assignment of the course "Zero to Pandas" at jovian.ai. The data that is used in this notebook is from https://gist.githubusercontent.com/aakashns/28b2e504b3350afd9bdb157893f9725c/raw/994b65665757f4f8887db1c85986a897abb23d84/countries.csv and https://gist.githubusercontent.com/aakashns/b2a968a6cfd9fbbb0ff3d6bd0f26262b/raw/b115ed1dfa17f10fc88bf966236cd4d9032f1df8/covid-countries-data.csv','covid-countries-data.csv .

#### Also, You can check out the said course at https://jovian.ai/learn/data-analysis-with-Python-Zero-to-Pandas.

First, I imported the Pandas library as pd to have access to its set of built-in functions.

In [2]:
import pandas as pd

After that, I got the data from its source using the request method from the urlretrieve library.

In [3]:
from urllib.request import urlretrieve

urlretrieve('https://gist.githubusercontent.com/aakashns/28b2e504b3350afd9bdb157893f9725c/raw/994b65665757f4f8887db1c85986a897abb23d84/countries.csv', 
            'countries.csv')

('countries.csv', <http.client.HTTPMessage at 0x21722d2bf10>)

After downloading the csv file, I read it with the read_csv function to get a visual representation of the table. 

In [5]:
countries_df = pd.read_csv('countries.csv')
countries_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita
0,Afghanistan,Asia,38928341.0,64.83,0.50,1803.987
1,Albania,Europe,2877800.0,78.57,2.89,11803.431
2,Algeria,Africa,43851043.0,76.88,1.90,13913.839
3,Andorra,Europe,77265.0,83.73,,
4,Angola,Africa,32866268.0,61.15,,5819.495
...,...,...,...,...,...,...
205,Vietnam,Asia,97338583.0,75.40,2.60,6171.884
206,Western Sahara,Africa,597330.0,70.26,,
207,Yemen,Asia,29825968.0,66.12,0.70,1479.147
208,Zambia,Africa,18383956.0,63.89,2.00,3689.251


Then I sought to find the answer to the following questions:

1. What are the continents present in the data set?

    1.1 How many countries per continent are there?
    
    
2. What is the total population of the data set?

In [8]:
# 1. How many continents are there in the data set?
continents = countries_df['continent'].unique()
continents

print('{} are the continents in the data set.'.format(continents))

['Asia' 'Europe' 'Africa' 'North America' 'South America' 'Oceania'] are the continents in the data set.


In [6]:
# 1.1 How many countries per continent is there?
country_counts_df = countries_df.groupby('continent')['location'].count()
country_counts_df

continent
Africa           55
Asia             47
Europe           51
North America    36
Oceania           8
South America    13
Name: location, dtype: int64

In [7]:
# 2. What is the total population of the data set?

total_population = countries_df['population'].sum()
print('The total population is {}'.format(int(total_population)))

The total population is 7757980095


Then I added a new column called "GDP." I then calculated the product of population and per capita GDP to arrive at the values I will put in the said column.

In [8]:
countries_df['gdp'] = countries_df['population'] * countries_df['gdp_per_capita']
countries_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,gdp
0,Afghanistan,Asia,38928341.0,64.83,0.50,1803.987,7.022622e+10
1,Albania,Europe,2877800.0,78.57,2.89,11803.431,3.396791e+10
2,Algeria,Africa,43851043.0,76.88,1.90,13913.839,6.101364e+11
3,Andorra,Europe,77265.0,83.73,,,
4,Angola,Africa,32866268.0,61.15,,5819.495,1.912651e+11
...,...,...,...,...,...,...,...
205,Vietnam,Asia,97338583.0,75.40,2.60,6171.884,6.007624e+11
206,Western Sahara,Africa,597330.0,70.26,,,
207,Yemen,Asia,29825968.0,66.12,0.70,1479.147,4.411699e+10
208,Zambia,Africa,18383956.0,63.89,2.00,3689.251,6.782303e+10


Then I took another csv file that contained the overall COVID-19 statistics for different countries. 

In [39]:
urlretrieve('https://gist.githubusercontent.com/aakashns/b2a968a6cfd9fbbb0ff3d6bd0f26262b/raw/b115ed1dfa17f10fc88bf966236cd4d9032f1df8/covid-countries-data.csv', 
            'covid-countries-data.csv')

('covid-countries-data.csv', <http.client.HTTPMessage at 0x21722f1d0a0>)

In [12]:
covid_data_df = pd.read_csv('covid-countries-data.csv')
covid_data_df

Unnamed: 0,location,total_cases,total_deaths,total_tests
0,Afghanistan,38243.0,1409.0,
1,Albania,9728.0,296.0,
2,Algeria,45158.0,1525.0,
3,Andorra,1199.0,53.0,
4,Angola,2729.0,109.0,
...,...,...,...,...
207,Western Sahara,766.0,1.0,
208,World,26059065.0,863535.0,
209,Yemen,1976.0,571.0,
210,Zambia,12415.0,292.0,


Then I noticed that there are numerous NaN values on the "total_tests" column, so our data is quite incomplete.

In [20]:
total_tests_missing = covid_data_df['total_tests'].isna().sum()

In [21]:
print("The data for total tests is missing for {} countries.".format(int(total_tests_missing)))

The data for total tests is missing for 122 countries.


Then, to make our data frame more knowledgeable, I merged the two data frames together using the "location" column as the merging value.

In [24]:
combined_df = countries_df.merge(covid_data_df, on='location')
combined_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,gdp,total_cases,total_deaths,total_tests
0,Afghanistan,Asia,38928341.0,64.83,0.50,1803.987,7.022622e+10,38243.0,1409.0,
1,Albania,Europe,2877800.0,78.57,2.89,11803.431,3.396791e+10,9728.0,296.0,
2,Algeria,Africa,43851043.0,76.88,1.90,13913.839,6.101364e+11,45158.0,1525.0,
3,Andorra,Europe,77265.0,83.73,,,,1199.0,53.0,
4,Angola,Africa,32866268.0,61.15,,5819.495,1.912651e+11,2729.0,109.0,
...,...,...,...,...,...,...,...,...,...,...
205,Vietnam,Asia,97338583.0,75.40,2.60,6171.884,6.007624e+11,1046.0,35.0,261004.0
206,Western Sahara,Africa,597330.0,70.26,,,,766.0,1.0,
207,Yemen,Asia,29825968.0,66.12,0.70,1479.147,4.411699e+10,1976.0,571.0,
208,Zambia,Africa,18383956.0,63.89,2.00,3689.251,6.782303e+10,12415.0,292.0,


After that, I made 3 columns on the new data frame and named them "tests_per_million,"  "cases_per_million," and "deaths_per_million."

In [36]:
combined_df['tests_per_million'] = combined_df['total_tests'] * 1e6 / combined_df['population']
combined_df['cases_per_million'] = combined_df['total_cases'] * 1e6 / combined_df['population']
combined_df['deaths_per_million'] = combined_df['total_deaths'] * 1e6 / combined_df['population']

In [26]:
combined_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,gdp,total_cases,total_deaths,total_tests,tests_per_million,cases_per_million
0,Afghanistan,Asia,38928341.0,64.83,0.50,1803.987,7.022622e+10,38243.0,1409.0,,,982.394806
1,Albania,Europe,2877800.0,78.57,2.89,11803.431,3.396791e+10,9728.0,296.0,,,3380.359997
2,Algeria,Africa,43851043.0,76.88,1.90,13913.839,6.101364e+11,45158.0,1525.0,,,1029.804468
3,Andorra,Europe,77265.0,83.73,,,,1199.0,53.0,,,15518.022390
4,Angola,Africa,32866268.0,61.15,,5819.495,1.912651e+11,2729.0,109.0,,,83.033462
...,...,...,...,...,...,...,...,...,...,...,...,...
205,Vietnam,Asia,97338583.0,75.40,2.60,6171.884,6.007624e+11,1046.0,35.0,261004.0,2681.403324,10.745996
206,Western Sahara,Africa,597330.0,70.26,,,,766.0,1.0,,,1282.373228
207,Yemen,Asia,29825968.0,66.12,0.70,1479.147,4.411699e+10,1976.0,571.0,,,66.250993
208,Zambia,Africa,18383956.0,63.89,2.00,3689.251,6.782303e+10,12415.0,292.0,,,675.317108


Lastly, I sorted the data frames according to 3 categories and looked for the top 10 countries that have the highest values per million people.

1. First is the countries with highest tests_per_million

In [29]:
highest_tests_df = combined_df.sort_values('tests_per_million', ascending=False).head(10)
highest_tests_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,gdp,total_cases,total_deaths,total_tests,tests_per_million,cases_per_million
197,United Arab Emirates,Asia,9890400.0,77.97,1.2,67293.483,665559500000.0,71540.0,387.0,7177430.0,725696.635121,7233.276713
14,Bahrain,Asia,1701583.0,77.29,2.0,43290.705,73662730000.0,52440.0,190.0,1118837.0,657527.137965,30818.36149
115,Luxembourg,Europe,625976.0,82.25,4.51,94277.965,59015740000.0,7928.0,124.0,385820.0,616349.508607,12665.022301
122,Malta,Europe,441539.0,82.53,4.485,36513.323,16122060000.0,1931.0,13.0,188539.0,427004.183096,4373.339614
53,Denmark,Europe,5792203.0,80.9,2.5,46682.515,270394600000.0,17195.0,626.0,2447911.0,422621.755488,2968.645954
96,Israel,Asia,8655541.0,82.97,2.99,33132.32,286778200000.0,122539.0,969.0,2353984.0,271962.665303,14157.289533
89,Iceland,Europe,341250.0,82.99,2.91,46482.958,15862310000.0,2121.0,10.0,88829.0,260304.761905,6215.384615
157,Russia,Europe,145934460.0,72.58,8.05,24765.954,3614206000000.0,1005000.0,17414.0,37176827.0,254750.159763,6886.653091
199,United States,North America,331002647.0,78.86,2.77,54225.446,17948770000000.0,6114406.0,185744.0,83898416.0,253467.507769,18472.377957
10,Australia,Oceania,25499881.0,83.44,3.84,44648.71,1138537000000.0,25923.0,663.0,6255797.0,245326.517406,1016.592979


Using this dataframe, I saw that the United Arab Emirates is the top country with the highest tests per million people, followed by Bahrain, Luxembourg, and so on.

2. The second is the list of the top 10 countries with the highest number of positive cases per million people.

In [30]:
highest_cases_df = combined_df.sort_values('cases_per_million', ascending=False).head(10)
highest_cases_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,gdp,total_cases,total_deaths,total_tests,tests_per_million,cases_per_million
155,Qatar,Asia,2881060.0,80.23,1.2,116935.6,336898500000.0,119206.0,199.0,634745.0,220316.48074,41375.74365
14,Bahrain,Asia,1701583.0,77.29,2.0,43290.705,73662730000.0,52440.0,190.0,1118837.0,657527.137965,30818.36149
147,Panama,North America,4314768.0,78.51,2.3,22267.037,96077100000.0,94084.0,2030.0,336345.0,77952.04748,21805.112117
40,Chile,South America,19116209.0,80.18,2.11,22767.037,435219400000.0,414739.0,11344.0,2458762.0,128621.841287,21695.671982
162,San Marino,Europe,33938.0,84.97,3.8,56861.47,1929765000.0,735.0,42.0,,,21657.13949
9,Aruba,North America,106766.0,76.29,,35973.781,3840777000.0,2211.0,12.0,,,20708.839893
105,Kuwait,Asia,4270563.0,75.49,2.0,65530.537,279852300000.0,86478.0,535.0,621616.0,145558.325682,20249.789079
150,Peru,South America,32971846.0,76.74,1.6,12236.706,403466800000.0,663437.0,29259.0,584232.0,17719.117092,20121.318048
27,Brazil,South America,212559409.0,75.88,2.2,14103.452,2997821000000.0,3997865.0,123780.0,4797948.0,22572.268255,18808.224105
199,United States,North America,331002647.0,78.86,2.77,54225.446,17948770000000.0,6114406.0,185744.0,83898416.0,253467.507769,18472.377957


This dataframe shows that Qatar and Bahrain are the countries with the highest cases per million people, and they are the same country in Asia.

3. The third dataframe shows the list of the top 10 countries with the highest number of deaths per million people.

In [37]:
highest_deaths_df = combined_df.sort_values('deaths_per_million', ascending=False).head(10)
highest_deaths_df

Unnamed: 0,location,continent,population,life_expectancy,hospital_beds_per_thousand,gdp_per_capita,gdp,total_cases,total_deaths,total_tests,tests_per_million,cases_per_million,deaths_per_million
162,San Marino,Europe,33938.0,84.97,3.8,56861.47,1929765000.0,735.0,42.0,,,21657.13949,1237.550828
150,Peru,South America,32971846.0,76.74,1.6,12236.706,403466800000.0,663437.0,29259.0,584232.0,17719.117092,20121.318048,887.393445
18,Belgium,Europe,11589616.0,81.63,5.64,42658.576,494396500000.0,85817.0,9898.0,2281853.0,196887.713967,7404.645676,854.040375
3,Andorra,Europe,77265.0,83.73,,,,1199.0,53.0,,,15518.02239,685.950948
177,Spain,Europe,46754783.0,83.56,2.97,34272.36,1602397000000.0,479554.0,29194.0,6416533.0,137238.001939,10256.790198,624.406705
198,United Kingdom,Europe,67886004.0,81.32,2.54,39753.244,2698689000000.0,338676.0,41514.0,13447568.0,198090.434075,4988.89285,611.525168
40,Chile,South America,19116209.0,80.18,2.11,22767.037,435219400000.0,414739.0,11344.0,2458762.0,128621.841287,21695.671982,593.4231
97,Italy,Europe,60461828.0,83.51,3.18,35220.084,2129471000000.0,271515.0,35497.0,5214766.0,86248.897403,4490.684602,587.097697
27,Brazil,South America,212559409.0,75.88,2.2,14103.452,2997821000000.0,3997865.0,123780.0,4797948.0,22572.268255,18808.224105,582.331314
182,Sweden,Europe,10099270.0,82.8,2.22,46949.283,474153500000.0,84532.0,5820.0,,,8370.109919,576.279276


For this last dataframe, it shows that San Marino tops the list for the highest death rate per million around the world. It is followed by Peru, Belgium, Andora, and so on.

## That's all for this jupyter notebook, thank you for checking my project out. Till we meet next time!

### Christian P. Ortiz

### December 25, 2022