<a href="https://colab.research.google.com/github/misterhay/Interesting-Problems/blob/master/covid-cases-per-capita.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COVID-19 Cases Per Capita

This Jupyter notebook uses [COVID-19 statistics from Johns Hopkins CSSE](https://github.com/CSSEGISandData/COVID-19) and population statistics from [Gapminder](http://gapm.io/dpop).

First, import the data by running the next cell. You can change the date, but make sure you use the format `'MM-DD-YYYY'` as they do in the CSSE data set.

In [0]:
date = '03-30-2020'

import pandas as pd

csv_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/'+date+'.csv'
covid_stats = pd.read_csv(csv_url)
covid_stats.replace('US','United States',regex=True,inplace=True)
covid_stats.replace('Korea, South','South Korea',regex=True,inplace=True)

pop_csv_url = 'https://docs.google.com/spreadsheets/d/18Ep3s1S0cvlT1ovQG9KdipLEoQ1Ktz5LtTTQpDcWbX0/export?gid=1668956939&format=csv'
pop_df = pd.read_csv(pop_csv_url)
current_population = pop_df[pop_df['time']==2019]

print('Data successfully imported')

Data successfully imported


### Create a DataFrame

Run the next cell to create a [DataFrame](https://www.tutorialspoint.com/python_pandas/python_pandas_dataframe.htm) from the downloaded data.

In [0]:
# If you prefer specific countries, put a # in front of the next line and remove the six ' marks around the next list
country_list = current_population['name'].unique()
'''
country_list = ['Italy', 'Spain', 'Germany', 'France', 
                'Israel', 'United States', 'United Kingdom',
                'South Korea', 'Singapore', 'Australia',
                'Canada', 'China', 'Argentina', 'Russia', 'India']
'''

df = pd.DataFrame(columns=['Country', 'Population', 'Confirmed', 'Percent'])

for country in country_list:
    confirmed = covid_stats[covid_stats['Country_Region']==country]['Confirmed'].sum()
    population = current_population[current_population['name']==country]['population'].values[0]
    percent = (confirmed/population)*100
    if percent != 0:
        data_row = {'Country':country,'Population':population,'Confirmed':confirmed,'Percent':percent}
        df = df.append(data_row, ignore_index=True)

df.sort_values('Confirmed',ascending=False)

Unnamed: 0,Country,Population,Confirmed,Percent
153,United States,329093110,161807,0.049168
76,Italy,59216525,101739,0.171808
135,Spain,46441049,87956,0.189393
33,China,1420062022,82198,0.005788
57,Germany,82438639,66885,0.081133
...,...,...,...,...
30,Central African Republic,4825711,3,0.000062
133,Somalia,15636171,3,0.000019
17,Belize,390231,3,0.000769
144,Timor-Leste,1352360,1,0.000074


### Add World Data

In [0]:
world_population = current_population['population'].sum()
world_confirmed_cases = covid_stats['Confirmed'].sum()
world_percent = (world_confirmed_cases/world_population)*100
world_values = {'Country':'World','Population':world_population,'Confirmed':world_confirmed_cases,'Percent':world_percent}
df = df.append(world_values, ignore_index=True)
df.sort_values('Confirmed',ascending=False).head(16)

Unnamed: 0,Country,Population,Confirmed,Percent
160,World,7705467692,782365,0.010153
153,United States,329093110,161807,0.049168
76,Italy,59216525,101739,0.171808
135,Spain,46441049,87956,0.189393
33,China,1420062022,82198,0.005788
57,Germany,82438639,66885,0.081133
53,France,65480710,45170,0.068982
72,Iran,82820766,41495,0.050102
152,United Kingdom,66959016,22453,0.033532
140,Switzerland,8608259,15922,0.184962


### Adding More Data

You can also edit then run the next cell to add other data to the DataFrame. Each time you run it, this code will add a row.

In [0]:
place = 'Edmonton'
population = 1461182
confirmed = 111
percent = (confirmed/population)*100
new_row = {'Country':place,'Population':population,'Confirmed':world_confirmed_cases,'Percent':world_percent}
df = df.append(new_row, ignore_index=True)

### Sorting Data

You can also sort by percent.

In [0]:
df.sort_values('Percent',ascending=False).head(20)

Unnamed: 0,Country,Population,Confirmed,Percent
66,Holy See,799,6,0.750939
126,San Marino,33683,230,0.682837
3,Andorra,77072,370,0.480071
90,Luxembourg,596992,1988,0.333003
69,Iceland,340566,1086,0.318881
135,Spain,46441049,87956,0.189393
140,Switzerland,8608259,15922,0.184962
76,Italy,59216525,101739,0.171808
88,Liechtenstein,38404,62,0.161442
100,Monaco,39102,49,0.125313


### Specific Countries

To see a DataFrame of specific countries, edit and run the next cell

In [0]:
#df[df['Country']=='Canada']
list_of_countries = ['Canada', 'China', 'Italy']
df[df['Country'].isin(list_of_countries)]

Unnamed: 0,Country,Population,Confirmed,Percent
29,Canada,37279811,7398,0.019845
33,China,1420062022,82198,0.005788
76,Italy,59216525,101739,0.171808


## Names of Countries

Unfortunately these two data sets don't always use the same country/region names, which explains the `.replace()` methods in the first code cell. As well not all countries/regions are represented in both data sets, which explains the `if percent != 0:` (if percent is not zero) when we were creating the DataFrame.

To see the country/region names, run each of the following cells.

In [0]:
covid_stats['Country_Region'].unique()

array(['United States', 'Canada', 'United Kingdom', 'China',
       'Netherlands', 'Australia', 'Denmark', 'France', 'Afghanistan',
       'Albania', 'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda',
       'Argentina', 'Armenia', 'Austria', 'Azerbaijan', 'Bahamas',
       'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
       'Belize', 'Benin', 'Bhutan', 'Bolivia', 'Bosnia and Herzegovina',
       'Botswana', 'Brazil', 'Brunei', 'Bulgaria', 'Burkina Faso',
       'Burma', 'Cabo Verde', 'Cambodia', 'Cameroon',
       'Central African Republic', 'Chad', 'Chile', 'Colombia',
       'Congo (Brazzaville)', 'Congo (Kinshasa)', 'Costa Rica',
       "Cote d'Ivoire", 'Croatia', 'Cuba', 'Cyprus', 'Czechia',
       'Diamond Princess', 'Djibouti', 'Dominica', 'Dominican Republic',
       'Ecuador', 'Egypt', 'El Salvador', 'Equatorial Guinea', 'Eritrea',
       'Estonia', 'Eswatini', 'Ethiopia', 'Fiji', 'Finland', 'Gabon',
       'Gambia', 'Georgia', 'Germany', 'Ghana', 'Greece', '

In [0]:
current_population['name'].unique()

array(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Australia',
       'Austria', 'Azerbaijan', 'Bahamas', 'Bahrain', 'Bangladesh',
       'Barbados', 'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan',
       'Bolivia', 'Bosnia and Herzegovina', 'Botswana', 'Brazil',
       'Brunei', 'Bulgaria', 'Burkina Faso', 'Burundi', 'Cambodia',
       'Cameroon', 'Canada', 'Cape Verde', 'Central African Republic',
       'Chad', 'Chile', 'China', 'Colombia', 'Comoros',
       'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire",
       'Croatia', 'Cuba', 'Cyprus', 'Czech Republic', 'Denmark',
       'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Ethiopia', 'Fiji', 'Finland', 'France', 'Gabon', 'Gambia',
       'Georgia', 'Germany', 'Ghana', 'Greece', 'Grenada', 'Guatemala',
       'Guinea', 'Guinea-Bissau', 'Guyana', 'Haiti', '

**Hopefully that's an interesting introduction to data science using online COVID-19 data.**