pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. pandas is well suited for many different kinds of data: 

    *Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

    *Ordered and unordered (not necessarily fixed-frequency) time series data.

    *Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

    *Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

In [1]:
import pandas as pd

For this tutorial use the original owid dataset. Data analysis can be simplified by pre-processing a dataset via SQL queries, however for this tutorial I will fully process and visualize the data using Pandas.

In [2]:
filename = 'owid-covid-data.csv'
path = 'C:/Users/Matth/git/DataAnalysisWorkbooks/Covid19/Data/Raw_data/'

# Pandas has a method to read csv files directly.
data = pd.read_csv(path+filename) 

In [3]:
# Print a list of columns in the dataframe
data.columns

Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoothed',
       't

In [4]:
data.shape  # rows x columns
#data.shape[0] # N rows
#data.shape[1] # N columns

(154911, 67)

In [5]:
data.info()  # column names, data types, N entries, and more.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154911 entries, 0 to 154910
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   iso_code                                    154911 non-null  object 
 1   continent                                   145592 non-null  object 
 2   location                                    154911 non-null  object 
 3   date                                        154911 non-null  object 
 4   total_cases                                 152118 non-null  float64
 5   new_cases                                   152042 non-null  float64
 6   new_cases_smoothed                          150891 non-null  float64
 7   total_deaths                                134784 non-null  float64
 8   new_deaths                                  134951 non-null  float64
 9   new_deaths_smoothed                         134819 non-null  float64
 

In [6]:
# Print the first five rows of the dataframe
data.head()  # Or .tail() for the last five rows

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


In [7]:
# Print a specific row
data.iloc[5]  # iloc is short for 'integer location'

# Print a specific row and column
data.iloc[5, 3]

# Reading rows in a loop
for index, row in (data.iterrows()):
    if index == 3: break
    print(index, row)

0 iso_code                                           AFG
continent                                         Asia
location                                   Afghanistan
date                                        2020-02-24
total_cases                                        5.0
                                              ...     
human_development_index                          0.511
excess_mortality_cumulative_absolute               NaN
excess_mortality_cumulative                        NaN
excess_mortality                                   NaN
excess_mortality_cumulative_per_million            NaN
Name: 0, Length: 67, dtype: object
1 iso_code                                           AFG
continent                                         Asia
location                                   Afghanistan
date                                        2020-02-25
total_cases                                        5.0
                                              ...     
human_development_index   

For more on pandas.read_csv(), see: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

<ins>sample of pd.Series methods:</ins>  
max() - maximum value  
argmax() - return index of max value  
abs() - absolute value of every element  
append() - concatenate two Series  
count() - number of non-N/A values in the Series  
describe() - some useful statistics of the Series  
isna() - detect missing values

A full list of methods is available at: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

<ins>sample of pd.DataFrame methods:</ins>  
head() - print first n rows (default = 5)  
describe() - some useful statistics of the Data Frame  
isnull().sum() - prints number of N/A rows for each column  
df[column].plot() - 1D line histogram of column
loc[df.index <= <row_name>].copy()

A full list of methods is available at: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [8]:
# Print statistical information of the dataframe using the .describe() method
data.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,152118.0,152042.0,150891.0,134784.0,134951.0,134819.0,151409.0,151333.0,150187.0,134088.0,...,98913.0,97491.0,63887.0,115351.0,144587.0,127027.0,5234.0,5234.0,5234.0,5234.0
mean,2179361.0,9020.597,8824.114,53463.77,170.198768,169.629285,24026.684048,117.101736,112.917633,464.14848,...,10.624086,32.765025,50.808165,3.028368,73.583455,0.72582,34799.99,9.011418,16.175762,857.853331
std,12923920.0,59087.42,54682.22,280935.6,829.516656,809.990207,39281.323194,444.991939,304.941323,730.507166,...,10.553918,13.515979,31.812348,2.451936,7.496354,0.149982,99499.91,16.669961,31.09173,1283.239325
min,1.0,-74347.0,-6223.0,1.0,-1918.0,-232.143,0.001,-3125.829,-272.971,0.0,...,0.1,7.7,1.188,0.1,53.28,0.394,-31959.4,-28.45,-95.92,-1745.051271
25%,1619.0,1.0,6.0,69.0,0.0,0.143,525.443,0.025,1.465,16.1405,...,1.9,21.6,19.351,1.3,69.5,0.602,-97.525,-0.87,-0.52,-37.856745
50%,21415.0,71.0,95.0,692.0,2.0,2.286,4083.126,10.24,16.29,108.046,...,6.3,31.4,49.839,2.4,75.05,0.743,2580.05,5.505,7.205,401.789828
75%,260953.2,944.0,995.0,6568.0,19.0,20.143,31547.758,88.786,101.687,630.972,...,19.3,41.3,83.241,4.0,78.93,0.845,21479.95,13.8875,22.6875,1456.137413
max,326152800.0,3701643.0,2946256.0,5535426.0,18062.0,14704.857,386379.502,51427.491,9241.954,6093.182,...,44.0,78.1,100.0,13.8,86.75,0.957,1043824.0,115.0,374.34,7912.067517


In [9]:
# Drop a column (This is not a permanent change)
data.drop( ['total_deaths'], axis=1 ).head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,new_deaths,new_deaths_smoothed,total_cases_per_million,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,


In [10]:
# Combine multiple columns into a new column (Added to the last column, this is a permanent change to the dataframe)
data['caseDeathRatio'] = data['total_cases'] / data['total_deaths']
data.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,caseDeathRatio
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,


In [11]:
# Check for missing (NULL) data
data.isnull().sum()

iso_code                                        0
continent                                    9319
location                                        0
date                                            0
total_cases                                  2793
                                            ...  
excess_mortality_cumulative_absolute       149677
excess_mortality_cumulative                149677
excess_mortality                           149677
excess_mortality_cumulative_per_million    149677
caseDeathRatio                              20128
Length: 68, dtype: int64

In [12]:
# Sorting data
data.sort_values(['date'])

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,caseDeathRatio
110985,PER,South America,Peru,2020-01-01,,,,,,,...,,,1.60,76.74,0.777,,,,,
91308,MEX,North America,Mexico,2020-01-01,,,,,,,...,21.4,87.847,1.38,75.05,0.779,,,,,
5462,ARG,South America,Argentina,2020-01-01,,,,,,,...,27.7,,5.00,76.67,0.845,,,,,
110986,PER,South America,Peru,2020-01-02,,,,,,,...,,,1.60,76.74,0.777,,,,,
5463,ARG,South America,Argentina,2020-01-02,,,,,,,...,27.7,,5.00,76.67,0.845,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23243,BFA,Africa,Burkina Faso,2022-01-15,20127.0,155.0,109.571,339.0,0.0,0.857,...,23.9,11.877,0.40,61.58,0.452,,,,,59.371681
110984,PRY,South America,Paraguay,2022-01-15,475686.0,,,16670.0,,,...,21.6,79.602,1.30,74.25,0.728,,,,,28.535453
111730,PER,South America,Peru,2022-01-15,2496542.0,22833.0,19693.857,203265.0,10.0,35.143,...,,,1.60,76.74,0.777,,,,,12.282203
108277,PLW,Oceania,Palau,2022-01-15,8.0,0.0,0.000,,,,...,22.7,,4.80,73.70,0.826,,,,,
