pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,
built on top of the Python programming language. pandas is well suited for many different kinds of data: 

    *Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet

    *Ordered and unordered (not necessarily fixed-frequency) time series data.

    *Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels

    *Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a pandas data structure

In [1]:
import pandas as pd

For this tutorial use the original owid dataset. Data analysis can be simplified by pre-processing a dataset via SQL queries, however for this tutorial I will fully process and visualize the data using Pandas.

In [2]:
filename = 'owid-covid-data.csv'
path = 'C:/Users/Matth/git/DataAnalysisWorkbooks/Covid19/Data/Raw_data/'

# Pandas has a method to read csv files directly.
data = pd.read_csv(path+filename) 

In [3]:
# Print a list of columns in the dataframe
data.columns

Index(['iso_code', 'continent', 'location', 'date', 'total_cases', 'new_cases',
       'new_cases_smoothed', 'total_deaths', 'new_deaths',
       'new_deaths_smoothed', 'total_cases_per_million',
       'new_cases_per_million', 'new_cases_smoothed_per_million',
       'total_deaths_per_million', 'new_deaths_per_million',
       'new_deaths_smoothed_per_million', 'reproduction_rate', 'icu_patients',
       'icu_patients_per_million', 'hosp_patients',
       'hosp_patients_per_million', 'weekly_icu_admissions',
       'weekly_icu_admissions_per_million', 'weekly_hosp_admissions',
       'weekly_hosp_admissions_per_million', 'new_tests', 'total_tests',
       'total_tests_per_thousand', 'new_tests_per_thousand',
       'new_tests_smoothed', 'new_tests_smoothed_per_thousand',
       'positive_rate', 'tests_per_case', 'tests_units', 'total_vaccinations',
       'people_vaccinated', 'people_fully_vaccinated', 'total_boosters',
       'new_vaccinations', 'new_vaccinations_smoothed',
       't

In [4]:
# Print the first five rows of the dataframe
data.head()  # Or .tail() for the last five rows

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,,37.746,0.5,64.83,0.511,,,,


For more on pandas.read_csv(), see: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

<ins>sample of pd.Series methods:</ins>  
max() - maximum value  
argmax() - return index of max value  
abs() - absolute value of every element  
append() - concatenate two Series  
count() - number of non-N/A values in the Series  
describe() - some useful statistics of the Series  
isna() - detect missing values

A full list of methods is available at: https://pandas.pydata.org/docs/reference/api/pandas.Series.html

<ins>sample of pd.DataFrame methods:</ins>  
head() - print first n rows (default = 5)  
describe() - some useful statistics of the Data Frame  
isnull().sum() - prints number of N/A rows for each column  
df[column].plot() - 1D line histogram of column
loc[df.index <= <row_name>].copy()

A full list of methods is available at: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [5]:
# Print statistical information of the dataframe using the .describe() method
data.describe()

Unnamed: 0,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,new_cases_per_million,new_cases_smoothed_per_million,total_deaths_per_million,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
count,150066.0,150016.0,148865.0,132849.0,133027.0,132895.0,149366.0,149316.0,148170.0,132162.0,...,97485.0,96081.0,63032.0,113707.0,142523.0,125334.0,5234.0,5234.0,5234.0,5234.0
mean,2129889.0,8396.613,8258.887,52687.4,170.658588,170.173845,23340.273531,105.798327,102.640578,456.016508,...,10.584668,32.758078,50.815453,3.026723,73.596327,0.725873,34799.99,9.011418,16.175762,857.853331
std,12621620.0,49814.58,46361.92,277070.2,831.686247,812.521382,38060.066126,364.490261,243.501643,720.868866,...,10.498829,13.521854,31.813518,2.452859,7.48983,0.149989,99499.91,16.669961,31.09173,1283.239325
min,1.0,-74347.0,-6223.0,1.0,-1918.0,-232.143,0.001,-3125.829,-272.971,0.0,...,0.1,7.7,1.188,0.1,53.28,0.394,-31959.4,-28.45,-95.92,-1745.051271
25%,1568.25,1.0,5.714,68.0,0.0,0.143,510.9505,0.022,1.44,15.823,...,1.9,21.6,19.351,1.3,69.5,0.602,-97.525,-0.87,-0.52,-37.856745
50%,20580.5,69.0,91.429,673.0,2.0,2.286,4010.154,10.018,15.843,105.32,...,6.3,31.4,49.839,2.4,75.05,0.743,2580.05,5.505,7.205,401.789828
75%,255435.8,920.0,972.143,6446.0,19.0,20.286,30549.622,86.32625,98.924,620.855,...,19.1,41.3,83.241,4.0,78.93,0.845,21479.95,13.8875,22.6875,1456.137413
max,300290300.0,2540890.0,1964839.0,5472566.0,18062.0,14704.714,326925.563,51427.491,7406.207,6083.26,...,44.0,78.1,100.0,13.8,86.75,0.957,1043824.0,115.0,374.34,7912.067517


In [6]:
# Drop a column (This is not a permanent change)
data.drop( ['total_deaths'], axis=1 ).head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,new_deaths,new_deaths_smoothed,total_cases_per_million,...,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,0.126,...,,,37.746,0.5,64.83,0.511,,,,


In [7]:
# Combine multiple columns into a new column (Added to the last column, this is a permanent change to the dataframe)
data['caseDeathRatio'] = data['total_cases'] / data['total_deaths']
data.head()

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million,caseDeathRatio
0,AFG,Asia,Afghanistan,2020-02-24,5.0,5.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
1,AFG,Asia,Afghanistan,2020-02-25,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
2,AFG,Asia,Afghanistan,2020-02-26,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
3,AFG,Asia,Afghanistan,2020-02-27,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,
4,AFG,Asia,Afghanistan,2020-02-28,5.0,0.0,,,,,...,,37.746,0.5,64.83,0.511,,,,,


In [8]:
# Check for missing (NULL) data
data.isnull().sum()

iso_code                                        0
continent                                    9202
location                                        0
date                                            0
total_cases                                  2629
                                            ...  
excess_mortality_cumulative_absolute       147461
excess_mortality_cumulative                147461
excess_mortality                           147461
excess_mortality_cumulative_per_million    147461
caseDeathRatio                              19847
Length: 68, dtype: int64