# Libraries and loading the dataset
In this first part of the examples we provide we will look at how to load in the dataset, and how to do some basic
manipulation of the data. In order to run the code in this notebook you will need to install, besides Jupyter, the
[Pandas](https://pandas.pydata.org/docs/getting_started/index.html#installation) package.

For more info on how to set up the Python environment for these notebooks, see README.md.

## Getting an overview of the data
Whenever you get a dataset you will want to inspect it and see what variables (columns) you have and what these values
look like. One way of doing this is to use the package [Pandas](https://pandas.pydata.org/docs/getting_started/overview.html),
a package that allows for easy manipulation of the data table (or dataframe, as it is called in Pandas).

For a quick overview of Pandas' capabilities, check out [this](https://pandas.pydata.org/docs/user_guide/10min.html#min)
link.

In [106]:
import pandas as pd
owid_data = pd.read_csv(r"../data/owid-covid-data.csv")
cgrt_data = pd.read_csv(r"../data/OxCGRT_latest_combined.csv", low_memory=False)

cgrt_data.head()

Unnamed: 0,CountryName,CountryCode,RegionName,RegionCode,Jurisdiction,Date,C1_combined_numeric,C1_combined,C2_combined_numeric,C2_combined,...,StringencyIndex,StringencyIndexForDisplay,StringencyLegacyIndex,StringencyLegacyIndexForDisplay,GovernmentResponseIndex,GovernmentResponseIndexForDisplay,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay
0,Aruba,ABW,,,NAT_TOTAL,20200101,0.0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Aruba,ABW,,,NAT_TOTAL,20200102,0.0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Aruba,ABW,,,NAT_TOTAL,20200103,0.0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Aruba,ABW,,,NAT_TOTAL,20200104,0.0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Aruba,ABW,,,NAT_TOTAL,20200105,0.0,0,0.0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In the above output table you can see the indices of the columns (using standard 0-indexing) as well as the human-readable
name and information about what the data in those columns is. For example: at index 0 we find the `iso_index` this is a
shorthand code for countries (see [here](https://www.iso.org/obp/ui/#search/code/)).

We can also see that there are 83862 'non-nullable' rows for that column. This means non-empty cells in that column. Since
dataframes can be sparsely populated, some rows can have empty column values (e.g. a row can miss the value for `new_tests`
but still have other values).

Lastly, we can see the Python data type for that column. Most will either be floating point numbers or unnamed objects -
meaning they contain multiple values within a single data point.

Note that, in this example, we're only loading in a single dataset. We provided several for you to analyse. Please check
the `data` directory for more data files.

## Fixing the date column
We can see that `date` column is not of the `datetime` object type, a commonly used format for timestamps in Python. It
will be useful to use this column as the x-axis for various plots and analyses, but most of those techniques assume that
the timestamps are in the aforementioned `datetime` format.

This means that if we want to use the `date` column effectively, we will have to alter the data type to `datetime`. The
following code block will show you how to do that. It is recommended to do this for your own analysis too!

In [109]:
owid_data['dateFixed'] = pd.to_datetime(owid_data['date'], format = '%Y-%m-%d')
cgrt_data['dateFixed'] = pd.to_datetime(cgrt_data['Date'], format = '%Y%m%d')
cgrt_data = cgrt_data.rename(columns={"CountryCode":"iso_code"})



KeyError: "None of ['iso_code'] are in the columns"

Your dataset is now ready for your analysis! You may want to sort your data set by date in order to more intuitively see
changes. You can do so doing:

Unnamed: 0_level_0,Unnamed: 1_level_0,CountryName,RegionName,RegionCode,Jurisdiction,Date,C1_combined_numeric,C1_combined,C2_combined_numeric,C2_combined,C3_combined_numeric,...,StringencyIndex,StringencyIndexForDisplay,StringencyLegacyIndex,StringencyLegacyIndexForDisplay,GovernmentResponseIndex,GovernmentResponseIndexForDisplay,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay
CountryCode,dateFixed,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
ABW,2020-01-01,Aruba,,,NAT_TOTAL,20200101,0.0,0,0.0,0,0.0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0
ABW,2020-01-02,Aruba,,,NAT_TOTAL,20200102,0.0,0,0.0,0,0.0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0
ABW,2020-01-03,Aruba,,,NAT_TOTAL,20200103,0.0,0,0.0,0,0.0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0
ABW,2020-01-04,Aruba,,,NAT_TOTAL,20200104,0.0,0,0.0,0,0.0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0
ABW,2020-01-05,Aruba,,,NAT_TOTAL,20200105,0.0,0,0.0,0,0.0,...,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZWE,2021-04-21,Zimbabwe,,,NAT_TOTAL,20210421,1.0,1G,2.0,2G,2.0,...,51.85,51.85,60.71,60.71,49.27,49.27,56.31,56.31,0.0,0.0
ZWE,2021-04-22,Zimbabwe,,,NAT_TOTAL,20210422,1.0,1G,2.0,2G,2.0,...,51.85,51.85,60.71,60.71,49.27,49.27,56.31,56.31,0.0,0.0
ZWE,2021-04-23,Zimbabwe,,,NAT_TOTAL,20210423,1.0,1G,2.0,2G,2.0,...,51.85,51.85,60.71,60.71,49.27,49.27,56.31,56.31,0.0,0.0
ZWE,2021-04-24,Zimbabwe,,,NAT_TOTAL,20210424,1.0,1G,2.0,2G,2.0,...,51.85,51.85,60.71,60.71,49.27,49.27,56.31,56.31,0.0,0.0


In [98]:
full_data = owid_data.merge(cgrt_data, left_index=True, right_index=True)
full_data.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,total_cases_per_million,...,StringencyIndex,StringencyIndexForDisplay,StringencyLegacyIndex,StringencyLegacyIndexForDisplay,GovernmentResponseIndex,GovernmentResponseIndexForDisplay,ContainmentHealthIndex,ContainmentHealthIndexForDisplay,EconomicSupportIndex,EconomicSupportIndexForDisplay
dateFixed,iso_code,CountryCode,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1
2020-01-01,ARG,ABW,South America,Argentina,2020-01-01,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-01,ARG,AFG,South America,Argentina,2020-01-01,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-01,ARG,AGO,South America,Argentina,2020-01-01,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-01,ARG,ALB,South America,Argentina,2020-01-01,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2020-01-01,ARG,AND,South America,Argentina,2020-01-01,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
full_data.to_csv("../data/full_data.csv")

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "c:\users\xoxix\appdata\local\programs\python\python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-99-594b6143b985>", line 1, in <module>
    full_data.to_csv("../data/full_data.csv")
  File "c:\users\xoxix\appdata\local\programs\python\python39\lib\site-packages\pandas\core\generic.py", line 3387, in to_csv
    return DataFrameRenderer(formatter).to_csv(
  File "c:\users\xoxix\appdata\local\programs\python\python39\lib\site-packages\pandas\io\formats\format.py", line 1083, in to_csv
    csv_formatter.save()
  File "c:\users\xoxix\appdata\local\programs\python\python39\lib\site-packages\pandas\io\formats\csvs.py", line 248, in save
    self._save()
  File "c:\users\xoxix\appdata\local\programs\python\python39\lib\site-packages\pandas\io\formats\csvs.py", line 253, in _save
    self._save_body()
  File "c:\users\xoxix\appdata\local\progra


KeyboardInterrupt



## Taking a look at the dataframe
The next notebook will deal more with the topic of data visualisation and exploring the data set, but for now you can
visualize (part of) your dataframe by using the following commands:

In [5]:
# Show first N (e.g. 10) rows of the dataframe
data.head(10)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
3009,ARG,South America,Argentina,2020-01-01,,,,,,,...,18933.907,0.6,191.032,5.5,16.2,27.7,,5.0,76.67,0.845
48722,MEX,North America,Mexico,2020-01-01,,,,,,,...,17336.469,2.5,152.783,13.06,6.9,21.4,87.847,1.38,75.05,0.779
48723,MEX,North America,Mexico,2020-01-02,,,,,,,...,17336.469,2.5,152.783,13.06,6.9,21.4,87.847,1.38,75.05,0.779
3010,ARG,South America,Argentina,2020-01-02,,,,,,,...,18933.907,0.6,191.032,5.5,16.2,27.7,,5.0,76.67,0.845
3011,ARG,South America,Argentina,2020-01-03,,,,,,,...,18933.907,0.6,191.032,5.5,16.2,27.7,,5.0,76.67,0.845
48724,MEX,North America,Mexico,2020-01-03,,,,,,,...,17336.469,2.5,152.783,13.06,6.9,21.4,87.847,1.38,75.05,0.779
3012,ARG,South America,Argentina,2020-01-04,,,,,,,...,18933.907,0.6,191.032,5.5,16.2,27.7,,5.0,76.67,0.845
75171,THA,Asia,Thailand,2020-01-04,,,,,,,...,16277.671,0.1,109.861,7.04,1.9,38.8,90.67,2.1,77.15,0.777
48725,MEX,North America,Mexico,2020-01-04,,,,,,,...,17336.469,2.5,152.783,13.06,6.9,21.4,87.847,1.38,75.05,0.779
75172,THA,Asia,Thailand,2020-01-05,,,,,,,...,16277.671,0.1,109.861,7.04,1.9,38.8,90.67,2.1,77.15,0.777


In [11]:
# Tail will show the last N results of the dataframe.
data.tail(10)

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
46479,MYS,Asia,Malaysia,2021-04-24,390252.0,2717.0,2484.714,1426.0,11.0,8.0,...,26808.164,0.1,260.942,16.74,1.0,42.4,,1.9,76.16,0.81
41871,LAO,Asia,Laos,2021-04-24,247.0,88.0,27.0,,,0.0,...,6397.36,22.7,368.111,4.0,7.3,51.2,49.839,1.5,67.92,0.613
59539,PRY,South America,Paraguay,2021-04-24,265296.0,2162.0,2418.857,5802.0,87.0,78.714,...,8827.01,1.7,199.128,8.27,5.0,21.6,79.602,1.3,74.25,0.728
3935,ABW,North America,Aruba,2021-04-24,,,,,,,...,35973.781,,,11.62,,,,,76.29,
57045,OWID_OCE,,Oceania,2021-04-24,42985.0,276.0,203.571,1041.0,3.0,3.0,...,,,,,,,,,,
75170,TZA,Africa,Tanzania,2021-04-24,509.0,0.0,0.0,21.0,0.0,0.0,...,2683.304,49.1,217.288,5.75,3.3,26.7,47.953,0.7,65.46,0.529
28257,GAB,Africa,Gabon,2021-04-24,22433.0,0.0,82.143,138.0,0.0,0.714,...,16562.413,3.4,259.967,7.2,,,,6.3,66.47,0.703
80366,URY,South America,Uruguay,2021-04-24,182326.0,2789.0,2846.571,2283.0,56.0,62.143,...,20551.409,0.1,160.708,6.93,14.0,19.9,,2.8,77.91,0.817
21298,DJI,Africa,Djibouti,2021-04-24,10746.0,8.0,47.714,132.0,0.0,2.571,...,2705.406,22.5,258.037,6.05,1.7,24.5,,1.4,67.11,0.524
83861,ZWE,Africa,Zimbabwe,2021-04-24,38064.0,19.0,52.143,1556.0,0.0,0.571,...,1899.775,21.4,307.846,1.82,1.6,30.7,36.791,1.7,61.49,0.571


In [76]:
from linearmodels import PanelOLS
regression = PanelOLS(timeData["new_cases"], timeData[["total_cases","diabetes_prevalence"]], time_effects=True)
print(regression.fit())

                          PanelOLS Estimation Summary                           
Dep. Variable:              new_cases   R-squared:                        0.8155
Estimator:                   PanelOLS   R-squared (Between):              0.9619
No. Observations:               76599   R-squared (Within):               0.5743
Date:                Fri, Jun 18 2021   R-squared (Overall):              0.8162
Time:                        20:33:00   Log-likelihood                -8.386e+05
Cov. Estimator:            Unadjusted                                           
                                        F-statistic:                   1.682e+05
Entities:                         187   P-value                           0.0000
Avg Obs:                       409.62   Distribution:                 F(2,76138)
Min Obs:                       94.000                                           
Max Obs:                       459.00   F-statistic (robust):          1.682e+05
                            