<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<div align="center">
<a href="https://github.com/kamu-data/kamu-cli">Repo</a> | 
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
    
# Covid Jupyter Notebook

</center>

In this tutorial there is a step by step guide on how to use the Covid data to make visually pleasing graphs and use SQL to manipulate data.

<div class="alert alert-block alert-success">
To follow this example checkout kamu-cli repository and navigate into the examples/covid sub-directory.

Create a temporary kamu workspace in that folder using:
<p style="background:black">
<code style="background:black;color:white">kamu init
</code>
</p>
</div>

<div class="alert alert-block alert-success">
Then add all dataset manifests found in the current directory:
<p style="background:black">
<code style="background:black;color:white">kamu add --recursive .
kamu pull --all
</code>
</p>
</div>

## Connect to Kamu
First we need to import `kamu` library and create a connection to the server. We will let the library to figure out where to find the server, but you can connect to other nodes by providing a URL.

<div class="alert alert-block alert-success">

Connect to `kamu` server.

</div>

In [None]:
import kamu

con = kamu.connect()

You can already query data using the connection object.

In [None]:
con.query("select 1 as value")

## Load Kamu Extension
To avoid typying `con.query("...")` all the time let's load <code>kamu</code> Jupyter extension.

In [None]:
%load_ext kamu

The extension provides a convenient `%%sql` cell magic. Let's use it to look at the data from the province of BC.

In [None]:
%%sql
select * from 'covid19.british-columbia.case-details' limit 3

## Explore Data

We can use the same approach to sample data from other provinces:

In [None]:
%%sql
select * from 'covid19.alberta.case-details' limit 3

In [None]:
%%sql
select * from 'covid19.ontario.case-details' limit 3

In [None]:
%%sql
select * from 'covid19.quebec.case-details' limit 3

Notice how data schemas and column semantics are slightly different between provinces. This makes pretty difficult to work with data across all provinces.

To tackle that we have created several harmonization datasets `{province}.case-details.hm` that bring data from all provinces under a common format. The `covid19.canada.case-details` dataset then uses `UNION ALL` operation to derive a new pan-Canadian dataset.

<div class="alert alert-block alert-success">
Take a minute to study the definitions of these datasets.
</div>

Let's sample the pan-Canadian dataset now.

In [None]:
%%sql
select * from 'covid19.canada.case-details' limit 3

Let's write a query that counts the number of cases by age group and by province.

In [None]:
%%sql -o age_cases
select
    province,
    age_group,
    count(*)
from 'covid19.canada.case-details'
group by province, age_group
order by province, age_group

We can use `plotly` to visualize this data as a pie chart.

In [None]:
import plotly.express 
plotly.express.pie(age_cases, values='count(*)', names='age_group', color='age_group', title='Cases by Age Group and Province', facet_col='province')

As can bee seen over a third of Quebec's cases are unknow which is probably due to Quebec's strict privacy act laws that are part of the Act Respecting Access to Documents Held by Public Bodies and the Protection of Personal Information. These differences in law can cause errors when comparing data!

Now let's look at the distribution of cases by gender and by province

In [None]:
%%sql -o total_cases
select
    province,
    gender,
    count(*)
from 'covid19.canada.case-details'
group by province, gender
order by province, gender

In [None]:
plotly.express.bar(total_cases, x='province', y='count(*)', color='gender', title='Cases per Gender')


Here you can see that Quebec has a large amount of people who were classified as undefined. This is probably again due to Quebec's strict privacy laws.

The last dataset that we will look at is daily cases aggregation for the four provinces.

In [None]:
%%sql -o daily_cases
select * from 'covid19.canada.daily-cases'

We can use it to create a line plot graph to compare the different amount of cases per day by province.

In [None]:
plotly.express.line(daily_cases, x="reported_date" , y="total_daily", color="province")

As seen in the graph above, the case data has multiple spikes, including two extreme ones in Quebec from late December 2020 and early January 2021. As explained in [this data source issue](https://github.com/ccodwg/Covid19Canada/issues/44) these spikes don't reflect an actual surge in cases, but rather a **delay in data entry** due to the holidays and weekends, with cases being attributed to the day they are entered on instead of amending the past data for the days they were registered on. This issue makes data hard to work with, often requiring some "smoothing" to get approximate number of cases on a cetrain date.


Kamu offers a combination of techniques like [watermarks](https://docs.kamu.dev/glossary/#watermark), explicit [retractions and corrections](https://docs.kamu.dev/glossary/#retractions-and-corrections) to automatically account for late arriving data and simultaneously provide **minimal latency** and **accuracy and consistency** of data.

Continue to [other examples](https://docs.kamu.dev/cli/get-started/examples/) to learn more!