<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<div align="center">
<a href="https://github.com/kamu-data/kamu-cli">Repo</a> | 
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
    
# Covid Jupyter Notebook

</center>

In this tutorial there is a step by step guide on how to use the Covid data to make visually pleasing graphs and use SQL to manipulate data.

<div class="alert alert-block alert-success">
To follow this example checkout kamu-cli repository and navigate into the examples/covid sub-directory.

Create a temporary kamu workspace in that folder using:
<p style="background:black">
<code style="background:black;color:white">kamu init
</code>
</p>
</div>

<div class="alert alert-block alert-success">
Then add all dataset manifests found in the current directory:
<p style="background:black">
<code style="background:black;color:white">kamu add --recursive .
kamu pull --all
</code>
</p>
</div>

## Load Kamu Extension
<div class="alert alert-block alert-success">
Start by loading <code>kamu</code> Jupyter extension in your terminal:
</div>

In [None]:
%load_ext kamu

## Import and Test Data
<div class="alert alert-block alert-success">
Now it is time to start importing your Covid data by province. First import the data from the province of BC by using the command <code>%import dataset</code>. An alias was created to make it easier to call this file.
</div>

In [None]:
%import_dataset covid19.british-columbia.case-details --alias cases_bc

<div class="alert alert-block alert-success">
To test if the data was loaded correctly a SQL querry is run.
</div>

In [None]:
%%sql
SELECT * FROM cases_bc
ORDER BY id DESC
LIMIT 10

<div class="alert alert-block alert-success">
Now it is time to import the rest of the Covid data files and create aliases for them
</div>

In [None]:
%import_dataset covid19.ontario.case-details --alias cases_on
%import_dataset covid19.alberta.case-details --alias cases_ab
%import_dataset covid19.quebec.case-details --alias cases_qb

<div class="alert alert-block alert-success">
Time to test again if the data was imported correctly. You can test the Alberta files by changing  <code>cases_on</code> to   <code>cases_ab</code>. For Quebec change it to  <code>cases_qb</code> and <code>id</code> to <code>row_id</code>. 
</div>

In [None]:
%%sql
SELECT * FROM cases_on
ORDER BY id DESC
LIMIT 10

<div class="alert alert-block alert-success">
The next file that you import is case details for the four provinces combined. The file <code>covid19.canada.case-details</code>  uses an SQL query in the yaml file to combine that data so that you don't have to combine them with 'UNION ALL'.
The SQL queries that harmonize the data of each province can be found in <code>(province).case-details.hm.</code> If you open these yamls, there are queries that make the datasets be able to be compared without semantic differences between them. For example only two provinces have a 90+ whereas the other two has age ranges of 80+. Therefore, we need to switch the age ranges to 80+ to compare the data.
</div>

In [None]:
%import_dataset covid19.canada.case-details --alias cases_four_provinces

<div class="alert alert-block alert-success">
Again, test to see if the data worked by showing the last 10 data rows.
</div>

In [None]:
%%sql
SELECT * FROM cases_four_provinces
LIMIT 10

<div class="alert alert-block alert-success">
To use this file, a SQL query is created to combine all of the cases by age group and by province
</div>

In [None]:
%%sql -o age_cases
SELECT province, age_group, COUNT(*) 
FROM cases_four_provinces
GROUP BY province, age_group
ORDER BY province, age_group;

<div class="alert alert-block alert-success">
    Through <code>With plotly.express.pie</code> a pie chart can be created to compare the cases per province then per age group. As can bee seen over a third of Quebec's cases are unknow which is probably to to Quebec's strict privacy act laws that are part of the Act Respecting Access to Documents Held by Public Bodies and the Protection of Personal Information. These differences in law can cause errors when comparing data. </div>

In [None]:
%%local 
import plotly.express 
plotly.express.pie(age_cases, values='count(1)', names='age_group', color='age_group', title='Cases by Age Group and Province', facet_col='province')

<div class="alert alert-block alert-success">
Another piece of data we can get from this yaml is gender. Therefore, a SQL query is created to combine all of the cases by gender and by province
</div>

In [None]:
%%sql -o total_cases
SELECT province, gender, COUNT(*) 
FROM cases_four_provinces
GROUP BY province, gender
ORDER BY province, gender;

<div class="alert alert-block alert-success">
    Through <code>plotly.express.bar</code> a bar chart can be created to compare the cases per province then per gender (male, female, unspecified).
</div>

In [None]:
%%local 
import plotly.express 
plotly.express.bar(total_cases, x='province', y='count(1)', color='gender', title='Cases per Gender')


<div class="alert alert-block alert-info">
    By looking through the data you can see that Quebec has a large amount of people who were classified as undefined. This is probably again due to Quebec's strict privacy laws.
</div>

<div class="alert alert-block alert-success">
The last dataset that we are importing is daily cases for the four provinces.
</div>

In [None]:
%import_dataset covid19.canada.daily-cases --alias daily_cases

<div class="alert alert-block alert-success">
Now test again to see if the data was succcesfully installed for this file.
</div>

In [None]:
%%sql -o daily_cases
select * from daily_cases

<div class="alert alert-block alert-success">
The last step is to create a line plot graph to compare the different amount of cases per day by province.
</div>

In [None]:
%%local
import plotly.express
plotly.express.line(daily_cases, x="reported_date" , y="total_daily", color="province")

<div class="alert alert-block alert-info">

As seen in the graph above, the case data has multiple spikes, including two significant ones in Quebec from late December 2020 and early January 2021. As explained in [this data source issue](https://github.com/ccodwg/Covid19Canada/issues/44) these spikes don't reflect an actual surge in cases, but rather a **delay in data entry** due to the holidays and weekends, with cases being attributed to the day they are entered on instead of amending the past data for the days they were registered on. This issue makes data hard to work with, often requiring some "smoothing" to get approximate number of cases on a cetrain date.


Kamu offers a combination of techniques like [watermarks](https://docs.kamu.dev/glossary/#watermark), explicit [retractions and corrections](https://docs.kamu.dev/glossary/#retractions-and-corrections) to automatically account for late arriving data and simultaneously provide **minimal latency** and **accuracy and consistency** of data. Continue to [other examples](https://docs.kamu.dev/cli/get-started/examples/) to learn more.

</div>