<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<div align="center">
<a href="https://github.com/kamu-data/kamu-cli">Repo</a> | 
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
    
# 1. Introduction

</center>

## Welcome

Hi, and thanks for checking out [kamu](https://github.com/kamu-data/kamu-cli) - the new generation data management tool!

In this tutorial we will learn how to use the `kamu` tool to build **decentralized collaborative data processing pipelines**.

<div class="alert alert-block alert-info">

The end result will look as a smaller version of [this pipeline](${KAMU_WEB_UI_URL}kamu/covid19.canada.case-details?tab=lineage) created by our community:

![](files/web-ui.png)
</div>

This environment comes with `kamu` command-line tool pre-installed, so give it a try now. 

<div class="alert alert-block alert-success">
<b>Your turn:</b> Open the <b>Terminal</b> tab in Jupyter and run:

<p style="background:black">
<code style="background:black;color:white">kamu
</code>
</p>
</div>

<div class="alert alert-block alert-warning">
<details>
<summary style="display:list-item"><b>New to Jupyter?</b></summary>

* Open the <b>File</b> menu at the top of the window
* Select <b>New -> Terminal</b>
* This will open a terminal in a new browser tab
* Now you can switch between the terminal tab and this lesson as you continue

</details>
</div>

## What is Kamu for?

[Kamu](https://github.com/kamu-data/kamu-cli) is a tool based on [Open Data Fabric](http://opendatafabric.org/) protocol that connects publishers and consumers of data into a <mark>decentralized data supply chain</mark>. It allows you to get data fast, in a ready-to-use form for analysis and ML tasks, ensure it is trustworthy and easy to keep up to date.

In this demo, we are going to explore some of the key features of `kamu` through some <mark>real world examples</mark>.

<div class="alert alert-block alert-warning">
<b>Short on time?</b> See <a href="https://www.youtube.com/watch?v=oUTiWW6W78A&list=PLV91cS45lwVG20Hicztbv7hsjN6x69MJk">this video</a> for a quick tour of key features.
</div>

If you have any questions throughout this demo - you can chat to us on [Discord](https://discord.gg/nU6TXRQNXC) or create an issue in [kamu-cli](https://github.com/kamu-data/kamu-cli) GitHub repository.

## Workspaces

You start working with `kamu` by creating a workspace. A workspace is just a directory where `kamu` stores data and metadata of the datasets.

<div class="alert alert-block alert-success">
Go ahead and create your first workspace (we'll do it right in the home directory):<br/>

<p style="background:black">
<code style="background:black;color:white">cd "01 - Kamu Basics (COVID-19 example)"
kamu init
</code>
</p>
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> Similarly to <code>git</code> it will create a <code>.kamu</code> directory in the folder you ran the command in.
</div>

Your new workspace is currently empty. Confirm that by running:

<p style="background:black">
<code style="background:black;color:white">kamu list
</code>
</p>

## Adding the first dataset
Let's add some new datasets to our workspace!

In `kamu`, datasets are defined using `.yaml` files. You can import them with `kamu add` command. 

In this demo, we are going to work with some disaggregated COVID-19 datasets published by different provinces of Canada.

<div class="alert alert-block alert-success">
To add a dataset to the workspace run:<br/>

<p style="background:black">
<code style="background:black;color:white">kamu add datasets/british-columbia.case-details.yaml
</code>
</p>
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> Every command in <code>kamu</code> is well documented. Try running <code>kamu add -h</code> or <code>kamu add --help</code> to see all the parameters and useful examples.
</div>

This particular dataset includes case data form British Columbia, including day, age group, gender and area where the case was registered. Such datasets that ingest or receive external data are called `root` datasets and contain valuable data that cannot be reconstructed if lost (also known as **source data**).

The dataset definition file looks like this:

```yaml
kind: DatasetSnapshot
version: 1
content:
  name: covid19.british-columbia.case-details
  kind: Root
  metadata:
    # Specifies the source of data that can be periodically polled to refresh the dataset
    - kind: SetPollingSource
      # Where to fetch the data from.
      fetch:
        kind: Url
        url: http://www.bccdc.ca/Health-Info-Site/Documents/BCCDC_COVID19_Dashboard_Case_Details.csv
      # How to parse the data.
      read:
        kind: Csv
        separator: ","
        header: true
        nullValue: ""
      # Pre-processing query that shapes the data (optional)
      preprocess:
        kind: Sql
        engine: datafusion  # Kamu supports many data processing engines like DataFusion, Spark, Flink...
        query: |
          select
            to_timestamp(Reported_Date) as reported_date,
            Classification_Reported as classification,
            id,
            HA as ha,
            Sex as sex,
            Age_Group as age_group
          from input
      # How to combine newly-ingested data with data that is already in the dataset
      merge:
        kind: Ledger
        primaryKey:
          - id
    # Tells kamu to use `reported_date` column as event time instead of the default `event_time`
    - kind: SetVocab
      eventTimeColumn: reported_date
```

<div class="alert alert-block alert-info">

Detailed schemas for `DatasetSnapshot` and other metadata can be found in [ODF reference](https://docs.kamu.dev/odf/reference/#datasetsnapshot).

When creating a new dataset you can also use `kamu new` command to start with an annotated template.
</div>

As you can see, it tells `kamu`:
- Where to fetch the data from,
- What type of data to expect and how to read it,
- What pre-processing steps are needed to shape the data into a nice typed schema,
- How to merge data we read with data that *is already in the dataset* (more on this in a minute).

<div class="alert alert-block alert-info">
<b>Note:</b> Kamu strictly follows <b>"data as code"</b> philosophy in which you never alter the data directly. Instead, you express all transformations with queries (SQL in this case).
</div>

### Pulling the data in

We can now see the dataset in our workspace, but it is still empty:

<p style="background:black">
<code style="background:black;color:white">kamu list
</code>
</p>

We told `kamu` where to get data from, but did not fetch it yet.

<div class="alert alert-block alert-success">
So let's run the following command to fetch data:<br/>

<p style="background:black">
<code style="background:black;color:white">kamu pull covid19.british-columbia.case-details
</code>
</p>
</div>

<div class="alert alert-block alert-info">
Make sure to use <b>shell completions</b>, they will save you a lot of typing!
<p style="background:black">
<code style="background:black;color:white">kamu pull co&lt;TAB&gt;
</code>
</p>
</div>

During this time `kamu` will fetch the data from its source, read and preprocess it as specified. 

<div class="alert alert-block alert-success">
Once completed, we can use <code>tail</code> command to see a sample of the new data:

<p style="background:black">
<code style="background:black;color:white">kamu tail covid19.british-columbia.case-details
</code>
</p>
</div>

As easy as that!

## Ledger nature of data

A very important aspect of `kamu` is that it stores the history of data, not just snapshots. If we run the `pull` command on this dataset tomorrow, it will only add the records that were not previously observed.

<div class="alert alert-block alert-success">
Run <code>pull</code> command again to verify that our data is still up-to-date with the source:

<p style="background:black">
<code style="background:black;color:white">kamu pull covid19.british-columbia.case-details
</code>
</p>
</div>

In other words, in `kamu` <mark>data is a ledger</mark> - an append-only record of events where **past events never change** (are immutable).

<div class="alert alert-block alert-info">

See [Merge Strategies](https://docs.kamu.dev/cli/ingest/merge-strategies/) topic in our documentation for explanation how `kamu` transforms different types of data into a ledger.

</div>

## Ledger nature of metadata 

Additionally, every modification to the dataset itself (like adding data, changing description, changing license) is stored in so-called **metadata chain**.

<div class="alert alert-block alert-success">
Inspect the metadata chain using the <code>log</code> command ("Q" to close):

<p style="background:black">
<code style="background:black;color:white">kamu log covid19.british-columbia.case-details
</code>
</p>
</div>

As you can see, **metadata is also a ledger**!

You will see at least four types of metadata **blocks** (in bottom-up order):
- `Seed` - establishes a globally-unique cryptographic identity of the dataset
- `SetPollingSource` - declares how external data should be ingested
- `SetVocab` - renames some system columns (in our case specifies that `reported_date` column should be treated as an event time)
- `AddData` - shows N new records that were added during the last `pull` command and hashes of the files that were produced.

You can think of the metadata chain as `git` commit log, except instead of data it stores an accurate **history of events** that affected how dataset looks like throughought its entire lifetime.

## Analyzing Data

Getting raw data in is just a small first step on our journey towards collaboration on data, but before we continue, let's take a quick break and see what we can do with data we already have.

### SQL Shell

Kamu has a built-in SQL shell which you can start by running:

<p style="background:black">
<code style="background:black;color:white">kamu sql
</code>
</p>

<div class="alert alert-block alert-info">
The default SQL shell is based on <a href="https://arrow.apache.org/datafusion/">Apache DataFusion</a> engine, but we also support Spark.
</div>

<div class="alert alert-block alert-success">
Try the following queries in the SQL shell:

<p style="background:black">
<code style="background:black;color:white">show tables;</code>
</p>
<p style="background:black">
<code style="background:black;color:white">describe "covid19.british-columbia.case-details";</code>
</p>
<p style="background:black">
<code style="background:black;color:white">select * from "covid19.british-columbia.case-details" limit 10;</code>
</p>
</div>

<div class="alert alert-block alert-success">

Now try writing a query that shows the total number of cases by different region (`ha` column).

</div>

Press **Ctrl + D** to exit.

### Notebooks

When you install Kamu CLI on your computer you can use `kamu notebook` command to start an integrated Jupyter 
Notebook environment identical to the one you are currently using.

Since we're already in the notebook environment - let's give this integration a try!

<div class="alert alert-block alert-success">
Start by creating a connection to <code>kamu</code> SQL server:
</div>

<div class="alert alert-block alert-warning">
<details>
<summary style="display:list-item"><b>New to Jupyter?</b></summary>

Jupyter notebooks contain cells that are **executable**, so static text can me mixed with computations and data visualization.

**You** are in control of what runs when, so you'll need to **select the code cell below** and then click the **"Run"** button on the top panel, or press `Shift + Enter`.

</details>
</div>

In [None]:
import kamu

con = kamu.connect("file://")
print("Connected to kamu via", con)

Using `kamu` Python library we can connect to any remote kamu node by providing a URL.

When URL is a local path - `kamu` library will automatically start an SQL server for that local workspace and connect to it. Super convenient!

We can now send SQL requests using `query(sql)` method. The result will be returned as Pandas DataFrame:

In [None]:
con.query("select 1")

In [None]:
con.query("select * from 'covid19.british-columbia.case-details' limit 3")

Writing `con.query(...)` many times can get old fast, but `kamu` Jupyter extension can help with that.

<div class="alert alert-block alert-success">
Load <code>kamu</code> Jupyter extension:
</div>

In [None]:
%load_ext kamu

The extension provides a very convenient `%%sql` cell magic:

<div class="alert alert-block alert-success">
To see the schema and number of records in the dataset run:
</div>

In [None]:
%%sql
select count(*) from 'covid19.british-columbia.case-details'

In [None]:
%%sql
describe 'covid19.british-columbia.case-details'

To see a sample of data run:

In [None]:
%%sql
select
    *
from 'covid19.british-columbia.case-details'
order by reported_date desc
limit 3

<div class="alert alert-block alert-success">
Run this simple SQL query to count number of cases per age group:
</div>

In [None]:
%%sql
select
    age_group,
    count(*) as case_count 
from 'covid19.british-columbia.case-details'
group by age_group

The `kamu` extension also provides a convenient auto-viz widget that you can use to quickly plot data in a data frame.

<div class="alert alert-block alert-success">
    
Once you get the results, try switching results view from "Table" to "Bar" tab and build a histogram.

</div>

Using `kamu` with Jupyter lets you offload complex computations to a selection of powerful SQL engines. It avoids having to download all data (which often may not fit into memory) into the notebook - instead you can shape and aggregate data on the SQL engine side and only download often much smaller results for the final visualization.

Using `-o <variable_name>` parameter of the `%%sql` cell magic we can save the result into a variable.

When you expect a lot of data and don't want to display a table you can also use `-q` or `--quiet` flag.

<div class="alert alert-block alert-success">

Let's count the number of cases per day and pull the result from SQL engine into our notebook:
    
</div>

In [None]:
%%sql -o df -q
select
    reported_date as date,
    count(*) as case_count
from 'covid19.british-columbia.case-details'
group by date
order by date

We now have a variable `df` containing the data as Pandas dataframe, and you are free to do with it anything you'd normally do in Jupyter.

This environment already comes with some popular plotting libraries pre-installed (like `plotly`, `bokeh`, `mapbox`, etc.), but if your favorite library is missing - you can always `pip install` it from the terminal.

<div class="alert alert-block alert-success">

Let's do some basic plotting:

</div>

In [None]:
import plotly.express as px

fig = px.scatter(
    df, x="date", y="case_count", 
    trendline="rolling", trendline_options=dict(window=7), 
    trendline_color_override="red")

fig.show()

---

## Up Next
🎉 Well done so far! 🎉

Now that we covered the basics of root datasets and data exploration - you are ready to move on to the next chapter where we will take a look at <mark>the key feature</mark> of `kamu` - **collaborative data processing**!