<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<div align="center">
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
    
# 2. Sharing and Collaboration

</center>

<div class="alert alert-block alert-info">
If you skipped the previous chapter or continuing after a break, use the following commands to get your environment ready for this chapter:
    
<p style="background:black">
<code style="background:black;color:white">cd "01 - Kamu Basics (COVID-19 example)"
./init-chapter-2.sh
</code>
</p>
</div>

## Repositories

After you created your dataset, you might want to share it with other people. We can do so by using **repositories**.

Repository can be just some storage, e.g.:
- Cloud and on-prem like S3, GCS, or Minio
- Decentralized storage like IPFS, Arweave (see next tutorial on "Web3 Data")
- Or even some old FTP server (see [full list](https://docs.kamu.dev/node/deploy/storage/))

As a reporitory for this demo we will use [**Kamu Node**](https://docs.kamu.dev/node/) - you can think of it as a small server on top of some storage (AWS S3 or Minio in this case) that speaks ORF protocol and provides a bunch of cool additional features, like highly optimized uploads/downloads, dataset search, and even executing remote SQL queries.

<div class="alert alert-block alert-success">
So let's add the node as a repository:
    
<p style="background:black">
<code style="background:black;color:white"> kamu repo add kamu-node ${KAMU_NODE_URL}
</code>
</p>
</div>

<div class="alert alert-block alert-success">
To see the list of all known repositories use:

<p style="background:black">
<code style="background:black;color:white"> kamu repo list
</code>
</p>
</div>

## Pushing data
You happen to already have `write` access to the node, so let's try sharing our dataset.

<div class="alert alert-block alert-success">
<b>push</b> your local dataset into the repository (note that we are storing it under your personal account):

<p style="background:black">
<code style="background:black;color:white"> kamu push covid19.british-columbia.case-details --to kamu-node/${GITHUB_LOGIN}/covid19.british-columbia.case-details
</code>
</p>
</div>

<div class="alert alert-block alert-success">
This initial push creates an association between your local dataset and the repository, so next time you want to update the remote dataset you can simply run:

<p style="background:black">
<code style="background:black;color:white"> kamu push covid19.british-columbia.case-details
</code>
</p>
</div>

## Discovering Data

Using the `search` command, we can also search for datasets in known repositories.

<div class="alert alert-block alert-success">
Try searching for `covid` datasets:

<p style="background:black">
<code style="background:black;color:white"> kamu search covid
</code>
</p>
</div>

Firstly, you should see the dataset you pushed in this list, under your account:

`kamu-node/${GITHUB_LOGIN}/covid19.british-columbia.case-details`


But more excitingly, you will also see some COVID-19 datasets added by other people, some related to other provinces of Canada, like:

```kamu-node/kamu/covid19.ontario.case-details```

<div class="alert alert-block alert-success">
Let's pull this dataset into our workspace, and have a look:

<p style="background:black">
<code style="background:black;color:white"> kamu pull kamu-node/kamu/covid19.ontario.case-details
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> kamu tail covid19.ontario.case-details
</code>
</p>
</div>

Ah! This dataset is very similar to the dataset that we created. It also contains individual case data, but from Ontario province.

## Derivative Datasets

Canada has 13 provinces and territories, so while we could run data analysis on each of them separately, things would be much simpler if we had one Canada-wide dataset.

I'm sure other researchers would also appreciate having such dataset too, so let's create one!

<div class="alert alert-block alert-success">
Let's start with checking the schema of both datasets:

<p style="background:black">
<code style="background:black;color:white"> kamu inspect schema covid19.british-columbia.case-details
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> kamu inspect schema covid19.ontario.case-details
</code>
</p>
</div>

The datasets are similar but they do have somewhat different schemas, so it looks like we will need to **harmonize the data** to be able to combine them. We can do this using a **derivative dataset**, which we will define in another `.yaml` file.

<div class="alert alert-block alert-success">
Add the pre-made derivative dataset to your workspace with:

<p style="background:black">
<code style="background:black;color:white"> kamu add datasets/canada.case-details.yaml
</code>
</p>
</div>

This is what this file looks like:

```yaml
kind: DatasetSnapshot
version: 1
content:
  name: covid19.canada.case-details
  kind: Derivative
  # List of metadata events that get dataset into its initial state
  metadata:
    - kind: SetTransform
      # References the datasets that will be used as sources of data.
      inputs:
        - datasetRef: covid19.british-columbia.case-details
        - datasetRef: covid19.ontario.case-details
      # Transformation that will be applied to produce new data
      transform:
        kind: Sql
        engine: spark
        query: |
          SELECT
            "BC" as province,
            id,
            reported_date,
            sex as gender,
            case when age_group = '<10' then '<20'
                 when age_group = '10-19' then '<20' 
                 when age_group = '20-29' then '20s'
                 when age_group = '30-39' then '30s'
                 when age_group = '40-49' then '40s'
                 when age_group = '50-59' then '50s'
                 when age_group = '60-69' then '60s'
                 when age_group = '70-79' then '70s'
                 when age_group = '80-89' then '80s'
                 when age_group = '90+' then '90+'
                 else 'UNKNOWN' end as age_group,
            ha as location
            FROM `covid19.british-columbia.case-details`
          UNION ALL
          SELECT
            "ON" as province,
            id,
            case_reported_date as reported_date,
            case when lower(gender) = 'male' then 'M' 
                 when lower(gender) = 'female' then 'F' 
                 else 'U' end as gender,
            age_group,
            city as location
            FROM `covid19.ontario.case-details`
    - kind: SetVocab
      eventTimeColumn: reported_date
```

Derivative datasets are used to transform, combine, enrich and aggregate data from multiple sources. Unlike root datasets, they work only with data that is already in the system to **guarantee reproducible/verifiable results**.

In our case, the inputs are two root COVID-19 datasets from BC and Ontario. Our output is computed via an SQL query that harmonizes the dataset schemas and performs a `UNION ALL` operation.

<div class="alert alert-block alert-success">
Just like before, the dataset is empty until we pull it:

<p style="background:black">
<code style="background:black;color:white"> kamu pull covid19.canada.case-details
</code>
</p>
</div>

This time, we are not fetching any external data while pulling. Instead, `kamu` will check which records from the input datasets have not been processed yet and it will feed them into the query.

## Keeping data up-to-date

If you run the pull command again:

<p style="background:black">
<code style="background:black;color:white"> kamu pull covid19.canada.case-details
</code>
</p>

You will see that the dataset is up-to-date, because neither of its two inputs have changed.

By adding the `--recursive` flag you can instruct `kamu` to check for updates in every dataset that is part of the dependency tree.

<div class="alert alert-block alert-success">
Try running:

<p style="background:black">
<code style="background:black;color:white"> kamu pull --recursive covid19.canada.case-details
</code>
</p>
</div>

This operation will perform three steps:
- For `covid19.british-columbia.case-details` - check the **external website** for updates
- For `covid19.ontario.case-details` - check **the repository** for updates (since we pulled this dataset from a repo)
- For `covid19.canada.case-details` - **apply the transformation** query to any new data in the above

And just like that, <mark>with a single command you can keep a very large number of datasets fully up-to-date</mark>!

With `kamu`, you are always only a few clicks away from getting the latest data into your projects, notebooks and dashboards.

## Welcome to Stream Processing!

You might not have recognized it, but you've just been doing **stream processing**. 

As opposed to the traditional batch processing (e.g. classic SQL, Pandas, or most other frameworks) all processing in `kamu` is streaming in nature.

<div class="alert alert-block alert-info">

**Have not used stream processing before?** It's a fairly young technology, but it's rapidly entering the data science and analytics space. A very simple overview of differences between streaming and batch can be found in [this talk recording](https://youtu.be/XxKnTusccUM?t=574).

</div>

To understand the benefits of this, let's create one more derivative dataset. This time we will use the Canada-wide dataset from the previous step to create the "total daily new cases" dataset, similar to the COVID statistics you used to hear on the news.

<div class="alert alert-block alert-success">
Add the pre-made derivative dataset to your workspace with:

<p style="background:black">
<code style="background:black;color:white"> kamu add datasets/canada.daily-cases.yaml
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> kamu pull covid19.canada.daily-cases
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> kamu tail covid19.canada.daily-cases
</code>
</p>
</div>

This dataset is defined as:

```yaml
kind: DatasetSnapshot
version: 1
content:
  id: covid19.canada.daily-cases
  kind: Derivative
  metadata:
    - kind: SetTransform
      inputs:
        - datasetRef: covid19.canada.case-details
      transform:
        kind: Sql
        engine: flink
        query: |
          SELECT
            TUMBLE_START(`reported_date`, INTERVAL '1' DAY) as `reported_date`,
            `province`,
            COUNT(*) as `total_daily`
          FROM `covid19.canada.case-details`
          GROUP BY TUMBLE(`reported_date`, INTERVAL '1' DAY), `province`
    - kind: SetVocab
      eventTimeColumn: reported_date
```

<div class="alert alert-block alert-info">

Notice that we already used **three different data processing engines**: DataFusion, Spark, and now Flink!
    
`kamu` allows you to use the individual strengths of [multiple different engines](https://docs.kamu.dev/cli/supported-engines/) and mix them within a single data pipeline.

</div>


Calculating the daily cases per province is an **aggregation**. In a classic SQL we would write it as:

```sql
SELECT
  reported_date,
  province,
  COUNT(*) as total_daily
FROM `covid19.canada.case-details`
GROUP BY reported_date, province
```

But this type of query is flawed in many ways:
- Data from different provinces may be updated on **different cadences** and may **lag** by one or several days (e.g. CDC of BC does not update their dataset on weekends and statutory holidays)
- Data may be **out-of-order**, can be **back-filled**, or contain errors that were later **corrected**

<div class="alert alert-block alert-danger">

Naively executing this batch query is guaranteed to constantly produce **inaccurate results**. What's worse, the errors will be concentrated in the most recent data - data everyone cares about the most.

</div>

Stream processing takes all these temporal problems into account and will delay producing the result before we have certainty that the input data is complete. This is an amazing property that ensures that as we build more and more complex data pipelines we don't end up creating a massive cascade of incorrect data.

Stream processing is a truly fascinating topic. We will have to park it for now, but be sure to later check out our examples and tutorials on this topic to learn more.

### Lineage

So things are getting a bit complicated. We have datasets building on top of datasets, building on yet more datasets...

<div class="alert alert-block alert-success">
If you ever lose your bearings - <b>lineage</b> command will help you out:

<p style="background:black">
<code style="background:black;color:white"> kamu inspect lineage covid19.canada.daily-cases
</code>
</p>
</div>
    
This command shows you a graph of datasets and their dependencies. Thanks to the metadata, `kamu` knows exactly where every single bit of data came from, so lineage is **guaranteed** to be accurate.

When you install `kamu` on your machine, you can also display lineage in a browser by running:

<p style="background:black">
<code style="background:black;color:white"> kamu inspect lineage --browse
</code>
</p>

It will look something like this:

![](files/lineage.png)

### Kamu Web UI
An even better option for exploring pipelines is [Kamu Web UI](https://docs.kamu.dev/platform/).

With `kamu` installed on your machine you can run it as:

<p style="background:black">
<code style="background:black;color:white"> kamu ui
</code>
</p>

Kamu Node that we pushed and pulled data from also comes with this web interface!

<div class="alert alert-block alert-success">
Follow this link to check it out:
    
${KAMU_WEB_UI_URL}kamu/covid19.canada.case-details?tab=lineage
</div>

There you will see a COVID data pipeline built by our community that is a bigger version of what we built in this exercise:

![](files/web-ui.png)

<div class="alert alert-block alert-info">

The above is a taste how `kamu` embraces **"local-first" design**:
- All features are available in a simple CLI tool
- You can build really complex pipelines on your laptop, without entangling your data in proprietary data platforms
- When your data becomes large - you can [**deploy your own Kamu Node**](https://docs.kamu.dev/node/quick-start/) in a cloud or on-prem and run your pipelines in a scalable data lake
- You can then invite other people to your node or **connect multiple nodes together** to collaborate on data.
    
This flexibility makes `kamu` the **world's first decentralized data lake and multi-party data processing network**.
</div>


---

## Up Next
🎉 Well done! 🎉

You have now discovered how `kamu` can be used to <mark>share</mark> data with other people, and how to easily keep all datasets <mark>up-to-date</mark>.

But besides having recent data - there's one very important component still missing:
- How can we **trust data** that we get from others?
- How can we reliably reuse their derivative datasets without fear of them introducing some mistakes, or worse, some **malicious** data?

Find out how `kamu` enables <mark>true collaboration and reuse</mark> in the next chapter!