<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<center><i>World's first decentralized real-time data warehouse, on your laptop</i></center>

<br/>

<div align="center">
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
<br/>
    
# 2. Sharing and Collaboration

</center>

<div class="alert alert-block alert-info">
If you skipped the previous chapter or continuing after a break, use the following commands to get your environment ready for this chapter:
    
<p style="background:black">
<code style="background:black;color:white">kamu init
kamu add demo/datasets/ca.bccdc.covid19.case-details.yaml
kamu pull --all
</code>
</p>
</div>

## Repositories

After you created your dataset, you might want to share it with other people. We can do so by using **repositories**. There are many kinds of repositories in `kamu`, some only provide storage capabilities (S3, GCS, FTP, ...), some also support remote query execution.

In this demo environment we have an S3-compatible storage server running under `minio` hostname. This server has a shared S3 bucket called `kamu-hub` that you can read from and write to.

<div class="alert alert-block alert-success">
So let's add this S3 bucket as a repository:
    
<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu repo add kamu-hub s3+http://minio/kamu-hub
</code>
</p>
</div>

<div class="alert alert-block alert-success">
To see the list of all known repositories use:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu repo list
</code>
</p>
</div>

<div class="alert alert-block alert-success">
Now we are ready to <b>push</b> your local dataset into the repository:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu push ca.bccdc.covid19.case-details --as kamu-hub/ca.bccdc.covid19.case-details
</code>
</p>
</div>

<div class="alert alert-block alert-warning">

If you are doing this demo in a **shared environment**, sombody may have already pushed a dataset to the repository using the same name. `kamu` will not let you overwrite this dataset, so simply **prefix the alias with your username** to make it unique:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu push ca.bccdc.covid19.case-details --as kamu-hub/&lt;your-github-username&gt;.ca.bccdc.covid19.case-details
</code>
</p>
</div>

<div class="alert alert-block alert-success">
This initial push creates an association between your local dataset and the repository, so next time you want to update the dataset you can simply run:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu push ca.bccdc.covid19.case-details
</code>
</p>
</div>

## Discovering Data

Using the `search` command, we can also search for datasets in known repositories.

<div class="alert alert-block alert-success">
Give search command a try:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu search covid
</code>
</p>
</div>

You should see your dataset in this list, and also some other COVID-19 datasets from other provinces that were added by other people. 

<div class="alert alert-block alert-success">
Let's pull one of these datasets into our workspace, and have a look:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu pull kamu-hub/ca.ontario.data.covid19.case-details
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu tail ca.ontario.data.covid19.case-details
</code>
</p>
</div>

Ah! This dataset is very similar to the dataset that we created. It also contains invidual case data, but from Ontario province.

## Derivative Datasets

Canada has 13 provinces and territories, so while you could run data analysis on each of them separately, things would be much simpler if we had one Canada-wide dataset. I'm sure other researchers would also appreciate having such dataset too, so let's create one!

<div class="alert alert-block alert-success">
Let's start with checking the schema of both datasets:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu inspect schema ca.bccdc.covid19.case-details
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu inspect schema ca.ontario.data.covid19.case-details
</code>
</p>
</div>

The datasets are similar but they do have somewhat different schemas, so it looks like we will need to **harmonize the data** to be able to combine them. We'll do all this using a **derivative dataset**, which we will define in another `.yaml` file. 

<div class="alert alert-block alert-success">
Add the pre-made derivative dataset to your workspace with:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu add demo/datasets/ca.covid19.case-details.yaml
</code>
</p>
</div>

This is what this file looks like:

```yaml
version: 1
kind: DatasetSnapshot
content:
  id: ca.covid19.case-details
  source:
    kind: derivative
    inputs:
      - ca.bccdc.covid19.case-details
      - ca.ontario.data.covid19.case-details
    transform:
      kind: sql
      engine: spark
      query: >
        SELECT
          "BC" as province,
          id,
          reported_date,
          sex as gender,
          Case when age_group = '<10' then '<20'
               when age_group = '10-19' then '<20' 
               when age_group = '20-29' then '20s'
               when age_group = '30-39' then '30s'
               when age_group = '40-49' then '40s'
               when age_group = '50-59' then '50s'
               when age_group = '60-69' then '60s'
               when age_group = '70-79' then '70s'
               when age_group = '80-89' then '80s'
               when age_group = '90+' then '90+'
               else 'UNKNOWN' end as age_group,
            ha as location
          FROM `ca.bccdc.covid19.case-details`
        UNION ALL
        SELECT
          "ON" as province,
          id,
          case_reported_date as reported_date,
          Case when lower(gender) = 'male' then 'M' 
               when lower(gender) = 'female' then 'F' 
               else 'U' end as gender,
          age_group,
          city as location
          FROM `ca.ontario.data.covid19.case-details`
  vocab:
    eventTimeColumn: reported_date
```

Unlike root datasets, derivative datasets work on data that is **already in the system**. They can transform, combine, enrich and aggregate data from multiple sources. 

In our case, the inputs are two root Covid-19 datasets from BC and Ontario. Our output is computed via an SQL query that harmonizes the dataset schemas and performs a `UNION ALL` operation.

<div class="alert alert-block alert-success">
Just like before, the dataset is empty untill we pull it:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu pull ca.covid19.case-details
</code>
</p>
</div>

This time, we are not fetching any external data while pulling. Instead, `kamu` will check which input records from the dataset have not been processed yet and it will feed them into the query.

## Keeping data up-to-date

If you run the pull command again:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu pull ca.covid19.case-details
</code>
</p>

You will see that the dataset is up-to-date, because neither of its two inputs have changed.

By adding the `--recursive` flag you can instruct `kamu` to check for updates in every dataset that is part of the dependency tree.

<div class="alert alert-block alert-success">
Try running:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu pull --recursive ca.covid19.case-details
</code>
</p>
</div>

This operation will perform three steps:
- For `ca.bccdc.covid19.case-details` - check the **external website** for updates
- For `ca.ontario.data.covid19.case-details` - check **the repository** for updates (since we pulled this dataset from a repo)
- For `ca.covid19.case-details` - **apply the transformation** query to any new data in the above

And just like that, <mark>with a single command you can keep a very large number of datasets fully up-to-date</mark>!

With `kamu`, you are always few clicks away from getting the latest data into your projects, notebooks and dashboards.

## Welcome to Stream Processing!

You might not have recognized it, but you've just been doing **stream processing**. 

As opposed to the traditional batch processing (e.g. classic SQL, Pandas, or most other frameworks) all processing in `kamu` is streaming in nature.

<div class="alert alert-block alert-info">

**Have not used stream processing before?** It's a fairly young technology, but it's rapidly entering the data science and analytics space. A very simple overview of differences between streaming and batch can be found in [this talk recording](https://youtu.be/XxKnTusccUM?t=574).

</div>

To understand the benefits of this, let's create one more derivative dataset. This time we will use the Canada-wide dataset from the previous step to create the "total daily new cases" dataset, similar to the COVID statistics you hear on the news nowadays.

<div class="alert alert-block alert-success">
Add the pre-made derivative dataset to your workspace with:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu add demo/datasets/ca.covid19.daily-cases.yaml
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu pull ca.covid19.daily-cases
</code>
</p>
<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu tail ca.covid19.daily-cases
</code>
</p>
</div>

This dataset is defined as:

```yaml
version: 1
kind: DatasetSnapshot
content:
  id: ca.covid19.daily-cases
  source:
    kind: derivative
    inputs:
      - ca.covid19.case-details
    transform:
      kind: sql
      engine: flink
      query: >
        SELECT
          TUMBLE_START(`reported_date`, INTERVAL '1' DAY) as `reported_date`,
          `province`,
          COUNT(*) as `total_daily`
        FROM `ca.covid19.case-details`
        GROUP BY TUMBLE(`reported_date`, INTERVAL '1' DAY), `province`
  vocab:
    eventTimeColumn: reported_date
```

<div class="alert alert-block alert-info">

**Notice** that previously we were using Apache Spark but have switched over to Apache Flink engine for this query. `kamu` already supports [multiple different frameworks](https://docs.kamu.dev/cli/transform/supported-engines/) and can be easily extended to support more.

</div>


Calculating the daily cases per province is an **aggregation**. In a classic SQL we would write it as:

```sql
SELECT
  reported_date,
  province,
  COUNT(*) as total_daily
FROM `ca.covid19.case-details`
GROUP BY reported_date, province
```

But this type of query is flawed in many ways:
- Data from different provinces may be updated on **different cadences**
- It may **lag** by one or several days (e.g. CDC of BC does not update their dataset on weekends)
- Data may be **out-of-order**

<div class="alert alert-block alert-danger">

Executing this batch query is guaranteed to constantly produce **innacurate results**. What's worse, the errors will be concentrated in the most recent data - data everyone cares about the most.

</div>

Stream processing takes all these temporal problems into account and will delay producing the result before we have certainty that input data is complete. This is an amazing property that ensures that as we build more and more complex data pipelines we don't end up creating a massive cascade of incorrect data.

Stream processing is a truly facinating topic. We will have to park it for now, but be sure to later check out our examples and tutorials on this topic to learn more.

### Lineage

So things are getting a bit complicated. We have datasets building on top of datasets, building on yet more datasets...

<div class="alert alert-block alert-success">
If you ever lose your bearings - <b>lineage</b> command will help you out:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu inspect lineage ca.covid19.daily-cases
</code>
</p>
</div>
    
This command shows you a graph of datasets and their dependencies. Thanks to the metadata, `kamu` knows exactly where every single bit of data came from, so lineage is **guaranteed** to be accurate.

When you install `kamu` on your desktop, you can also display lineage in a browser by running:

<p style="background:black">
<code style="background:black;color:white"> &dollar; kamu inspect lineage --browse
</code>
</p>

It will look something like this:

![](files/lineage.png)

---

## Up Next
🎉 Well done! 🎉

You have now discovered how `kamu` can be used to <mark>share</mark> data with other people, and how to easily keep all datasets <mark>up-to-date</mark>.

But besides having recent data - there's one very important component still missing. How can we trust data that we get from others? How can we reliably reuse their derivative datasets without fear of them introducing some mistakes, or worse, some malicious data.

Find out how `kamu` enables <mark>true collaboration and reuse</mark> in the next chapter!