<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<center><i>World's first decentralized real-time data warehouse, on your laptop</i></center>

<br/>

<div align="center">
<a href="https://github.com/kamu-data/kamu-cli">Repo</a> | 
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
<br/>
    
# 1. Introduction

</center>

## Welcome

Hi, and thank you for checking out [kamu](https://github.com/kamu-data/kamu-cli) - the **new generation data management tool**!

This environment comes with `kamu` command-line tool pre-installed, so give it a try now. 

<div class="alert alert-block alert-success">
<b>Your turn:</b> Open the <b>Terminal</b> tab in Jupyter and run:

<p style="background:black">
<code style="background:black;color:white">kamu
</code>
</p>
</div>

<div class="alert alert-block alert-warning">
<details>
<summary style="display:list-item"><b>New to Jupyter?</b></summary>

* Go back to the Jupyter's <b>main tab</b> that shows the list of files 
* In the top right corner click <b>New -> Terminal</b>
* Now you can switch between the terminal tab and this lesson as you continue

</details>
</div>

## What is Kamu for?

[Kamu](https://github.com/kamu-data/kamu-cli) is a tool based on [Open Data Fabric](http://opendatafabric.org/) protocol that connects publishers and consumers of data into a <mark>decentralized data supply chain</mark>. It allows you to get data fast, in a ready-to-use form for analysis and ML tasks, ensure it is trustworthy and easy to keep up to date.

In this demo, we are going to explore some of the key features of `kamu` through some <mark>real world examples</mark>.

<div class="alert alert-block alert-warning">
<b>Short on time?</b> See <a href="https://www.youtube.com/watch?v=oUTiWW6W78A&list=PLV91cS45lwVG20Hicztbv7hsjN6x69MJk">this video</a> for a quick tour of key features.
</div>

If you have any questions throughout this demo - you can chat to us on [Discord](https://discord.gg/nU6TXRQNXC) or create an issue in [kamu-cli](https://github.com/kamu-data/kamu-cli) GitHub repository.

## Workspaces

You start working with `kamu` by creating a workspace. A workspace is just a directory where `kamu` stores data and metadata of the datasets.

<div class="alert alert-block alert-success">
Go ahead and create your first workspace (we'll do it right in the home directory):<br/>

<p style="background:black">
<code style="background:black;color:white">cd "01 - Kamu Basics (COVID-19 example)"
kamu init
</code>
</p>
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> Similarly to <code>git</code> it will create a <code>.kamu</code> directory in the folder you ran the command in.
</div>

Your new workspace is currently empty. Confirm that by running:

<p style="background:black">
<code style="background:black;color:white">kamu list
</code>
</p>

## Adding the first dataset
Let's add some new datasets to our workspace!

In `kamu`, datasets are defined using `.yaml` files. You can import them with `kamu add` command. 

In this demo, we are going to work with some disagregated COVID-19 datasets published by different provinces of Canada.

<div class="alert alert-block alert-success">
To add a dataset to the workspace run:<br/>

<p style="background:black">
<code style="background:black;color:white">kamu add demo/datasets/british-columbia.case-details.yaml
</code>
</p>
</div>

<div class="alert alert-block alert-info">
<b>Note:</b> Every command in <code>kamu</code> is well documented. Try running <code>kamu add -h</code> or <code>kamu add --help</code> to see all the parameters and useful examples.
</div>

This particular dataset includes case data form British Columbia, including day, age group, gender and area where the case was registered. Such datasets that ingest or receive external data are called `root` datasets and contain valuable data that cannot be reconstructed if lost (also known as **source data**).

The dataset definition file looks like this:

```yaml
kind: DatasetSnapshot
version: 1
content:
  name: british-columbia.case-details
  kind: root
  metadata:
    # Specifies the source of data that can be periodically polled to refresh the dataset
    # See: https://github.com/kamu-data/open-data-fabric/blob/master/open-data-fabric.md#setpollingsource-schema
    - kind: setPollingSource
      # Where to fetch the data from.
      fetch:
        kind: url
        url: http://www.bccdc.ca/Health-Info-Site/Documents/BCCDC_COVID19_Dashboard_Case_Details.csv
      # How to parse the data.
      read:
        kind: csv
        separator: ","
        header: true
        nullValue: ""
      # Pre-processing query that shapes the data (optional)
      preprocess:
        kind: sql
        engine: spark
        query: >
          SELECT
            CAST(UNIX_TIMESTAMP(Reported_Date, "yyyy-MM-dd") as TIMESTAMP) as reported_date,
            Classification_Reported as classification,
            id,
            ha,
            sex,
            age_group
          FROM input
      # How to combine newly-ingested data with data that is already in the dataset
      merge:
        kind: ledger
        primaryKey:
          - id
    # Tells kamu to use `reported_date` column as event time intead of the default `event_tume`
    - kind: setVocab
      eventTimeColumn: reported_date
```

<div class="alert alert-block alert-info">

Detailed schemas for `DatasetSnapshot` and other metadata can be found in [ODF reference](https://github.com/kamu-data/open-data-fabric/blob/master/open-data-fabric.md#reference-information). When creating a new dataset you can also use `kamu new` command to start with an annotated template.

</div>

As you can see, it tells `kamu` where to fetch the data from, what type of data to expect, and all the pre-processing steps needed to shape the data into a nice typed schema. 

<div class="alert alert-block alert-info">
<b>Note:</b> Kamu strictly follows <b>"data as code"</b> philosophy in which you never alter the data directly. Instead, you express all transformations with queries (SQL in this case).
</div>

### Pulling the data in

We can now see the dataset in our workspace, but it is still empty:

<p style="background:black">
<code style="background:black;color:white">kamu list
</code>
</p>

We told `kamu` where to get data from, but did not fetch it yet.

<div class="alert alert-block alert-success">
So let's run the following command to fetch data:<br/>

<p style="background:black">
<code style="background:black;color:white">kamu pull british-columbia.case-details
</code>
</p>
</div>

<div class="alert alert-block alert-info">
Make sure to use <b>shell completions</b>, they will save you a lot of typing!
<p style="background:black">
<code style="background:black;color:white">kamu pull br&lt;TAB&gt;
</code>
</p>
</div>

During this time `kamu` will fetch the data from its source, read and preprocess it as specified. 

<div class="alert alert-block alert-success">
Once completed, we can use <code>tail</code> command to see a sample of the new data:

<p style="background:black">
<code style="background:black;color:white">kamu tail british-columbia.case-details
</code>
</p>
</div>

## Ledger nature of Data and Metadata 

A very important aspect of `kamu` is that it stores the history of data, not just snapshots. If we run the `pull` command on this dataset tomorrow, it will only add the records that were not previously observed.

<div class="alert alert-block alert-success">
Run <code>pull</code> command again to verify that our data is still up-to-date with the source:

<p style="background:black">
<code style="background:black;color:white">kamu pull british-columbia.case-details
</code>
</p>
</div>

In other words, in `kamu` <mark>data is a ledger</mark> - an append-only record of events where past events never change (are immutable). The exact way how external data is transformed into a ledger is determined by the merge strategies documented [here](https://github.com/kamu-data/kamu-cli/blob/master/docs/merge_strategies.md).

Additionally, every event that affects the dataset is stored in so-called **metadata chain**.

<div class="alert alert-block alert-success">
Inspect the metadata chain using the <code>log</code> command ("Q" to close):

<p style="background:black">
<code style="background:black;color:white">kamu log british-columbia.case-details
</code>
</p>
</div>

As you can see, **metadata is also a ledger**! There are four metadata **blocks** (in bottom-up order):
- `Seed` - establishes a globally-unique cryptographic identity of the dataset
- `SetPollingSource` - declares how external data should be ingested
- `SetVocab` - renames some system columns (in our case specifies that `reported_date` column should be treated as an event time)
- `AddData` - shows N new records that were added during the last `pull` command and hashes of the files that were produced.

You can think of the metadata chain as `git` commit log, except instead of data it stores an accurate **history of events** that affected how dataset looks like throughought its entire lifetime.

## Analyzing Data

Getting raw data in is just a small first step on our journey towards collaboration on data, but before we continue, let's take a quick break and see how you can analyze the data that we already have.

### SQL Shell

Kamu has a built-in SQL shell which you can start by running:

<p style="background:black">
<code style="background:black;color:white">kamu sql
</code>
</p>

<div class="alert alert-block alert-info">
The default SQL shell is based on the <a href="https://spark.apache.org/">Apache Spark</a>.
</div>

<div class="alert alert-block alert-success">
Once the shell starts, try the following queries:

<p style="background:black">
<code style="background:black;color:white">&gt; show tables;</code>
</p>
<p style="background:black">
<code style="background:black;color:white">&gt; describe `british-columbia.case-details`;</code>
</p>
<p style="background:black">
<code style="background:black;color:white">&gt; select * from `british-columbia.case-details` limit 10;</code>
</p>
</div>

<div class="alert alert-block alert-success">

Now try writing a query that shows the total number of cases by differen region (`ha` column).

</div>

Press **Ctrl + D** to exit.

### Notebooks

When you install `kamu` on your computer you can use `kamu notebook` command to start an integrated Jupyter 
Notebook environment, identical to the one you are currently using.

Since we're already in the notebook environment - let's give this integration a try!

<div class="alert alert-block alert-success">
Start by loading <code>kamu</code> Jupyter extension:
</div>

In [None]:
%load_ext kamu

<div class="alert alert-block alert-warning">
<details>
<summary style="display:list-item"><b>New to Jupyter?</b></summary>

Jupyter notebooks contain cells that are **executable**, so static text can me mixed with computations and data visualization.

**You** are in control of what runs when, so you'll need to **select the code cell above** and then click the **"Run"** button on the top panel, or press `Shift + Enter`.

</details>
</div>

We can now import the dataset we have in our workspace into this notebook environment. We can also give it a less verbose alias.

<div class="alert alert-block alert-success">
Run the below to import the dataset (may take 15 or so seconds first time):
</div>

In [None]:
%import_dataset british-columbia.case-details --alias bc_covid19

<div class="alert alert-block alert-success">
To see the schema and number of records in the dataset run:
</div>

In [None]:
bc_covid19.printSchema()
bc_covid19.count()

<div class="alert alert-block alert-info">
<details>
<summary style="display:list-item"><b>What did we just run?</b></summary>

The code you type into a regular cell is executed by [PySpark](https://spark.apache.org/docs/latest/api/python/) server that `kamu` runs when you are working with notebooks.

So it's a Python code, but it is **executed remotely**, not in the notebook kernel. We will discuss benefits of this later.

</details>
</div>

You can use the `%%sql` cell command to run SQL queries on the imported datasets.

<div class="alert alert-block alert-success">
To see a sample of data run:
</div>

In [None]:
%%sql
select * from bc_covid19 
order by reported_date desc
limit 5

<div class="alert alert-block alert-info">
<details>
<summary style="display:list-item"><b>What did we just run?</b></summary>

Similarly to the PySpark code, the queries in `%%sql` cells are sent to and executed by the Spark SQL engine. The results are then returned back to the notebook kernel.

</details>
</div>

<div class="alert alert-block alert-success">
Let's run this simple SQL query to build a histogram of cases by the age group:
</div>

In [None]:
%%sql
select
    age_group,
    count(*) as case_count 
from bc_covid19
group by age_group

<div class="alert alert-block alert-success">
    
Once you get the results, try using the built-in data visualizer to plot the data as a **bar chart**

</div>

SQL is great for shaping and aggregating data, but for more advanced processing or visualizations you might need more tools. Using `-o <variable_name>` parameter of the `%%sql` command we can ask for the result of a query to be returned into the notebook as **Pandas dataframe**.

<div class="alert alert-block alert-success">

Let's count the number of cases per day and pull the result from Spark into our notebook:
    
</div>

In [None]:
%%sql -o df
select
    reported_date as date,
    count(*) as case_count
from bc_covid19
group by Date
order by Date

We now have a variable `df` containing the data as Pandas dataframe, and you are free to do with it anything you'd normally do in Jupyter.

<div class="alert alert-block alert-warning">

Note that if you just type `df` in a cell - you will get an error. That's because by default this kernel executes operations in the remore PySpark environment. To access `df` you need to use `%%local` cell command which will execute code in this local Python kernel.
    
</div>

This environment already comes with some popular plotting libraries pre-installed (like `plotly`, `bokeh`, `mapbox`, etc.), but if your favorite library is missing - you can always `pip install` it from the terminal.

<div class="alert alert-block alert-success">
    
Let's do some basic plotting:
    
</div>

In [None]:
%%local
import plotly.express as px

fig = px.scatter(
    df, x="date", y="case_count", 
    trendline="rolling", trendline_options=dict(window=7), 
    trendline_color_override="red")

fig.show()

---

## Up Next
ðŸŽ‰ Well done so far! ðŸŽ‰

Now that we covered the basics of root datasets and data exploration - you are ready to move on to the next chapter where we will take a look at <mark>the key feature</mark> of `kamu` - **data collaboration**!