<center>
<a href="https://github.com/kamu-data/kamu-cli">
<img alt="kamu" src="https://raw.githubusercontent.com/kamu-data/kamu-cli/master/docs/readme_files/kamu_logo.png" width=270/>
</a>
</center>

<br/>

<center><i>World's first decentralized real-time data warehouse, on your laptop</i></center>

<br/>

<div align="center">
<a href="https://docs.kamu.dev/cli/">Docs</a> | 
<a href="https://docs.kamu.dev/cli/learn/learning-materials/">Tutorials</a> | 
<a href="https://docs.kamu.dev/cli/learn/examples/">Examples</a> |
<a href="https://docs.kamu.dev/cli/get-started/faq/">FAQ</a> |
<a href="https://discord.gg/nU6TXRQNXC">Discord</a> |
<a href="https://kamu.dev">Website</a>
</div>


<center>

<br/>
<br/>
    
# 3. Trustworthiness of Data

</center>

<div class="alert alert-block alert-info">
If you skipped the previous chapter or continuing after a break, use the following commands to get your environment ready for this chapter:
    
<p style="background:black">
<code style="background:black;color:white">cd "01 - Kamu Basics (COVID-19 example)"
./init-chapter-3.sh
</code>
</p>
</div>

# Problem
So far we've built a simple but very useful data supply chain, parts of which could be **owned and maintained by different people and organizations**. As long as publishers continue to update source data, we saw how easy it is to keep all of our datasets up-to-date. The ODF protocol is in fact designed to achieve <mark>near real-time latency</mark>.

But getting data fast is not the only concern. When multiple parties are involved in publishing and transforming data there is a big question of whether you can **trust** the data you are receiving. When working with data in a decentralized setting you have to account for the worst case - **malicious intent**.

<div class="alert alert-block alert-warning">
<b>Short on time?</b> See <a href="https://www.youtube.com/watch?v=hN_vpHYmwi0&list=PLV91cS45lwVG20Hicztbv7hsjN6x69MJk&index=2">this video</a> for a quick explainer of trust and collaboration.
</div>

After incidents like [the Surgisphere scandal](https://www.the-scientist.com/features/the-surgisphere-scandal-what-went-wrong--67955) (where publication based on fake medical data derailed drug research worldwide) the sentiment is changing from assuming that all research is done in good faith, to <mark>considering any research unreliable until proven otherwise</mark>.

This is why we've built `kamu` with <mark>complete reproducibility and verifiability</mark> in mind. In this section, we will see how to assess the trustworthiness of any dataset.

## Root Datasets
Previously, you've seen the clear separation `kamu` makes between root and derivative datasets. Let's see how trust works in each of them separately, starting with root datasets and source data.

### Reproducibility & Verifiability
In our example, we used root datasets that directly fetch the COVID-19 data from government websites. Why are we duplicating the data instead of using it directly?

In general, data sources like websites should be considered **non-reproducible**:
- The website can be down or **unreachable**
- Publishers do not provide us any clear guarantees around **availability** of data, so the dataset may disappear at any time
- There is no clarity around **preserving history**, so publisher may (intentionally or by accident) destructively update data and alter history

We could make some assumptions on a case-by-case basis, but what we really want is a **clear contract** on how data is preserved.

<div class="alert alert-block alert-info">

Destructive updates and not preserving history is unfortunately the case for a very large portion of data publishers today. For example http://geonames.org (one of the most widely used GIS data publishers) update their datasets daily by **overwriting** the old data. So the data you download today will be different from the one you download tomorrow. 
    
This makes **copying and versioning data** the only way to share a **stable data reference** with your colleagues and make your data project reproducible, but by doing so you lose the ability to verify that input data actually came from the trusted publisher.

</div>

The `kamu`'s contract that is enforced by root datasets is simple - **complete reproducibility & verifiability** of source data. It provides all the above guarantees and the certainty that the data observed today will still be available tomorrow, that the history will not be altered, and that you can easily check "*did this data actually came from this publisher?*".


#### In Practice

Since dataset is a ledger - a single metadata block hash is sufficient to create a **stable reference** for data and ensure **preproducibility**. Same effect can be achieved using a fixed position in a ledger (as an offset or using an upper bound for the `system_time` column).

<!--div class="alert alert-block alert-danger">
// TODO - show how to work with stable references
</div-->

Since metadata chain is secured cryptographically, to verify that it's authentic it is sufficient to validate its integrity (something that `kamu` always does automatically) and then ask publisher if they have a metadata block with this hash.

<!--div class="alert alert-block alert-danger">
// TODO - show how to check metadata is authentic
</div-->

And since metadata chain contains hashes of all data files in the dataset - you can easily verify that your local copy of data corresponds to the metadata and was not tampered.

<div class="alert alert-block alert-success">
To verify data that we previously downloaded from a repository run:

<p style="background:black">
<code style="background:black;color:white"> kamu verify ontario.case-details
</code>
</p>
</div>


### Trustworthiness of Root Data
While tools above can ensure that your data actually came from a certain publisher, they unfortunately cannot tell you <mark>whether the data itself is trustworthy</mark>.

Remember that the source data is always under complete control of the owner of the root dataset, which means that measuring and processing errors and even malicious data can make its way in. Don't trust any random person presenting you data.

<div class="alert alert-block alert-info">

Take our first dataset for example (`british-columbia.case-details`). If you downloaded this root dataset from the repository and inspected metadata - you would see that it fetches data from the official source - the CDC website. That sounds trustworthy, right?
    
Not quite! The person who published the dataset **could've tampered with the data** when it was being pulled from CDC website. This problem disappears when organizations that publish data also own the corresponding root datasets in ODF, as it removes the middle man.
    
</div>

The amount of trust you can put into the data should be equal to how much you can trust the publisher. You can measure this trust based on following factors:

- First factor is <mark>**reputation**</mark>
    - Your degree of trust may go up if you know the publisher personally
    - Or see that other people in the community trust them
    - An affiliation with the government, university, or some reputable organization would also improve their credibility
- Another possibility is <mark>**audit**</mark>
    - A third party can audit the methodology and tools used to collect data
    - This method doesn't scale well and always leaves the question of "who watches the watchmen?"
- A much better option is <mark>**cross-validation**</mark>
    - You can find another independent publisher that provides similar data and see if their measurements are similar
- A step above that is <mark>**outlier detection**</mark>
    - You can create a derivative dataset that continuously relies on a group of independent publishers, compares their data, while discarding and possibly penalizing the outliers

`kamu` helps you to keep track of which data publishers you depend on and lets you implement the above advanced techniques to improve reliability of data.

## Derivative Datasets
As you saw before, `kamu` strictly follows "Data as Code" philosophy where you never modify data directly and only manipulate data through queries. This makes collaborating on data very similar to collaborating on software.

Derivative data in `kamu` is inseparable from queries that produced it, because data provenance of which we cannot establish is useless as it cannot be trusted.

The stream processing queries we discussed previously in fact have very **strict properties** - they are fully **deterministic and reproducible**. Meaning that the same query on same inputs will always produce the same result.

Thanks to this, verification process takes just three steps:

1. Understand **which root datasets you depend on** and whether you trust their publishers

<div class="alert alert-block alert-success">
For this you can use the already familiar lineage command:

<p style="background:black">
<code style="background:black;color:white"> kamu inspect lineage canada.daily-cases
</code>
</p>
</div>

2. Audit the queries to make sure they are not malicious

<div class="alert alert-block alert-success">
Use following command to see the queries used by a dataset:

<p style="background:black">
<code style="background:black;color:white"> kamu inspect query canada.daily-cases
</code>
</p>
</div>


3. Verify that data you downloaded matches the declared transformations and was not tampered

<div class="alert alert-block alert-success">
Done by the same command we used previously for root datasets:

<p style="background:black">
<code style="background:black;color:white"> kamu verify canada.daily-cases
</code>
</p>
</div>

When called on a derivative dataset, in addition to comparing the hashes of data to metadata, the `verify` command does one extra thing - it **replays all the transformations locally** to guarantee that the resulting data was in fact produced by queries declared in the metadata chain. It re-executes the queries in the same sequence and on the same inputs as recorded in metadata, one block at a time.

The last part relies on determinism and reproducibility of the queries.

<div class="alert alert-block alert-info">

**Here's another interesting way to look at this:**
    
Because all queries are stored in metadata, and are deterministic, the entirety of derivative data can be deleted and fully re-created from only root datasets and metadata. This makes derivative data **a form of caching**! 
    
This is a pretty amazing property that means that derivative data can be stored cheaply, unreliably, or even not stored at all, potentially resulting in massive savings as amounts of derivative data we produce on the path to insight can often exceed the volume of the original raw data.
</div>

Since the data pipelines can get quite long you can also use the `--recursive` flag to verify all derivative data starting from the root datasets:

<p style="background:black">
<code style="background:black;color:white"> kamu verify --recursive canada.daily-cases
</code>
</p>


## Network Effect

We just saw how `kamu` provides a simple step-by-step process to evaluate trustworthiness of data, enabling efficient reuse and collaboration. Most of this work is automated, but remember that the remaining manual bits are a <mark>**community effort**</mark>!

Just like in open source software, you will always have <mark>an army of peers on your side</mark>, helping to establish trustworthiness of different publishers and auditing the derivative queries in complex transformation pipelines! Enabling the collaboration effect is the **true superpower** of `kamu`.

---

## Well Done! 🎉

This concludes the introductory demo!

Next, we invite you to try other demos and explore our growing [collection of examples](https://docs.kamu.dev/cli/learn/examples/) and collection of [external datasets](https://github.com/kamu-data/kamu-contrib). 

You can continue working in this environment - all examples are located in `~/XX - Other Examples` directory.

Also, please tell us what you think about this demo:
- by [joining our Discord](https://discord.gg/nU6TXRQNXC)
- by creating an [issue](https://github.com/kamu-data/kamu-cli/issues)
- or e-mailing us at [info@kamu.dev](mailto:info@kamu.dev)

Thank you for checking out `kamu`!