#  Functional Data Engineering in Team 🥑

* Forecasting Team 🥑 at Maersk
* Data and machine learning pipelines in k8s
* Implementation of great expectations
* Pain & Gain

___

Presenter: Micha B A Kunze - michabenachim.kunze@maersk.com | [@mbakunze](https://github.com/mbakunze)

# Forecasting Team 🥑 at Maersk


# Maersk

* largest ocean container shipper (~20-25% of global volume)
* ~ 700 vessels (owned + chartered)


## Team 🥑
We build forecasts - together:

Edward, Lasse, Hans, Karin, Julija, Henrik, Ricko, Luca, Marco, Andreas, Micha, Søren and Julia.

## Building forecasts

 * batch processing of __full__ historical datasets
 * crucial that data is historically accurate (at which time did you know what)
 * computationally heavy jobs
 

 ### Tech stack:
 * Git
 * Docker
 * Kubernetes
 * Python, R, Spark
 * Azure Blob Storage
 * Datadog

# How we work

 * everything is code, and code lives in git
 * DevOps / GitOps
 * all running code is containerized
 * highly collaborative: software engineers | data scientists | data engineers pair to solve challenges
 * multiple forecasting products in production -> millions of forecasts a day

 ![git_commits](assets/git_commits.png)


## Forecasting products

HOW MUCH CAN I SAY HERE?

# Data and machine learning pipelines in k8s

# How we build pipelines

 * use pippi to run pipelines (handles scheduling, compute and storage)
 * separate functions (-> code) and data (-> config)
 * use git branching to build new/modify pipeline code
 * datasets are saved as __immutable__ snapshots for each run!
 
 Allows us to isolate concerns and develop in parallel.
 
![pippi_r2l](assets/pippi_r2l.png)

In [17]:
# this might be might the code for you data pipeline
# runs locally the same as on k8s

import pandas as pd

def most_important_transform(source01_path: str, source02_path: str, destination_path: str):
    source01_df = pd.read_parquet(source01_path)
    source02_df = pd.read_parquet(source02_path)
    
    transformed_df = source01_df.merge(source02_df, on="shipment_id", how="left")
    
    transformed_df.to_parquet(destination_path)
    


## Your configuration could then look like this:

```bash
SOURCE01_STORAGE_ACCOUNT=prod
SOURCE01_DATASET=shipments

SOURCE02_STORAGE_ACCOUNT=prod
SOURCE02_DATASET=commitments

DESTINATION_STORAGE_ACCOUNT=prod
DESTINATION_DATASET=commitment_shipments

JOB_EXECUTION_TARGET=ge_entrypoint python -m most_important_transform

SCHEDULER_STRATEGY=Any
```

# Functional Data Engineering

Functional as in _functional programming_. Many of the principles lend themsleves well to data engineering.

 * __immutable data__ - snapshot all the data!
 * __idempotent functions__ - data pipelines are functions that have no side effects
 * __reproducibility__ - foundation of the scientific method (and sanity)
 
 
___ 
 _Check:_ Maxime Beauchemin, founder of Airflow
  * [medium post on functional data engineering](https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a)
  * [youtube video on functional data engineering](https://youtu.be/4Spo2QRTz1k)

## Immutability

Data snapshotting:

![pippi-snaphoting](assets/pippi-pipeline.png)

## Immutability

Data & code dependencies:

![pippi-snaphoting](assets/pippi-snaphots.png)



## Idempotency

Running the exact pipeline again produces the same output.

OR

Repeatedly running the pipeline will not change the outcome.

## Reproducibility

Foundation to do scientific/analytics work. 

Immutability is a key enabler of reproducibility, and so is idempotency. 


# Implementation of great expectations


## Data Docs

# How we use data validation

* validate source AND destination data
* break pipeline when either is bad

![ge_flow](assets/ge_entrypoint_logic.svg)


![ge_flow](assets/ge_entrypoint_flow.svg)

# Pain & Gain

## Things Break Constantly 💥

 * not a question of IF, but only of WHEN
 * when we first put great expectation into production we were breaking many data pipelines that turned out to actually be OK
  - we started to learn about our data quality
  - iterative process

## Data Validation in Team 🥑

* data validation in production for all our forecasting products for ~3 month
* already prevented a handful of incidents 💪  
* lead to investigation of data quality issues and improvements

<img src="assets/ge-coverage.png" alt="ge-coverage" width="200" style="float: left;"/>

