# Data Lake Maturity Model

Adapted and extended from: [Zalando Data Lake Maturity Model](https://resources.zaloni.com/ebooks/data-lake-maturity-model)

## Organization's Readiness for big data and analytics

### Data

#### Data Lake
A large repository of organizational data characterized by best practices in architecture, curation, and access. Data lake is a loose confederation of databases that:

- can have different structures, 
- come from different vendors (internal and external sources),
- be processed through different tools.

The goals for a data lake include:

- Integrating new sources of data as social media feeds or sensor data from IoT,
- Democratization of data (self-service), business users with little or no programming skill can create their own reports and dashboards,
- Access to broader ranges of data coupled with better security and privacy guarantees (personal data regulations, trade secrects.

By sharing data, data lakes can open up communications between company divisions, cutting down silos and multiplying the benefits of insights from each division.

### Technology

- [MapReduce: Simplified Data Processing on Large Clusters (2004)](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) - MapReduce is a two-step algorithm. It accepts input data as a key/value store, or creates a key, as part of `the Map step`. The Map step determines all of the values associated with a key; for instance, all the web pages that contain a word. This step is highly parallelizable. `The Reduce step` takes all output of the Map jobs and creates the final dataset linking documents to words. The algorithm is designed from the start to distribute work among large numbers of computers. MapReduce's opensource implementation is Hadoop. 

    In 2009 [Spark](http://spark.apache.org/) extended MR algorithm so that you can set up any kind of pipeline you like with key/value algorithms. Hadoop traditionally processed data in big batches. For modern needs a streaming (fast) data must be processed in near real-time fashion, so additional tools were added to the picture, like [Storm](http://storm.apache.org/) and [Flink](https://flink.apache.org/). These tools separate the streams into managable chunks for processing effectively, creating microbatches of data called `windows`.
    
- **Containers**. Modern programming involves dividing programs into small modules that do individual, well-defined services and expose the services through APIs (`microservices`). `Virtual machines` (WMs) and `containers` have become important because the microservices don't need a full computer system, so running many of them on each single chip can save a lot of hardware. In addition, each service may be created and torn down quickly, or fail and be replced automatically.
    
    [Docker](https://www.docker.com/) is by far the most popular container platform. [Kubernetes](https://kubernetes.io/) is an open-source system for automating deployment, scaling, and management of containerized applications. There are many other tools for administering and providing resources (`orchestration`).
    
- **AI libraries**. Analytics are being increasingly formalized and packaged into programming libraries, usually providing modern AI tool. When these algorithms were discovered, they were published in papers and each data scientist would code up their own functions. As they became standard, the programmers released libraries to handle classic data analysis tasks such as clustering and classification ([scikit-learn](https://scikit-learn.org/stable/)). A few years later, programmers took the next step and released libraries such as [TensorFlow](https://www.tensorflow.org/) and [MXNet](http://mxnet.apache.org/).

### People

Business users need data faster than ever. Waiting weeks or months for a programmer to code up a report is no longer acceptable. Business users have also become more technically savvy, so they have a better idea than in previous generations what kinds of decisions they need data to make and what data they want. Many have become used to creating Excel macros. Giving each team and user control over queries can speed up decisions by orders of magnitude.

Self-service requires work at many levels as well as new architectures for data storage:
- Users need to be able to find data (`data catalog`) through searches and queries
- Comprehensive online `taxonomy` - a list of terms and their relationships
- A process must be in place for giving the users access to data. This might include having data owner vet the access, copying the data to a new repository, anonymizing or masking sensitive parts of data, and checking later to make sure the user adheres to the contract provided with the data
- Tools must support access by people on different level of skills (web-based query interface, APIs, customizable dashboards)

When self-service becomes universal and everyone in the organization is trained to consult the data before making a decision, the organization has become truly data driven.

### Processes

#### Curation: Cleaning, Prepping, and Provenance

Curation means putting things in a suitable order for use. In terms of datasets, it covers a number of tasks in cleaning, prepping, and preserving the provenance (`lineage`) of data.

Typical `cleaning operations` include (automatically):
- converting inconsistent values to standards (i.e. units)
- removing bad values (i.e. human age above 150)
- filling in missing values (i.e. through arithmetic tricks such as interpolation)
- removing rows that contain bad values
- deduplication (i.e. proximity of different records)

### TODO: 
work on this list and link to an external notebook for steps

Even after data is cleaned, there may be need for more preprocessing before querying. Security might also require `preparation` in order to conform to regulations and protect corporate secrets. The preparation mey include:
- Merging records from different sources (can be also considered as part of cleaning)
- Putting data into a schema
- Adding tags to mark the value of the data for various purposes or to indicate provenance information such as source of the data
- Providing aggregate information such as totals or averages
- Determining the security and privacy rules that apply to datasets or individual fields and preparing them so that they are kept safe from unauthorized view (access rules, encryption, separation of sensitive data to different datasets)
- De-identifying or anonymizing data to protect privacy. This involves generalizing the data so that it applies to a larger group of entities.

`Provenance` covers many types of metadata:
- the source of data
- generation and collection dates
- owner
- other
The metadata is useful to help determine the value of the data to each user.

#### Data Access

#### Architecture

`Zones` are important part of data lake's architecture, they satisfy the need to give different users data of different types. If data is expected to be sensitive, it can be loaded into a `transient` or `landing` zone so that it can be vetted before it even enters the data lake. For cleaning and prepping `raw` zone may exist. The data ready for quering is stored in  a `gold` or `trusted` zone which serves the rest of the organization as the "single source of truth". Different users might want data to be represented with different schemas. Data transformed to meet these needs is moved from the gold zone to a `work` or `refined` zone where users can issue the precise queries they want. Some companies store sensitive data in a separate location `sensitive` zone.

<img src="images/datalake-zones.png" alt="Data lake zones" style="width: 800px;"/>

## Zaloni Data Lake Maturity Model

<img src="images/datalake-zaloni-maturity-model.png" alt="Data lake zones" style="width: 800px;"/>

Many Organizations today are still conducting business the way that all have in the past: consulting data only through quarterly reports or canned dashboards, keeping most of their data in the dark where no one uses it, and making decisions mostly on intuition and personal experience. They realize that they are not making full use of their data and technology infrastructure, but it still "does the trick" for them. Not being able to think of better uses of data, they go on with business as usual until they are bumped out of their market by more dynamic companies.

### Data at the Ignore Level
The organization preserves whatever information it can as structured data, using traditional relational databases or data marts. The organization is operating with limited datasets collected for immediate purpose such as sales leads. Much of the data is internally collected and might even be entered manually. This limits the amount of data that a company can feasibly collect and questions its quality. The `unstructured` plain-text data is rarely used as there are no means available for it. The data is often stored in silos (different systems) with different schemas (which limits automation).