# Data Lake Maturity Model

Adapted and extended from: [Zalando Data Lake Maturity Model](https://resources.zaloni.com/ebooks/data-lake-maturity-model)

---

The first thing needed for analytics is data. It should be accessible, complete, trustful, well governed and easily used by anyone needed to make data-driven decisions. Below there are some thoughts on how this data collections should be organized.

<img src="images/ds-hierarchy-of-needs.png" alt="Data Science Hierarchy of Needs" style="width: 800px;"/>

- From [The AI Hierarchy of Needs](https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007)

---
## Organization's Readiness for big data and analytics

### Data

#### Data Lake
A large repository of organizational data characterized by best practices in architecture, curation, and access. Data lake is a loose confederation of databases that:

- can have different structures, 
- come from different vendors (internal and external sources),
- be processed through different tools.

The goals for a data lake include:

- Integrating new sources of data as social media feeds or sensor data from IoT,
- Democratization of data (self-service), business users with little or no programming skill can create their own reports and dashboards,
- Access to broader ranges of data coupled with better security and privacy guarantees (personal data regulations, trade secrects.

By sharing data, data lakes can open up communications between company divisions, cutting down silos and multiplying the benefits of insights from each division.

### Technology

- [MapReduce: Simplified Data Processing on Large Clusters (2004)](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf) - MapReduce is a two-step algorithm. It accepts input data as a key/value store, or creates a key, as part of `the Map step`. The Map step determines all of the values associated with a key; for instance, all the web pages that contain a word. This step is highly parallelizable. `The Reduce step` takes all output of the Map jobs and creates the final dataset linking documents to words. The algorithm is designed from the start to distribute work among large numbers of computers. MapReduce's opensource implementation is Hadoop. 

    In 2009 [Spark](http://spark.apache.org/) extended MR algorithm so that you can set up any kind of pipeline you like with key/value algorithms. Hadoop traditionally processed data in big batches. For modern needs a streaming (fast) data must be processed in near real-time fashion, so additional tools were added to the picture, like [Storm](http://storm.apache.org/) and [Flink](https://flink.apache.org/). These tools separate the streams into managable chunks for processing effectively, creating microbatches of data called `windows`.
    
- **Containers**. Modern programming involves dividing programs into small modules that do individual, well-defined services and expose the services through APIs (`microservices`). `Virtual machines` (WMs) and `containers` have become important because the microservices don't need a full computer system, so running many of them on each single chip can save a lot of hardware. In addition, each service may be created and torn down quickly, or fail and be replced automatically.
    
    [Docker](https://www.docker.com/) is by far the most popular container platform. [Kubernetes](https://kubernetes.io/) is an open-source system for automating deployment, scaling, and management of containerized applications. There are many other tools for administering and providing resources (`orchestration`).
    
- **AI libraries**. Analytics are being increasingly formalized and packaged into programming libraries, usually providing modern AI tool. When these algorithms were discovered, they were published in papers and each data scientist would code up their own functions. As they became standard, the programmers released libraries to handle classic data analysis tasks such as clustering and classification ([scikit-learn](https://scikit-learn.org/stable/)). A few years later, programmers took the next step and released libraries such as [TensorFlow](https://www.tensorflow.org/) and [MXNet](http://mxnet.apache.org/).

### People

Business users need data faster than ever. Waiting weeks or months for a programmer to code up a report is no longer acceptable. Business users have also become more technically savvy, so they have a better idea than in previous generations what kinds of decisions they need data to make and what data they want. Many have become used to creating Excel macros. Giving each team and user control over queries can speed up decisions by orders of magnitude.

Self-service requires work at many levels as well as new architectures for data storage:
- Users need to be able to find data (`data catalog`) through searches and queries
- Comprehensive online `taxonomy` - a list of terms and their relationships
- A process must be in place for giving the users access to data. This might include having data owner vet the access, copying the data to a new repository, anonymizing or masking sensitive parts of data, and checking later to make sure the user adheres to the contract provided with the data
- Tools must support access by people on different level of skills (web-based query interface, APIs, customizable dashboards)

When self-service becomes universal and everyone in the organization is trained to consult the data before making a decision, the organization has become truly data driven.

### Processes

#### Curation: Cleaning, Prepping, and Provenance

Curation means putting things in a suitable order for use. In terms of datasets, it covers a number of tasks in cleaning, prepping, and preserving the provenance (`lineage`) of data.

Typical `cleaning operations` include (automatically):
- converting inconsistent values to standards (i.e. units)
- removing bad values (i.e. human age above 150)
- filling in missing values (i.e. through arithmetic tricks such as interpolation)
- removing rows that contain bad values
- deduplication (i.e. proximity of different records)

### TODO: 
work on this list and link to an external notebook for steps

Even after data is cleaned, there may be need for more preprocessing before querying. Security might also require `preparation` in order to conform to regulations and protect corporate secrets. The preparation mey include:
- Merging records from different sources (can be also considered as part of cleaning)
- Putting data into a schema
- Adding tags to mark the value of the data for various purposes or to indicate provenance information such as source of the data
- Providing aggregate information such as totals or averages
- Determining the security and privacy rules that apply to datasets or individual fields and preparing them so that they are kept safe from unauthorized view (access rules, encryption, separation of sensitive data to different datasets)
- De-identifying or anonymizing data to protect privacy. This involves generalizing the data so that it applies to a larger group of entities.

`Provenance` covers many types of metadata:
- the source of data
- generation and collection dates
- owner
- other
The metadata is useful to help determine the value of the data to each user.

#### Data Access

#### Architecture

`Zones` are important part of data lake's architecture, they satisfy the need to give different users data of different types. If data is expected to be sensitive, it can be loaded into a `transient` or `landing` zone so that it can be vetted before it even enters the data lake. For cleaning and prepping `raw` zone may exist. The data ready for quering is stored in  a `gold` or `trusted` zone which serves the rest of the organization as the "single source of truth". Different users might want data to be represented with different schemas. Data transformed to meet these needs is moved from the gold zone to a `work` or `refined` zone where users can issue the precise queries they want. Some companies store sensitive data in a separate location `sensitive` zone.

<img src="images/datalake-zones.png" alt="Data lake zones" style="width: 800px;"/>

---
## Zaloni Data Lake Maturity Model

<img src="images/datalake-zaloni-maturity-model.png" alt="Data lake zones" style="width: 800px;"/>

Many Organizations today are still conducting business the way that all have in the past: consulting data only through quarterly reports or canned dashboards, keeping most of their data in the dark where no one uses it, and making decisions mostly on intuition and personal experience. They realize that they are not making full use of their data and technology infrastructure, but it still "does the trick" for them. Not being able to think of better uses of data, they go on with business as usual until they are bumped out of their market by more dynamic companies.

### Level 1: Ignore
The firtst imperative, to move out of this level, is to wake people up. Big changes are taking place, which are understood and exploited by companies that will be the winners. The winners are winning more while the losers are losing big. No single person or organization can hold back global trends fro very long. Every organization has to take that to heart.

To propose the use of big data, a trong alignment with the business is needed. You might find obvious pain points that the data lake will solve, or you might suggest that analytics can find new revenue streams delivered by existing business teams. You need to get buy-ion from many sider, upper management as well as key teams.

Most businesses set up test projects after hearing about big data, analytics, and data lakes. If chosen carefully, these will produce the desired results and stimulate wider adoption of the tools and data-driven thinking.

#### Data
The organization preserves whatever information it can as structured data, using traditional relational databases or data marts (data warehouse). The organization is operating with limited datasets collected for immediate purpose such as sales leads. Much of the data is internally collected and might even be entered manually. This limits the amount of data that a company can feasibly collect and questions its quality. The `unstructured` plain-text data is rarely used as there are no means available for it. The data is often stored in silos (different systems) with different schemas (which limits automation).

#### Technology
Organizations typically use tools that were handcoded in-house, or standalone tools - often proprietary - that don't interact well. The tools might be created by a particular team for a particular purpose, so along with siloead data there is siloed technology for processing the data. There can be redundancy with IT staff doing the same code over and over. There is no common code repository and no way of searching for useful functions that already exist.

#### People
This level is characterized by a business culture resistant to change. Analytics are run only by individuals or single teams. There is no staff for big data, although one or more lone voices, probavly among the IT managers, might have introduced the idea of big data.

##### Intuition vs. Data to Drive Decisions
The key to effective organizations in our time is embracing big data while maintaining a critical attitude (intuition). Professionals and managers can use
- data to be better managers by helping them to
    - represent what they know so that others can tap this insight when making their own decisions,
    - learn from data about hidden insights to open new ways of how their business could work and better understand it
- their intuition to do the following
    - Suggest new directions for research and company investment
    - Determine what data is needed to make the ensuing decisions
    - Evaluate the validity of analytics and challenge suspicious conclusions
    - Assess the impacts of decisions on employees, cleints, and other aspects of business
    - Find resources and marshal the organization to act effectively on data-driven decisions
    
#### Process
The manager of each team sets its agenda, which might jibe poorly with the agendas of other divisions. And the direction taken is based on the intuition of the manager, with little or no recourse to data. The organization lacks processes around the use of data. In the absence of a searchable data catalog, most of it goes unused. There is little data governance as policies around sensitive and regulated data are enforced manually and violations can slip by.

### Level 2: Store
At this level, there is executive-level recognition of the value of data and its return on investment (ROI). A few departments are running some of the big data tools and data stores mentioned earlier in this report, either on-premises or in the cloud.

To move forward, an organization should:
- Make sure the experiments are aimed at clear business value
- Business sponsorship and executive leadersip are in place 
- Start to define and document processes at the corporate level for setting up the zones of a data lake, preserving relevant provenance, defining ownership, and auditing
- Call on managers to use data to evaluate the results of every initiative
- Get the rules and resources to do training and certification for a wide range of staff
- Make sure your technology and data infrastructure can scale

#### Data
Although a few teams are using data lake tools and data stores, the use of each dataset still tends to be limited to one team. Oversight and governance at the corporate level are limited, as the data visibility. This is called `data puddle`. Unfortunately, many data pudles put together do not form a data lake. That requires more planning and governance than the individual teams can do on their own. Instead, the natural evolution is to a `data swamp`. A data swamp collects lots of data but fails to organize it so it can be used effectively. It can be also very inefficient and costly because they can contain redundant data, cause duplicated work (by individual teams) and use of duplicated tools. Combining datasets later can be difficult because of inconsistences between different puddles.

#### Technology
Several teams are running big data projects, but each team chooses its own tools, so they probably don't work with tools chosen by other teams. The data warehouse still serves most of the company, so data moves regularly between it and the new data puddles.

#### People
Self-service does not yet exist. Each team depends on IT staff to do queries and run its analytics. By now, business strategists recognize the value of data in decision making. There is no vision and strategy around the ways big data can transform the organization. The business culture is still largely resistant to change into data-driven. To overcom e resisatance, you need train them.
    
#### Process
There is little data governance, and it is done only at the level of the individual team. Roles and responsibilities around data ownership and data sharing are unclear because there are no formal policies.

### Level 3: Govern
At this level, people, processes and technologies begin to become coordinated across the enterprise. The company is running at least one scalable production data lake, and is making sure that new datasets are linked into this data lake instead of being stored in silos in individual departments. Departments are working together to standardize and centralize their data, so a governance organization is present.

Business users are acquiring useful data across the enterprise and are improving the visibility of their data by managing provenance and other metadata. The company has 360-degree views of customers, products, and so on. There is still a lot of manual work involved in the data life cycle, but teams are sharing best practices and are consolidating their efforts for the sake of efficiency and consistency. 

Data catalog should be in place and all data entered are tagged. First attempts to process automation should happen and as many insights as possible should be exposed through web forms and dashboards to users. Getting to the next stage requires an investigation into the processes people are routinely going through and then automating these.

#### Data
Data quickly moves through different stages in one or more data lakes to become useful to business teams. New datasets are being generated or acquired from outside, and shared across the enterprise in the data lake. Storing data protection and security are in place, along with auditing to ensure that only authorized people are getting access to data. Metadata is always stored. The data lake stores metadata about data usage, like reports generated from the data.

#### Technology
Newly built systems by default use HDFS or other big data stores. The enterprise data warehouse is maintained mostly for legacy applications. It staff is routinely exploring new tools. Teams try to use the same tools. If the company does not already have a unified data catalog, the staff is building one.

#### People
Business strategists cite data and the results of analytics when making decisions. The company is using analytics not only to streamline current operations, but to generate more value, relying on predictive analytics. The business is evolving into a data-driven institution where managers do not fear the encroachment of data on their expertise.
    
#### Process
Data lake architecture and the use of tools are informed by documented rules and best practices. Enterprise-wide governance and security policies are in place. There is a repeatable, well understood way to onboard new divisions and incorporate their data into the lake.

### Level 4: Automate
This level frees employees from many routine tasks and helps speed up innovation. One key to reaching this stage is autodiscovery: for instance, tools that crawl existing datasets to identify other datasets that need to be added to the catalog.

This is the time to build advanced capabilities such as text mining, machine learning, forecast modeling, statistical model building, and prescriptive analytics.

#### Data
Provenance is consistely used for all data, and staff can have confidence in data quality. Data enables immedaite action on business operations. Metadata might continue to evolve; for instance, machine learning models can be traced to the datasets they use as input.

#### Technology
Data engineers and programmers are focused on processes that can be automated to eliminate manual repetitive tasks. Autodiscovery makes it possible to monitor datasets, carrying out such tasks as automating their addition to the data lake or the catalog. Tools cannot always make a firm decision but should be used to augment human intelligence.

At this level, the analytic models being used for machine learning or other analysis are part of the overall system. It is easy to identify the model's lineage, the quality of the model, where it is being applied and by whom.

Self-service is universally available, with ever-expanding tools for users what can't program.

#### People
Predictive analysis is routinely used to make optimal business decisions. The company is responsive to changes from clients and the marketplace. New and existing employees go through training sessions explaining the value of data and how the organization is using it and seek for it.
    
#### Process
Processes and best practices are fully defined at the corporate level, and most are routinely followed. The information architecture and associated standards are fully defined and implemented. Continuous development and integration are in place in all development efforts where they are useful. Audits and conformance are automated as much as possible and incorporated into continuous development.

### Level 5: Optimize


#### Data


#### Technology


#### People


#### Process

