# Module 8: Data in Motion

# Introduction

Modern applications have many components that need to communicate with each other. For a simple application, these components may all reside together on a single computer. However, for a web application of any sophistication, the components will be split across different servers for performance or reliability reasons. In this module, we cover **data in motion** &mdash; i.e. how to reliably move data between different servers in a way that allows for future modification without downtime.

# Learning Outcomes  

By the end of this module, you will be able to:

* Define metadata, its types and its importance
* Compare and contrast formats for encoding and transmitting data as messages
* Describe patterns of data transfer between one or more systems

# Readings and Resources

We invite you to further supplement this notebook with the following recommended texts/resources:

- Kleppmann, M. (2017). Chapter 4: Encoding and Evolution in *Designing Data Intensive Applications*. O’Reilly: Boston. http://shop.oreilly.com/product/0636920032175.do


- Apache Software Foundation (2012). Apache Avro™ 1.8.2 Documentation. https://avro.apache.org/docs/current/

<h1>Table of Contents<span class="tocSkip"></span></h1>
<br>
<div class="toc">
<ul class="toc-item">
<li><span><a href="#Module-8:-Data-in-Motion" data-toc-modified-id="Module-8:-Data-in-Motion">Module 8: Data in Motion</a></span>
</li>
<li><span><a href="#Introduction" data-toc-modified-id="Introduction">Introduction</a></span>
</li>
<li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes">Learning Outcomes</a></span>
</li>
<li><span><a href="#Readings-and-Resources" data-toc-modified-id="Readings-and-Resources">Readings and Resources</a></span>
</li>
<li><span><a href="#Why-do-data-scientists-need-to-know-about-data-encoding?" data-toc-modified-id="Why-do-data-scientists-need-to-know-about-data-encoding?">Why do data scientists need to know about data encoding?</a></span>
</li>
<li><span><a href="#Metadata" data-toc-modified-id="Metadata">Metadata</a></span>
<ul class="toc-item">
<li><span><a href="#Metadata-in-a-network-environment" data-toc-modified-id="Metadata-in-a-network-environment">Metadata in a network environment</a></span>
</li>
<li><span><a href="#Translation-between-representations" data-toc-modified-id="Translation-between-representations">Translation between representations</a></span>
</li>
<li><span><a href="#Standardized-metadata" data-toc-modified-id="Standardized-metadata">Standardized metadata</a></span>
</li>
<li><span><a href="#Protocols" data-toc-modified-id="Protocols">Protocols</a></span>
</li>
<li><span><a href="#The-API-Economy" data-toc-modified-id="The-API-Economy">The API Economy</a></span>
</li>
</ul>
</li>
<li><span><a href="#Messages" data-toc-modified-id="Messages">Messages</a></span>
<ul class="toc-item">
<li><span><a href="#Message-formats" data-toc-modified-id="Message-formats">Message formats</a></span>
<ul class="toc-item">
<li><span><a href="#Fixed-field" data-toc-modified-id="Fixed-field">Fixed field</a></span>
</li>
<li><span><a href="#CSV" data-toc-modified-id="CSV">CSV</a></span>
</li>
<li><span><a href="#XML" data-toc-modified-id="XML">XML</a></span>
</li>
<li><span><a href="#JSON" data-toc-modified-id="JSON">JSON</a></span>
</li>
<li><span><a href="#Binary-encoding-of-JSON" data-toc-modified-id="Binary-encoding-of-JSON">Binary encoding of JSON</a></span>
</li>
<li><span><a href="#Proprietary-encodings-for-data-interchange" data-toc-modified-id="Proprietary-encodings-for-data-interchange">Proprietary encodings for data interchange</a></span>
</li>
</ul>
</li>
<li><span><a href="#Message-Versioning" data-toc-modified-id="Message-Versioning">Message Versioning</a></span>
<ul class="toc-item">
<li><span><a href="#Backward-and-forward-compatibility" data-toc-modified-id="Backward-and-forward-compatibility">Backward and forward compatibility</a></span>
</li>
</ul>
</li>
<li><span><a href="#Message-Protocols" data-toc-modified-id="Message-Protocols">Message Protocols</a></span>
<ul class="toc-item">
<li><span><a href="#Thrift" data-toc-modified-id="Thrift">Thrift</a></span>
</li>
<li><span><a href="#Protocol-Buffers" data-toc-modified-id="Protocol-Buffers">Protocol Buffers</a></span>
</li>
<li><span><a href="#Code-generation-from-Protocol-Buffers-and-Thrift-schemas" data-toc-modified-id="Code-generation-from-Protocol-Buffers-and-Thrift-schemas">Code generation from Protocol Buffers and Thrift schemas</a></span>
<ul class="toc-item">
<li><span><a href="#How-Protocol-Buffers-and-Thrift-enable-schema-evolution" data-toc-modified-id="How-Protocol-Buffers-and-Thrift-enable-schema-evolution">How Protocol Buffers and Thrift enable schema evolution</a></span>
</li>
</ul>
</li>
<li><span><a href="#Avro" data-toc-modified-id="Avro">Avro</a></span>
<ul class="toc-item">
<li><span><a href="#How-to-Communicate-the-Writer-Schema-in-Avro" data-toc-modified-id="How-to-Communicate-the-Writer-Schema-in-Avro">How to Communicate the Writer Schema in Avro</a></span>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li><span><a href="#Data-integration-patterns" data-toc-modified-id="Data-integration-patterns">Data integration patterns</a></span>
<ul class="toc-item">
<li><span><a href="#The-Point-to-Point-Pattern" data-toc-modified-id="The-Point-to-Point-Pattern">The Point-to-Point Pattern</a></span>
<ul class="toc-item">
<li><span><a href="#Advantages-and-disadvantages-of-Point-to-Point" data-toc-modified-id="Advantages-and-disadvantages-of-Point-to-Point">Advantages and disadvantages of Point-to-Point</a></span>
</li>
</ul>
</li>
<li><span><a href="#ETL-(Extract,-Transform,-Load)" data-toc-modified-id="ETL-(Extract,-Transform,-Load)">ETL (Extract, Transform, Load)</a></span>
</li>
<li><span><a href="#Publish-&-Subscribe" data-toc-modified-id="Publish-&-Subscribe">Publish & Subscribe</a></span>
</li>
<li><span><a href="#Service-Oriented-Architecture-and-Web-Services" data-toc-modified-id="Service-Oriented-Architecture-and-Web-Services">Service Oriented Architecture and Web Services</a></span>
<ul class="toc-item">
<li><span><a href="#Service-Oriented-Architecture-(SOA)" data-toc-modified-id="Service-Oriented-Architecture-(SOA)">Service-Oriented Architecture (SOA)</a></span>
</li>
<li><span><a href="#Web-Services" data-toc-modified-id="Web-Services">Web Services</a></span>
</li>
</ul>
</li>
<li><span><a href="#Microservices" data-toc-modified-id="Microservices">Microservices</a></span>
</li>
</ul>
</li>
<li><span><a href="#Integration-Tools" data-toc-modified-id="Integration-Tools">Integration Tools</a></span>
<ul class="toc-item">
<li><span><a href="#Integration-Brokers" data-toc-modified-id="Integration-Brokers">Integration Brokers</a></span>
</li>
<li><span><a href="#Enterprise-Service-Buses-(ESBs)" data-toc-modified-id="Enterprise-Service-Buses-(ESBs)">Enterprise Service Buses (ESBs)</a></span>
</li>
<li><span><a href="#The-Distributed-Actor-Framework" data-toc-modified-id="The-Distributed-Actor-Framework">The Distributed Actor Framework</a></span>
</li>
<li><span><a href="#Microservices-and-integration-brokers" data-toc-modified-id="Microservices-and-integration-brokers">Microservices and integration brokers</a></span>
</li>
</ul>
</li>
<li><span><a href="#References" data-toc-modified-id="References">References</a></span>
</li>
</ul>
</div>

# Why do data scientists need to know about data encoding?

Before we can analyze data we need to procure it. Often the data resides in existing systems that may use any of a wide variety  of methods for enabling access to the data they hold. As we'll see, there are many different methods and tools that you'll need to know about and use to get access, or at least to be able to communicate with big data engineers who might be assisting you. 

# Metadata

We can use data to describe other data. When we do this, we call it **metadata**. Metadata makes data *self-describing* and makes data easier to find, interpret and use.

> _“Data that provides information about other data.”_ (Merriam-Webster, 2018)

The National Information Standards Organization (NISO) identifies three major types of metadata:

1. **Descriptive metadata**: information that helps in finding, identifying and understanding a resource. Properties  include elements such as title, author, subject, keywords, etc.<br><br>

2. **Structural metadata**: Information about containers of data, indicating how compound objects are put together, for example, names of fields within objects. Relational table schemas contain structural metadata about columns' types and relationships (via foreign keys) to other tables.<br><br>

3. **Administrative metadata**: Information that helps manage a resource, such as when and how it was created. For example, the file type and intellectual property rights (such as a Creative Commons license) associated with a resource.

NISO also lists markup languages such as XML and JSON as a metadata type. Markup languages allow us to tag parts of some content with additional information about its meaning, type, structure or how it should be formatted.


## Metadata in a network environment

In a distributed network environment, it is frequently necessary for a system process (or "service") to send data or retrieve data from another process with which it does not share memory, and in fact may be on a different machine in a different part of the world.

Prior to the internet, it was common for systems to send records of a known, fixed format between processes or programs. This is okay if the format of the data never needs to change. But, systems evolve over time and when an application change was required, the application needed to be shut down everywhere, updated and brought up again, usually incurring significant downtime.

However, in today's 24/7/365 world, this is no longer acceptable.  Companies such as Google release dozens of changes every day and can't possibly shut down all the servers providing a particular service for a period of time while they are all patched with a new version.

The solution the industry developed is to make the messages between the services *self-describing* by adding metadata. The metadata allowed the receiving process to determine what a message it receives contains by looking at it. As long as the receiver can handle all currently-active versions of the structures that a message could be in, the service provider company can take its time rolling out a new service, server-by-server, and everything keeps working.


## Translation between representations

The sending and receiving services may use quite different internal representations of the same data. Data can be stored in memory in different data structures such as lists, arrays, hash tables, trees, objects, structs, etc. and there's nothing to stop two endpoints from using totally different ones.

Below are some examples of storing the names `John`, `Shawn`, and `Mary` in Python using a few different data structures.

In [1]:
names_list = ["John", "Shawn", "Mary"]
print(names_list)

['John', 'Shawn', 'Mary']


In [2]:
names_tuple = ("John", "Shawn", "Mary")
print(names_tuple)

('John', 'Shawn', 'Mary')


In [3]:
names_set = {"John", "Shawn", "Mary"}
print(names_set)

{'John', 'Mary', 'Shawn'}


Which data structure a program uses depends on how the program uses the data. Different structures are convenient for different purposes.

Ultimately though, all data at its lowest level is simply an array of bytes. Whether writing data to a file or sending data over a network, the data is transmitted as a sequence of bytes. The data can be stored as a flat file, JSON document, XML document, etc., but on disk it is stored as bytes irrespective of the format chosen. Usually, the first few bytes, called the **file signature** (also called magic numbers), help identify the data format that applications can use to process the data. For example, `EF BB BF` is the file signature for UTF-8 encoded text files.  Beyond that, all structure the data might have will be lost when we write it to disk unless we add metadata to describe it.

When we add metadata to preserve information such as the names of the fields in a data structure or relationships between them (for example, an object graph), we say the data is **encoded** or **serialized**.  Reconstructing the data structure from the metadata and data is called **decoding** or **deserialization**.  You may also see the  terms **pickling**, **hydration** or **marshalling** which are all also alternatives words meaning *encoding*.

## Standardized metadata

Standardized encoding offers the advantage of **portability**. Using metadata to describe the encoding format removes the need to hard-code the encoding/decoding logic &mdash; it can be replaced by a standard encoding/decoding library. It also allows helps ensure the encoder/decoder performs well (probably better than custom code written by a non-specialist). Most importantly, it allows organizations that adopt a standard encoding to share data with fewer headaches than if they had to synchronize every update to their systems with all of their ecosystem partners (say, for example, along a supply chain).

As a concrete example, **HL7** is an industry-specific standard which provides a framework for exchange, integration, sharing and retrieval of electronic health information and defines how information is packaged and communicated between healthcare providers. For more details see http://www.hl7.org/.

The standards that we will look at however will be for the most part industry-agnostic.

## Protocols

When the way in which two processes communicate with each other is standardized, we call the specification of how they communicate a **protocol** or an **API** (borrowing the term **application programming interface** from object-oriented programming). Although the terms are sometimes used interchangeably, *protocol* is typically used to refer to the way in which metadata is required to be embedded in messages, and *API* is used to refer to the set of services that are available as a bundle, what's expected in a properly-formatted message, and what is produced as output (including the output format).

## The API Economy

The standardization of APIs enabled the rise of **digital services** that reside on the internet and can be easily integrated into new applications without requiring much custom-coding. This is sometimes referred to as the **API economy** &mdash; an economy of electronic services rather than physical goods.

#  Messages

Data can be sent between systems in two ways: as a *flat file* or as a *message*. A **flat file** is simply a text file that is written by one system in its entirety and then read by another.  A **message** can contain an entire file, but is typically something smaller like a single record or query result and allows the sender and receiver to both be running concurrently rather than sequentially.  Messages also typically include metadata that make their contents self-describing.

A message can be sent directly between a sender or a receiver, or can use a service that facilitate the transfer, for example, by temporarily cacheing the message if the receiver is not currently available.  This intermediary software is broadly referred to by the term *Enterprise Application Integration (EAI)*. We will look at the various styles of EAI later in this notebook.  First, let's focus on the formats a message can take. 

## Message formats

The most common message formats are:

- Fixed field 
- CSV
- XML
- JSON

### Fixed field

**Fixed field format** refer to files where data is stored in columns and rows, with columns having a fixed character length. This format does not use any special marking (delimiter) to separate fields, but expects each data element in a column to have the same, fixed length. This approach is rarely used nowadays as it has little flexibility and is not self-describing.

### CSV

**CSV format** includes both a delimiter and an optional enclosing character such as quote marks. Commonly-used delimiters are the comma, pipe (|) and tab, but any character that can't appear elsewhere in the message as data could be used. The enclosing character, usually double quotes, is often used at the beginning and end of a value. An optional first line, also referred to as a header row, may have names of fields.  The header row, if present, provides a little self-description, but there is no way programatically to reliably tell if a CSV file has a header row or not, the receiver just has to know in advance or trying to parse it.

### XML

**XML** stands for eXtensible Markup Language. XML is a textual data format designed for human and machine consumption. XML has a self-defining structure of metadata surrounding each data element. XML documents are usually accompanied by a schema following a standard known as **XML Schema** or **XSD**. When XML is used for messaging, the sender and receiver typically both know the schema in advance and can use it to validate that a message is correctly structured.

XSD documents are themselves XML documents. The XSD provides a standardized method for interpretation by generic XML parsers. An XML document is called *well formed* if it conforms to the syntax of XML. Further, it is *valid* if it is both well formed and conforms to its associated schema.  

You can see an example of an XML document and its XSD schema at https://www.w3schools.com/xml/schema_example.asp.

### JSON

**JavaScript Object Notation (JSON)** is based on how objects work in the JavaScript Programming Language.

JSON is built on two structures:

1. A collection of name value pairs.<br><br>

2. An ordered list of values.  

JSON Schema specifies a JSON-based format to define the structure of JSON data for validation, documentation, and interaction control. It is based on the concept of the XML Schema (XSD) but is JSON-based. Like XSD, it is self-describing and the same encoding/decoding tools can be used for both the schema and data. If no schema is prescribed, all the object field names will need to be included within the encoded data.

Here is an example of a JSON document/message:

    {
      "mail": {
        "to": "Sam",
        "from": "Tony",
        "subject": "Module 8 Data in Motion",
        "body": "This is an XML document"
      }
    }

Here's an example of how we can read JSON using the Python library `json`:

In [4]:
import json
doc = '{ "userName": "Martin", "favoriteNumber": 1337, "interests": ["daydreaming", "hacking"]}'
y = json.loads(doc)
print(y["favoriteNumber"])

1337


JSON is simpler than XML and easier for people to read. Also, whereas XML has to be parsed by an XML parser, JSON can be parsed by a standard JavaScript function.  JSON is less verbose than XML, thus for the same information, JSON takes fewer bytes than XML. XML on the other hand has the advantage that XSD allows for a far more precise definition of the allowable contents of an XML document than JSON Schema. In the end, though, JSON has become by far the dominant format and XML is rarely used in new systems today.

### Binary encoding of JSON

To reduce the size of the textual version of JSON, the data can be binary encoded. This will reduce the total size of the document allowing for more efficient storage and/or data communications. Different standards for doing this, such as MessagePack, BSON, and BJSON can be used. For example, MongoDB actually stores JSON documents in the more compact BSON format.

However, binary encoding of JSON is not as widely adopted. Most JSON is stored in the plain textual version. Unless the messages are huge, the savings aren't worth the trouble. However, in a data centre where millions of messages per minute are travelling between systems, bandwidth becomes a scarce resource. In a moment, we will talk about Thrift, Protocol Buffers and Avro. These are binary encoding libraries which reduce the size even more than the above-mentioned encodings and provide for *message versioning*.

### Proprietary encodings for data interchange

Many programming languages such as Java (`Java.io.Serializable`), Ruby (`Marshall`), and Python (`pickle`) have built-in support for encoding and decoding. Though they are convenient to use, the downside of using language-specific encoding is that all intercommunicating applications get tied to a particular language. They are fine for situations where you just need to store then recover a complex data structure such as an object graph within a single application, but aren't well suited to communicating between services that have been written in different programming languages.

## Message Versioning

In modern applications with numerous intercommunicating systems, it is important that the data communications between the various components be done in a way that is independent of the specifics of each of the components. We want to avoid a situation where any small change in the internals of a component requires redesigning the data intercommunications in order to keep everything working.

For example, if two applications exchange data using schema `Version 1`, and one of the sender applications changes its data schema to `Version 2`, it will break the receiving application since it still expects data in the `Version 1` format.   

### Backward and forward compatibility

There are two kinds of compatibility we need to consider:

1. **Backward compatibility** is a property of an application which allows it to be able to read data that was written previously by an older version of the application.<br><br>

2. **Forward compatibility** is a property of an application which can read data written by a new version of the application.  

For example, consider an application that uses a database table `Users` which consists of four fields: `First Name`,`Last Name`,`Address`, and `City`.

This application (Version 1) exposes (provides to other programs or users) two services:

- The first service is responsible for collecting data from the user and updating the data in the `Users` table. 


- The second service retrieves data from the `Users` table and shows it back to the user.

Your manager asks you to add a new field `Country` to the database table which will require you to update the application to a newer version (Version 2) which knows how to read and write the table using the new table schema. 

The newer application (Version 2) is considered backward compatible if it can read records that don't contain a value for the `Country` field and still operate properly.

For the older Version 1 application to be considered forward compatible, it must still be able to successfully read a record written by the Version 2 application, and, if required, update and write the record back, doing so without dropping anything that had been put in the new `Country` field previously by the Version 2 application.

Next we will look at modern message protocols, standardized ways of encoding messages using metadata that have built-in features like support for message versioning.

## Message Protocols

Several standards have arisen over the last two decades for encoding messages in a highly-standardized, versioned, compact fashion.  These are advanced messaging techniques that are only really needed for big data applications involving large data rates and volumes, where there is also a need to allow services to be updated without downtime.

### Thrift
  
**Thrift** was originally developed by Facebook, and is now Apache [Thrift](https://thrift.apache.org/) (Apache Software Foundation, 2017).  

The Apache Thrift software framework for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml, Delphi and other languages (Apache Software Foundation, 2017).

Thrift provides an interface definition language which is used to describe the structure of a message, as well as tools to generate code from the description: "IDL (Interface Description Language) is a practical and useful tool for controlling the exchange of structured data between different components of a large system. IDL is a notation for describing collections of programs and the data structures through which they communicate. Using IDL, a designer gives abstract descriptions of data structures, together with representation specifications that specialize the abstract structures for particular programs. A tool, the IDL translator, generates readers and writers that map between concrete internal representations and abstract exchange representations." (Lamb, 1987)

We won't concern ourselves with the details here, but schema definition using Thrift looks like:

    struct  Person {
        1: required string     userName,
        2: optional i64        favoriteNumber,
        3: optional list<string>     interests
    }

### Protocol Buffers

**Protocol Buffers**, also known as **ProtoBuf**, was originally developed by [Google](https://developers.google.com/protocol-buffers/) (Google Developers, n.d.) and is another format designed for serializing structured data. Like Thrift, it also provides an interface definition language.

The equivalent schema definition to the one for Thrift, but for Protocol Buffers looks like this:

    message  Person {
        required string     user_name        =1;
        optional i64        favorite_number  =2;
        repeated string     interests        =3;
    }
    
ProtoBuf has become particularly popular recently, especially for communications between high volume services where compact messages conserve network bandwidth or where versioning is necessary for 24/7/365 operation.

### Code generation from Protocol Buffers and Thrift schemas

Both Protocol Buffers and Thrift come with a code generation tool that takes a schema definition and produces classes that implement the schema in various programming languages.  Application code can call this generated code to encode/decode records of the schema in a native data structure type for the language.

The big difference between binary encoding of JSON/XML with and without schemas is that ProtoBuf or Thrift encoded data contains **field tags** (which are numbers) when using a schema while the data contains **field names** when not using a schema. Field tags are much shorter than the field names they represent. The schema tells the parser how to reconstitute the full message.

#### How Protocol Buffers and Thrift enable schema evolution

Each field is identified by its tag number, annotated by a datatype. The field value is omitted in the encoded record if unfilled. A tag cannot be changed in the schema. Only new fields can be added by giving it a new tag number.

Forward compatibility is supported as old code can skip the field if it doesn't recognize a tag, with the datatype annotation telling how many bytes to skip. Backward compatibility is supported as long as each field has a unique tag and all new tags must be optional.

### Avro

Avro is a sub-project of Hadoop (Apache Software Foundation, 2018). It has two schema languages:

1. Interface definition language (IDL) for human editing<br><br>

2. JSON based for machine consumption
    
Here are examples of both:

IDL:
    
    record Person {
        string    userName;
        union {null, long}    favoriteNumber = null;
        array<string>    interests;
    }


JSON:

    {
        “type”: “record”,
        “name”: “Person”,
        “fields”: [
            {“name”: “userName”,    “type”: “string”},
            {“name”: “favoriteNumber”,    “type”: [“null”, ‘long”], “default”: null},
            {“name”: “interests”,        “type”: {“type”: “array”, “items”: “string”}}
            ]
    }

Avro achieves forward and backward compatibility in a different way than Thift and ProtoBuf. It doesn't use tag numbers in the schema. In fact it doesn't contain any information to identify fields and their datatypes. The values are concatenated together, as UTF-8 bytes after a short prefix. It's very compact.

Avro doesn't require code generation. The reader’s schema and the writer’s schema must be the same for data to be decoded.

For schema evolution, the reader’s schema and the writer’s schema do not have to be the same, only compatible, as the Avro library handles translation. The order of fields doesn't matter as the Avro library resolves any differences using the reader's and writer's schema. It uses a default value if a field is expected in the reader’s schema but is not provided by the writer’s schema. It ignores fields in the writer’s but not in the reader’s schema. To maintain compatibility, a programmer can only remove fields that have a default value that the reader can fill in for the missing value if it requires it.

#### How to Communicate the Writer Schema in Avro

If you're using Hadoop, you probably have millions of records stored in files. Each of these files include the writer’s schema at the beginning of the file. In a file where the records weren't all written at the same time, different records could have been written using different writer schemas. By including a version number at the beginning of each encoded record and keeping a list of schema versions in the file, individual records can be correctly decoded at any time.  

# Data integration patterns

Over the history of networked computers a number of common patterns have emerged when it comes to how intercommunicating processes can share data.

## The Point-to-Point Pattern

The expression **point-to-point pattern** usually refers to hand-crafted one-off integrations, each between a single sender and receiver. These are either old-style simple file or record transfers, or can make use of more modern protocols and tools, but are single-purpose and are not intended to support sharing by other processes than the two they were written for.

### Advantages and disadvantages of Point-to-Point

Point-to-point integrations are still the most common type:

* They're considered expedient because using the pattern avoids the cognitive burden of learning to use more sophisticated tools and patterns
* They can potentially be made very efficient, as they can avoid the overhead of a broker or general-purpose translator

There are also some disadvantages of using this pattern:

* They are not (in general) reusable
* It tightly couples the source and destination of the data (i.e. changes to data being transferred usually imply changes to both the sender and receiver)
* It does not benefit from the advantages of using an integration broker which can provide services such as automatic encryption/decryption, prioritization of traffic, guaranteed message delivery, message replay, etc. (Integration brokers are described below).

## ETL (Extract, Transform, Load)

*ETL* is a broad term (originating from the relational database community) for batch data transfer where data is extracted from various source systems and placed into a centralized data store of some kind such a as a data warehouse. In a *big data* environment, the data could be collected into an S3 (Amazon Simple Storage Service), data lake, lakehouse or other mass storage technology.  The term predates the rise of the API economy so it is based on the assumption that there are no existing APIs we can use to get the data from the various systems we want to aggregate from.

The pattern is named for its three stages:

- **Extract**: Data is usually extracted from several sources with varying formats e.g. relational databases, XML files, flat files, non-relational or NoSQL databases.  


- **Transform**: This extracted data is then cleaned and converted into a usable form by applying a series of rules/transformations that describe how to convert the data to a form that is acceptable to the receiver. Transformations typically include cleaning, type conversion, sorting, aggregation, and simple arithmetic and string manipulation, but can include more complex transformations.  


- **Load**: The transformed data is then loaded into the target database.

There are a variety of software tools available to assist with this process.  These ETL tools typically include adapters for reading and writing data from/to the most popular database management systems and may also provide point-and-click mapping tools to describe simple transformations quickly and easily without custom coding.

## Publish & Subscribe

*Publish and subscribe*, also popularly known as **pub-sub**, is a many-to-many messaging pattern where messages are designed to have a well-documented structure that is not tied to how a single receiver will use it. Instead, the messages are designed to be general and are *broadcast* by sending them to a special kind of broker application that makes them available to any other applications that have registered with the broker to receive them. Senders of messages are called **publishers** and receivers are called **subscribers**.  For this to work, the message structure must be either fully self-describing or there needs to be a schema provided by the publisher to all subscribers.  This is the pattern that public APIs use.

One such broker application that is particularly popular for big data applications is **Kafka**. Kafka classifies messages from broadcasters into classes (or topics). Subscribers receive messages in one or more classes (or topics) of interest, without necessarily knowing where the messages originated.

The publishers push messages to topics and subscribers who have subscribed to the topic are notified whenever a message arrives.

This kind of pattern reduces the coupling between publishers and subscribers &mdash; a subscriber can subscribe or unsubscribe at any time without it impacting the publishers.

This pattern does have some unique challenges however. The timing and order in which messages arrive at the subscribers is not usually well-defined and it is possible for the subscribers to receive out-of-order messages. The receiving applications typically need to be written to still work correctly if this happens.

## Service Oriented Architecture and Web Services

### Service-Oriented Architecture (SOA)

The patterns above all existed prior to the internet and were used for inter-application communications within companies and to a limited extent over proprietary networks with trading partners.  With the rise of the internet it became apparent that a more standard-based point-to-point approach was needed to enable the *API economy* vision.

SOA is an architectural style that is based on the idea that applications should be organized like a community of interacting service providers &mdash; i.e. all code should be modularized into single-purpose functions, and that these functions should be invokable through published web APIs. The manner in which each function is programmed (what programming language it is in, what specific machine it's running on, and the details of how it operates internally) should be irrelevant to the user of the function.  These functions are called **services**, hence **service oriented architecture*.

### Web Services

It's important to distinguish between the terms **services** and **web services**.  Web services are a particular style of SOA.  Once the SOA concept emerged, standards bodies, driven primarily by the major software vendors, worked towards creating a set of detailed protocols as to how these services should interact with each other.  Conceptually, they wanted to create the API economy by enabling systems on the internet to discover, connect to each other and intercommunicate without any prior knowledge of each other's schemas.  The vision was that, for example, a car manufacturer's system could reach out onto the net and find tire suppliers, negotiate prices with them and place an order, all without human intervention. Or perhaps it might require *a lot* of intervention, passiing through many levels of approval requiring messages to move around from person to person as they work their way through an organization. Even if that vision was a bit far-fetched, web services promised to at minimum allow packaged software products (such as ERPs) from the various vendors to connect to each other and share data with minimal fuss.

For a while in early 2000's, SOA and web services were almost synonymous. However, three issues arose.

First, the standards became so complex over time that few programmers understood how to use them.  The standards bodies tried to build more and more intelligence into the messages via metadata to the point where the messages were more metadata than data. 

Second, it became apparent over time that although the vendors were claiming that web services would allow their products to interoperate with each other, they in the end didn't really want them to.  They liked to be able to tell new customers that their systems were open, but once a customer had some of their product set, they wanted their clients to buy more from them and not mix-and-match with other vendors' products.

Third, it became possible to run JavaScript on servers (whereas previously it only ran in browsers) and it was a trivial exercise to serialize and deserialize JavaScript objects to move data between systems written in JavaScript.  Since most programming languages already had a JSON-like data structure (e.g. Python dictionaries) it was fairly easy to use JSON for messaging regardless of the programming language the services were written.  It wasn't long before open source developers provided standard libraries to make it easy to use JSON across all languages and even complex integrations.  It was a bonus that JSON was much easier to read than the XML that web services required.

In April 2009, Anne Thomas of Gartner wrote a watershed article titled "SOA is Dead; Long Live Services" which stated for the record what was becoming obvious although the title doesn't seem quite right.  It should have been called "Web services are dead; long live SOA".  You can still find applications that use web services but they are becoming increasingly rare.

## Microservices

We've discussed how in the mid-2010s, internet companies were looking for ways of incrementally deploying new features without incurring downtime &mdash; if you're Google you can't go down over the weekend to do an upgrade. As a result, the idea of systems as interacting, small, autonomous services gained a lot of traction. **Microservices** are just services as we've described them above, but with an emphasis on truly making them small and single purpose, independently deployable/upgradable, and with minimal dependencies on integration brokers (sometimes called the "smart endpoints and dumb pipes" approach).

Sam Newman, author of *Building Microservices*, (Newman, 2015) distilled the essence of this kind of service design into a set of eight principles for building microservices: 
  
- **Modeled after Business Domain**: Each service should have a meaningful name and a very specific business purpose (i.e. they should be small; hence *micro*services). The focus should be on designing high quality APIs that make them easy-to-use.


- **Embracing a Culture of Automation**: This requires investments in infrastructure automation, automatic testing and continuous delivery to reduce the risk of human error when deploying updates to production.


- **Hide Implementation Details**: All data access should be through APIs; no direct access to databases should be available.


- **Decentralize All the Things**: Microservices should be able to send messages directly between each other without needing a central EAI broker.


- **Deploy Independently**: It should be possible to update a single microservice in production without needing to update any other microservices simultaneously. Allowance should be made for the coexistence of multiple versions of endpoints so the switchover to new versions can be gradual.


- **Consumer First**: API documentation is critical.


- **Isolate Failure**: Reasonable timeouts for microservices is extremely important. If a call to a service doesn't produce a response in a reasonable amount of time, the calling program shouldn't hang.  It should wake up again and produce an error message for the user or take some other appropriate action.


- **Highly Observable**: Microservices should be designed to provide an ongoing stream of information about their health and raise alerts that go to the operations team if errors are occurring.

Most large systems  built today follow many or all of these principles.

# Integration Tools

This section focuses on intermediary systems, typically purchased as software products from software vendors, that facilitate communication of messages between application systems.

## Integration Brokers

**Integration brokers** are systems that act as bridge between two or more intercommunicating systems. An integration broker is a program or module that validates, transforms and/or routes data. It is often used to translate protocols between systems. Integration brokers typically provide asynchronous delivery by queuing messages between systems. The advantage brokers have over one-off integrations is that these brokers provide standard adapters for a wide variety of databases, enterprise software (e.g. ERP, CRM, etc.), etc. (similar to ETL tools), but with better guarantees for delivery despite temporary network issues.  The messages are not only queued but are also retried in case of failures from the receiving system. They also support security features for encryption/decryption of data while en route.

Examples of integration brokers include Apache ActiveMQ and IBM MQ.

## Enterprise Service Buses (ESBs)

As the internet became ubiquious and it became apparent that reliable many-to-many interactions were becoming a must, second generation enterprise application integration tools began to support standards-based messaging protocols, initially on web services, then JSON. Earlier integration brokers had often used proprietary transport protocols that resulted in incompatibilities between different vendors' brokers.  The name **Enterprise Aervice Bus**, patterned after the idea of a bus from electronics i.e. a kind of superhighway for moving data between plug-and-play modules, was coined.

Their features typically include:

* Delivery guarantees
* Pushing messages as events to registered listeners using the publish and subscribe pattern
* Procedural or rules-based workflow management (sometimes called **orchestration**)
* Quality of Service (QoS) management for prioritization of message delivery and/or rate throttling
* Governance management (using policy-based constraints to define who can send what kinds of messages)
* Encryption/decryption
* Protocol conversion

Examples of ESBs include Apache Camel, Microsoft BizTalk Server, IBM App Connect, and Oracle Service Bus.

Many integration broker products already on the market when the ESB concept emerged were expanded with these additional features and rebranded as ESBs.  At the same time many of the products previously described as ETL tools added integration broker capabilities and similarly rebranded.

## The Distributed Actor Framework

There is an alternative many-to-many broker-free integration pattern that is well-suited to distributed big data and microservice applications. It is a concurrent programming framework which uses small, independent software objects called **actors**. Actors communicate via asynchronous message-passing (similar to the publish & subscribe model).

The key idea when using actors however is that all shared mutable data in a system of interacting applications must be encapsulated inside actors within the applications. The actors ensure only one thread can update any piece of data at a time. They provide a safe way to share mutable state between asynchronous processes i.e. they can make data changes transactional with little fuss.  They also scale well to huge systems because each actor takes very little memory.  An application can contain millions of actors with no problem.

Actors work similarly to reading emails from one's email mailbox: they accumulate messages they haven't got to reading yet, and they process one and only one message at a time, (normally) in the order they arrived. Because messages are processed atomically they are like relational database transactions and provide similar guarantees.

Three popular actor frameworks to explore if you would like to know more are: Akka, Orleans and Erlang OTP.

## Microservices and integration brokers

Integration brokers were very popular in the early 2000's but are becoming less so since the rise of microservices. Microservices are intended to follow the "Decentralize All the Things" principle, which recommends that services not depend on a broker to manage message delivery for them, but should handle communications failures themselves.  The notion is that the author of a sending service has a better understanding of what the appropriate service behaviour should be in the face of failure (e.g. retry immediately, retry later, circuit break, inform the user that it wasn't successful, try an alternative service provider, reverse a partial transaction, etc.) than what a generic broker would or could know.  So newer services tend to rely less on brokers, but in a large company you will likely see a mix of brokers being used for some integrations, and  microservices-style direct point-to-point used for others.


**End of Module**

You have reached the end of this module.

If you have any questions, please reach out to your peers using the discussion boards. If you and your peers are unable to come to a suitable conclusion, do not hesitate to reach out to your instructor on the designated discussion board.

When you are comfortable with the content, you may proceed to the next module.

# References

- Apache Software Foundation (2017). Apache Thrift™. Retrieved November 28, 2018 from https://thrift.apache.org/   


- Apache Software Foundation (2018). Avro™. Retrieved December 13, 2018 from https://avro.apache.org/  


- Arpaci-Dusseau, R.H.; Arpaci-Dusseau, A.C., (2015). Chapter 48: Distributed Systems in Operating Systems: Three Easy Pieces. Retrieved December 13, 2018 from http://pages.cs.wisc.edu/~remzi/OSTEP/  


- Google Developers (n.d.). Protocol Buffers. Retrieved December 13, 2018 from https://developers.google.com/protocol-buffers/  


- Health Level Seven® International (2018). Introduction to HL7 Standards. Retrieved December 12, 2018 from http://www.hl7.org/implement/standards/  


- Lamb, D.A. (1987). IDL: Sharing Intermediate Representations. Transactions on Programming Languages and Systems (TOPLAS), 9(3), pp 297-318. DOI: 10.1145/24039.24040


- Malinverno, P. & O'Neill, M., (2016). Magic Quadrant for Full Life Cycle API Management. Published: 27 October 2016 ID: G00277632  


- Merriam-Webster, (2018). Metadata. Retrieved November 28, 2018 from https://www.merriam-webster.com/dictionary/metadata  


- Newman, S. (2015). Building Microsystems – designing fine-grained systems. O’Reilly Media. http://shop.oreilly.com/product/0636920033158.do  


- OpenGroup (2016). Service-Oriented Architecture – What is SOA? Retrieved November 28, 2018 http://www.opengroup.org/soa/source-book/soa/p1.htm  


- Riley, J. (2001). Understanding Metadata. What is Metadata and what is it for? National Information Standards Organization (NISO). Retrieved December 12, 2018 from https://groups.niso.org/apps/group_public/download.php/17446/Understanding%20Metadata.pdf. Creative Commons Attribution Non-Commercial 4.0 International license. 


- Amazon Web Services (2019). Retrieved January 3, 2019 from https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/