# Data Engineering Pipeline

A Data pipeline is a sum of tools and processes for performing data integration.

## Data Ingestion

Data ingestion is the transportation of data from assorted sources to a storage medium where it can be accessed, used, and analyzed by an organization. The destination is typically a data warehouse. 

Batch vs. streaming ingestion:

- The most common kind of data ingestion is batch processing. Here, the ingestion layer periodically collects and groups source data and sends it to the destination system. Groups may be processed based on any logical ordering, the activation of certain conditions, or a simple schedule. When having near-real-time data is not important, batch processing is typically used, since it’s generally easier and more affordably implemented than streaming ingestion.

- Real-time processing (also called stream processing or streaming) involves no grouping at all. Data is sourced, manipulated, and loaded as soon as it’s created or recognized by the data ingestion layer. This kind of ingestion is more expensive, since it requires systems to constantly monitor sources and accept new information. However, it may be appropriate for analytics that require continually refreshed data.

[ref](https://www.stitchdata.com/resources/data-ingestion/)

### Data Ingestion with Singer

[Singer](https://www.singer.io) is defined as the open-source standard for writing scripts that move data.

> Singer describes how data extraction scripts—called “taps” —and data loading scripts—called “targets”— should communicate, allowing them to be used in any combination to move data from any source to any destination. Send data between databases, web APIs, files, queues, and just about anything else you can think of.

Singer applications communicate with JSON, making them easy to work with and implement in any programming language. Singer also supports JSON Schema to provide rich data types and rigid structure when needed. Additionally, many configuration files in Singer hold JSON. Therefore it is good to know how to write some configuration details of a database to a JSON file.

In [2]:
# Import json
import json

database_address = {
  "host": "10.0.0.5",
  "port": 1234
}

# Open the configuration file in writable mode
with open("data/dp/database_config.json", "w") as fh:
  # Serialize the object in this file handle
  json.dump(obj=database_address, fp=fh) 

_Note: The difference between `json.dumps` and `json.dump` is that, the former transforms the object to a string, whereas the latter writes the string to a file._

JSON Schema is a vocabulary that allows us to annotate and validate JSON documents. Below is and example JSON schema object.

In [None]:
# Complete the JSON schema
schema = {'properties': {
    'brand': {'type': 'string'},
    'model': {'type': 'string'},
    'price': {'type': 'number'},
    'currency': {'type': 'string'},
    'quantity': {'type': 'integer', 'minimum': 1},  
    'date': {'type': 'string', 'format': 'date'},
    'countrycode': {'type': 'string', 'pattern': "^[A-Z]{2}$"}, 
    'store_name': {'type': 'string'}}}

# Write the schema
singer.write_schema(stream_name="products", schema=schema, key_properties=[])

In [None]:
{'items': [{'brand': 'Huggies',
            'model': 'newborn',
            'price': 6.8,
            'currency': 'EUR',            
            'quantity': 40,
            'date': '2019-02-01',
            'countrycode': 'DE'            
            },
           {…}]

Singer extracts and loads data with _taps_ and _targets_, which are extracting and loading scripts and can be written in any programming language. 
- Singer Taps: Taps extract data from any source and write it to a standard stream in a JSON-based format.
- Singer Targets: Targets consume data from taps and do something with it, like load it into a file, API or database.

These can easily be mixed and matched to create small data pipelines that move data.

Taps and Targest communicate over streams:
- schema (metadata)
- state (process metadata)
- record (data)