## What is a Data Pipeline?
<u>Definition</u>: A series of steps in which data is processed. Depending on the data requirement for each step, some steps may occur in parallel. Data pipelines also typically occur on a schedule which can be once in hour, once a day, every minute or once a year. It depends on how frequently the data is delivered and how often the data consumer need new insights. Schedules are the most common mechanisms of triggering an execution of a data pipeline, external triggers and events can also be used to execute data pipelines. 
### Real World Data Pipelines
Following are some examples of real world data pipelines
* Automated  marketing emails
* Real-time pricing in rideshare apps
* Targeted advertising based on browsing history

### Example
Pretend we work at a bikeshare company and want to email customers who didn't complete a purchase.

A data pipeline to accomplish this task would like:
1. Load application event data from a source such as S3 or Kafka
2. Load the data into an analytic warehouse such as Redshift
3. Perform data transformations that identify high-traffice bike docs so the business can determine where to build additional locations.

### QUIZ QUESTION
What is a data pipeline?
- [ ] A visual way of displaying data to business users
- [ ] An algorithm that classifies data.
- [x] A series of steps in which data is processed.
- [ ] A type of database.

### Extract Transform Load (ETL) and Extract Load Transform (ELT):
"ETL is normally a continuous, ongoing process with a well-defined workflow. ETL first extracts data from homogeneous or heterogeneous data sources. Then, data is cleansed, enriched, transformed, and stored either back in the lake or in a data warehouse.

"ELT (Extract, Load, Transform) is a variant of ETL wherein the extracted data is first loaded into the target system. Transformations are performed after the data is loaded into the data warehouse. ELT typically works well when the target system is powerful enough to handle transformations. Analytical databases like Amazon Redshift and Google BigQ."
Source: [Xplenty.com](https://www.xplenty.com/blog/etl-vs-elt/)

This [Quora post](https://www.quora.com/What-is-the-difference-between-the-ETL-and-ELT) is also helpful if you'd like to read more.

### What is S3?
"Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites."
Source: [Amazon Web Services Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html).

If you want to learn more, start [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html).

### What is Kafka?
"Apache Kafka is an **open-source stream-processing software platform** developed by Linkedin and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a massively scalable pub/sub message queue designed as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data."
Source: Wikipedia.

If you want to learn more, start [here](https://kafka.apache.org/intro).

### What is RedShift?
"Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more... The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your data set and then perform data analysis queries. Regardless of the size of the data set, Amazon Redshift offers fast query performance using the same SQL-based tools and business intelligence applications that you use today.

If you want to learn more, start [here](https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html).

So in other words, S3 is an example of the final data store where data might be loaded (e.g. ETL). While Redshift is an example of a data warehouse product, provided specifically by Amazon.




## Data Validation
Data Validation is the process of ensuring that data is present, correct & meaningful. Ensuring the quality of your data through automated validation checks is a critical step in building data pipelines at any organization.

Data validation can be done manually by quality assurance, data engineers or even data customers. It's much preferable to perform data validation in an automated fashion. Validation can and should become part of your pipeline definitions. 

### What could go wrong?
In our previous bikeshare example we loaded event data, analyzed it, and ranked out busiest locations to determine where to build additional capacity.

What would happen if the data was wrong?
What would happen if our system miscalculate the location ranking? 
What if no data was produced at all? 

When we do a mistake in our data pipeline it can lead to some serious problems for our businesses, for our customers and for people who depend on that kind of data. 
So it's really important that we perform data validation to ensure that the data we're creating is accurate and correct.

### Data Validation in Action
In our bikesharing example, we could have added the following validation steps:

After loading from S3 ro redshift:
* Validate the number of rows in Redshift match the number of records in S3

Once location business analysis is complete:
* Validate that all locations have a daily visit greater than 0
* Validate the number of locations in our output table match the number of tables in the input table.

### Why is it important?
* Data pipelines provide a set of logical guidelines and a common set of terminology.
* The conceptual framework of data pipelines will help you better organize and execute everyday data engineering tasks.

### QUIZ QUESTION
Which of the following are examples of data validation?
- [x] Ensuring that the number of rows in Redshift match the number of records in S3
- [x] Ensuring that the number of rows in a table are greater than zero
- [ ] Ensuring that the output table matches the needs of the data consumer.