## Data Lineage
### Definition
The data lineage of a dataset describes the discrete steps involved in the creation, movement, and calculation of that dataset.

### Why is Data Lineage important?
1. **Instilling Confidence**: Being able to describe the data lineage of a particular dataset or analysis will build confidence in data consumers (engineers, analysts, data scientists, etc.) that our data pipeline is creating meaningful results using the correct datasets. If the data lineage is unclear, its less likely that the data consumers will trust or use the data.
2. **Defining Metrics**: Another major benefit of surfacing data lineage is that it allows everyone in the organization to agree on the definition of how a particular metric is calculated.
3. **Debugging**: Data lineage helps data engineers track down the root of errors when they occur. If each step of the data movement and transformation process is well described, it's easy to find problems when they occur.

In general, data lineage has important implications for a business. Each department or business unit's success is tied to data and to the flow of data between departments. For e.g., sales departments rely on data to make sales forecasts, while at the same time the finance department would need to track sales and revenue. Each of these departments and roles depend on data, and knowing where to find the data. Data flow and data lineage tools enable data engineers and architects to track the flow of this large web of data.

## Visualizing Data Lineage
As we already know that airflow parses our DAG and surfaces a visualization of the graph. Airflow also keep tracks of all the runs of a particular DAG. Airflow also shows us the rendered code for each task. One thing to keep in mind that airflow keeps a record of historical DAGs and task executions but it does not store the data from those historical runs. Whatever the latest code is in your DAG definition is what airflow will render for you in the browser. So, be careful of making the assumption of what was running historically. So if you modify the DAG then the code for the historical run will also change. 
To see the demo watch this [tutorial](https://www.youtube.com/watch?v=1IGTicTXeUQ) by Udacity on youtube.

### QUESTION 1 OF 3
What is the data lineage of a dataset?
- [ ] The starting location of a dataset
- [x] Description of the discrete steps involved in the creation, movement, and calculation of that dataset
- [ ] The final destination of a dataset
- [ ] The calculation steps involved in producing a dataset

### QUESTION 2 OF 3
Which of the following are benefits of visualizing data lineage?
- [ ] Allows business users to modify our data pipelines
- [x] Builds confidence in our users that our data pipelines are designed properly
- [x] Helps organizations surface and agree on dataset definitions
- [ ] Provides easy access to credential management
- [x] Makes locating errors more obvious

### QUESTION 3 OF 3
Which components of Airflow can be used to track data lineage?
- [ ] Connections Configuration Page
- [x] Rendered code tab for a task
- [x] Graph view for a DAG
- [x] Historical runs under the tree view
- [ ] Airflow home page


## Creating exercises folder

In [1]:
import os

if not os.path.exists('exercises'):
    os.makedirs('exercises')

## Exercise 1: Data Lineage in Airflow

**Instructions**:
1. Run the DAG as it is first, and observe the Airflow UI
2. Next, open up the DAG and add the copy and load tasks
3. Reload the Airflow UI and run the DAG once more, observing the Airflow UI

In [2]:
%%writefile exercises/lesson2_exercise1.py

import datetime
import logging

from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator

import sql_statements

def load_trip_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


def load_station_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


dag = DAG(
    'lesson2.exercise1',
    start_date=datetime.datetime.now() - datetime.timedelta(days=1)
)

create_trips_table = PostgresOperator(
    task_id="create_trips_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_TRIPS_TABLE_SQL
)

copy_trips_task = PythonOperator(
    task_id='load_trips_from_s3_to_redshift',
    dag=dag,
    python_callable=load_trip_data_to_redshift,
)

create_stations_table = PostgresOperator(
    task_id="create_stations_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
)

copy_stations_task = PythonOperator(
    task_id='load_stations_from_s3_to_redshift',
    dag=dag,
    python_callable=load_station_data_to_redshift,
)

create_trips_table >> copy_trips_task

Overwriting exercises/lesson2_exercise1.py


In [3]:
%%writefile exercises/sql_statements.py

CREATE_TRIPS_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS trips (
trip_id INTEGER NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL,
bikeid INTEGER NOT NULL,
tripduration DECIMAL(16,2) NOT NULL,
from_station_id INTEGER NOT NULL,
from_station_name VARCHAR(100) NOT NULL,
to_station_id INTEGER NOT NULL,
to_station_name VARCHAR(100) NOT NULL,
usertype VARCHAR(20),
gender VARCHAR(6),
birthyear INTEGER,
PRIMARY KEY(trip_id))
DISTSTYLE ALL;
"""

CREATE_STATIONS_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS stations (
id INTEGER NOT NULL,
name VARCHAR(250) NOT NULL,
city VARCHAR(100) NOT NULL,
latitude DECIMAL(9, 6) NOT NULL,
longitude DECIMAL(9, 6) NOT NULL,
dpcapacity INTEGER NOT NULL,
online_date TIMESTAMP NOT NULL,
PRIMARY KEY(id))
DISTSTYLE ALL;
"""

COPY_SQL = """
COPY {}
FROM '{}'
ACCESS_KEY_ID '{{}}'
SECRET_ACCESS_KEY '{{}}'
IGNOREHEADER 1
DELIMITER ','
"""

COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
    "trips",
    "s3://udacity-dend/data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
)

COPY_ALL_TRIPS_SQL = COPY_SQL.format(
    "trips",
    "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
)

COPY_STATIONS_SQL = COPY_SQL.format(
    "stations",
    "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
)

LOCATION_TRAFFIC_SQL = """
BEGIN;
DROP TABLE IF EXISTS station_traffic;
CREATE TABLE station_traffic AS
SELECT
    DISTINCT(t.from_station_id) AS station_id,
    t.from_station_name AS station_name,
    num_departures,
    num_arrivals
FROM trips t
JOIN (
    SELECT
        from_station_id,
        COUNT(from_station_id) AS num_departures
    FROM trips
    GROUP BY from_station_id
) AS fs ON t.from_station_id = fs.from_station_id
JOIN (
    SELECT
        to_station_id,
        COUNT(to_station_id) AS num_arrivals
    FROM trips
    GROUP BY to_station_id
) AS ts ON t.from_station_id = ts.to_station_id
"""


Overwriting exercises/sql_statements.py


In [4]:
!cp exercises/sql_statements.py $AIRFLOW_HOME/dags
!cp exercises/lesson2_exercise1.py $AIRFLOW_HOME/dags

First, load the Airflow UI and run this DAG once.<br>
Next, configure the task ordering for stations data to have the create run before
the copy. <br>
Then, run this DAG once more and inspect the run history. to see the
differences.

In [5]:
%%writefile exercises/lesson2_exercise1.py

import datetime
import logging

from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator

import sql_statements

def load_trip_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


def load_station_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


dag = DAG(
    'lesson2.exercise1',
    start_date=datetime.datetime.now() - datetime.timedelta(days=1)
)

create_trips_table = PostgresOperator(
    task_id="create_trips_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_TRIPS_TABLE_SQL
)

copy_trips_task = PythonOperator(
    task_id='load_trips_from_s3_to_redshift',
    dag=dag,
    python_callable=load_trip_data_to_redshift,
)

create_stations_table = PostgresOperator(
    task_id="create_stations_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
)

copy_stations_task = PythonOperator(
    task_id='load_stations_from_s3_to_redshift',
    dag=dag,
    python_callable=load_station_data_to_redshift,
)

create_trips_table >> copy_trips_task

create_stations_table >> copy_stations_task

Overwriting exercises/lesson2_exercise1.py


In [6]:
!cp exercises/lesson2_exercise1.py $AIRFLOW_HOME/dags

## Data Pipeline Schedules

### Schedules
Pipelines are often driven by schedules which determine what data should be analyzed and when.

### Why Schedules
* Pipeline schedules can reduce the amount of data that needs to be processed in a given run. It helps scope the job to only run the data for the time period since the data pipeline last ran. In a naive analysis, with no scope, we would analyze all of the data at all times.
* Using schedules to select only data relevant to the time period of the given pipeline execution can help improve the quality and accuracy of the analyses performed by our pipeline.
Running pipelines on a schedule will decrease the time it takes the pipeline to run.
* An analysis of larger scope can leverage already-completed work. For. e.g., if the aggregates for all months prior to now have already been done by a scheduled job, then we only need to perform the aggregation for the current month and add it to the existing totals.

### Selecting the time period
Determining the appropriate time period for a schedule is based on a number of factors which you need to consider as the pipeline designer.

1. **What is the size of data, on average, for a time period?** If an entire years worth of data is only a few kb or mb, then perhaps its fine to load the entire dataset. If an hours worth of data is hundreds of mb or even in the gbs then likely you will need to schedule your pipeline more frequently.

2. **How frequently is data arriving, and how often does the analysis need to be performed?** If our bikeshare company needs trip data every hour, that will be a driving factor in determining the schedule. Alternatively, if we have to load hundreds of thousands of tiny records, even if they don't add up to much in terms of mb or gb, the file access alone will slow down our analysis and we’ll likely want to run it more often.

3. **What's the frequency on related datasets?** A good rule of thumb is that the frequency of a pipeline’s schedule should be determined by the dataset in our pipeline which requires the most frequent analysis. This isn’t universally the case, but it's a good starting assumption. For example, if our trips data is updating every hour, but our bikeshare station table only updates once a quarter, we’ll probably want to run our trip analysis every hour, and not once a quarter.

## Schedules in Airflow
### Start Date
Airflow will begin running pipelines on the start date selected. Whenever the start date of a DAG is in the past, and the time difference between the start date and now includes more than one schedule intervals, Airflow will automatically schedule and execute a DAG run to satisfy each one of those intervals. This feature is useful in almost all enterprise settings, where companies have established years of data that may need to be retroactively analyzed.

### End Date
Airflow pipelines can also have end dates. You can use an end_date with your pipeline to let Airflow know when to stop running the pipeline. End_dates can also be useful when you want to perform an overhaul or redesign of an existing pipeline. Update the old pipeline with an end_date and then have the new pipeline start on the end date of the old pipeline.

### QUESTION 1 OF 5
How are schedules used by data pipelines?

- [x] Determine what data should be analyzed and when
- [ ] Determine when to interact with data sources
- [ ] Determine when to run particular tasks
- [ ] Determine when to email observers

### QUESTION 2 OF 5
Which of the following are used by Airflow to determine schedules?

- [x] start_date
- [ ] interval_date
- [x] end_date
- [x] schedule_interval

### QUESTION 3 OF 5
True or False: End date is required by Airflow Schedules.
- [ ] True
- [x] False

### QUESTION 4 OF 5
True or False: Start date is required by Airflow Schedules.
- [x] True
- [x] False

### QUESTION 5 OF 5
True or False: Schedule interval is required by Airflow Schedules.
- [ ] True
- [x] False

## Updating DAGs

### Common Questions
**Wouldn't creating a new DAG for every feature change become cumbersome because feature changes or bugs happen all the time?**

Yes, it really just comes down to the kind of analysis we need to perform as a team. So the question is : "What is the change that I'm making? Is it the fundamental change to the DAG that actually changes the meaning of what we're trying to do? or are we adding at something that can be easily reflected in the existing data pipeline that we've already designed? So if we go back to Airflow, if we were to add a marketing email send at the end of the DAG . The problem is that if we didn't update or rerun the DAG to reflect that then they can cause issues for other people who might come by later and that this DAG has been running or whatever the time period is, we sent all those emails. But in fact, we haven't right? So what we can do here is to design a new DAG. This is the simplest solution. The other option is that you can actually clear the history of DAG runs. There is another important consideration when we are going to do this. Airflow has no concept of the data stores that you have outside of Airflow. So if rerunning a task is destructive or rerunning a task triggers some kind of downstream action, this can be really dangerous for the internal usage of Airflow and for other data consumers. So it is possible to rerun a DAG or clear the history of a DAG, but there's gonna be repercussions of doing that. So we need to know the repercussions of doing that. So that's why we need to know the pros and cons of the idea that we need a new DAG or do we need to go back and wipe history and try again,
For more watch this [tutorial](https://www.youtube.com/watch?v=zuLPBN9SZRc) by Udacity on youtube.

**Is time the only type of partition? Can you partition other types, such as events or values?**
Yes there are like logical partitioning and also data size partitioning. We can also partition around events and other type of things but yes time is the common one but it's not the only kind of partitioning.

## Exercise 2: Schedules and Backfills in Airflow

In [7]:
%%writefile exercises/lesson2_exercise2.py

import datetime
import logging

from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator

import sql_statements


def load_trip_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_ALL_TRIPS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


def load_station_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


dag = DAG(
    'lesson2.exercise2',
    start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
    # TODO: Set the end date to February first
    end_date=datetime.datetime(2018, 2, 1, 0, 0, 0, 0),
    # TODO: Set the schedule to be monthly
    schedule_interval='@monthly',
    # TODO: set the number of max active runs to 1
    max_active_runs=1
)

create_trips_table = PostgresOperator(
    task_id="create_trips_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_TRIPS_TABLE_SQL
)

copy_trips_task = PythonOperator(
    task_id='load_trips_from_s3_to_redshift',
    dag=dag,
    python_callable=load_trip_data_to_redshift,
    provide_context=True,
)

create_stations_table = PostgresOperator(
    task_id="create_stations_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
)

copy_stations_task = PythonOperator(
    task_id='load_stations_from_s3_to_redshift',
    dag=dag,
    python_callable=load_station_data_to_redshift,
)

create_trips_table >> copy_trips_task
create_stations_table >> copy_stations_task

Overwriting exercises/lesson2_exercise2.py


In [8]:
!cp exercises/lesson2_exercise2.py $AIRFLOW_HOME/dags

## Data Partitioning
### Schedule partitioning
Not only are schedules great for reducing the amount of data our pipelines have to process, but they also help us guarantee that we can meet timing guarantees that our data consumers may need.

### Logical partitioning
Conceptually related data can be partitioned into discrete segments and processed separately. This process of separating data based on its conceptual relationship is called logical partitioning. With logical partitioning, unrelated things belong in separate steps. Consider your dependencies and separate processing around those boundaries.

Also worth mentioning, the data location is another form of logical partitioning. For example, if our data is stored in a key-value store like Amazon's S3 in a format such as: `s3://<bucket>/<year>/<month>/<day>` we could say that our date is logically partitioned by time.

### Size Partitioning
Size partitioning separates data for processing based on desired or required storage limits. This essentially sets the amount of data included in a data pipeline run. Size partitioning is critical to understand when working with large datasets, especially with Airflow.

## Why Data Partitioning?
Pipelines designed to work with partitioned data fail more gracefully. Smaller datasets, smaller time periods, and related concepts are easier to debug than big datasets, large time periods, and unrelated concepts. Partitioning makes debugging and rerunning failed tasks much simpler. It also enables easier redos of work, reducing cost and time.

Another great thing about Airflow is that if your data is partitioned appropriately, your tasks will naturally have fewer dependencies on each other. Because of this, Airflow will be able to parallelize execution of your DAGs to produce your results even faster.

### QUESTION 1 OF 4
What are four common types of data partitioning?

- [x] Location
- [x] Logical
- [x] Size
- [ ] Cloud
- [x] Time

### QUESTION 2 OF 4
Logical Partitioning is the process of...
- [ ] Processing data based on its location in a datastore
- [ ] Separating data for processing based on desired or required storage limits
- [ ] Processing data based on a schedule or when it was created
- [x] Breaking conceptually related data into discrete groups for processing

### QUESTION 3 OF 4
Time Partitioning is the process of...
- [ ] Processing data based on its location in a datastore
- [ ] Separating data for processing based on desired or required storage limits
- [x] Processing data based on a schedule or when it was created
- [ ] Breaking conceptually related data into discrete groups for processing

### QUESTION 4 OF 4
Size Partitioning is the process of
- [ ] Processing data based on its location in a datastore
- [x] Separating data for processing based on desired or required storage limits
- [ ] Processing data based on a schedule or when it was created
- [ ] Breaking conceptually related data into discrete groups for processing

## Exercise 3
**Instructions**:
1. Modify the bikeshare DAG to load data month by month, instead of loading it all at once, every time. 
2. Use time partitioning to parallelize the execution of the DAG.

In [9]:
%%writefile exercises/lesson2_exercise3.py

import datetime
import logging

from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator

import sql_statements

def load_trip_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook('redshift')
    execution_date = kwargs['execution_date']
    sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
        year=execution_date.year,
        month=execution_date.month
    )
    redshift_hook.run(sql_stmt)
    
def load_station_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook('redshift')
    sql_smt = sql_statements.COPY_STATIONS_SQL.format(
        credentials.access_key,
        credentials.secret_key
    )
    redshift_hook.run(sql_smt)
    
dag = DAG(
        'lesson2.exercise3',
        start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
        end_date=datetime.datetime(2019, 1, 1, 0, 0, 0, 0),
        schedule_interval='@monthly',
        max_active_runs=1
)

create_trips_table = PostgresOperator(
    task_id='create_trips_table',
    dag=dag,
    postgres_conn_id='redshift',
    sql=sql_statements.CREATE_TRIPS_TABLE_SQL
)

copy_trips_task = PythonOperator(
    task_id='load_trips_from_s3_to_redshift',
    dag=dag,
    python_callable=load_trip_data_to_redshift,
    provide_context=True
)

create_stations_table = PostgresOperator(
    task_id="create_stations_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
)

copy_stations_task = PythonOperator(
    task_id='load_stations_from_s3_to_redshift',
    dag=dag,
    python_callable=load_station_data_to_redshift,
)

create_trips_table >> copy_trips_task
create_stations_table >> copy_stations_task

Overwriting exercises/lesson2_exercise3.py


In [10]:
!cp exercises/lesson2_exercise3.py $AIRFLOW_HOME/dags

## Data Quality
**Data quality** is the measure of how well a dataset satisfies its intended use. When we, talk about the intended use of data, we're typically referring to how our downstream data consumers are going to utilize this data.

Adherence to a set of **requirement** is a good starting point for measuring data quality. Requirements should be defined by you and your data consumers before you start creating your data pipeline.

### Examples of Data Quality Requirements
* Data must be a certain size
* Data must be accurate to some margin of error
* Data must arrive within a given timeframe from the start of execution
* Pipelines must run on a particular schedule
* Data must not contain any sensitive information

### QUESTION 1 OF 4
Which of the following are true about data quality requirements?
- [ ] Requirements tell engineers exactly how to build a data pipeline
- [x] Requirements are how we can set and measure quality
- [x] Requirements allow both engineering and non-engineering roles to agree on the high-level method for preparing the output.
- [x] Requirements tell engineers what the output of their data pipelines should be

### QUESTION 2 OF 4
How would you set a requirement for ensuring that data arrives within a certain timeframe of a DAG starting?

- [x] Use a Service Level Agreement
- [ ] Use a schedule
- [ ] Use a start date
- [ ] Use an end data

### QUESTION 3 OF 4
What kind of requirement would be violated if no data was produced by a DAG?
- [ ] Data must be accurate to some margin of error
- [ ] Data must arrive within given timeframe from the start of execution
- [ ] Pipelines must run on  a particular Schedule
- [ ] Data must not contain any sensitive information
- [x] Data must be of a certain size

### QUESTION 4 OF 4
What kind of requirement would be violated if data arrived after it was needed?
- [ ] Data must be accurate to some margin of error
- [x] Data must arrive within given timeframe from the start of execution
- [ ] Pipelines must run on  a particular Schedule
- [ ] Data must not contain any sensitive information
- [ ] Data must be of a certain size

## Exercise 4: Data Quality
**Instructions**:
1. Set an SLA on our bikeshare traffic calculation operator
2. Add data verification step after the load step from s3 to redshift
3. Add data verification step after we calculate our output table

In [11]:
%%writefile exercises/lesson2_exercise4.py

import datetime
import logging

from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator

import sql_statements


def load_trip_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    execution_date = kwargs["execution_date"]
    sql_stmt = sql_statements.COPY_MONTHLY_TRIPS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
        year=execution_date.year,
        month=execution_date.month
    )
    redshift_hook.run(sql_stmt)


def load_station_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    sql_stmt = sql_statements.COPY_STATIONS_SQL.format(
        credentials.access_key,
        credentials.secret_key,
    )
    redshift_hook.run(sql_stmt)


def check_greater_than_zero(*args, **kwargs):
    table = kwargs["params"]["table"]
    redshift_hook = PostgresHook("redshift")
    records = redshift_hook.get_records(f"SELECT COUNT(*) FROM {table}")
    if len(records) < 1 or len(records[0]) < 1:
        logging.error(f"No records present in destination {table}")
        raise ValueError(f"Data quality check failed. {table} returned no results")
    num_records = records[0][0]
        
    logging.info(f"Data quality on table {table} check passed with {records[0][0]} records")


dag = DAG(
    'lesson2.exercise4',
    start_date=datetime.datetime(2018, 1, 1, 0, 0, 0, 0),
    end_date=datetime.datetime(2019, 1, 1, 0, 0, 0, 0),
    schedule_interval='@monthly',
    max_active_runs=1
)

create_trips_table = PostgresOperator(
    task_id="create_trips_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_TRIPS_TABLE_SQL
)

copy_trips_task = PythonOperator(
    task_id='load_trips_from_s3_to_redshift',
    dag=dag,
    python_callable=load_trip_data_to_redshift,
    provide_context=True,
)

check_trips = PythonOperator(
    task_id='check_trips_data',
    dag=dag,
    python_callable=check_greater_than_zero,
    provide_context=True,
    params={
        'table': 'trips',
    }
)

create_stations_table = PostgresOperator(
    task_id="create_stations_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_STATIONS_TABLE_SQL,
)

copy_stations_task = PythonOperator(
    task_id='load_stations_from_s3_to_redshift',
    dag=dag,
    python_callable=load_station_data_to_redshift,
)

check_stations = PythonOperator(
    task_id='check_stations_data',
    dag=dag,
    python_callable=check_greater_than_zero,
    provide_context=True,
    params={
        'table': 'stations',
    }
)

create_trips_table >> copy_trips_task >> check_trips
create_stations_table >> copy_stations_task >> check_stations


Overwriting exercises/lesson2_exercise4.py


In [12]:
!cp exercises/lesson2_exercise4.py $AIRFLOW_HOME/dags