## What is a Data Pipeline?
<u>Definition</u>: A series of steps in which data is processed. 

Depending on the data requirement for each step, some steps may occur in parallel. Data pipelines also typically occur on a schedule which can be once in hour, once a day, every minute or once a year. It depends on how frequently the data is delivered and how often the data consumer need new insights. Schedules are the most common mechanisms of triggering an execution of a data pipeline, external triggers and events can also be used to execute data pipelines. 
### Real World Data Pipelines
Following are some examples of real world data pipelines
* Automated  marketing emails
* Real-time pricing in rideshare apps
* Targeted advertising based on browsing history

### Example
Pretend we work at a bikeshare company and want to email customers who didn't complete a purchase.

A data pipeline to accomplish this task would like:
1. Load application event data from a source such as S3 or Kafka
2. Load the data into an analytic warehouse such as Redshift
3. Perform data transformations that identify high-traffice bike docs so the business can determine where to build additional locations.

### QUIZ QUESTION
What is a data pipeline?
- [ ] A visual way of displaying data to business users
- [ ] An algorithm that classifies data.
- [x] A series of steps in which data is processed.
- [ ] A type of database.

### Extract Transform Load (ETL) and Extract Load Transform (ELT):
"ETL is normally a continuous, ongoing process with a well-defined workflow. ETL first extracts data from homogeneous or heterogeneous data sources. Then, data is cleansed, enriched, transformed, and stored either back in the lake or in a data warehouse.

"ELT (Extract, Load, Transform) is a variant of ETL wherein the extracted data is first loaded into the target system. Transformations are performed after the data is loaded into the data warehouse. ELT typically works well when the target system is powerful enough to handle transformations. Analytical databases like Amazon Redshift and Google BigQ."
Source: [Xplenty.com](https://www.xplenty.com/blog/etl-vs-elt/)

This [Quora post](https://www.quora.com/What-is-the-difference-between-the-ETL-and-ELT) is also helpful if you'd like to read more.

### What is S3?
"Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data, at any time, from anywhere on the web. It gives any developer access to the same highly scalable, reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites."
Source: [Amazon Web Services Documentation](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html).

If you want to learn more, start [here](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html).

### What is Kafka?
"Apache Kafka is an **open-source stream-processing software platform** developed by Linkedin and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a massively scalable pub/sub message queue designed as a distributed transaction log, making it highly valuable for enterprise infrastructures to process streaming data."
Source: Wikipedia.

If you want to learn more, start [here](https://kafka.apache.org/intro).

### What is RedShift?
"Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more... The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your data set and then perform data analysis queries. Regardless of the size of the data set, Amazon Redshift offers fast query performance using the same SQL-based tools and business intelligence applications that you use today.

If you want to learn more, start [here](https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html).

So in other words, S3 is an example of the final data store where data might be loaded (e.g. ETL). While Redshift is an example of a data warehouse product, provided specifically by Amazon.




## Data Validation
Data Validation is the process of ensuring that data is present, correct & meaningful. Ensuring the quality of your data through automated validation checks is a critical step in building data pipelines at any organization.

Data validation can be done manually by quality assurance, data engineers or even data customers. It's much preferable to perform data validation in an automated fashion. Validation can and should become part of your pipeline definitions. 

### What could go wrong?
In our previous bikeshare example we loaded event data, analyzed it, and ranked out busiest locations to determine where to build additional capacity.

What would happen if the data was wrong?
What would happen if our system miscalculate the location ranking? 
What if no data was produced at all? 

When we do a mistake in our data pipeline it can lead to some serious problems for our businesses, for our customers and for people who depend on that kind of data. 
So it's really important that we perform data validation to ensure that the data we're creating is accurate and correct.

### Data Validation in Action
In our bikesharing example, we could have added the following validation steps:

After loading from S3 ro redshift:
* Validate the number of rows in Redshift match the number of records in S3

Once location business analysis is complete:
* Validate that all locations have a daily visit greater than 0
* Validate the number of locations in our output table match the number of tables in the input table.

### Why is it important?
* Data pipelines provide a set of logical guidelines and a common set of terminology.
* The conceptual framework of data pipelines will help you better organize and execute everyday data engineering tasks.

### QUIZ QUESTION
Which of the following are examples of data validation?
- [x] Ensuring that the number of rows in Redshift match the number of records in S3
- [x] Ensuring that the number of rows in a table are greater than zero
- [ ] Ensuring that the output table matches the needs of the data consumer.

## DAGs and Data Pipelines

### Definitions
* **Directed Acyclic Graphs (DAGs)**: DAGs are a special subset of graphs in which the edges between nodes have a specific direction, and no cycles exist. When we say “no cycles exist” what we mean is the nodes cant create a path back to themselves.
* **Nodes**: A step in the data pipeline process.
* **Edges**: The dependencies or relationships other between nodes.

<img src="images/dags.png">

### Data Pipelines as DAGs
In ETL, each step of the process typically depends on the last.

Each step is a node and dependencies on prior steps are directed edges.
### Common Questions
#### Are there real world cases where a data pipeline is not DAG?

It is possible to model a data pipeline that is not a DAG, meaning that it contains a cycle within the process. However, the vast majority of use cases for data pipelines can be described as a directed acyclic graph (DAG). This makes the code more understandable and maintainable.

#### Can we have two different pipelines for the same data and can we merge them back together?

Yes. It's not uncommon for a data pipeline to take the same dataset, perform two different processes to analyze the it, then merge the results of those two processes back together.

## Bikeshare DAG

* First we're going to extract the data from S3 and load the data from S3 into Redshift
* Perform analysis in Redshift using SQL.
* Deliver data to some destination server.

<img src="images/bikeshare_dag_1.png">

What happens if we need to add another data source?

* Let's say that we want to integrate the data from the city's API. This steps needs to be completed before we perform Redshift analysis as shown below:

<img src="images/bikeshare_dag_2.png">

### QUESTION 1 OF 3
What are the two components of ALL graphs?
- [ ] Cycle
- [x] Node
- [x] Edge
- [ ] Direction

### QUESTION 2 OF 3
Which of the following are features which define a Directed Acyclic Graph?
- [ ] Has Cycles
- [x] No Cycles
- [x] Nodes may have more than one edge that connects to them
- [x] Edges between nodes imply a directed relationship

### QUESTION 3 OF 3
Which graph(s) shown below are directed acyclic graphs (DAG)?

<img src="images/dag-quiz.png">

- [ ] Graph 1
- [x] Graph 2
- [ ] Graph 3

## Introduction to Apache Airflow

### Apache Airflow
* "Airflow is a platform to programmatically author, schedule and monitor workflows. Use airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative."

* Airflow allows users to write DAGs in Python that run on a schedule and/or from an external trigger.

* Airflow is simple to maintain and can run data analysis itself or trigger external tools (Redshift, Spark, Presto, Hadoop, etc) during execution.

Airflow also provides a web-based UI for users to visualize and interact with their data pipelines.

If you'd like to learn more, start [here](https://airflow.apache.org/).

## Creating Exercises Folder

In [1]:
import os

if not os.path.exists('exercises'):
    os.makedirs('exercises')

## Airflow Installation

Aiflow installation is straightforward.

```bash
# airflow needs a home, ~/airflow is the default,
# but you can lay foundation somewhere else if you prefer
# (optional)
export AIRFLOW_HOME=~/airflow

# install from pypi using pip
pip install apache-airflow

# initialize the database
airflow initdb

# start the web server, default port is 8080
airflow webserver -p 8080

# start the scheduler
airflow scheduler
```

After installing the airflow create a folder named `dags` inside `AIRFLOW_HOME` folder and put all the your airflow application their

## Exercise 1: Airflow DAGs
**Instructions**:
Define a function that uses the python logger to log a function. Then finish filling in the details of the DAG down below. Once you’ve done that, run "/opt/airflow/start.sh" command to start the web server. Once the Airflow web server is ready,  open the Airflow UI using the "Access Airflow" button. Turn your DAG “On”, and then Run your DAG. If you get stuck, you can take a look at the solution file or the video walkthrough on the next page.

In [2]:
%%writefile exercises/exercise1.py
import datetime
import logging

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def hello_world():
    logging.info("Hello World!")


dag = DAG(
        'lesson1.exercise1',
        start_date=datetime.datetime.now())

greet_task = PythonOperator(
   task_id="hello_world_task",
   python_callable=hello_world,
   dag=dag
)

Overwriting exercises/exercise1.py


In [3]:
!cp exercises/exercise1.py $AIRFLOW_HOME/dags

## How Airflow Works

### Components of Airflow
* **Scheduler** orchestrates the execution of jobs on a trigger or schedule. The Scheduler chooses how to prioritize the running and execution of tasks within the system. You can learn more about the Scheduler from the official [Apache Airflow documentation](https://airflow.apache.org/scheduler.html).
* **Work Queue** is used by the scheduler in most Airflow installations to deliver tasks that need to be run to the **Workers**.
* **Worker** processes execute the operations defined in each DAG. In most Airflow installations, workers pull from the **work queue** when it is ready to process a task. When the worker completes the execution of the task, it will attempt to process more work from the **work queue** until there is no further work remaining. When work in the queue arrives, the worker will begin to process it.
* **Database** saves credentials, connections, history, and configuration. The database, often referred to as the metadata database, also stores the state of all tasks in the system. Airflow components interact with the database with the Python ORM, [SQLAlchemy](https://www.sqlalchemy.org/).
* **Web Interface** provides a control dashboard for users and maintainers. Throughout this course you will see how the web interface allows users to perform tasks such as stopping and starting DAGs, retrying failed tasks, configuring credentials, The web interface is built using the [Flask web-development microframework](http://flask.pocoo.org/).

<img src="images/airflow-diagram.png">

Airflow itself is not a data processing framework. In airflow we don't pass data in memory between steps in your DAG. Instead we use airflow to coordinate the data between other data storage and data processing tools. Database only stores the meta data. The workers in the airflow when they execute will work with Redshift, Spark and other systems. Also we'll not typically run heavy processing workloads on Airflow. Airflow is only limited to the processing power of a single machine. This is why airflow developers prefer airflow to trigger heavy processing steps in analytic warehouses like Redshift or dataframe works like spark.

### Order of Operations For an Airflow DAG

* The Airflow Scheduler starts DAGs based on time or external triggers.
* Once a DAG is started, the Scheduler looks at the steps within the DAG and determines which steps can run by looking at their dependencies.
* The Scheduler places runnable steps in the queue.
* Workers pick up those tasks and run them.
* Once the worker has finished running the step, the final status of the task is recorded and additional tasks are placed by the scheduler until all tasks are complete.
* Once all tasks have been completed, the DAG is complete.

<img src="images/how-airflow-works.png">

## Quiz: Airflow Runtime Architecture
### QUESTION 1 OF 4
What are the five components of Airflow’s architecture?
- [x] Scheduler
- [ ] Data Warehouse
- [x] Workers
- [x] UI/Web Server
- [x] Queue
- [ ] Streaming Server
- [x] Database

### QUESTION 2 OF 4
What does the Airflow UI do?
- [ ] Allow the user to construct data pipelines graphically in the UI
- [x] Provides a control interface for users and maintainers
- [ ] Allows the user to write queries against databases
- [ ] Runs and records the outcome of individual pipeline tasks

### QUESTION 3 OF 4
What does the Airflow Scheduler do?
- [ ] Runs and records the outcome of individual pipeline tasks
- [ ] Provides a control interface for users and maintainers
- [ ] Sends email on a scheduled basis
- [x] Starts DAGs based on triggers or schedules and moves them towards completion

### QUESTION 4 OF 4
What do the Airflow Workers do?
- [x] Runs and records the outcome of individual pipeline tasks
- [ ] Provides a control interface for users and maintainers
- [ ] Sends email on a scheduled basis
- [ ] Starts DAGs based on triggers or schedules and moves them towards completion

## Building a Data Pipeline
### Creating a DAG
Creating a DAG is easy. Give it a name, a description, a start date, and an interval.

```python
from airflow import DAG

divvy_dag = DAG(
    'divvy',
    description='Analyzes Divvy Bikeshare Data',
    start_date=datetime(2019, 2, 4),
    schedule_interval='@daily')

```

### Creating Operators to Perform Tasks
**Operators** define the atomic steps of work that make up a DAG. Instantiated operators are referred to as **Tasks**.

```python
from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def hello_world():
    print("Hello World")

divvy_dag = DAG(...)
task = PythonOperator(
    task_id=’hello_world’,
    python_callable=hello_world,
    dag=divvy_dag)
```

### Schedules
**Schedules** are optional, and may be defined with cron strings or Airflow Presets. Airflow provides the following presets:

* `@once` - Run a DAG once and then never again
* `@hourly` - Run the DAG every hour
* `@daily` - Run the DAG every day
* `@weekly` - Run the DAG every week
* `@monthly` - Run the DAG every month
* `@yearly`- Run the DAG every year
* `None` - Only run the DAG when the user initiates it

**Start Date:** If your start date is in the past, Airflow will run your DAG as many times as there are schedule intervals between that start date and the current date.

**End Date:** Unless you specify an optional end date, Airflow will continue to run your DAGs until you disable or delete the DAG.

## Exercise 2: Run the Schedules
**Instructions**: Complete the DAG so that it runs once a day. Once you’ve done that, open the Airflow UI. go to the Airflow UI and turn the last exercise off, then turn this exercise on. Wait a moment and refresh the UI to see Airflow automatically run your DAG.

In [4]:
%%writefile exercises/exercise2.py

import datetime
import logging

from airflow import DAG
from airflow.operators.python_operator import PythonOperator


def hello_world():
    logging.info("Hello World")

dag = DAG(
        "lesson1.exercise2",
        start_date=datetime.datetime.now() - datetime.timedelta(days=2),
        schedule_interval='@daily')

task = PythonOperator(
        task_id="hello_world_task",
        python_callable=hello_world,
        dag=dag)

Overwriting exercises/exercise2.py


In [5]:
!cp exercises/exercise2.py $AIRFLOW_HOME/dags

## Operators and Tasks

### Operators 
Operators define the atomic steps of work that make up a DAG. Airflow comes with many Operators that can perform common operations. Here are a handful of common ones:

* `PythonOperator`
* `PostgresOperator`
* `RedshiftToS3Operator`
* `S3ToRedshiftOperator`
* `BashOperator`
* `SimpleHttpOperator`
* `Sensor`

### Task Dependencies
In Airflow DAGs:

* Nodes = Tasks
* Edges = Ordering and dependencies between tasks

Task dependencies can be described programmatically in Airflow using `>>` and `<<`

* a `>>` b means a comes before b
* a `<<` b means a comes after b

```python
hello_world_task = PythonOperator(task_id=’hello_world’, ...)
goodbye_world_task = PythonOperator(task_id=’goodbye_world’, ...)
...
# Use >> to denote that goodbye_world_task depends on hello_world_task
hello_world_task >> goodbye_world_task
```

Tasks dependencies can also be set with "set_downstream" and "set_upstream"

* `a.set_downstream(b)` means a comes before b
* `a.set_upstream(b)` means a comes after b

```python

hello_world_task = PythonOperator(task_id=’hello_world’, ...)
goodbye_world_task = PythonOperator(task_id=’goodbye_world’, ...)
...
hello_world_task.set_downstream(goodbye_world_task)

```

## Exercise 3: Task Dependencies
**Instructions** :  Define tasks and graphs in this exercise

In [6]:
%%writefile exercises/exercise3.py

import datetime
import logging

from airflow import DAG
from airflow.operators.python_operator import PythonOperator


def hello_world():
    logging.info("Hello World")


def addition():
    logging.info(f"2 + 2 = {2+2}")


def subtraction():
    logging.info(f"6 -2 = {6-2}")


def division():
    logging.info(f"10 / 2 = {int(10/2)}")


dag = DAG(
    "lesson1.exercise3",
    schedule_interval='@hourly',
    start_date=datetime.datetime.now() - datetime.timedelta(days=1))

hello_world_task = PythonOperator(
    task_id="hello_world",
    python_callable=hello_world,
    dag=dag)

addition_task = PythonOperator(
    task_id="addition",
    python_callable=addition,
    dag=dag
    )

subtraction_task = PythonOperator(
    task_id="subtraction",
    python_callable=subtraction,
    dag=dag
    )
#
# TODO: Define a division task that calls the `division` function above
#

division_task = PythonOperator(
    task_id="division",
    python_callable=division,
    dag=dag
    )

#
# Configuring the task dependencies such that the graph looks like the following:
#
#                    ->  addition_task
#                   /                 \
#   hello_world_task                   -> division_task
#                   \                 /
#                    ->subtraction_task

hello_world_task >> addition_task
hello_world_task >> subtraction_task

addition_task >> division_task
subtraction_task >> division_task

Overwriting exercises/exercise3.py


In [7]:
!cp exercises/exercise3.py $AIRFLOW_HOME/dags

## Connection via Airflow Hooks
Connections can be accessed in code via hooks. Hooks provide a reusable interface to external systems and databases. With hooks, you don’t have to worry about how and where to store these connection strings and secrets in your code.

```python
from airflow import DAG
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.python_operator import PythonOperator

def load():
# Create a PostgresHook option using the `demo` connection
    db_hook = PostgresHook(‘demo’)
    df = db_hook.get_pandas_df('SELECT * FROM rides')
    print(f'Successfully used PostgresHook to return {len(df)} records')

load_task = PythonOperator(task_id=’load’, python_callable=hello_world, ...)
```

Airflow comes with many Hooks that can integrate with common systems. Here are a few common ones:

* `HttpHook`
* `PostgresHook` (works with RedShift)
* `MySqlHook`
* `SlackHook`
* `PrestoHook`

## Exercise 4: Connections and Hooks

**Instructions**: We're going to create a connection and a
variable.
1. Open your browser to localhost:8080 and open Admin->Variables
2. Click "Create"
3. Set "Key" equal to "s3_bucket" and set "Val" equal to "udacity-dend"
4. Set "Key" equal to "s3_prefix" and set "Val" equal to "data-pipelines"
5. Click save
6. Open Admin->Connections
7. Click "Create"
8. Set "Conn Id" to "aws_credentials", "Conn Type" to "Amazon Web Services"
9. Set "Login" to your aws_access_key_id and "Password" to your aws_secret_key
10. Click save
11. Run the DAG

In [8]:
%%writefile exercises/exercise4.py

from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.hooks import S3_hook
from airflow.models import Variable

import datetime
import logging

def list_keys():
    hook = S3_hook.S3Hook('aws_credentials')
    bucket = Variable.get('s3_bucket')
    prefix = Variable.get('s3_prefix')
    logging.info(f"Listing Keys from {bucket}/{prefix}")
    keys = hook.list_keys(bucket, prefix=prefix)
    for key in keys:
        logging.info(f"- s3://{bucket}/{key}")

dag = DAG('lesson1.exercise4', start_date=datetime.datetime.now() - datetime.timedelta(days=1))

list_task = PythonOperator(
    task_id="list_keys",
    python_callable=list_keys,
    dag=dag
)

list_task

Overwriting exercises/exercise4.py


In [9]:
!cp exercises/exercise4.py $AIRFLOW_HOME/dags

## Context and Templating
[Here](https://airflow.apache.org/macros.html) is the Apache Airflow documentation on **context variables** that can be included as kwargs.

Here is a link to a [blog post](https://blog.godatadriven.com/zen-of-python-and-apache-airflow) that also discusses this topic.

### Runtime Variables
Airflow leverages templating to allow users to "fill in the blank" with important runtime variables for tasks.

```python

from airflow import DAG
from airflow.operators.python_operator import PythonOperator

def hello_date(*args, **kwargs):
    print(f"Hello {kwargs['execution_date']}")

divvy_dag = DAG(...)
task = PythonOperator(
    task_id="hello_date",
    python_callable=hello_date,
    provide_context=True,
    dag=divvy_dag)

```

## Exercise 5: Context and Templating
**Instructions**: Use the Airflow context in the pythonoperator to complete the TODOs below. Once you are done, run your DAG and check the logs to see the context in use.

In [10]:
%%writefile exercises/exercise5.py

import datetime
import logging

from airflow import DAG
from airflow.models import Variable
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.S3_hook import S3Hook


def log_details(*args, **kwargs):
    #
    # TODO: Extract ds, run_id, prev_ds, and next_ds from the kwargs, and log them
    # NOTE: Look here for context variables passed in on kwargs:
    #       https://airflow.apache.org/macros.html
    #
    ds = kwargs['ds'] # kwargs[]
    run_id = kwargs['run_id'] # kwargs[]
    previous_ds = kwargs.get('prev_ds') # kwargs.get('')
    next_ds = kwargs.get('next_ds') # kwargs.get('')

    logging.info(f"Execution date is {ds}")
    logging.info(f"My run id is {run_id}")
    if previous_ds:
        logging.info(f"My previous run was on {previous_ds}")
    if next_ds:
        logging.info(f"My next run will be {next_ds}")

dag = DAG(
    'lesson1.exercise5',
    schedule_interval="@daily",
    start_date=datetime.datetime.now() - datetime.timedelta(days=2)
)

list_task = PythonOperator(
    task_id="log_details",
    python_callable=log_details,
    provide_context=True,
    dag=dag
)

Overwriting exercises/exercise5.py


In [11]:
!cp exercises/exercise5.py $AIRFLOW_HOME/dags

## Quiz: Review of Pipeline Components

### QUESTION 1 OF 2
Match the following definitions to the the component they describe.

|DEFINITION|COMPONENT|
|-----------|---------|
|A collection of nodes and edges that describe the order of operations for a data pipeline|DAG|
|An instantiated step in a pipeline fully parameterized for execution|Task|
|A reusable connection to an external database or system|Hook|
|An abstract building block that can be configured to perform some work|Operator|

### QUESTION 2 OF 2
Which of the following constructs a DAG that runs task "B", then "C", then "A"?
- [ ] A >> B >> C
- [x] B >> C >> A
- [ ] C >> A >> B
- [ ] B >> A >> C

## Exercise 6: Build the S3 to Redshift DAG

**Instructions**: Copy and populate the trips table. Then, add another operator which creates a traffic analysis table from the trips table you created. Note, in this class, we won’t be writing SQL -- all of the SQL statements we run against Redshift are predefined and included in your lesson.

In [12]:
%%writefile exercises/exercise6.py

import datetime
import logging

from airflow import DAG
from airflow.contrib.hooks.aws_hook import AwsHook
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.python_operator import PythonOperator

import sql_statements


def load_data_to_redshift(*args, **kwargs):
    aws_hook = AwsHook("aws_credentials")
    credentials = aws_hook.get_credentials()
    redshift_hook = PostgresHook("redshift")
    redshift_hook.run(sql_statements.COPY_ALL_TRIPS_SQL.format(credentials.access_key, credentials.secret_key))


dag = DAG(
    'lesson1.exercise6',
    start_date=datetime.datetime.now() - datetime.timedelta(days=1)
)

create_table = PostgresOperator(
    task_id="create_table",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.CREATE_TRIPS_TABLE_SQL
)

copy_task = PythonOperator(
    task_id='load_from_s3_to_redshift',
    dag=dag,
    python_callable=load_data_to_redshift
)

location_traffic_task = PostgresOperator(
    task_id="calculate_location_traffic",
    dag=dag,
    postgres_conn_id="redshift",
    sql=sql_statements.LOCATION_TRAFFIC_SQL
)

create_table >> copy_task
copy_task >> location_traffic_task


Overwriting exercises/exercise6.py


In [13]:
%%writefile exercises/sql_statements.py

CREATE_TRIPS_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS trips (
trip_id INTEGER NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL,
bikeid INTEGER NOT NULL,
tripduration DECIMAL(16,2) NOT NULL,
from_station_id INTEGER NOT NULL,
from_station_name VARCHAR(100) NOT NULL,
to_station_id INTEGER NOT NULL,
to_station_name VARCHAR(100) NOT NULL,
usertype VARCHAR(20),
gender VARCHAR(6),
birthyear INTEGER,
PRIMARY KEY(trip_id))
DISTSTYLE ALL;
"""

CREATE_STATIONS_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS stations (
id INTEGER NOT NULL,
name VARCHAR(250) NOT NULL,
city VARCHAR(100) NOT NULL,
latitude DECIMAL(9, 6) NOT NULL,
longitude DECIMAL(9, 6) NOT NULL,
dpcapacity INTEGER NOT NULL,
online_date TIMESTAMP NOT NULL,
PRIMARY KEY(id))
DISTSTYLE ALL;
"""

COPY_SQL = """
COPY {}
FROM '{}'
ACCESS_KEY_ID '{{}}'
SECRET_ACCESS_KEY '{{}}'
IGNOREHEADER 1
DELIMITER ','
"""

COPY_MONTHLY_TRIPS_SQL = COPY_SQL.format(
    "trips",
    "s3://udacity-dend/data-pipelines/divvy/partitioned/{year}/{month}/divvy_trips.csv"
)

COPY_ALL_TRIPS_SQL = COPY_SQL.format(
    "trips",
    "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_trips_2018.csv"
)

COPY_STATIONS_SQL = COPY_SQL.format(
    "stations",
    "s3://udacity-dend/data-pipelines/divvy/unpartitioned/divvy_stations_2017.csv"
)

LOCATION_TRAFFIC_SQL = """
BEGIN;
DROP TABLE IF EXISTS station_traffic;
CREATE TABLE station_traffic AS
SELECT
    DISTINCT(t.from_station_id) AS station_id,
    t.from_station_name AS station_name,
    num_departures,
    num_arrivals
FROM trips t
JOIN (
    SELECT
        from_station_id,
        COUNT(from_station_id) AS num_departures
    FROM trips
    GROUP BY from_station_id
) AS fs ON t.from_station_id = fs.from_station_id
JOIN (
    SELECT
        to_station_id,
        COUNT(to_station_id) AS num_arrivals
    FROM trips
    GROUP BY to_station_id
) AS ts ON t.from_station_id = ts.to_station_id
"""

Overwriting exercises/sql_statements.py


In [14]:
!cp exercises/sql_statements.py $AIRFLOW_HOME/dags
!cp exercises/exercise6.py $AIRFLOW_HOME/dags