# Dags Lab

### Introduction

In this lesson, we'll practice creating our own DAGs in Airflow.  Let's get started.

### Creating a Dag

Begin by creating a DAG with the id of `get_tax_receipts`.  Then let's boot up an airflow webserver connected to the dags and check that the dag is properly loaded.

Make sure that the dag is loaded into the correct folder in the airflow container.  Check that it is by sh-ing into the docker container, and then looking in the `/usr/local/airflow/dags` folder.

<img src="./sh-airflow.png" width="90%">

And then we go to `localhost:8080` and hopefully see our dag appear.

<img src="./get_tax_receipts.png" width="70%"> 

Now the next step is to add a `start_date` to the dag.  Add the start date as five days previous to the current time.  

> It's best to shut down the docker container each time you update the code.

If you update the code, you can check that airflow saw the changes by clicking on the lightning bolt on the righthand panel of the dag.

> <img src="./code-view.png" width="40%">

Then turn on the dag run with the switch over to the left, and shortly thereafter, we should see the dag run.

<img src="./dag-run.png" width="90%">

Now if you click on the green DAG runs button, you can see five successes.  The reason is because there is a default time interval that the dag should be run every day, and we placed the start date as 5 days ago.  So airflow will run the dag five times to try to backfill.  

<img src="./dag-run-history.png" width="60%">

Let's setup the dag so that it only runs once -- we can do that by setting the `start_date` as one day in the past.  Make that change now.

### Adding Tasks

Ok, now so far we have created a DAG.  But we have not added any steps or tasks to that dag.  Let's do that now.  

* First add a step called `retrieve_receipts`.  For now, the task will not actually retreive receipt.  Instead, we should just see the string `getting receipts` in the logs when we run the dag.

At this point, it's probably best to shut down the airflow server and start it again.  If we click on the `get_tax_receipts` dag, we should see something like the task listed there.

> If the task is not listing, check clicking the refresh at the top right of the tree view.  It's may also be worth trying to change the task and dag ids.

> <img src="./with_task.png" width="50%">

Then if we flip the DAG on, we should see both the dag, and the related task running.

> If this is not occurring, try adding new DAG or task ids.  

<img src="./dag-task-runs.png" width="80%"> 

From there, let's check the green circle under recent tasks, then click on the task id, and look at the logs.  We should see something like the following:

<img src="./get_receipts_log.png" width="80%">

> Notice that in the second to last line it says the returned value was `getting receipts`.

* Now let's add another task called `find_large_receipts` that returns `found large receipts` when run.  And have the `retreive_receipts` and then the `finding_receipts` step be called in sequential order.

After shutting down the docker container and restarting it, we should now see the following under the Dag's tree view.

> <img src="./dag-sequential.png" width="40%">

Notice that with this updated diagram, that `get_receipts` is run followed by `find_receipts`.

Now, let's switch on the dag, and check that both of these tasks are run.

> <img src="./running_tasks.png" width="60%">

If we click on the recent tasks, we should then take a look at the logs of each tasks to see that the correct values were properly returned.

The `find_receipts` task should produce logs like the following:

> <img src="./found_large_receipts.png" width="100%">

In the third to last line, we can see that it says `found large receipts` were returned.

### Working with Schedule Intervals

Now let's change how often our dag is run.  We can do this with the schedule interval.  Update the dag so that it sets a schedule interval to daily, by adding the following keyword argument.

```python
schedule_interval='@hourly'

```

We can see the various options for setting how often the dag will be run with the [presets documentation](https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html#cron-presets)

### Make it real(er)

Ok, now let's have our `get_receipts` task connect to our Texas Drinks API.  The url to reach is the following:

`https://data.texas.gov/resource/naix-2893.json?taxpayer_zip=77036`

And next time when we run the dag, have that task return the first dictionary from the api.  For example, the log should look something like the following:

<img src="./returned_api_value.png" width="100%">

> So we can see the `Returned value` had the dictionary with the `taxpayer_number` and `name`.

### Summary

In this lab, we practiced working with directed acyclic graphs.  We saw that we could initialize a DAG, and then add tasks to the DAG.  Then we moved into specifying a schedule interval with the dag, by adding the `schedule_interval` keyword argument and the preset of `'@hourly'`.