### **8.4 - [Practica] Storing your logs in AWS S3**

We have learned how to set up a custom logging configuration in Airflow, now it’s time to see how to store

and read logs on AWS S3. Before moving forward,

let me remind you what is an AWS S3 bucket. A S3 bucket is a public cloud storage resource.

A bucket is used to store objects, which consist of data and metadata that describes the data. You

can think of it as a hard-drive on the cloud.

That being said let's move forward. At this point,

I assume that you already have an AWS account,

So if not, please take your time to create one and come back when you are done. Ok,

By default, Airflow stores its log files locally without compression.

If you are running a lot of jobs or even a small number of jobs frequently, disk space can get eaten

up pretty fast.

Storing logs on S3 not only alleviates you from disk maintenance considerations, the durability of these

files will be better guaranteed with AWS S3 versus what most disk manufactures and file systems

can offer.

All right, time to go to the AWS console.

Once you are connected to your account, the first step is to create the S3 bucket where the logs of Airflow

will be stored.

Click on “Services”, type “S3” and select the first choice.

Now, click on “Create bucket”.

From there, we have to configure the bucket. The name here should be unique. In my case I’m gonna type marcl

-airflow-logs.

Notice that you have to set your own name. Don’t use the same as mine,

It won't work.

Okay.

Select the closest region to your bucket. For me it’s Paris. Click on “Next”.

Here we don't have to change anything.

Next.

Since we don’t want to allow public access to the bucket, we keep the box checked

here. Ok

Everything is good.

We can click on “Create Bucket”, and the new bucket has been created here.

Perfect.

If you click on it, you can see that for now it is empty.

After having created the bucket to store the logs, we are going to create a new user with only the permission

to read and write to that bucket.

Indeed, it is a best practice to do this to avoid potential security issues.

So, click on “Services”, type “IAM” and select the service. Then click on “Users”

and “Add User”. We need to define a username,

let’s say “airflow-log-s3”.

We select “Programmatic access” since we will need an access key ID and a secret access key in order

to create the connection to AWS S3 from Airflow. Click on “Next”. Select “Attach existing policies

directly” and click on “Create policy”.

Here, we are going to create a policy to only give the read write permission for the bucket we created

previously. Select “Choose a service” and select “S3”. Then in “Actions” under “Access Level”

and “List”, choose “ListBucket”. In “Read”, choose “GetOject”. In “Write”,

select “PutObject”, “DeleteObject”, “ReplicateObject” and “RestoreObject”. Now the actions are set, in

“Resources”, click on the warnings.

From there we have to specify the resources that the user is allowed to interact with.

First, we restrict the access to only the bucket we created by clicking on “Add ARN”. Here, type the

name you defined for the bucket. In my case it’s “marcl-airflow-logs”.

Check that you didn’t make any mistakes and click on “Add”.

Okay.

We do the same for the objects.

Type the same bucket name,

“marcl-airflow-logs” in my case

and select any for the object name.

Click on “Add”.

Perfect.

We can review the policy.

We need to give a name to the policy, let’s say “ReadWriteS3AirflowLogs”.

And finally click on “Create policy”.

Okay back to the User view, refresh the policies by clicking here

and select “ReadWriteS3AirflowLogs”. Click on “Next”, “Next” again and “Create User”.

Perfect,

the user has been well created.

Here, we got the access key ID and secret access key that we gonna use in Airflow to connect

to S3.

So, download the credentials right now, as you would not be able to see this page again and save them

somewhere safe.

Click on “Close”. So we done with AWS,

let’s move to the Airflow side. In you terminal, check that you are under the folder airflow-materials/

airflow-section-8 and start the docker containers

with start.sh.

As you can see, docker is building a new docker image of Airflow.

Why?

Because in order to store your logs into AWS S3, you have to add the package S3 along with

the install of Airflow. To be more concrete,

go to your code editor, check that you are under the folder airflow-materials/airflow-section-

8 and open the dockerfile in the folder Docker.

Here, you have the different commands to setup the docker container running Airflow.

If you take a look at the line where Airflow is installed with pip, you can see here, the package S3 as

well the other usual packages such as crypto, celery, postgres and so on.

So if you want to access S3 with Airflow, this package must be installed.

Alright. Back to the terminal,

I’m gonna pause the video right now, and I will come back when the build is done. Ok,

the build is done, airflow should be running,

let’s type “docker

ps”.

and the containers are running as expected.

Perfect.

Now in your web browser, open a new tab

and type localhost:

8080.

Enter. From the Airflow

UI, click on “Admin”, “Connection” and “Create”.

Here, we are going to setup the connection we need to access the S3 bucket from Airflow.

Let's type “AWSS3LogStorage” for the name.

Then select s3 for the connection type.

Finally, in the extra field type the following json value, {"aws_access_key_id": "the access key", "aws_secret_access_key": "the secret key"}.

Notice that for each access key you should input your credentials from the csv file you downloaded

earlier.

Like that.

Okay.

Click on “Create”. Perfect

the connection has been successfully created. Now, from your code editor,

open the file airflow.cfg.

In order to use an external storage for your logs, three parameters must be defined.

Remote_logging, remote_log_conn_id and remote_

base_log_folder.

Let's start by the first one.

Remote_logging

when set to true allows Airflow to write and read logs from a remote location. So change the value from

False to True. Then, remote_log_conn_id corresponds to the connection

we just created to connect to AWS S3.

So here, we have to put the connection id AWSS3LogStorage.

Finally, remote_base_log_folder indicates the folder where the logs

are going to be stored and read from your remote storage.

So type s3:// your bucket name which is marcl-airflow-logs in my case

and the folder where we want to save the logs which is /airflow-logs.

At the end,

you should have the same line as mine except for the name of the bucket, where you have to put own.

Okay save the file.

Now, if you go back to the AWS S3 dashboard,

you should find your bucket here and if you click on it, it should be empty.

All right,

it’s time to see if everything works as expected. From your terminal,

execute the command ./restart.sh so that the modifications we made are applied.

Ok, type docker

ps.

Enter. Great, now

type docker logs -f with the container id of the worker where tasks of DAGs are going to be executed.

Enter. The purpose of this command is to show the connections that will be made to your AWS

S3 bucket by airflow in order to read and write the logs.

So in your web browser, go to the Airflow UI and click on “DAGs”. From there,

turn on the toggle of the DAG logger_dag to schedule it, and wait for the DAGRun to finish.

This DAG has only two tasks, t1 doing nothing and t2 printing out a message to the standard output.

All right,

the DAGRun is finished,

click on the DAG, then “graph view”. Let’s click on “t1” and “view log”.

From there

pay attention to the first line here indicating that the current logs you are reading, have been fetched

remotely from the following file stored in your S3 bucket.

Indeed, if you go back to the S3 dashboard and click on this icon to refresh the bucket, you can see

a new folder called airflow-logs.

If you click on it, you obtain another folder with the name of the DAG logger_dag. Then, in it

and you have one folder for each task.

Click on “t1”, the last date, and 1.log. “Open”. And it corresponds to the logs displayed from the Airflow UI.

Well done.

You are now able to store and read the logs produced by Airflow in AWS S3. Before moving forward,

go back to your terminal, and execute the script

./stop.sh.

Finally, in your code editor, remove the values of the variables remote_log_

conn_id and remote_base_log_folder

then set the value of remote_logging to False.

Save the file and we are done.

Perfect.

Let's take a quick break and see you for the next video.
