# Acessing EFD data stored as Parquet files from S3

Angelo Fausti, Simon Krughoff

November 1, 2021


## Introduction

The EFD uses the [Amazon S3 Sink connector](https://docs.confluent.io/kafka-connect-s3-sink/current/overview.html) to convert data from Kafka topics in Avro format to Parquet format with snappy compression. 
In this notebook we show how to access EFD data stored as Parquet files from S3. 


## Reading Parquet files from S3


In [None]:
import io
import boto3
import pandas as pd
import pyarrow.parquet as pq

The Amazon S3 Sink connector runs at LDF, EFD data is replicated from the Summit and stored in an S3 bucket: 

In [None]:
BUCKET_NAME = "efd-int"

The AWA credentials to read this bucket can be found in 1Password (search for "EFD AWS S3 credentials" and then "Credentials for the efd-reader-s3 IAM user"). 
Add the read credentials to the `~/.aws/credentials` file.

For example:
```
cat ~/.aws/credentials
[default]
aws_access_key_id = <the aws_access_key_id>
aws_secret_access_key = <the aws_secret_access_key>
```

The S3 region can be added to the `~/.aws/config` file.

```
cat ~/.aws/config 
[default]
region=us-east-1
```

In [None]:
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)

The connector  is configured to partition data from Kafka topics by time on an hourly basis using the `Record` timestamp (added by Kafka when the message arrived the kafka broker) as reference. 
The following helps to construct the path to find the Parquet files on S3 for one of the topics, for example, `lsst.sal.ATCamera.logevent_heartbeat`.

In [None]:
topic = "lsst.sal.ATCamera.logevent_heartbeat"
year = "2021"
month = "10"
day = "28"
hour = "01"

We use the `bucket.download_fileobj()` method to download the Parquet files into a buffer, and then Pyarrow to read the files, convert and append them to a Pandas dataframe.

In [None]:
df = pd.DataFrame()
for obj in bucket.objects.filter(Prefix=f"topics/{topic}/year={year}/month={month}/day={day}/hour={hour}"):
    buffer = io.BytesIO()
    bucket.download_fileobj(obj.key, buffer)
    df = df.append(pq.read_table(buffer).to_pandas())
    print(f"{bucket.name}:{obj.key}")

The connector is configured to invoke file commits to S3 every 10 minutes (see the [`rotate.interval.ms` configuration setting](https://docs.confluent.io/kafka-connect-s3-sink/current/configuration_options.html)), so you should see 6 files in this path.

NOTE: To read all the Parquet files on a given day you can filter the bucket objects using a prefix like `f"topics/{topic}/year={year}/month={month}/day={day}"`.

In [None]:
df.head()