# Querying Hudi tables via Athena and pydtbools

The purpose of this notebook is to demonstrate read compatibility of hudi tables using pydbtools (which is simply a wrapper for awswrangler). It's assumed that the `test_hudi_datbase` has been created using the `dummy_database_creator` found in the `helpers` subdirectory of this project.

In [None]:
import os
import time
import awswrangler as wr
import pydbtools as pydb

In [None]:
database_name = "test_hudi_database"
table_name = "test_hudi_table"

## Checking the table's information in the Glue catalog

Let's query the table's information as it's found in the Glue catalog.

In [None]:
table_details = wr.catalog.table(database=database_name, table=table_name)
table_details

A couple of things that are worth noting:
* The Glue catalog has a record of the Hudi table information as fields you can query. This is different to iceberg, where the information is hidden, and is queried using the `$` syntax.
* `status` is recorded as a partition. In the equivalent Iceberg example `status` is a hidden partition and so does not show as a partition in the Glue catalog. 

## Querying the data

We'll now query the dataset using `pydbtools`.

In [None]:
sql = f"""
    SELECT *
    FROM {database_name}.{table_name}
    LIMIT 10
"""
df = pydb.read_sql_query(sql)
df

As the hoodie information is available via the query, we can construct the filepath to the file that contains any record using the `_hoodie_partition_path` and `_hoodie_file_name`, along with the table's location as shown below.

In [None]:
table_location = wr.catalog.get_table_location(database=database_name, table=table_name)
first_record_location = os.path.join(
    table_location,
    df._hoodie_partition_path[0],
    df._hoodie_file_name[0]
)
record_df = wr.s3.read_parquet(first_record_location)
record_df[record_df.PK == df.pk[0]]

## Limitations of Hudi with Athena

As noted in the [AWS documentation for Hudi](https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html):

* Athena does not support incremental queries.
* Unlike for Iceberg, Athena does not support CTAS or INSERT INTO on Hudi data.
* Using MSCK REPAIR TABLE on Hudi tables in Athena is not supported. If you need to load a Hudi table not created in AWS Glue, you need to use ALTER TABLE ADD PARTITION.