# deltalake 0.7.0

New features in deltalake 0.7.0:

* Cleaner API for getting a file list
* Can manually create checkpoints
* Can get DataFrame of table add actions

In [2]:
import pandas as pd
from deltalake import DeltaTable, write_deltalake

example_df = pd.DataFrame({
    "part": ["a", "a", "b", "b"],
    "value": [1, 2, 3, 4]
})

write_deltalake(
    "example_table",
    example_df,
    partition_by=["part"],
    mode="overwrite"
)

## Cleaner API for file lists

There used to be four different methods on `DeltaTable` to get the list of files: `files()`, `file_paths()`, `file_uris()`, and `files_by_partition()`. These varied in whether you could pass filters (to select a subset of partitions) and whether they returned relative paths or absolute URIs. In 0.7.0, we've consolidated them into two functions:

 * `DeltaTable.files()`: get the paths of files as they are in the Delta Log (usually relative, but can be absolute, particularly if they are located outside of the delta table root)
 * `DeltaTable.file_uris()`: get the aboslute URIs for files.
 
Both of these functions accept partition filters.

In [7]:
table = DeltaTable("example_table")

print(f"All files:\n  {table.files()}")
print(f"All file URIs:\n  {table.file_uris()}")
print(f"Files in partition part=a:\n  {table.files([('part', '=', 'a')])}")
print(f"File URIs in partition part=a:\n  {table.file_uris([('part', '=', 'a')])}")

All files:
  ['part=a/0-a635877b-2228-4237-9a16-4f72ff147cd7-0.parquet', 'part=b/0-a635877b-2228-4237-9a16-4f72ff147cd7-0.parquet']
All file URIs:
  ['/Users/willjones/Documents/notebooks/example_table/part=a/0-a635877b-2228-4237-9a16-4f72ff147cd7-0.parquet', '/Users/willjones/Documents/notebooks/example_table/part=b/0-a635877b-2228-4237-9a16-4f72ff147cd7-0.parquet']
Files in partition part=a:
  ['part=a/0-a635877b-2228-4237-9a16-4f72ff147cd7-0.parquet']
File URIs in partition part=a:
  ['/Users/willjones/Documents/notebooks/example_table/part=a/0-a635877b-2228-4237-9a16-4f72ff147cd7-0.parquet']


## Manually create checkpoints

We not allow manually creating checkpoints. This can be useful if you have done several operations that create and remove many files (such as successive overwrites), but haven't yet hit an automatic checkpoint.

In [8]:
table.create_checkpoint()

In [11]:
import os

# Now we will see a checkpoint file in our log:
os.listdir(table.table_uri + '_delta_log')

['_last_checkpoint',
 '00000000000000000000.checkpoint.parquet',
 '00000000000000000000.json']

## Get DataFrame of active add actions

We also have a new experimental API to provide a table of the active add actions. The active add actions is the metadata about the set of files that are part of the table. This allows you to see their partition values, record counts, and statistics. (Note: there is currently a bug in the modification time, but that will soon be fixed.) This data can be useful in understanding how well compaction and Z-order are working for your table.

In [20]:
# flatten=True eliminates nested columns, making it easier to work with in Pandas
table.get_add_actions(flatten=True).to_pandas()

Unnamed: 0,path,size_bytes,modification_time,data_change,partition.part,num_records,null_count.value,min.value,max.value
0,part=a/0-a635877b-2228-4237-9a16-4f72ff147cd7-...,1936,1970-01-20 09:27:03.636,True,a,2,0,1,2
1,part=b/0-a635877b-2228-4237-9a16-4f72ff147cd7-...,1936,1970-01-20 09:27:03.636,True,b,2,0,3,4
