diff --git a/_blog.yml b/_blog.yml index 50dc862c4e..5db77e293e 100644 --- a/_blog.yml +++ b/_blog.yml @@ -4730,3 +4730,13 @@ tags: - research - community + +- local: improve_parquet_dedupe + title: "Improving Parquet Dedupe on Hugging Face Hub" + author: yuchenglow + thumbnail: /blog/assets/improve_parquet_dedupe/thumbnail.png + date: October 5, 2024 + tags: + - parquet + - dedupe + - storage diff --git a/assets/improve_parquet_dedupe/thumbnail.png b/assets/improve_parquet_dedupe/thumbnail.png new file mode 100644 index 0000000000..27c2419538 Binary files /dev/null and b/assets/improve_parquet_dedupe/thumbnail.png differ diff --git a/improve_parquet_dedupe.md b/improve_parquet_dedupe.md new file mode 100644 index 0000000000..a31e1a5837 --- /dev/null +++ b/improve_parquet_dedupe.md @@ -0,0 +1,153 @@ +--- +title: "Improving Parquet Dedupe on Hugging Face Hub" +thumbnail: /blog/assets/improve_parquet_dedupe/thumbnail.png +authors: + - user: yuchenglow + - user: seanses +--- + +# Improving Parquet Dedupe on Hugging Face Hub + +The Xet team at Hugging Face is working on improving the efficiency of the Hub's +storage architecture to make it easier and quicker for users to +store and update data and models. As Hugging Face hosts nearly 11PB of datasets +with Parquet files alone accounting for over 2.2PB of that storage, +optimizing Parquet storage is of pretty high priority. + +Most Parquet files are bulk exports from various data analysis pipelines +or databases, often appearing as full snapshots rather than incremental +updates. Data deduplication becomes critical for efficiency when users want to +update their datasets on a regular basis. Only by deduplicating can we store +all versions as compactly as possible, without requiring everything to be uploaded +again on every update. In an ideal case, we should be able to store every version +of a growing dataset with only a little more space than the size of its largest version. + +Our default storage algorithm uses byte-level [Content-Defined Chunking (CDC)](https://joshleeb.com/posts/content-defined-chunking.html), +which generally dedupes well over insertions and deletions, but the Parquet layout brings some challenges. +Here we run some experiments to see how some simple modifications behave on +Parquet files, using a 2GB Parquet file with 1,092,000 rows from the +[FineWeb](https://huggingface.co/datasets/HuggingFaceFW/fineweb/tree/main/data/CC-MAIN-2013-20) +dataset and generating visualizations using our [dedupe +estimator](https://github.com/huggingface/dedupe_estimator). + +## Background + +Parquet tables work by splitting the table into row groups, each with a fixed +number of rows (for instance 1000 rows). Each column within the row group is +then compressed and stored: + +

+ Parquet Layout +

+ +Intuitively, this means that operations which do not mess with the row +grouping, like modifications or appends, should dedupe pretty well. So let's +test this out! + +## Append + +Here we append 10,000 new rows to the file and compare the results with the +original version. Green represents all deduped blocks, red represents all +new blocks, and shades in between show different levels of deduplication. + +

+ Visualization of dedupe from data appends +

+ +We can see that indeed we are able to dedupe nearly the entire file, +but only with changes seen at the end of the file. The new file is 99.1% +deduped, requiring only 20MB of additional storage. This matches our +intuition pretty well. + +## Modification + +Given the layout, we would expect that row modifications to be pretty +isolated, but this is apparently not the case. Here we make a small +modification to row 10000, and we see that while most of the file does dedupe, +there are many small regularly spaced sections of new data! + +

+ Visualization of dedupe from data modifications +

+ +A quick scan of the [Parquet file +format](https://parquet.apache.org/docs/file-format/metadata/) suggests +that absolute file offsets are part of the Parquet column headers (see the +structures ColumnChunk and ColumnMetaData)! This means that any +modification is likely to rewrite all the Column headers. So while the +data does dedupe well (it is mostly green), we get new bytes in every +column header. + +In this case, the new file is only 89% deduped, requiring 230MB of additional +storage. + +## Deletion + +Here we delete a row from the middle of the file (note: insertion should have +similar behavior). As this reorganizes the entire row group layout (each +row group is 1000 rows), we see that while we dedupe the first half of +the file, the remaining file has completely new blocks. + +

+ Visualization of dedupe from data deletion +

+ +This is mostly because the Parquet format compresses each column +aggressively. If we turn off compression we are able to dedupe more +aggressively: + +

+ Visualization of dedupe from data deletion without column compression +

+ +However the file sizes are nearly 2x larger if we store the data +uncompressed. + +Is it possible to have the benefit of dedupe and compression at the same +time? + +## Content Defined Row Groups + +One potential solution is to use not only byte-level CDC, but also apply it at the row level: +we split row groups not based on absolute count (1000 rows), but on a hash of a provided +“Key” column. In other words, I split off a row group whenever the hash of +the key column % [target row count] = 0, with some allowances for a minimum +and a maximum row group size. + +I hacked up a quick inefficient experimental demonstration +[here](https://gist.github.com/ylow/db38522fb0ca69bdf1065237222b4d1c). + +With this, we are able to efficiently dedupe across compressed Parquet +files even as I delete a row. Here we clearly see a big red block +representing the rewritten row group, followed by a small change for every +column header. + +

+ Visualization of dedupe from data deletion with content defined row groups +

+ +# Optimizing Parquet for Dedupe-ability + +Based on these experiments, we could consider improving Parquet file +dedupe-ability in a couple of ways: + +1. Use relative offsets instead of absolute offsets for file structure +data. This would make the Parquet structures position independent and +easy to “memcpy” around, although it is an involving file format change that +is probably difficult to do. +2. Support content defined chunking on row groups. The format actually +supports this today as it does not require row groups to be uniformly sized, +so this can be done with minimal blast radius. Only Parquet format writers +would have to be updated. + +While we will continue exploring ways to improve Parquet storage performance +(Ex: perhaps we could optionally rewrite Parquet files before uploading? +Strip absolute file offsets on upload and restore on download?), we would +love to work with the Apache Arrow project to see if there is interest in +implementing some of these ideas in the Parquet / Arrow code base. + +In the meantime, we are also exploring the behavior of our data dedupe process +on other common filetypes. Please do try our [dedupe +estimator](https://github.com/huggingface/dedupe_estimator) and tell us about +your findings! +