# Parquet Content-Defined Chunking

Apache Parquet is a columnar storage format that is widely used in the data engineering community. 

As Hugging Face hosts nearly 11PB of datasets with Parquet files alone accounting for over 2.2PB of that storage, optimizing Parquet storage is of high priority.
Hugging Face has introduced a new storage layer called [Xet](https://huggingface.co/blog/xet-on-the-hub) that leverages content-defined chunking to efficiently deduplicate chunks of data reducing storage costs and improving download/upload speeds.

While Xet is format agnostic, Parquet's layout and column-chunk (data page) based compression can produce entirely different byte-level representations for data with minor changes, leading to suboptimal deduplication performance. To address this, the Parquet files should be written in a way that minimizes the byte-level differences between similar data, which is where content-defined chunking (CDC) comes into play.

Let's explore the performance benefits of the new Parquet CDC feature used alongside Hugging Face's Xet storage layer.

### Note about required pyarrow version

The parquet content-defined chunking feature hasn't been released yet, so we need to install a nightly build of `pyarrow`:

## Prepare the data to experiment with

For demonstration purposes, we will use a manageable sized subset of [finemath-4plus](https://huggingface.co/datasets/HuggingFaceTB/finemath) dataset.

In [123]:
# from datasets import load_dataset

# # ds = load_dataset("HuggingFaceTB/finemath", "finemath-4plus", split="train[:100_000]")
# ds = load_dataset("Open-Orca/OpenOrca", )
# table = ds.data.table

In [124]:
# table[:5].to_pandas()

In [1]:
import pyarrow.parquet as pq
from huggingface_hub import hf_hub_download

path = hf_hub_download(
    repo_id="Open-Orca/OpenOrca", 
    filename="3_5M-GPT3_5-Augmented.parquet", 
    repo_type="dataset"
)

full = pq.read_table(path)

In [None]:
# TODO(kszucs): have synthetic data here as well

In [10]:
table = full[:500_000]

In [11]:
table[:5].to_pandas()

Unnamed: 0,id,system_prompt,question,response
0,t0.1791914,You are an AI assistant that follows instructi...,Q:The exercise is to decide whether the questi...,True.
1,flan.2203053,"You are a helpful assistant, who always provid...",Sentence 1: There is no need. \n\nSentence 2: ...,"Yes, if the first sentence is true, then the s..."
2,flan.1943030,"You are a helpful assistant, who always provid...",On his oak mantelpiece are a drinking bowl fro...,"We cannot conclude that the sentence ""The Ebol..."
3,t0.870962,You are an AI assistant that follows instructi...,Question: Read the following paragraph and ext...,"The full name of the person whose version of ""..."
4,t0.314926,You are an AI assistant that follows instructi...,I have a test where I am given the following a...,The Rova compound remained largely closed to t...


### Upload the table as a Parquet file to Hugging Face Hub

Since [pyarrow>=21.0.0](https://github.com/apache/arrow/pull/45089) we can use Hugging Face URIs in the `pyarrow` functions to directly read and write parquet (and other file formats) files to the Hub using the `hf://` URI scheme.

In [None]:
# import pyarrow.parquet as pq

# # Write the table to the Hugging Face Hub
# pq.write_table(table, "hf://datasets/kszucs/pq/finemath-100k.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-e71zysgn:  80%|#######9  |  218MB /  273MB            

No files have been modified since last commit. Skipping to prevent empty commit.


We can see that the table has been uploaded entirely as new data because it is not known to the Xet storage layer yet. Now read it back as a `pyarrow` table:

In [None]:
# table = pq.read_table("hf://datasets/kszucs/pq/finemath-100k.parquet")
# len(table)

100000

Note that all `pyarrow` functions that accept a file path also accept a Hugging Face URI, like [pyarrow datasets](https://arrow.apache.org/docs/python/dataset.html), 
[CSV functions](https://arrow.apache.org/docs/python/generated/pyarrow.csv.read_csv.html), [incremental Parquet writer](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) or reading only the parquet metadata:

In [None]:
# pq.read_metadata("hf://datasets/kszucs/pq/finemath-100k.parquet")

<pyarrow._parquet.FileMetaData object at 0x16ce99b70>
  created_by: parquet-cpp-arrow version 21.0.0-SNAPSHOT
  num_columns: 16
  num_rows: 100000
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 11809

## Different Use Cases for Parquet Deduplication

To demonstrate the effectiveness of the content-defined chunking feature, we will try out how it performs in case of:
1. Re-uploading exact copies of the table
2. Adding/removing columns from the table
3. Changing column types in the table
4. Appending new rows and concatenating tables
5. Inserting / deleting rows in the table
6. Change row-group size of the table
7. Change file-level splitting/partitioning of the table

TODO(kszucs): add the ppm visualization table for each use case


### 1. Re-uploading an Exact Copies of the Table

While this use case sounds trivial, traditional file systems do not deduplicate files resulting in full re-upload and re-download of the data. In contrast, a system utilizing content-defined chunking can recognize that the file content is identical and avoid unnecessary data transfer.

In [None]:
# pq.write_table(table, "hf://datasets/kszucs/pq/finemath-100k-copy.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-r5bynfdz:  77%|#######6  |  210MB /  273MB            

No files have been modified since last commit. Skipping to prevent empty commit.


We can see that no new data has been uploaded, and the operation was instantaneous. Now let's see what happens if we upload the the same file again but to a different repository:


In [None]:
# pq.write_table(table, "hf://datasets/kszucs/pq-copy/finemath-100k-second-copy.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-ywdkqwzf:  74%|#######3  |  201MB /  273MB            

No files have been modified since last commit. Skipping to prevent empty commit.


The upload was instantaneous again since deduplication works across repositories as well. This is a key feature of the Xet storage layer, allowing efficient data sharing and collaboration. We can also visualize the deduplication between the two parquet files using the deduplication estimation tool:

In [12]:
# TODO(kszucs): remove this cell and introduce the visualization later when needed

from de import visualize

visualize(table, {"exact-copy": table})


#### Parquet Deduplication for Exact-copy
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![exact-copy Vanilla](temp-none-exact-copy-nocdc.parquet.png) | ![exact-copy Vanilla](temp-zstd-exact-copy-nocdc.parquet.png) | ![exact-copy Vanilla](temp-snappy-exact-copy-nocdc.parquet.png) |



The three totally green heatmaps indicate that the two files deduplicate perfectly.

### 2. Adding and Removing Columns from the Table

First write out the original and changed tables to local parquet files to see their sizes:

In [20]:
import pyarrow.parquet as pq
import pyarrow.compute as pc

table_with_new_columns = table.add_column(
    table.schema.get_field_index("question"),
    "question_length",
    pc.utf8_length(table["question"])
)
table_with_new_columns = table_with_new_columns.add_column(
    table_with_new_columns.schema.get_field_index("response"),
    "response_short",
    pc.utf8_slice_codeunits(table["response"], 0, 50)
)
table_with_removed_columns = table.drop(["response"])
    
pq.write_table(table, "/tmp/original.parquet")
pq.write_table(table_with_new_columns, "/tmp/with-new-columns.parquet")
pq.write_table(table_with_removed_columns, "/tmp/with-removed-columns.parquet")

In [17]:
!ls -lah /tmp/*.parquet

-rw-r--r--  1 kszucs  wheel   455M Jul 15 22:51 /tmp/original.parquet
-rw-r--r--  1 kszucs  wheel   471M Jul 15 22:51 /tmp/with-new-columns.parquet
-rw-r--r--  1 kszucs  wheel   455M Jul 15 22:51 /tmp/with-removed-columns.parquet


Now upload them to Hugging Face to see how much data is actually transferred:

In [None]:
# pq.write_table(table_with_new_columns, "hf://datasets/kszucs/pq/finemath-100k-with-new-columns.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-x0nrujai:  69%|######8   |  226MB /  330MB            

No files have been modified since last commit. Skipping to prevent empty commit.


We can see that only the new columns and the new parquet metadata placed in the file's footer were uploaded, while the original data was not transferred again. This is a huge benefit of the content-defined chunking feature, as it allows us to efficiently add new columns without transferring the entire dataset again. 

Same applies to removing columns, as we can see below:

In [None]:
# pq.write_table(table_with_removed_columns, "hf://datasets/kszucs/pq/finemath-100k-with-removed-columns.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-4i1yxc9o: 100%|##########| 7.03MB / 7.03MB            

No files have been modified since last commit. Skipping to prevent empty commit.


Now let's visualize the deduplication heatmaps:

In [18]:
visualize(table, {"with-new-columns": table_with_new_columns})


#### Parquet Deduplication for With-new-columns3
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![with-new-columns3 Vanilla](temp-none-with-new-columns3-nocdc.parquet.png) | ![with-new-columns3 Vanilla](temp-zstd-with-new-columns3-nocdc.parquet.png) | ![with-new-columns3 Vanilla](temp-snappy-with-new-columns3-nocdc.parquet.png) |



Adding two new columns mean that we have unseen data pages which must be transferred (highlighted in red), but the rest of the data remains unchanged, so it is not transferred again. Note the small red area in the footer metadata which almost always changes as we modify the parquet file.

In [21]:
visualize(table, {"with-removed-columns1": table_with_removed_columns})


#### Parquet Deduplication for With-removed-columns1
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![with-removed-columns1 Vanilla](temp-none-with-removed-columns1-nocdc.parquet.png) | ![with-removed-columns1 Vanilla](temp-zstd-with-removed-columns1-nocdc.parquet.png) | ![with-removed-columns1 Vanilla](temp-snappy-with-removed-columns1-nocdc.parquet.png) |



Since we are removing entire columns we can only see changes in the footer metadata, all the other columns remain unchanged and already existing in the storage layer, so they are not transferred again.

### 3. Changing Column Types in the Table

Another common use case is changing the column types in the table e.g. to reduce the storage size or to optimize the data for specific queries. Let's change the `score` column from `float64` to `float32` and see how much data is transferred:

In [29]:
# first make the table much smaller by removing the largest column
# this will highlight the change in the heatmap much better
table_without_text = table_with_new_columns.drop(["question", "response"])

In [30]:
# cast the question_length column to int64
table_with_casted_column = table_without_text.set_column(
    table_without_text.schema.get_field_index("question_length"),
    "question_length",
    table_without_text["question_length"].cast("int64")
)

# pq.write_table(table_with_casted_column, "hf://datasets/kszucs/pq/finemath-100k-with-casted-column.parquet")

Again, we can see that only the new column and the updated parquet metadata were uploaded. Now visualize the deduplication heatmap:

In [32]:
visualize(table_without_text, {"with-casted-column1": table_with_casted_column})


#### Parquet Deduplication for With-casted-column1
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![with-casted-column1 Vanilla](temp-none-with-casted-column1-nocdc.parquet.png) | ![with-casted-column1 Vanilla](temp-zstd-with-casted-column1-nocdc.parquet.png) | ![with-casted-column1 Vanilla](temp-snappy-with-casted-column1-nocdc.parquet.png) |



The first red block indicates the new column that was added, while the second red block indicates the updated metadata in the footer. The rest of the data remains unchanged and is not transferred again.

### 4. Appending New Rows and Concatenating Tables

We are going to append new rows by concatenating another slice of the original dataset to the table. 

In [36]:
# ds = load_dataset("HuggingFaceTB/finemath", "finemath-4plus", split="train[:110_000]")
# table = ds.data.table[:100_000]
# next_10k_rows = ds.data.table[100_000:110_000]

import pyarrow as pa

original = table[:300_000]
next_20k_rows = table[300_000:320_000]
appended = pa.concat_tables([original, next_20k_rows])

assert len(appended) == 320_000

Now check that only the new rows are being uploaded since the original data is already known to the Xet storage layer:

In [None]:
# pq.write_table(table_with_appended_rows, "hf://datasets/kszucs/pq/finemath-110k.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-lzvp9gvu:  70%|######9   |  210MB /  300MB            

In [37]:
visualize(original, {"with-appended-rows": appended})


#### Parquet Deduplication for With-appended-rows
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![with-appended-rows Vanilla](temp-none-with-appended-rows-nocdc.parquet.png) | ![with-appended-rows Vanilla](temp-zstd-with-appended-rows-nocdc.parquet.png) | ![with-appended-rows Vanilla](temp-snappy-with-appended-rows-nocdc.parquet.png) |



Since each column gets new data, we can see multiple red strides. This is due to the actual parquet file specification where whole columns are layed out after each other (within each row group). Note the large read area at the bottom which is the new data for the `text` column. 

### 5. Inserting / Deleting Rows in the Table

Here comes the difficult part as insertions and deletions are shifting the existing rows which lead to different columns chunks or data pages in the parquet nomenclature. Since each data page is compressed separately, even a single row insertion or deletion can lead to a completely different byte-level representation starting from the edited row(s) to the end of the parquet file. 

This parquet specific problem cannot be solved by the Xet storage layer alone, the parquet file itself needs to be written in a way that minimizes the data page differences even if there are inserted or deleted rows. 

Let's try to use the existing mechanism and see how it performs.

In [121]:
ds = load_dataset("HuggingFaceTB/finemath", "finemath-4plus", split="train[:110_000]")

table_source = ds.data.table
table_original = table_source[:100_000]

# remove 10k rows from the middle
table_with_deleted_rows = pa.concat_tables([
    table_source[:50_001], 
    table_source[60_000:100_000]
])

# add 10k rows to the middle from the end of `table`
table_with_inserted_rows = pa.concat_tables([
    table_source[:50_000], 
    table_source[100_000:110_000], 
    table_source[50_000:100_000]
])

# assert len(table_original) == 100_000
# assert len(table_with_deleted_rows) == 90_000
# assert len(table_with_inserted_rows) == 110_000

Resolving data files:   0%|          | 0/128 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/64 [00:00<?, ?it/s]

In [113]:
pq.write_table(table_original, "hf://datasets/kszucs/pq/finemath-200k-original.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-hktgiian:  95%|#########5| 6.00MB / 6.31MB            

In [114]:
pq.write_table(table_with_deleted_rows, "hf://datasets/kszucs/pq/finemath-190k-with-deleted-rows.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-cd_zush_:  50%|####9     | 2.82MB / 5.68MB            

In [122]:
visualize(
    table_original, 
    {
        "with-deleted-rows": table_with_deleted_rows,
        "with-inserted-rows": table_with_inserted_rows
    }, 
    with_content_defined_chunking=True
)


#### Parquet Deduplication for With-deleted-rows
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![with-deleted-rows Vanilla](temp-none-with-deleted-rows-nocdc.parquet.png) | ![with-deleted-rows Vanilla](temp-zstd-with-deleted-rows-nocdc.parquet.png) | ![with-deleted-rows Vanilla](temp-snappy-with-deleted-rows-nocdc.parquet.png) |
| CDC Parquet | ![with-deleted-rows CDC](temp-none-with-deleted-rows-cdc.parquet.png) | ![with-deleted-rows CDC](temp-zstd-with-deleted-rows-cdc.parquet.png) | ![with-deleted-rows CDC](temp-snappy-with-deleted-rows-cdc.parquet.png) |




#### Parquet Deduplication for With-inserted-rows
    
| Variant | No Compression | Zstd Compression  | Snappy Compression |
|---------|----------------|-------------------|--------------------|
| Vanilla Parquet | ![with-inserted-rows Vanilla](temp-none-with-inserted-rows-nocdc.parquet.png) | ![with-inserted-rows Vanilla](temp-zstd-with-inserted-rows-nocdc.parquet.png) | ![with-inserted-rows Vanilla](temp-snappy-with-inserted-rows-nocdc.parquet.png) |
| CDC Parquet | ![with-inserted-rows CDC](temp-none-with-inserted-rows-cdc.parquet.png) | ![with-inserted-rows CDC](temp-zstd-with-inserted-rows-cdc.parquet.png) | ![with-inserted-rows CDC](temp-snappy-with-inserted-rows-cdc.parquet.png) |



In [82]:
table.to_pandas().head()

Unnamed: 0,url,fetch_time,content_mime_type,warc_filename,warc_record_offset,warc_record_length,text,token_count,char_count,metadata,score,int_score,crawl,snapshot_type,language,language_score
0,https://ex.burnettmediagroup.com/what-is-the-p...,1627454305000000000,text/html,crawl-data/CC-MAIN-2021-31/segments/1627046153...,265657037,7750,# What is the term”exclamation mark” in mathem...,509,1930,"{""found_math"": false, ""script_math_tex"": 0, ""s...",4.53125,5,CC-MAIN-2021-31,longest,en,0.901994
1,https://esingaporemath.com/program-grade-3,1638473770000000000,text/html,crawl-data/CC-MAIN-2021-49/segments/1637964362...,305205293,13865,Our curriculum is spiral\n\nPlease note that o...,1234,6132,"{""found_math"": false, ""script_math_tex"": 0, ""s...",4.09375,4,CC-MAIN-2021-49,latest,en,0.914701
2,https://www.teacherspayteachers.com/Product/Fr...,1537662858000000000,text/html,crawl-data/CC-MAIN-2018-39/segments/1537267158...,874532398,19691,"# Fractions Bundle ""Twist"" 12 Worksheets\n\nSu...",709,3141,"{""found_math"": false, ""script_math_tex"": 0, ""s...",3.921875,4,CC-MAIN-2018-39,latest,en,0.764651
3,https://www.hackmath.net/en/math-problem/8311,1611170840000000000,text/html,crawl-data/CC-MAIN-2021-04/segments/1610703521...,835761036,13306,# Height of the room\n\nGiven the floor area o...,641,2490,"{""found_math"": false, ""script_math_tex"": 0, ""s...",4.1875,4,CC-MAIN-2021-04,longest,en,0.920097
4,https://www.jiskha.com/questions/838764/solve-...,1540044392000000000,text/html,crawl-data/CC-MAIN-2018-43/segments/1539583512...,980654936,3697,math\n\nsolve by using subsitution:\n\n4x+y=2\...,316,731,"{""found_math"": false, ""script_math_tex"": 0, ""s...",3.671875,4,CC-MAIN-2018-43,latest,en,0.822756


According to the deduplication estimator, we expect to upload approximately 0.5GB of data to Hugging Face Hub.

In [83]:
pq.write_table(table, "hf://datasets/kszucs/pq/finemath-1m.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-hqxs945m:  80%|#######9  |  218MB /  273MB            

In [84]:
pq.write_table(table[:100_000], "hf://datasets/kszucs/pq/finemath-sample.parquet")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ..._mlhlq2kgrd_m0000gn/T/hffs-6jeszd21:  80%|#######9  |  218MB /  273MB            

- reordering columns?