Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/xet_build_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Build Xet documentation

on:
push:
paths:
- "docs/xet/**"
branches:
- main

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
with:
commit_sha: ${{ github.sha }}
package: hub-docs
package_name: xet
path_to_docs: hub-docs/docs/xet/
additional_args: --not_python_module
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
21 changes: 21 additions & 0 deletions .github/workflows/xet_build_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: Build Xet PR Documentation

on:
pull_request:
paths:
- "docs/xet/**"

concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
with:
commit_sha: ${{ github.event.pull_request.head.sha }}
pr_number: ${{ github.event.number }}
package: hub-docs
package_name: xet
path_to_docs: hub-docs/docs/xet/
additional_args: --not_python_module
16 changes: 16 additions & 0 deletions .github/workflows/xet_upload_pr_documentation.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Upload Xet PR Documentation

on:
workflow_run:
workflows: ["Build Xet PR Documentation"]
types:
- completed

jobs:
build:
uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
with:
package_name: xet
secrets:
hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
30 changes: 30 additions & 0 deletions docs/xet/_toctree.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
- local: index
title: Xet Protocol Specification

- title: Building a client library for xet storage
sections:
- local: upload-protocol
title: Upload Protocol
- local: download-protocol
title: Download Protocol
- local: api
title: CAS API
- local: auth
title: Authentication and Authorization
- local: file-id
title: Hugging Face Hub Files Conversion to Xet File ID's

- title: Overall Xet architecture
sections:
- local: chunking
title: Content-Defined Chunking
- local: hashing
title: Hashing Methods
- local: file-reconstruction
title: File Reconstruction
- local: xorb
title: Xorb Format
- local: shard
title: Shard Format
- local: deduplication
title: Deduplication
193 changes: 193 additions & 0 deletions docs/xet/api.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# CAS API Documentation

This document describes the HTTP API endpoints used by the Content Addressable Storage (CAS) client to interact with the remote CAS server.

## Authentication

To authenticate, authorize, and obtain the API base URL, follow the instructions in [Authentication](./auth).

## Converting Hashes to Strings

Sometimes hashes are used in API paths as hexadecimal strings (reconstruction, xorb upload, global dedupe API).

To convert a 32 hash to a 64 hexadecimal character string to be used as part of an API path there is a specific procedure, MUST NOT directly convert each byte.

### Procedure

For every 8 bytes in the hash (indices 0-7, 8-15, 16-23, 24-31) reverse the order of each byte in those regions then concatenate the regions back in order.

Otherwise stated, consider each 8 byte part of a hash as a little endian 64 bit unsigned integer, then concatenate the hexadecimal representation of the 4 numbers in order (each padded with 0's to 16 characters).

> [!NOTE]
> In all cases that a hash is represented as a string it is converted from a byte array to a string using this procedure.

### Example

Suppose a hash value is:
`[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]`

Then before converting to a string it will first have its bytes reordered to:
`[7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8, 23, 22, 21, 20, 19, 18, 17, 16, 31, 30, 29, 28, 27, 26, 25, 24]`

So the string value of the the provided hash [0..32] is **NOT** `000102030405060708090a0b0c0d0e0f101112131415161718191a1b1c1d1e1f`.
It is: `07060504030201000f0e0d0c0b0a0908171615141312111f1e1d1c1b1a1918`.

## Endpoints

### 1. Get File Reconstruction

- **Description**: Retrieves reconstruction information for a specific file, includes byte range support when `Range` header is set.
- **Path**: `/v1/reconstructions/{file_id}`
- **Method**: `GET`
- **Parameters**:
- `file_id`: File hash in hex format (64 lowercase hexadecimal characters).
See [file hashes](./hashing#file-hashes) for computing the file hash and [converting hashes to strings](./api#converting-hashes-to-strings).
- **Headers**:
- `Range`: OPTIONAL. Format: `bytes={start}-{end}` (end is inclusive).
- **Minimum Token Scope**: `read`
- **Body**: None.
- **Response**: JSON (`QueryReconstructionResponse`)

```json
{
"offset_into_first_range": 0,
"terms": [...],
"fetch_info": {...}
}
```

- **Error Responses**: See [Error Cases](./api#error-cases)
- `400 Bad Request`: Malformed `file_id` in the path. Fix the path before retrying.
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
- `404 Not Found`: The file does not exist. Not retryable.
- `416 Range Not Satisfiable`: The requested byte range start exceeds the end of the file. Not retryable.

```txt
GET /v1/reconstructions/0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
-H "Authorization: Bearer <token>"
OPTIONAL: -H Range: "bytes=0-100000"
```

### Example File Reconstruction Response Body

See [QueryReconstructionResponse](./download-protocol#queryreconstructionresponse-structure) for more details in the download protocol specification.

### 2. Query Chunk Deduplication (Global Deduplication)

- **Description**: Checks if a chunk exists in the CAS for deduplication purposes.
- **Path**: `/v1/chunks/{prefix}/{hash}`
- **Method**: `GET`
- **Parameters**:
- `prefix`: The only acceptable prefix for the Global Deduplication API is `default-merkledb`.
- `hash`: Chunk hash in hex format (64 lowercase hexadecimal characters).
See [Chunk Hashes](./hashing#chunk-hashes) to compute the chunk hash and [converting hashes to strings](./api#converting-hashes-to-strings).
- **Minimum Token Scope**: `read`
- **Body**: None.
- **Response**: Shard format bytes (`application/octet-stream`), deserialize as a [shard](./shard#global-deduplication).
- **Error Responses**: See [Error Cases](./api#error-cases)
- `400 Bad Request`: Malformed hash in the path. Fix the path before retrying.
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
- `404 Not Found`: Chunk not already tracked by global deduplication. Not retryable.

```txt
GET /v1/chunks/default-merkledb/0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
-H "Authorization: Bearer <token>"
```

#### Example Shard Response Body

An example shard response body can be found in [Xet reference files](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.shard.dedupe).

### 3. Upload Xorb

- **Description**: Uploads a serialized Xorb to the server; uploading real data in serialized format.
- **Path**: `/v1/xorbs/{prefix}/{hash}`
- **Method**: `POST`
- **Parameters**:
- `prefix`: The only acceptable prefix for the Xorb upload API is `default`.
- `hash`: Xorb hash in hex format (64 lowercase hexadecimal characters).
See [Xorb Hashes](./hashing#xorb-hashes) to compute the hash, and [converting hashes to strings](./api#converting-hashes-to-strings).
- **Minimum Token Scope**: `write`
- **Body**: Serialized Xorb bytes (`application/octet-stream`).
See [xorb format serialization](./xorb).
- **Response**: JSON (`UploadXorbResponse`)

```json
{
"was_inserted": true
}
```

- Note: `was_inserted` is `false` if the Xorb already exists; this is not an error.

- **Error Responses**: See [Error Cases](./api#error-cases)
- `400 Bad Request`: Malformed hash in the path, Xorb hash does not match the body, or body is incorrectly serialized.
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
- `403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided). Clients MUST retry with a `write` scope token.

```txt
POST /v1/xorbs/default/0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef
-H "Authorization: Bearer <token>"
```

#### Example Xorb Request Body

An example xorb request body can be found in [Xet reference files](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/eea25d6ee393ccae385820daed127b96ef0ea034dfb7cf6da3a950ce334b7632.xorb).

### 4. Upload Shard

- **Description**: Uploads a Shard to the CAS.
Uploads file reconstructions and new xorb listing, serialized into the shard format; marks the files as uploaded.
- **Path**: `/v1/shards`
- **Method**: `POST`
- **Minimum Token Scope**: `write`
- **Body**: Serialized Shard data as bytes (`application/octet-stream`).
See [Shard format guide](./shard#shard-upload).
- **Response**: JSON (`UploadShardResponse`)

```json
{
"result": 0
}
```

- Where `result` is:
- `0`: The Shard already exists.
- `1`: `SyncPerformed` — the Shard was registered.

The value of `result` does not carry any meaning, if the upload shard API returns a `200 OK` status code, the upload was successful and the files listed are considered uploaded.

- **Error Responses**: See [Error Cases](./api#error-cases)
- `400 Bad Request`: Shard is incorrectly serialized or Shard contents failed verification.
- Can mean that a referenced Xorb doesn't exist or the shard is too large
- `401 Unauthorized`: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
- `403 Forbidden`: Token provided but does not have a wide enough scope (for example, a `read` token was provided).

```txt
POST /v1/shards
-H "Authorization: Bearer <token>"
```

#### Example Shard Request Body

An example shard request body can be found in [Xet reference files](https://huggingface.co/datasets/xet-team/xet-spec-reference-files/blob/main/Electric_Vehicle_Population_Data_20250917.csv.shard.verification-no-footer).

## Error Cases

### Non-Retryable Errors

- **400 Bad Request**: Returned when the request parameters are invalid (for example, invalid Xorb/Shard on upload APIs).
- **401 Unauthorized**: Refresh the token to continue making requests, or provide a token in the `Authorization` header.
- **403 Forbidden**: Token provided but does not have a wide enough scope (for example, a `read` token was provided for an API requiring `write` scope).
- **404 Not Found**: Occurs on `GET` APIs where the resource (Xorb, file) does not exist.
- **416 Range Not Satisfiable**: Reconstruction API only; returned when byte range requests are invalid. Specifically, the requested start range is greater than or equal to the length of the file.

### Retryable Errors

- **Connection Errors**: Often caused by network issues. Retry if intermittent.
Clients SHOULD ensure no firewall blocks requests and SHOULD NOT use DNS overrides.
- **429 Rate Limiting**: Lower your request rate using a backoff strategy, then wait and retry.
Assume all APIs are rate limited.
- **500 Internal Server Error**: The server experienced an intermittent issue; clients SHOULD retry their requests.
- **503 Service Unavailable**: Service is temporarily unable to process requests; wait and retry.
- **504 Gateway Timeout**: Service took too long to respond; wait and retry.
Loading