Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions TOC-tidb-cloud-lake.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,15 @@
- Data Sources
- [Overview](/tidb-cloud-lake/guides/data-sources.md)
- [Amazon S3 - Credentials](/tidb-cloud-lake/guides/aws-credentials.md)
- [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md)
- [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md) ![BETA](/media/tidb-cloud/blank_transparent_placeholder.png)
- [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md)
- [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md)
- [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md)
- Integration Tasks
- [Overview](/tidb-cloud-lake/guides/integration-tasks.md)
- [Task Management](/tidb-cloud-lake/guides/task-management.md)
- [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md)
- [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md)
- [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) ![BETA](/media/tidb-cloud/blank_transparent_placeholder.png)
- [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md)
- [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md)
- Connect
Expand Down
10 changes: 5 additions & 5 deletions tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
---
title: Amazon SQS (S3) - IAM Role
title: Amazon SQS (S3) - IAM Role (Beta)
summary: Learn how to create an "Amazon SQS (S3) - IAM Role" data source in {{{ .lake }}}.
---

# Amazon SQS (S3) - IAM Role
# Amazon SQS (S3) - IAM Role (Beta)

This page describes how to create an `Amazon SQS (S3) - IAM Role` data source. This data source stores the configuration required to access an Amazon SQS queue and the corresponding S3 bucket, and is used for consuming S3 object creation events delivered from Amazon S3 to SQS.

Expand Down Expand Up @@ -49,7 +49,7 @@ Before creating the data source, complete the following configuration in your AW
5. Attach S3 read permissions and SQS consume permissions to the IAM Role.
6. Upload a test object and confirm that S3 can deliver the event to SQS.

Prepare the following variables first. `AWS_REGION` must be the Region where both the S3 bucket and SQS queue are located. `EXTERNAL_ID` is the organization ID from the {{{ .lake }}} console.
Prepare the following variables first. `AWS_REGION` must be the Region where both the S3 bucket and SQS queue are located. `EXTERNAL_ID` is the organization ID from the {{{ .lake }}} platform console.

```bash
export AWS_REGION="<bucket-and-sqs-region>"
Expand Down Expand Up @@ -231,9 +231,9 @@ aws s3api get-bucket-notification-configuration \

Confirm that `QueueArn` points to the target SQS queue, `Events` includes `s3:ObjectCreated:*`, and `FilterRules` matches the `Object Key Prefix` / `Object Key Suffix` configured in the {{{ .lake }}} data source.

## Step 4: Create an IAM Role for Platform to Assume
## Step 4: Create an IAM Role for {{{ .lake }}} to Assume

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
## Step 4: Create an IAM Role for {{{ .lake }}} to Assume
## Step 4: Create an IAM role for {{{ .lake }}} to assume
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


Generate `trust-policy.json`. `ExternalId` is the organization ID from the Platform console.
Generate `trust-policy.json`. `ExternalId` is the organization ID from the {{{ .lake }}} (platform) console.

```bash
jq -n \
Expand Down
2 changes: 1 addition & 1 deletion tidb-cloud-lake/guides/data-integration-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Not every data source corresponds to an ingestion task. For example, `FeiShuBot`
| Task Type | Description |
|-----------|-------------|
| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. |
| [Amazon SQS (S3)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. |
| [Amazon SQS (S3) (Beta)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. |
| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. |
| [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. |

Expand Down
2 changes: 1 addition & 1 deletion tidb-cloud-lake/guides/data-sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Data sources do not execute synchronization by themselves. Their role is to cent
| Type | Purpose |
|------|---------|
| [Amazon S3 - Credentials](/tidb-cloud-lake/guides/aws-credentials.md) | Stores the Access Key and Secret Key required to access Amazon S3. These credentials can be reused across multiple S3 import tasks. |
| [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md) | Stores the queue URL, Region, IAM Role, and S3 path scope required for SQS (S3) ingestion. It can be used to consume S3 object creation events. |
| [Amazon SQS (S3) - IAM Role (Beta)](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md) | Stores the queue URL, Region, IAM Role, and S3 path scope required for SQS (S3) ingestion. It can be used to consume S3 object creation events. |
| [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) | Stores the host, port, username, password, and database information required to access MySQL. These settings can be reused across multiple MySQL sync tasks. |
| [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md) | Stores the host, port, username, password, and database information required to access PostgreSQL. These settings can be reused across multiple PostgreSQL sync tasks. |
| [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) | Stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios. |
Expand Down
6 changes: 3 additions & 3 deletions tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
---
title: Amazon SQS (S3) Integration Task
title: Amazon SQS (S3) Integration Task (Beta)
summary: Learn how to create an Amazon SQS (S3) integration task that consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}.
---

# Amazon SQS (S3) Integration Task
# Amazon SQS (S3) Integration Task (Beta)

This page describes how to create an Amazon SQS (S3) integration task that consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}.

This task is designed for S3 event-driven data ingestion. After an upstream system writes an object to S3, S3 sends an `ObjectCreated` event to SQS. {{{ .lake }}} consumes the SQS message through AssumeRole and writes data into {{{ .lake }}} based on the bucket and object key in the event.

If you need to create reusable SQS (S3) connection settings first, see [Amazon SQS (S3) - IAM Role](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md).
If you need to create reusable SQS (S3) connection settings first, see [Amazon SQS (S3) - IAM Role (Beta)](/tidb-cloud-lake/guides/amazon-sqs-s3-iam-role.md).

## Use Cases

Expand Down
2 changes: 1 addition & 1 deletion tidb-cloud-lake/guides/integration-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Unlike data sources, integration tasks are the executable units that actually pe
| Task Type | Description |
|-----------|-------------|
| [Amazon S3](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md) | Imports CSV, Parquet, or NDJSON files from Amazon S3 with support for one-time or continuous ingestion. |
| [Amazon SQS (S3)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. |
| [Amazon SQS (S3) (Beta)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. |
| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. |
| [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. |

Expand Down
153 changes: 142 additions & 11 deletions tidb-cloud-lake/guides/schema-evolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,16 @@ summary: Automatically evolve table schemas when loading data with COPY INTO.

# Schema Evolution

Schema evolution allows {{{ .lake }}} to automatically add new columns to a table during `COPY INTO` when the source Parquet files contain columns not yet present in the table.
Schema evolution allows {{{ .lake }}} to automatically add columns that exist in source files but are missing from the target table during `COPY INTO`. It currently supports **Parquet** and **NDJSON** files.

## How It Works

When enabled, `COPY INTO`:
When enabled, {{{ .lake }}} infers the source file schema before loading and appends new columns to the end of the table. New columns are nullable, and missing values are filled with `NULL`.

1. Infers the schema from source Parquet files.
2. Adds any new columns (not in the table) as nullable columns.
3. Loads the data, filling missing values with `NULL`.
The workflow differs slightly by file format:

- **Parquet**: After the table option is enabled, `COPY INTO` infers new columns directly from Parquet file schemas.
- **NDJSON**: After the table option is enabled, `COPY INTO` uses `AUTO` sampling values for schema inference. You can optionally add `SCHEMA_EVOLUTION = (...)` to override the file and record sampling limits.

## Enabling Schema Evolution

Expand All @@ -27,15 +28,21 @@ ALTER TABLE my_table SET OPTIONS(ENABLE_SCHEMA_EVOLUTION = true);
CREATE TABLE my_table(id INT) ENABLE_SCHEMA_EVOLUTION = true;
```

To disable, set it back to `false`:
To disable schema evolution, set it back to `false`:

```sql
ALTER TABLE my_table SET OPTIONS(ENABLE_SCHEMA_EVOLUTION = false);
```

## Tutorial
## Privileges

When `COPY INTO <table>` loads files from a stage or external location and runs schema evolution inference, the loading role must have both `INSERT` and `ALTER` privileges on the target table. `ALTER` is required because {{{ .lake }}} may append new columns before loading.

Query-based COPY is not affected. For example, `COPY INTO <table> FROM (SELECT ... FROM @stage)` keeps the existing privilege requirements.

## Parquet Example

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
## Parquet Example
## Parquet example
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


This tutorial uses a fully runnable example to demonstrate schema evolution.
The following example loads Parquet files with different schemas and automatically adds missing columns.

### Step 1: Create a Table and Stage

Expand Down Expand Up @@ -72,7 +79,7 @@ FILE_FORMAT = (TYPE = parquet MISSING_FIELD_AS = FIELD_DEFAULT);

### Step 4: Verify Results

The table now has three columns `amount` and `currency` were added automatically:
The table now has three columns. `amount` and `currency` were added automatically:

```sql
DESC invoices;
Expand Down Expand Up @@ -104,6 +111,129 @@ SELECT * FROM invoices ORDER BY order_id;

Row 3 has `currency = NULL` because its source file did not contain that column.

## NDJSON Example

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
## NDJSON Example
## NDJSON example
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


{{{ .lake }}} loads NDJSON files with `TYPE = ndjson`. NDJSON files do not have an embedded columnar schema like Parquet files, so {{{ .lake }}} samples file content, infers fields that are missing from the target table, and appends them as nullable columns.

### Step 1: Create a Table and Stage

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
### Step 1: Create a Table and Stage
### Step 1: Create a table and stage
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


```sql
CREATE OR REPLACE TABLE events(id INT);
CREATE OR REPLACE STAGE events_stage;
```

### Step 2: Generate NDJSON Files with Different Fields

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
### Step 2: Generate NDJSON Files with Different Fields
### Step 2: Generate NDJSON files with different fields
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


```sql
-- File with fields: id, city, score
COPY INTO @events_stage FROM (
SELECT 1 AS id, 'SF' AS city, 9 AS score
UNION ALL
SELECT 2, 'NYC', 8
) FILE_FORMAT = (TYPE = ndjson);

-- File with fields: id, score (no city)
COPY INTO @events_stage FROM (
SELECT 3 AS id, 7 AS score
) FILE_FORMAT = (TYPE = ndjson);
```

### Step 3: Enable Schema Evolution and Load

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
### Step 3: Enable Schema Evolution and Load
### Step 3: Enable schema evolution and load
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


```sql
ALTER TABLE events SET OPTIONS(ENABLE_SCHEMA_EVOLUTION = true);

COPY INTO events
FROM @events_stage/
FILE_FORMAT = (TYPE = ndjson MISSING_FIELD_AS = FIELD_DEFAULT)
SCHEMA_EVOLUTION = (
SAMPLE_FILES = AUTO,
SAMPLE_RECORDS_PER_FILE = AUTO,
SAMPLE_TOTAL_RECORDS = AUTO
);
```

The three `SCHEMA_EVOLUTION` sampling options accept either `AUTO` or a positive integer:

| Option | Description |
|------|------|
| `SAMPLE_FILES` | Number of files to sample. |
| `SAMPLE_RECORDS_PER_FILE` | Maximum number of records to sample from each selected file. |
| `SAMPLE_TOTAL_RECORDS` | Maximum number of records to sample across all selected files. |

If `SCHEMA_EVOLUTION` is omitted, {{{ .lake }}} uses `AUTO` for all three sampling options. The current `AUTO` behavior samples up to 64 files, 1,000 records per file, and 10,000 records in total. These internal defaults may change in future versions. If your load is sensitive to the sampling strategy, set `SAMPLE_FILES`, `SAMPLE_RECORDS_PER_FILE`, and `SAMPLE_TOTAL_RECORDS` explicitly.

#### NDJSON Inference Rules

When running Schema Evolution for NDJSON, {{{ .lake }}} infers new columns using these rules:

- Schema is inferred only from sampled NDJSON records. Fields not covered by the sample are not added to the target table ahead of time.
- Each line must be a JSON object. {{{ .lake }}} uses top-level object field names as candidate column names.
- Columns that already exist in the target table are not added again. Only fields missing from the target table are appended.
- New field types are inferred from sampled JSON values, such as integers, floats, strings, and booleans.
- Schema Evolution uses shallow NDJSON inference: if a top-level field value is an object or array, it is appended as a `VARIANT` column instead of being recursively expanded.
- `NULL` samples only mark the field as nullable. They do not force later non-null values to become `VARCHAR` or `VARIANT`.
- Same-name fields across files or records are merged: integer and float conflicts become `DOUBLE`; other scalar conflicts become `VARCHAR`; any conflict involving an object, array, or `VARIANT` becomes `VARIANT`.
- If loading encounters extra fields that were not inferred during sampling, the load fails and reports those field names. Increase `SAMPLE_FILES`, `SAMPLE_RECORDS_PER_FILE`, or `SAMPLE_TOTAL_RECORDS` and retry.

> **Note:**
>
> The `INFER_SCHEMA` table function does not limit NDJSON nesting depth by default. The rules here describe the shallow inference used by `COPY INTO` Schema Evolution.

For example, the following NDJSON records infer six new columns: `name`, `age`, `active`, `score`, `profile`, and `tags`:

```json
{"id":1,"name":"Alice","age":30,"active":true,"score":1,"profile":{"city":"SF"},"tags":["new"]}
{"id":2,"name":"Bob","age":null,"active":false,"score":1.5,"profile":{"city":"NYC"},"tags":["vip"]}
```

If the target table only has `id INT`, {{{ .lake }}} appends:

```text
name VARCHAR NULL
age BIGINT NULL
active BOOLEAN NULL
score DOUBLE NULL
profile VARIANT NULL
tags VARIANT NULL
```

The second row has `age = NULL`, which does not change the `BIGINT` type inferred from the first row. `score` contains both an integer and a float, so it becomes `DOUBLE`. `profile` and `tags` are an object and an array, so Schema Evolution appends them as `VARIANT` columns.

### Step 4: Verify Results

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

According to the style guide, headings should use sentence case. Please update this heading to use sentence case.

Suggested change
### Step 4: Verify Results
### Step 4: Verify results
References
  1. Use sentence case for headings (e.g., ## Configure the cluster). (link)


The table now has three columns. `city` and `score` were added automatically:

```sql
DESC events;
```

```text
┌─────────────────────────────────────────────────────────┐
│ Field │ Type │ Null │ Default │ Extra │
├───────┼──────────────┼────────┼─────────┼──────────────┤
│ id │ INT │ YES │ NULL │ │
│ city │ VARCHAR │ YES │ NULL │ │
│ score │ BIGINT │ YES │ NULL │ │
└─────────────────────────────────────────────────────────┘
```

```sql
SELECT * FROM events ORDER BY id;
```

```text
┌────────────────────────────┐
│ id │ city │ score │
├────┼──────┼────────────────┤
│ 1 │ SF │ 9 │
│ 2 │ NYC │ 8 │
│ 3 │ NULL │ 7 │
└────────────────────────────┘
```

If the sample does not cover a field that appears later in the data, loading fails and returns the extra field name. Increase `SAMPLE_FILES`, `SAMPLE_RECORDS_PER_FILE`, or `SAMPLE_TOTAL_RECORDS` and retry.

## Column Match Mode

By default, column names are matched case-insensitively. Use `COLUMN_MATCH_MODE` for case-sensitive matching:
Expand All @@ -117,8 +247,9 @@ COLUMN_MATCH_MODE = CASE_SENSITIVE;

## Limitations

- Supported for **Parquet** files only.
- Currently supports **Parquet** and **NDJSON** files.
- New columns are appended to the end of the table and are always nullable.
- If the same column name appears in multiple files with **different data types**, the load fails.
- No automatic type promotion (e.g., `INT` `BIGINT`).
- No automatic type promotion, such as `INT` to `BIGINT`.
- Column drops and renames are not supported through schema evolution.
- NDJSON relies on sampling to infer schema. If sampling does not cover all fields, increase the `SCHEMA_EVOLUTION` sampling options.
2 changes: 1 addition & 1 deletion tidb-cloud-lake/guides/task-management.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,6 @@ Click a task to view its execution history. The run history includes:
For field-level configuration and detailed behavior, continue with the relevant task guide:

- [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md)
- [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md)
- [Amazon SQS (S3) Integration Task (Beta)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md)
- [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md)
- [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md)
Loading
Loading