Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions ticdc/ticdc-architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,49 @@ In table split mode, pay attention to the following settings:
- [`scheduler.region-count-per-span`](/ticdc/ticdc-changefeed-config.md#region-count-per-span-new-in-v854): the default value is `100`. During changefeed initialization, tables that meet the split conditions are split according to this parameter. After splitting, each split sub-table contains at most `region-count-per-span` regions.
- [`scheduler.write-key-threshold`](/ticdc/ticdc-changefeed-config.md#write-key-threshold): the default value is `0` (disabled). When the sink write throughput of a table exceeds this threshold, TiCDC triggers table splitting. In most cases, keep this parameter to `0`.

## Storage Sink file name changes and consumption instructions

After switching to the new TiCDC architecture and enabling table-level task splitting, for [Storage Sink](/ticdc/ticdc-sink-to-cloud-storage.md), the file name format for recording data changes changes from `CDC_{num}.{extension}` to `CDC_{uuid}_{num}.{extension}`, and the Index file name format changes from `CDC.index` to `CDC_{uuid}.index`. Here, `uuid` identifies the sub replication task after table splitting, and `num` indicates the file sequence number.

- Data change record path

```
{scheme}://{prefix}/{schema}/{table}/{table-version-separator}/{partition-separator}/{date-separator}/CDC_{uuid}_{num}.{extension}
```

- Index file path

```
{scheme}://{prefix}/{schema}/{table}/{table-version-separator}/{partition-separator}/{date-separator}/meta/CDC_{uuid}.index
```

After table-level task splitting is enabled, under the `{schema}/{table}/{table-version-separator}/` directory, the same table might have multiple data files with different `uuid` values but the same sequence number. For example:

```
├── metadata
└── test
├── tbl_1
│ ├── 437752935075545091
│ │ ├── CDC_11_000001.json
│ │ ├── CDC_11_000002.json
│ │ ├── CDC_22_000001.json
│ │ └── meta
│ │ ├── CDC_11.index
│ │ └── CDC_22.index
│ ├── 437752935075546092
│ │ ├── CDC_33_000001.json
│ │ ├── CDC_44_000001.json
│ │ └── meta
│ │ ├── CDC_33.index
│ │ └── CDC_44.index
```

Because multiple sub replication tasks write files in parallel, a data file might be read by the downstream before it is fully written, causing part of the data to fail to be read successfully. To avoid this situation, when writing a downstream consumer program, read the data in the following order:

1. Read the `meta/CDC_{uuid}.index` file (for example, `CDC_11.index`) to obtain the name of the file that has been completely written (for example, `CDC_11_000002.json`).
2. Read the files whose sequence numbers are less than or equal to the sequence number in that file name in order (for example, `CDC_11_000001.json` and `CDC_11_000002.json`).
3. After reading DML events from the files of all sub tasks, sort these files by the `commit-ts` of the DML events, and then process them downstream in a unified manner.

## Compatibility

Except as described in the following special cases, the TiCDC new architecture is fully compatible with the classic architecture.
Expand Down
10 changes: 10 additions & 0 deletions ticdc/ticdc-storage-consumer-dev-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -176,3 +176,13 @@ The consumption logic is consistent. Specifically, the consumer parses the table
After DDL events are properly processed, you can process DML events in the `{schema}/{table}/{table-version-separator}/` directory based on the specific file format (CSV or Canal-JSON) and file number.

TiCDC ensures that data is replicated at least once. Therefore, there might be duplicate data. You need to compare the commit ts of the change data with the consumer checkpoint. If the commit ts is less than the consumer checkpoint, you need to perform deduplication.

When processing files, a downstream consumer might read a data file before it is fully written, causing some data to fail to be read successfully. To avoid this issue, when writing a downstream consumer, read data in the following order:

1. Read the `meta/CDC.index` file in the `{schema}/{table}/{table-version-separator}/` directory to obtain the name of the file that has been completely written.
2. For the [new TiCDC architecture](/ticdc/ticdc-architecture.md), read files in sequence whose file numbers are less than or equal to the number in that file name. For the [classic TiCDC architecture](/ticdc/ticdc-classic-architecture.md), read files in sequence whose file numbers are less than the number in that file name.

> **Note:**
>
> When `scheduler.enable-table-across-nodes` is enabled in the [new TiCDC architecture](/ticdc/ticdc-architecture.md), the file name format for recording data changes changes from `CDC_{num}.{extension}` to `CDC_{uuid}_{num}.{extension}`, and the Index file name format changes from `CDC.index` to `CDC_{uuid}.index`. In this case, files with different UUIDs but the same sequence number exist in the `{schema}/{table}/{table-version-separator}/` directory. When writing a downstream consumer, refer to the order described in [Storage Sink file name changes and consumption instructions](/ticdc/ticdc-architecture.md#storage-sink-file-name-changes-and-consumption-instructions) for consumption.