-
Notifications
You must be signed in to change notification settings - Fork 712
lake: incremental updates 0607 #23019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9d4a5b9
56f5cc1
330a1fb
f85fb8a
32e7988
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -5,15 +5,16 @@ summary: Automatically evolve table schemas when loading data with COPY INTO. | |||||
|
|
||||||
| # Schema Evolution | ||||||
|
|
||||||
| Schema evolution allows {{{ .lake }}} to automatically add new columns to a table during `COPY INTO` when the source Parquet files contain columns not yet present in the table. | ||||||
| Schema evolution allows {{{ .lake }}} to automatically add columns that exist in source files but are missing from the target table during `COPY INTO`. It currently supports **Parquet** and **NDJSON** files. | ||||||
|
|
||||||
| ## How It Works | ||||||
|
|
||||||
| When enabled, `COPY INTO`: | ||||||
| When enabled, {{{ .lake }}} infers the source file schema before loading and appends new columns to the end of the table. New columns are nullable, and missing values are filled with `NULL`. | ||||||
|
|
||||||
| 1. Infers the schema from source Parquet files. | ||||||
| 2. Adds any new columns (not in the table) as nullable columns. | ||||||
| 3. Loads the data, filling missing values with `NULL`. | ||||||
| The workflow differs slightly by file format: | ||||||
|
|
||||||
| - **Parquet**: After the table option is enabled, `COPY INTO` infers new columns directly from Parquet file schemas. | ||||||
| - **NDJSON**: After the table option is enabled, `COPY INTO` uses `AUTO` sampling values for schema inference. You can optionally add `SCHEMA_EVOLUTION = (...)` to override the file and record sampling limits. | ||||||
|
|
||||||
| ## Enabling Schema Evolution | ||||||
|
|
||||||
|
|
@@ -27,15 +28,21 @@ ALTER TABLE my_table SET OPTIONS(ENABLE_SCHEMA_EVOLUTION = true); | |||||
| CREATE TABLE my_table(id INT) ENABLE_SCHEMA_EVOLUTION = true; | ||||||
| ``` | ||||||
|
|
||||||
| To disable, set it back to `false`: | ||||||
| To disable schema evolution, set it back to `false`: | ||||||
|
|
||||||
| ```sql | ||||||
| ALTER TABLE my_table SET OPTIONS(ENABLE_SCHEMA_EVOLUTION = false); | ||||||
| ``` | ||||||
|
|
||||||
| ## Tutorial | ||||||
| ## Privileges | ||||||
|
|
||||||
| When `COPY INTO <table>` loads files from a stage or external location and runs schema evolution inference, the loading role must have both `INSERT` and `ALTER` privileges on the target table. `ALTER` is required because {{{ .lake }}} may append new columns before loading. | ||||||
|
|
||||||
| Query-based COPY is not affected. For example, `COPY INTO <table> FROM (SELECT ... FROM @stage)` keeps the existing privilege requirements. | ||||||
|
|
||||||
| ## Parquet Example | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
Suggested change
References
|
||||||
|
|
||||||
| This tutorial uses a fully runnable example to demonstrate schema evolution. | ||||||
| The following example loads Parquet files with different schemas and automatically adds missing columns. | ||||||
|
|
||||||
| ### Step 1: Create a Table and Stage | ||||||
|
|
||||||
|
|
@@ -72,7 +79,7 @@ FILE_FORMAT = (TYPE = parquet MISSING_FIELD_AS = FIELD_DEFAULT); | |||||
|
|
||||||
| ### Step 4: Verify Results | ||||||
|
|
||||||
| The table now has three columns — `amount` and `currency` were added automatically: | ||||||
| The table now has three columns. `amount` and `currency` were added automatically: | ||||||
|
|
||||||
| ```sql | ||||||
| DESC invoices; | ||||||
|
|
@@ -104,6 +111,129 @@ SELECT * FROM invoices ORDER BY order_id; | |||||
|
|
||||||
| Row 3 has `currency = NULL` because its source file did not contain that column. | ||||||
|
|
||||||
| ## NDJSON Example | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
Suggested change
References
|
||||||
|
|
||||||
| {{{ .lake }}} loads NDJSON files with `TYPE = ndjson`. NDJSON files do not have an embedded columnar schema like Parquet files, so {{{ .lake }}} samples file content, infers fields that are missing from the target table, and appends them as nullable columns. | ||||||
|
|
||||||
| ### Step 1: Create a Table and Stage | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
Suggested change
References
|
||||||
|
|
||||||
| ```sql | ||||||
| CREATE OR REPLACE TABLE events(id INT); | ||||||
| CREATE OR REPLACE STAGE events_stage; | ||||||
| ``` | ||||||
|
|
||||||
| ### Step 2: Generate NDJSON Files with Different Fields | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
Suggested change
References
|
||||||
|
|
||||||
| ```sql | ||||||
| -- File with fields: id, city, score | ||||||
| COPY INTO @events_stage FROM ( | ||||||
| SELECT 1 AS id, 'SF' AS city, 9 AS score | ||||||
| UNION ALL | ||||||
| SELECT 2, 'NYC', 8 | ||||||
| ) FILE_FORMAT = (TYPE = ndjson); | ||||||
|
|
||||||
| -- File with fields: id, score (no city) | ||||||
| COPY INTO @events_stage FROM ( | ||||||
| SELECT 3 AS id, 7 AS score | ||||||
| ) FILE_FORMAT = (TYPE = ndjson); | ||||||
| ``` | ||||||
|
|
||||||
| ### Step 3: Enable Schema Evolution and Load | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
Suggested change
References
|
||||||
|
|
||||||
| ```sql | ||||||
| ALTER TABLE events SET OPTIONS(ENABLE_SCHEMA_EVOLUTION = true); | ||||||
|
|
||||||
| COPY INTO events | ||||||
| FROM @events_stage/ | ||||||
| FILE_FORMAT = (TYPE = ndjson MISSING_FIELD_AS = FIELD_DEFAULT) | ||||||
| SCHEMA_EVOLUTION = ( | ||||||
| SAMPLE_FILES = AUTO, | ||||||
| SAMPLE_RECORDS_PER_FILE = AUTO, | ||||||
| SAMPLE_TOTAL_RECORDS = AUTO | ||||||
| ); | ||||||
| ``` | ||||||
|
|
||||||
| The three `SCHEMA_EVOLUTION` sampling options accept either `AUTO` or a positive integer: | ||||||
|
|
||||||
| | Option | Description | | ||||||
| |------|------| | ||||||
| | `SAMPLE_FILES` | Number of files to sample. | | ||||||
| | `SAMPLE_RECORDS_PER_FILE` | Maximum number of records to sample from each selected file. | | ||||||
| | `SAMPLE_TOTAL_RECORDS` | Maximum number of records to sample across all selected files. | | ||||||
|
|
||||||
| If `SCHEMA_EVOLUTION` is omitted, {{{ .lake }}} uses `AUTO` for all three sampling options. The current `AUTO` behavior samples up to 64 files, 1,000 records per file, and 10,000 records in total. These internal defaults may change in future versions. If your load is sensitive to the sampling strategy, set `SAMPLE_FILES`, `SAMPLE_RECORDS_PER_FILE`, and `SAMPLE_TOTAL_RECORDS` explicitly. | ||||||
|
|
||||||
| #### NDJSON Inference Rules | ||||||
|
|
||||||
| When running Schema Evolution for NDJSON, {{{ .lake }}} infers new columns using these rules: | ||||||
|
|
||||||
| - Schema is inferred only from sampled NDJSON records. Fields not covered by the sample are not added to the target table ahead of time. | ||||||
| - Each line must be a JSON object. {{{ .lake }}} uses top-level object field names as candidate column names. | ||||||
| - Columns that already exist in the target table are not added again. Only fields missing from the target table are appended. | ||||||
| - New field types are inferred from sampled JSON values, such as integers, floats, strings, and booleans. | ||||||
| - Schema Evolution uses shallow NDJSON inference: if a top-level field value is an object or array, it is appended as a `VARIANT` column instead of being recursively expanded. | ||||||
| - `NULL` samples only mark the field as nullable. They do not force later non-null values to become `VARCHAR` or `VARIANT`. | ||||||
| - Same-name fields across files or records are merged: integer and float conflicts become `DOUBLE`; other scalar conflicts become `VARCHAR`; any conflict involving an object, array, or `VARIANT` becomes `VARIANT`. | ||||||
| - If loading encounters extra fields that were not inferred during sampling, the load fails and reports those field names. Increase `SAMPLE_FILES`, `SAMPLE_RECORDS_PER_FILE`, or `SAMPLE_TOTAL_RECORDS` and retry. | ||||||
|
|
||||||
| > **Note:** | ||||||
| > | ||||||
| > The `INFER_SCHEMA` table function does not limit NDJSON nesting depth by default. The rules here describe the shallow inference used by `COPY INTO` Schema Evolution. | ||||||
|
|
||||||
| For example, the following NDJSON records infer six new columns: `name`, `age`, `active`, `score`, `profile`, and `tags`: | ||||||
|
|
||||||
| ```json | ||||||
| {"id":1,"name":"Alice","age":30,"active":true,"score":1,"profile":{"city":"SF"},"tags":["new"]} | ||||||
| {"id":2,"name":"Bob","age":null,"active":false,"score":1.5,"profile":{"city":"NYC"},"tags":["vip"]} | ||||||
| ``` | ||||||
|
|
||||||
| If the target table only has `id INT`, {{{ .lake }}} appends: | ||||||
|
|
||||||
| ```text | ||||||
| name VARCHAR NULL | ||||||
| age BIGINT NULL | ||||||
| active BOOLEAN NULL | ||||||
| score DOUBLE NULL | ||||||
| profile VARIANT NULL | ||||||
| tags VARIANT NULL | ||||||
| ``` | ||||||
|
|
||||||
| The second row has `age = NULL`, which does not change the `BIGINT` type inferred from the first row. `score` contains both an integer and a float, so it becomes `DOUBLE`. `profile` and `tags` are an object and an array, so Schema Evolution appends them as `VARIANT` columns. | ||||||
|
|
||||||
| ### Step 4: Verify Results | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
Suggested change
References
|
||||||
|
|
||||||
| The table now has three columns. `city` and `score` were added automatically: | ||||||
|
|
||||||
| ```sql | ||||||
| DESC events; | ||||||
| ``` | ||||||
|
|
||||||
| ```text | ||||||
| ┌─────────────────────────────────────────────────────────┐ | ||||||
| │ Field │ Type │ Null │ Default │ Extra │ | ||||||
| ├───────┼──────────────┼────────┼─────────┼──────────────┤ | ||||||
| │ id │ INT │ YES │ NULL │ │ | ||||||
| │ city │ VARCHAR │ YES │ NULL │ │ | ||||||
| │ score │ BIGINT │ YES │ NULL │ │ | ||||||
| └─────────────────────────────────────────────────────────┘ | ||||||
| ``` | ||||||
|
|
||||||
| ```sql | ||||||
| SELECT * FROM events ORDER BY id; | ||||||
| ``` | ||||||
|
|
||||||
| ```text | ||||||
| ┌────────────────────────────┐ | ||||||
| │ id │ city │ score │ | ||||||
| ├────┼──────┼────────────────┤ | ||||||
| │ 1 │ SF │ 9 │ | ||||||
| │ 2 │ NYC │ 8 │ | ||||||
| │ 3 │ NULL │ 7 │ | ||||||
| └────────────────────────────┘ | ||||||
| ``` | ||||||
|
|
||||||
| If the sample does not cover a field that appears later in the data, loading fails and returns the extra field name. Increase `SAMPLE_FILES`, `SAMPLE_RECORDS_PER_FILE`, or `SAMPLE_TOTAL_RECORDS` and retry. | ||||||
|
|
||||||
| ## Column Match Mode | ||||||
|
|
||||||
| By default, column names are matched case-insensitively. Use `COLUMN_MATCH_MODE` for case-sensitive matching: | ||||||
|
|
@@ -117,8 +247,9 @@ COLUMN_MATCH_MODE = CASE_SENSITIVE; | |||||
|
|
||||||
| ## Limitations | ||||||
|
|
||||||
| - Supported for **Parquet** files only. | ||||||
| - Currently supports **Parquet** and **NDJSON** files. | ||||||
| - New columns are appended to the end of the table and are always nullable. | ||||||
| - If the same column name appears in multiple files with **different data types**, the load fails. | ||||||
| - No automatic type promotion (e.g., `INT` → `BIGINT`). | ||||||
| - No automatic type promotion, such as `INT` to `BIGINT`. | ||||||
| - Column drops and renames are not supported through schema evolution. | ||||||
| - NDJSON relies on sampling to infer schema. If sampling does not cover all fields, increase the `SCHEMA_EVOLUTION` sampling options. | ||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the style guide, headings should use sentence case. Please update this heading to use sentence case.
References
## Configure the cluster). (link)