# Iceberg Workshop

## 环境准备

In [1]:
%%configure -f
{
    "conf":{
        "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog":"org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
        "spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
        "spark.sql.catalog.glue_catalog.warehouse":"s3://myemr-bucket-01/data/iceberg-folder/"
        }
}

## 小文件管理与过期数据清理

首先，我们先创建一张 Iceberg 表，并且分批次插入一些数据。

In [11]:
%%sql
CREATE TABLE glue_catalog.iceberg_db.sample_table_20240809 (
id int,
a string,
b int
)
USING iceberg
LOCATION 's3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/'
TBLPROPERTIES (
    'format' = 'iceberg/parquet',
    'format-version' = '2',
    'write.metadata.delete-after-commit.enabled' = 'true',
    'write.metadata.previous-versions-max' = '5',
    'history.expire.max-snapshot-age-ms' = '86400',
    'history.expire.min-snapshots-to-keep' = '1',
    'write.update.mode' = 'merge-on-read',
    'write.delete.mode' = 'merge-on-read',
    'write.merge.mode' = 'merge-on-read'
)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

### Insert Data

In [14]:
%%sql
insert into glue_catalog.iceberg_db.sample_table_20240809 values 
(1,'test01',10),
(2,'test02',20),
(3,'test03',30);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [15]:
%%sql
insert into glue_catalog.iceberg_db.sample_table_20240809 values 
(4,'test04',40),
(5,'test05',50),
(6,'test06',60);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [16]:
%%sql
insert into glue_catalog.iceberg_db.sample_table_20240809 values 
(7,'test07',70),
(8,'test08',80),
(9,'test09',90);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [20]:
%%sql
insert into glue_catalog.iceberg_db.sample_table_20240809 values 
(10,'test10',100),
(11,'test12',110),
(12,'test13',120);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [21]:
%%sql
insert into glue_catalog.iceberg_db.sample_table_20240809 values 
(13,'test13',130),
(14,'test14',140),
(15,'test15',150);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

In [23]:
%%sql
insert into glue_catalog.iceberg_db.sample_table_20240809 values 
(16,'test16',160),
(17,'test17',170),
(18,'test18',180);

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(), EncodingWidget(children=(VBox(children=(HTML(value='Encoding:'), Dropdown(description='…

Output()

这样我们分6次写入数据，然后再来观察在文件目录下的情况

In [24]:
%%sh
aws s3 ls s3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/metadata/

2024-08-09 12:26:39       2645 00001-b65a2c18-b674-4fa1-b18f-62bdd288c3b1.metadata.json
2024-08-09 12:26:44       3750 00002-187219c4-18f6-44c5-b120-53dd119eb5a6.metadata.json
2024-08-09 12:27:10       4854 00003-a86cbfc0-e6de-4f9a-b896-c130844dc6d6.metadata.json
2024-08-09 12:29:12       5960 00004-94a7ab43-e3d9-426e-94c9-22eb39b3d927.metadata.json
2024-08-09 12:29:29       7067 00005-2531dfbe-849b-4ab1-afa2-85406b74c3ab.metadata.json
2024-08-09 12:30:28       7964 00006-27f4d74a-7a8f-40fd-9edb-5467426cd88b.metadata.json
2024-08-09 12:29:29       6802 2c708c90-e34e-48c7-8ead-00404e3803fa-m0.avro
2024-08-09 12:29:12       6804 302c4aa6-1e35-4231-ba45-073978d1aff7-m0.avro
2024-08-09 12:27:10       6802 464ade3d-a575-468d-8f51-6ec77bec14aa-m0.avro
2024-08-09 12:30:28       6810 76d80d7e-8481-47b3-b3b7-7572e46c3e98-m0.avro
2024-08-09 12:26:39       6815 92c4f650-15ed-4128-87d9-bfcf12d5af2c-m0.avro
2024-08-09 12:26:44       6809 f49c732b-7209-4cd4-ac76-11531ab12d25-m0.avro
2024-08-09 12:29

In [25]:
%%sh
aws s3 ls s3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/data/

2024-08-09 12:26:43        804 00000-12-e8faa9e2-18e9-4aa1-9dce-ee8d5bff0adb-00001.parquet
2024-08-09 12:27:10        804 00000-15-a36af0e9-3fe2-4ef8-a745-c0bb8c0c04cc-00001.parquet
2024-08-09 12:29:11        804 00000-18-6c2bc256-af53-4298-b701-cc5bb85c6be2-00001.parquet
2024-08-09 12:29:29        804 00000-20-c8f2e180-0424-46bd-83ca-64fa0fda1330-00001.parquet
2024-08-09 12:30:28        804 00000-23-608cdd88-d8e6-49ac-905d-c59a0c4264a4-00001.parquet
2024-08-09 12:26:39        803 00000-9-ba0eda63-f999-42a7-a313-2c5b098d2019-00001.parquet
2024-08-09 12:26:39        804 00001-10-ba0eda63-f999-42a7-a313-2c5b098d2019-00001.parquet
2024-08-09 12:26:43        803 00001-13-e8faa9e2-18e9-4aa1-9dce-ee8d5bff0adb-00001.parquet
2024-08-09 12:27:10        804 00001-16-a36af0e9-3fe2-4ef8-a745-c0bb8c0c04cc-00001.parquet
2024-08-09 12:29:11        803 00001-19-6c2bc256-af53-4298-b701-cc5bb85c6be2-00001.parquet
2024-08-09 12:29:29        804 00001-21-c8f2e180-0424-46bd-83ca-64fa0fda1330-00001.parquet


### 小文件合并 rewrite_data_files

In [28]:
%%sql
CALL glue_catalog.system.rewrite_data_files('iceberg_db.sample_table_20240809')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

执行小文件合并命令后，可以看到返回结果中，已经提示合并了17个文件，增加了一个文件。接着我们再看一下现在文件目录中 metadata 和 data 中的文件情况。

In [35]:
%%sh
aws s3 ls --summarize s3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/data/

2024-08-09 12:26:43        804 00000-12-e8faa9e2-18e9-4aa1-9dce-ee8d5bff0adb-00001.parquet
2024-08-09 12:27:10        804 00000-15-a36af0e9-3fe2-4ef8-a745-c0bb8c0c04cc-00001.parquet
2024-08-09 12:29:11        804 00000-18-6c2bc256-af53-4298-b701-cc5bb85c6be2-00001.parquet
2024-08-09 12:29:29        804 00000-20-c8f2e180-0424-46bd-83ca-64fa0fda1330-00001.parquet
2024-08-09 12:30:28        804 00000-23-608cdd88-d8e6-49ac-905d-c59a0c4264a4-00001.parquet
2024-08-09 12:33:34        951 00000-26-7127a9c2-52ba-4d58-8eea-e91d0be607a3-00001.parquet
2024-08-09 12:26:39        803 00000-9-ba0eda63-f999-42a7-a313-2c5b098d2019-00001.parquet
2024-08-09 12:26:39        804 00001-10-ba0eda63-f999-42a7-a313-2c5b098d2019-00001.parquet
2024-08-09 12:26:43        803 00001-13-e8faa9e2-18e9-4aa1-9dce-ee8d5bff0adb-00001.parquet
2024-08-09 12:27:10        804 00001-16-a36af0e9-3fe2-4ef8-a745-c0bb8c0c04cc-00001.parquet
2024-08-09 12:29:11        803 00001-19-6c2bc256-af53-4298-b701-cc5bb85c6be2-00001.parquet


这个时候我们观察到，17个文件合并到了一个文件中 `00000-26-7127a9c2-52ba-4d58-8eea-e91d0be607a3-00001.parquet`

In [30]:
%%sh
aws s3 ls s3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/metadata/

2024-08-09 12:26:44       3750 00002-187219c4-18f6-44c5-b120-53dd119eb5a6.metadata.json
2024-08-09 12:27:10       4854 00003-a86cbfc0-e6de-4f9a-b896-c130844dc6d6.metadata.json
2024-08-09 12:29:12       5960 00004-94a7ab43-e3d9-426e-94c9-22eb39b3d927.metadata.json
2024-08-09 12:29:29       7067 00005-2531dfbe-849b-4ab1-afa2-85406b74c3ab.metadata.json
2024-08-09 12:30:28       7964 00006-27f4d74a-7a8f-40fd-9edb-5467426cd88b.metadata.json
2024-08-09 12:33:35       8907 00007-16785830-4dd4-4c15-b2f6-260b90ded345.metadata.json
2024-08-09 12:33:35       6802 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m0.avro
2024-08-09 12:33:35       6813 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m1.avro
2024-08-09 12:33:35       6802 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m2.avro
2024-08-09 12:33:35       6802 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m3.avro
2024-08-09 12:33:35       6808 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m4.avro
2024-08-09 12:33:35       6815 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m5.avro
2024-08-09 12:33

这个时候我们可以看到已经有7个 snapshot 文件，如果想再确认一次，可以通过以下命令，查看当前snapshot对应的文件。

In [37]:
%%sql
select file_path from glue_catalog.iceberg_db.sample_table_20240809.files

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

那么，现在文件是合并到了一个文件中，但是会有历史的 snapshot 文件仍然还是占用空间的。所以我们要考虑的是如何清理这些历史的文件。

### expire_snapshots

首先清理过期的 snapshot<br>

另外，需要清楚一个概念，Iceberg 的每次 write/update/delete/upsert/compaction 都会产生一个新快照，同时保留旧数据和元数据，以实现快照隔离和时间旅行。expire_snapshots 程序可用于删除不再需要的旧快照及其文件。<br>

该存储过程将删除旧快照和这些旧快照唯一需要的数据文件。这意味着 expire_snapshots 存储过程永远不会删除非过期快照仍然需要的文件。<br>
对于之前写入样例数据的表，我们设置将 `2024-08-09 12:33:00` 之前的快照过期

In [44]:
%%sql
CALL glue_catalog.system.expire_snapshots(table => 'iceberg_db.sample_table_20240809',older_than => TIMESTAMP '2024-08-09 12:33:00')

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

VBox(children=(HBox(children=(HTML(value='Type:'), Button(description='Table', layout=Layout(width='70px'), st…

Output()

根据返回的结果，可以看到 manifest list 文件删除了 6个， manifest file 也删除了6个，并且文件删除了17个，这时候我们再来检查 metadata 目录和data目录下的文件数量

In [47]:
%%sh
aws s3 ls s3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/metadata/

2024-08-09 12:29:12       5960 00004-94a7ab43-e3d9-426e-94c9-22eb39b3d927.metadata.json
2024-08-09 12:29:29       7067 00005-2531dfbe-849b-4ab1-afa2-85406b74c3ab.metadata.json
2024-08-09 12:30:28       7964 00006-27f4d74a-7a8f-40fd-9edb-5467426cd88b.metadata.json
2024-08-09 12:33:35       8907 00007-16785830-4dd4-4c15-b2f6-260b90ded345.metadata.json
2024-08-09 12:49:29       6277 00008-6b7c8400-ff10-4728-8bde-da1374fa3a42.metadata.json
2024-08-09 12:57:39       3587 00009-64dc227a-53f2-44fa-9047-61f4a1f01d8f.metadata.json
2024-08-09 12:33:35       6802 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m0.avro
2024-08-09 12:33:35       6813 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m1.avro
2024-08-09 12:33:35       6802 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m2.avro
2024-08-09 12:33:35       6802 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m3.avro
2024-08-09 12:33:35       6808 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m4.avro
2024-08-09 12:33:35       6815 1623bf74-4813-45a7-b7c8-935d0d16a2a2-m5.avro
2024-08-09 12:33

清理过期快照后，我们可以在查看一下当前的数据文件，可以看到这个时候data目录下的文件只有一个了，并且 snapshot 文件只有一个了。

In [46]:
%%sh
aws s3 ls s3://myemr-bucket-01/data/iceberg-folder/iceberg_db.db/sample_table_20240809/data/

2024-08-09 12:33:34        951 00000-26-7127a9c2-52ba-4d58-8eea-e91d0be607a3-00001.parquet
