# Query Lifecycle

## Creating a Table

In [6]:
create_table = """
CREATE TABLE default.db.orders (
    order_id BIGINT,
    customer_id BIGINT,
    order_amount DECIMAL(10,2),
    order_ts TIMESTAMP
)
USING iceberg
PARTITIONED BY (HOUR(order_ts))
"""

spark.sql(create_table)

DataFrame[]

### Table Metadata
Inspecting the metadata of this newly created table in Minio, you'll be able to figure out the following:

**Table identification:**

    - format-version: Schema version (2)
    - table-uuid: Unique identifier
    - location: S3 storage path


**Schema definition:**
    - Field specifications with IDs, names, types
    - decimal(10,2) for order_amount: 10 total digits with 2 after decimal point


**Partitioning:**
    
    - Partitioned by hour of order_ts
    - Source ID 4 maps to order_ts field


**Table properties:**

    - Compression codec: zstd
    - Owner: root


**Snapshot tracking:**
    - current-snapshot-id: -1 indicates no data loaded yet
    - Empty snapshots array confirms no data


**Timestamps:**

  - last-updated-ms: When metadata was last modified

This is a newly created table without any data yet.
    
**location of the metadata file:** warehouse/default/db/orders/metadata/00000-34f871b7-5f96-4abb-abfc-0a6bd70d20d1.metadata.json

```json
{
  "format-version": 2,
  "table-uuid": "ff988440-bf07-4d0a-a172-caeb1c67e67d",
  "location": "s3://warehouse/default/db/orders",
  "last-sequence-number": 0,
  "last-updated-ms": 1741897628742,
  "last-column-id": 4,
  "current-schema-id": 0,
  "schemas": [
    {
      "type": "struct",
      "schema-id": 0,
      "fields": [
        {
          "id": 1,
          "name": "order_id",
          "required": false,
          "type": "long"
        },
        {
          "id": 2,
          "name": "customer_id",
          "required": false,
          "type": "long"
        },
        {
          "id": 3,
          "name": "order_amount",
          "required": false,
          "type": "decimal(10, 2)"
        },
        {
          "id": 4,
          "name": "order_ts",
          "required": false,
          "type": "timestamptz"
        }
      ]
    }
  ],
  "default-spec-id": 0,
  "partition-specs": [
    {
      "spec-id": 0,
      "fields": [
        {
          "name": "order_ts_hour",
          "transform": "hour",
          "source-id": 4,
          "field-id": 1000
        }
      ]
    }
  ],
  "last-partition-id": 1000,
  "default-sort-order-id": 0,
  "sort-orders": [
    {
      "order-id": 0,
      "fields": []
    }
  ],
  "properties": {
    "owner": "root",
    "write.parquet.compression-codec": "zstd"
  },
  "current-snapshot-id": -1,
  "refs": {},
  "snapshots": [],
  "statistics": [],
  "partition-statistics": [],
  "snapshot-log": [],
  "metadata-log": []
}
```

### Useful Queries to Inspect Tables Metadata


In [33]:
# 
spark.sql("SHOW NAMESPACES IN default").toPandas()

Unnamed: 0,namespace
0,default.db


In [34]:
spark.sql("SHOW TABLES IN default.db").toPandas()

Unnamed: 0,namespace,tableName,isTemporary
0,default.db,orders,False


In [24]:
spark.sql("SELECT * FROM default.db.orders.manifests").toPandas()

Unnamed: 0,content,path,length,partition_spec_id,added_snapshot_id,added_data_files_count,existing_data_files_count,deleted_data_files_count,added_delete_files_count,existing_delete_files_count,deleted_delete_files_count,partition_summaries


In [23]:
spark.sql("SELECT * FROM default.db.orders.history").toPandas()

Unnamed: 0,made_current_at,snapshot_id,parent_id,is_current_ancestor


In [22]:
spark.sql("SELECT * FROM default.db.orders.snapshots").toPandas()

Unnamed: 0,committed_at,snapshot_id,parent_id,operation,manifest_list,summary


In [21]:
spark.sql("SELECT * FROM default.db.orders.files").toPandas()

Unnamed: 0,content,file_path,file_format,spec_id,partition,record_count,file_size_in_bytes,column_sizes,value_counts,null_value_counts,...,lower_bounds,upper_bounds,key_metadata,split_offsets,equality_ids,sort_order_id,referenced_data_file,content_offset,content_size_in_bytes,readable_metrics
