# üóÑÔ∏è CATALOG & METADATA

---

## üìã **DAY 5 - LESSON 3: CATALOG & METADATA**

### **üéØ M·ª§C TI√äU:**

1. **Spark Catalog API** - Explore metadata
2. **Database Management** - Create, drop, use databases
3. **Table Management** - Create, alter, drop tables
4. **Schema Operations** - Inspect and modify schemas
5. **Metadata Queries** - Query table/column metadata
6. **Best Practices** - Metadata management

---

## üí° **CATALOG & METADATA:**

- Catalog = Metadata store (databases, tables, columns)
- Spark Catalog API = Python interface to metadata
- Metastore = Persistent metadata storage
- Important for data governance and discovery

---

## üîß **SETUP**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, lit, when, desc, asc
from pyspark.sql.types import *
import random
from datetime import datetime, timedelta

spark = SparkSession.builder \
    .appName("CatalogMetadata") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "1g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .enableHiveSupport() \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Hive Support: Enabled")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/11 16:00:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created
Spark Version: 3.5.1
Hive Support: Enabled


---

## üìä **1. SPARK CATALOG API OVERVIEW**

### **What is Spark Catalog?**
- Interface to Spark's metadata
- Access via `spark.catalog`
- Query databases, tables, columns, functions
- Manage cache and temporary views

In [2]:
print("="*80)
print("üìä 1. SPARK CATALOG API OVERVIEW")
print("="*80)

print("\nüí° Spark Catalog API Methods:")
print("-" * 80)

catalog_methods = [
    ("currentDatabase()", "Get current database"),
    ("setCurrentDatabase(name)", "Set current database"),
    ("listDatabases()", "List all databases"),
    ("listTables()", "List tables in current database"),
    ("listColumns(tableName)", "List columns in table"),
    ("listFunctions()", "List available functions"),
    ("tableExists(tableName)", "Check if table exists"),
    ("databaseExists(dbName)", "Check if database exists"),
    ("createTable()", "Create table from DataFrame"),
    ("dropTempView(viewName)", "Drop temporary view"),
    ("isCached(tableName)", "Check if table is cached"),
    ("cacheTable(tableName)", "Cache table"),
    ("uncacheTable(tableName)", "Uncache table"),
    ("clearCache()", "Clear all cached tables"),
    ("refreshTable(tableName)", "Refresh table metadata")
]

for method, description in catalog_methods:
    print(f"   {method:30s} - {description}")

print("\nüîπ Current Database:")
print(f"   {spark.catalog.currentDatabase()}")

üìä 1. SPARK CATALOG API OVERVIEW

üí° Spark Catalog API Methods:
--------------------------------------------------------------------------------
   currentDatabase()              - Get current database
   setCurrentDatabase(name)       - Set current database
   listDatabases()                - List all databases
   listTables()                   - List tables in current database
   listColumns(tableName)         - List columns in table
   listFunctions()                - List available functions
   tableExists(tableName)         - Check if table exists
   databaseExists(dbName)         - Check if database exists
   createTable()                  - Create table from DataFrame
   dropTempView(viewName)         - Drop temporary view
   isCached(tableName)            - Check if table is cached
   cacheTable(tableName)          - Cache table
   uncacheTable(tableName)        - Uncache table
   clearCache()                   - Clear all cached tables
   refreshTable(tableName)        - R

---

## üóÑÔ∏è **2. DATABASE MANAGEMENT**

In [3]:
print("="*80)
print("üóÑÔ∏è 2. DATABASE MANAGEMENT")
print("="*80)

# A. List all databases
print("\nüìä A. LIST ALL DATABASES")
print("-" * 80)

databases = spark.catalog.listDatabases()
print(f"\nFound {len(databases)} database(s):\n")
for db in databases:
    print(f"   üìÅ {db.name}")
    print(f"      Description: {db.description}")
    print(f"      Location: {db.locationUri}")
    print()

# Using SQL
print("Using SQL:")
spark.sql("SHOW DATABASES").show()

üóÑÔ∏è 2. DATABASE MANAGEMENT

üìä A. LIST ALL DATABASES
--------------------------------------------------------------------------------


26/01/11 16:00:49 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
26/01/11 16:00:49 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
26/01/11 16:00:53 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 2.3.0
26/01/11 16:00:53 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by MetaStore UNKNOWN@172.18.0.9
26/01/11 16:00:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
                                                                                


Found 1 database(s):

   üìÅ default
      Description: Default Hive database
      Location: file:/opt/spark-notebooks/day05/spark-warehouse

Using SQL:
+---------+
|namespace|
+---------+
|  default|
+---------+



In [4]:
# B. Create databases
print("\nüìä B. CREATE DATABASES")
print("-" * 80)

# Create test databases
databases_to_create = [
    ("hr_database", "Human Resources data"),
    ("sales_database", "Sales and revenue data"),
    ("analytics_database", "Analytics and reporting data")
]

for db_name, description in databases_to_create:
    # Drop if exists
    spark.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE")
    
    # Create database
    spark.sql(f"""
        CREATE DATABASE IF NOT EXISTS {db_name}
        COMMENT '{description}'
    """)
    print(f"‚úÖ Created database: {db_name}")

print("\nüìã All databases:")
spark.sql("SHOW DATABASES").show()


üìä B. CREATE DATABASES
--------------------------------------------------------------------------------


26/01/11 16:00:58 WARN ObjectStore: Failed to get database hr_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database hr_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database hr_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database hr_database, returning NoSuchObjectException


‚úÖ Created database: hr_database
‚úÖ Created database: sales_database


26/01/11 16:00:58 WARN ObjectStore: Failed to get database sales_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database sales_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database sales_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database sales_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database analytics_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database analytics_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database analytics_database, returning NoSuchObjectException
26/01/11 16:00:58 WARN ObjectStore: Failed to get database analytics_database, returning NoSuchObjectException


‚úÖ Created database: analytics_database

üìã All databases:
+------------------+
|         namespace|
+------------------+
|analytics_database|
|           default|
|       hr_database|
|    sales_database|
+------------------+



In [5]:
# C. Switch database
print("\nüìä C. SWITCH DATABASE")
print("-" * 80)

print(f"Current database: {spark.catalog.currentDatabase()}")

# Switch to hr_database
spark.catalog.setCurrentDatabase("hr_database")
print(f"Switched to: {spark.catalog.currentDatabase()}")

# Using SQL
spark.sql("USE sales_database")
print(f"Switched to: {spark.catalog.currentDatabase()}")

# Switch back to default
spark.catalog.setCurrentDatabase("default")
print(f"Switched to: {spark.catalog.currentDatabase()}")


üìä C. SWITCH DATABASE
--------------------------------------------------------------------------------
Current database: default
Switched to: hr_database
Switched to: sales_database
Switched to: default


In [6]:
# D. Database properties
print("\nüìä D. DATABASE PROPERTIES")
print("-" * 80)

# Describe database
spark.sql("DESCRIBE DATABASE hr_database").show(truncate=False)

# Extended info
spark.sql("DESCRIBE DATABASE EXTENDED hr_database").show(truncate=False)


üìä D. DATABASE PROPERTIES
--------------------------------------------------------------------------------
+--------------+--------------------------------------------------------------+
|info_name     |info_value                                                    |
+--------------+--------------------------------------------------------------+
|Catalog Name  |spark_catalog                                                 |
|Namespace Name|hr_database                                                   |
|Comment       |Human Resources data                                          |
|Location      |file:/opt/spark-notebooks/day05/spark-warehouse/hr_database.db|
|Owner         |spark                                                         |
+--------------+--------------------------------------------------------------+

+--------------+--------------------------------------------------------------+
|info_name     |info_value                                                    |
+--------

In [7]:
# E. Check if database exists
print("\nüìä E. CHECK IF DATABASE EXISTS")
print("-" * 80)

databases_to_check = ["hr_database", "sales_database", "nonexistent_db"]

for db_name in databases_to_check:
    exists = spark.catalog.databaseExists(db_name)
    status = "‚úÖ EXISTS" if exists else "‚ùå NOT FOUND"
    print(f"   {db_name:20s} - {status}")


üìä E. CHECK IF DATABASE EXISTS
--------------------------------------------------------------------------------
   hr_database          - ‚úÖ EXISTS
   sales_database       - ‚úÖ EXISTS
   nonexistent_db       - ‚ùå NOT FOUND


26/01/11 16:00:59 WARN ObjectStore: Failed to get database nonexistent_db, returning NoSuchObjectException
26/01/11 16:00:59 WARN ObjectStore: Failed to get database nonexistent_db, returning NoSuchObjectException


---

## üìã **3. TABLE MANAGEMENT**

In [8]:
print("="*80)
print("üìã 3. TABLE MANAGEMENT")
print("="*80)

# Create sample data
print("\nüîπ Creating sample data...")

employees_data = []
for i in range(1, 101):
    employees_data.append((
        f"EMP{i:04d}",
        f"Employee {i}",
        random.randint(22, 60),
        random.choice(["Engineering", "Sales", "Marketing", "HR", "Finance"]),
        random.randint(50000, 120000),
        random.choice(["Active", "Inactive"]),
        (datetime(2020, 1, 1) + timedelta(days=random.randint(0, 1460))).strftime("%Y-%m-%d")
    ))

employees = spark.createDataFrame(employees_data,
    ["employee_id", "name", "age", "department", "salary", "status", "hire_date"])

print(f"‚úÖ Created DataFrame with {employees.count()} rows")
employees.show(5)

üìã 3. TABLE MANAGEMENT

üîπ Creating sample data...


                                                                                

‚úÖ Created DataFrame with 100 rows
+-----------+----------+---+----------+------+--------+----------+
|employee_id|      name|age|department|salary|  status| hire_date|
+-----------+----------+---+----------+------+--------+----------+
|    EMP0001|Employee 1| 33|     Sales|112953|  Active|2021-08-17|
|    EMP0002|Employee 2| 43|        HR|103986|Inactive|2020-12-22|
|    EMP0003|Employee 3| 33|     Sales| 84113|Inactive|2021-08-13|
|    EMP0004|Employee 4| 48|   Finance|115490|  Active|2020-01-08|
|    EMP0005|Employee 5| 59|        HR| 82896|Inactive|2022-05-08|
+-----------+----------+---+----------+------+--------+----------+
only showing top 5 rows



In [10]:
# A. Create tables
print("\nüìä A. CREATE TABLES")
print("-" * 80)

# Method 1: Create temporary view
employees.createOrReplaceTempView("employees_temp")
print("‚úÖ Created temporary view: employees_temp")

# Method 2: Create global temporary view
employees.createOrReplaceGlobalTempView("employees_global")
print("‚úÖ Created global temporary view: employees_global")
print("   Access via: global_temp.employees_global")

# # Method 3: Save as table (persistent)
# spark.sql("USE hr_database")
# employees.write.mode("overwrite").saveAsTable("employees_persistent")
# print("‚úÖ Created persistent table: hr_database.employees_persistent")

# # Method 4: Create table using SQL
# spark.sql("""
#     CREATE TABLE IF NOT EXISTS hr_database.employees_sql (
#         employee_id STRING,
#         name STRING,
#         age INT,
#         department STRING,
#         salary INT,
#         status STRING,
#         hire_date STRING
#     )
#     USING parquet
# """)
# print("‚úÖ Created table using SQL: hr_database.employees_sql")

# # Insert data
# spark.sql("""
#     INSERT INTO hr_database.employees_sql
#     SELECT * FROM employees_temp
# """)
# print("‚úÖ Inserted data into employees_sql")


üìä A. CREATE TABLES
--------------------------------------------------------------------------------
‚úÖ Created temporary view: employees_temp
‚úÖ Created global temporary view: employees_global
   Access via: global_temp.employees_global


In [11]:
# B. List tables
print("\nüìä B. LIST TABLES")
print("-" * 80)

# List tables in current database
print("\nTables in hr_database:")
tables = spark.catalog.listTables("hr_database")
for table in tables:
    print(f"\n   üìÑ {table.name}")
    print(f"      Database: {table.database}")
    print(f"      Type: {table.tableType}")
    print(f"      Is Temporary: {table.isTemporary}")

# Using SQL
print("\nUsing SQL:")
spark.sql("SHOW TABLES IN hr_database").show()

# List all tables (including temp views)
print("\nAll tables (including temp views):")
all_tables = spark.catalog.listTables()
for table in all_tables:
    print(f"   {table.name:30s} - {table.tableType:15s} - Temp: {table.isTemporary}")


üìä B. LIST TABLES
--------------------------------------------------------------------------------

Tables in hr_database:


                                                                                


   üìÑ employees_temp
      Database: None
      Type: TEMPORARY
      Is Temporary: True

Using SQL:
+---------+--------------+-----------+
|namespace|     tableName|isTemporary|
+---------+--------------+-----------+
|         |employees_temp|       true|
+---------+--------------+-----------+


All tables (including temp views):
   employees_temp                 - TEMPORARY       - Temp: True


In [12]:
# C. Check if table exists
print("\nüìä C. CHECK IF TABLE EXISTS")
print("-" * 80)

tables_to_check = [
    "employees_temp",
    "employees_persistent",
    "hr_database.employees_sql",
    "nonexistent_table"
]

for table_name in tables_to_check:
    exists = spark.catalog.tableExists(table_name)
    status = "‚úÖ EXISTS" if exists else "‚ùå NOT FOUND"
    print(f"   {table_name:35s} - {status}")


üìä C. CHECK IF TABLE EXISTS
--------------------------------------------------------------------------------
   employees_temp                      - ‚úÖ EXISTS
   employees_persistent                - ‚ùå NOT FOUND
   hr_database.employees_sql           - ‚ùå NOT FOUND
   nonexistent_table                   - ‚ùå NOT FOUND


In [16]:
# D. Table properties
print("\nüìä D. TABLE PROPERTIES")
print("-" * 80)

# Describe table
print("\nDESCRIBE TABLE:")
spark.sql("DESCRIBE TABLE hr_database.employees_persistent").show(truncate=False)

# Extended info
print("\nDESCRIBE TABLE EXTENDED:")
spark.sql("DESCRIBE TABLE EXTENDED hr_database.employees_persistent").show(truncate=False)

# Formatted
print("\nDESCRIBE TABLE FORMATTED:")
spark.sql("DESCRIBE FORMATTED hr_database.employees_persistent").show(100, truncate=False)

In [17]:
# E. Alter table
print("\nüìä E. ALTER TABLE")
print("-" * 80)

# Add column
spark.sql("""
    ALTER TABLE hr_database.employees_sql
    ADD COLUMNS (email STRING COMMENT 'Employee email')
""")
print("‚úÖ Added column: email")

# Rename table
spark.sql("""
    ALTER TABLE hr_database.employees_sql
    RENAME TO hr_database.employees_renamed
""")
print("‚úÖ Renamed table: employees_sql ‚Üí employees_renamed")

# Set table properties
spark.sql("""
    ALTER TABLE hr_database.employees_renamed
    SET TBLPROPERTIES ('comment' = 'Employee master data')
""")
print("‚úÖ Set table properties")

# Verify changes
print("\nVerify changes:")
spark.sql("DESCRIBE TABLE hr_database.employees_renamed").show(truncate=False)


üìä E. ALTER TABLE
--------------------------------------------------------------------------------


AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `hr_database`.`employees_sql` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 2 pos 16;
'AddColumns [QualifiedColType(None,email,StringType,true,Some(Employee email),None,None)]
+- 'UnresolvedTable [hr_database, employees_sql], ALTER TABLE ... ADD COLUMNS


In [None]:
# F. Drop tables
print("\nüìä F. DROP TABLES")
print("-" * 80)

# Drop temporary view
spark.catalog.dropTempView("employees_temp")
print("‚úÖ Dropped temporary view: employees_temp")

# Drop global temporary view
spark.catalog.dropGlobalTempView("employees_global")
print("‚úÖ Dropped global temporary view: employees_global")

# Drop table using SQL
spark.sql("DROP TABLE IF EXISTS hr_database.employees_renamed")
print("‚úÖ Dropped table: hr_database.employees_renamed")

print("\nüìã Remaining tables:")
spark.sql("SHOW TABLES IN hr_database").show()

---

## üîç **4. SCHEMA OPERATIONS**

In [14]:
print("="*80)
print("üîç 4. SCHEMA OPERATIONS")
print("="*80)

# A. List columns
print("\nüìä A. LIST COLUMNS")
print("-" * 80)

columns = spark.catalog.listColumns("hr_database.employees_persistent")
print(f"\nTable: hr_database.employees_persistent")
print(f"Columns: {len(columns)}\n")

for col in columns:
    print(f"   üìå {col.name}")
    print(f"      Type: {col.dataType}")
    print(f"      Nullable: {col.nullable}")
    print(f"      Description: {col.description}")
    print(f"      Is Partition: {col.isPartition}")
    print(f"      Is Bucket: {col.isBucket}")
    print()

üîç 4. SCHEMA OPERATIONS

üìä A. LIST COLUMNS
--------------------------------------------------------------------------------


AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `hr_database`.`employees_persistent` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.;
'UnresolvedTableOrView [hr_database, employees_persistent], Catalog.listColumns, true


In [15]:
# B. Get schema
print("\nüìä B. GET SCHEMA")
print("-" * 80)

df = spark.table("hr_database.employees_persistent")

print("\nMethod 1: printSchema()")
df.printSchema()

print("\nMethod 2: schema")
print(df.schema)

print("\nMethod 3: dtypes")
for col_name, col_type in df.dtypes:
    print(f"   {col_name:20s} - {col_type}")

print("\nMethod 4: columns")
print(f"   Columns: {df.columns}")


üìä B. GET SCHEMA
--------------------------------------------------------------------------------


AnalysisException: [TABLE_OR_VIEW_NOT_FOUND] The table or view `hr_database`.`employees_persistent` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.;
'UnresolvedRelation [hr_database, employees_persistent], [], false


In [None]:
# C. Schema as JSON
print("\nüìä C. SCHEMA AS JSON")
print("-" * 80)

schema_json = df.schema.json()
print("\nSchema JSON:")
print(schema_json)

# Parse back to schema
from pyspark.sql.types import StructType
parsed_schema = StructType.fromJson(eval(schema_json))
print("\n‚úÖ Schema can be serialized and deserialized")

In [None]:
# D. Schema evolution example
print("\nüìä D. SCHEMA EVOLUTION EXAMPLE")
print("-" * 80)

# Original schema
print("\n1. Original schema:")
df.printSchema()

# Add new column
df_v2 = df.withColumn("bonus", (col("salary") * 0.1).cast("int"))
print("\n2. After adding 'bonus' column:")
df_v2.printSchema()

# Change column type
df_v3 = df_v2.withColumn("age", col("age").cast("string"))
print("\n3. After changing 'age' type to string:")
df_v3.printSchema()

# Drop column
df_v4 = df_v3.drop("status")
print("\n4. After dropping 'status' column:")
df_v4.printSchema()

print("""
üí° Schema Evolution:
   - Add columns: withColumn()
   - Change types: cast()
   - Drop columns: drop()
   - Rename columns: withColumnRenamed()
   
   ‚ö†Ô∏è  Be careful with schema changes in production!
""")

---

## üìä **5. METADATA QUERIES**

In [None]:
print("="*80)
print("üìä 5. METADATA QUERIES")
print("="*80)

# A. Table statistics
print("\nüìä A. TABLE STATISTICS")
print("-" * 80)

# Analyze table
spark.sql("ANALYZE TABLE hr_database.employees_persistent COMPUTE STATISTICS")
print("‚úÖ Computed table statistics")

# Show statistics
spark.sql("DESCRIBE EXTENDED hr_database.employees_persistent").show(100, truncate=False)

In [None]:
# B. Column statistics
print("\nüìä B. COLUMN STATISTICS")
print("-" * 80)

# Analyze columns
spark.sql("""
    ANALYZE TABLE hr_database.employees_persistent
    COMPUTE STATISTICS FOR COLUMNS salary, age, department
""")
print("‚úÖ Computed column statistics")

# Show column stats
spark.sql("DESCRIBE EXTENDED hr_database.employees_persistent salary").show(truncate=False)

In [None]:
# C. List functions
print("\nüìä C. LIST FUNCTIONS")
print("-" * 80)

functions = spark.catalog.listFunctions()
print(f"\nTotal functions: {len(functions)}")

# Show first 20
print("\nFirst 20 functions:")
for func in functions[:20]:
    print(f"   {func.name:30s} - {func.description}")

# Filter by name
print("\nString functions (contains 'string'):")
string_funcs = [f for f in functions if 'string' in f.name.lower()]
for func in string_funcs[:10]:
    print(f"   {func.name}")

In [None]:
# D. Function details
print("\nüìä D. FUNCTION DETAILS")
print("-" * 80)

# Describe function
functions_to_describe = ['avg', 'concat', 'date_format', 'explode']

for func_name in functions_to_describe:
    print(f"\n{'='*80}")
    print(f"Function: {func_name}")
    print('='*80)
    spark.sql(f"DESCRIBE FUNCTION {func_name}").show(truncate=False)
    
    print(f"\nExtended info:")
    spark.sql(f"DESCRIBE FUNCTION EXTENDED {func_name}").show(truncate=False)

In [None]:
# E. Custom metadata query
print("\nüìä E. CUSTOM METADATA QUERY")
print("-" * 80)

# Create metadata summary
metadata_summary = []

for db in spark.catalog.listDatabases():
    tables = spark.catalog.listTables(db.name)
    for table in tables:
        if not table.isTemporary:
            columns = spark.catalog.listColumns(f"{db.name}.{table.name}")
            metadata_summary.append((
                db.name,
                table.name,
                table.tableType,
                len(columns)
            ))

metadata_df = spark.createDataFrame(metadata_summary,
    ["database", "table", "type", "column_count"])

print("\nüìã Metadata Summary:")
metadata_df.show(truncate=False)

print("\nüìä Statistics:")
metadata_df.groupBy("database").agg(
    F.count("table").alias("table_count"),
    F.sum("column_count").alias("total_columns")
).show()

---

## üíæ **6. CACHE MANAGEMENT**

In [None]:
print("="*80)
print("üíæ 6. CACHE MANAGEMENT")
print("="*80)

# A. Cache table
print("\nüìä A. CACHE TABLE")
print("-" * 80)

table_name = "hr_database.employees_persistent"

# Check if cached
print(f"Is cached before: {spark.catalog.isCached(table_name)}")

# Cache table
spark.catalog.cacheTable(table_name)
print(f"‚úÖ Cached table: {table_name}")

# Check again
print(f"Is cached after: {spark.catalog.isCached(table_name)}")

# Using SQL
spark.sql(f"CACHE TABLE {table_name}")
print(f"‚úÖ Cached using SQL")

In [None]:
# B. Uncache table
print("\nüìä B. UNCACHE TABLE")
print("-" * 80)

# Uncache
spark.catalog.uncacheTable(table_name)
print(f"‚úÖ Uncached table: {table_name}")
print(f"Is cached: {spark.catalog.isCached(table_name)}")

# Using SQL
spark.sql(f"UNCACHE TABLE {table_name}")
print(f"‚úÖ Uncached using SQL")

In [None]:
# C. Clear all cache
print("\nüìä C. CLEAR ALL CACHE")
print("-" * 80)

# Cache multiple tables
spark.catalog.cacheTable(table_name)
print(f"‚úÖ Cached: {table_name}")

# Clear all
spark.catalog.clearCache()
print("‚úÖ Cleared all cache")
print(f"Is cached: {spark.catalog.isCached(table_name)}")

In [None]:
# D. Refresh table
print("\nüìä D. REFRESH TABLE")
print("-" * 80)

# Refresh metadata
spark.catalog.refreshTable(table_name)
print(f"‚úÖ Refreshed table: {table_name}")

# Using SQL
spark.sql(f"REFRESH TABLE {table_name}")
print(f"‚úÖ Refreshed using SQL")

print("""
üí° When to refresh:
   - After external changes to data files
   - After partition changes
   - To update metadata cache
""")

---

## üéØ **7. BEST PRACTICES**

In [None]:
print("="*80)
print("üéØ 7. BEST PRACTICES")
print("="*80)

print("""
üí° CATALOG & METADATA BEST PRACTICES:

1. DATABASE ORGANIZATION
   ‚úÖ Use separate databases for different domains
   ‚úÖ Name databases clearly (e.g., hr_prod, sales_staging)
   ‚úÖ Add descriptions to databases
   ‚úÖ Use consistent naming conventions

2. TABLE MANAGEMENT
   ‚úÖ Use meaningful table names
   ‚úÖ Add comments to tables and columns
   ‚úÖ Document schema changes
   ‚úÖ Use temporary views for intermediate results
   ‚úÖ Drop temporary views when done

3. SCHEMA MANAGEMENT
   ‚úÖ Define explicit schemas (don't rely on inference)
   ‚úÖ Use appropriate data types
   ‚úÖ Document column meanings
   ‚úÖ Plan for schema evolution
   ‚úÖ Version your schemas

4. METADATA QUERIES
   ‚úÖ Compute statistics regularly
   ‚úÖ Use ANALYZE TABLE for better query planning
   ‚úÖ Monitor table sizes
   ‚úÖ Track schema changes

5. CACHE MANAGEMENT
   ‚úÖ Cache frequently accessed tables
   ‚úÖ Uncache when done
   ‚úÖ Monitor cache memory usage
   ‚úÖ Use LAZY caching for large tables
   ‚úÖ Clear cache periodically

6. PERFORMANCE
   ‚úÖ Use partitioning for large tables
   ‚úÖ Compute statistics for cost-based optimization
   ‚úÖ Use appropriate file formats (Parquet, ORC)
   ‚úÖ Refresh metadata after external changes

7. GOVERNANCE
   ‚úÖ Document all tables and columns
   ‚úÖ Track data lineage
   ‚úÖ Implement access controls
   ‚úÖ Monitor metadata changes
   ‚úÖ Regular metadata audits

8. COMMON MISTAKES TO AVOID
   ‚ùå Not dropping temporary views
   ‚ùå Forgetting to uncache tables
   ‚ùå Not computing statistics
   ‚ùå Using SELECT * in production
   ‚ùå Not documenting schema changes
   ‚ùå Inconsistent naming conventions
   ‚ùå Not refreshing metadata after external changes
""")

---

## üìö **8. PRACTICAL EXAMPLES**

In [None]:
print("="*80)
print("üìö 8. PRACTICAL EXAMPLES")
print("="*80)

# Example 1: Data catalog report
print("\nüìä EXAMPLE 1: DATA CATALOG REPORT")
print("-" * 80)

def generate_catalog_report():
    """Generate comprehensive catalog report"""
    
    report = []
    
    for db in spark.catalog.listDatabases():
        db_name = db.name
        tables = spark.catalog.listTables(db_name)
        
        for table in tables:
            if not table.isTemporary:
                table_name = f"{db_name}.{table.name}"
                
                # Get columns
                columns = spark.catalog.listColumns(table_name)
                
                # Get row count (if possible)
                try:
                    row_count = spark.table(table_name).count()
                except:
                    row_count = -1
                
                report.append({
                    'database': db_name,
                    'table': table.name,
                    'type': table.tableType,
                    'columns': len(columns),
                    'rows': row_count,
                    'cached': spark.catalog.isCached(table_name)
                })
    
    return spark.createDataFrame(report)

catalog_report = generate_catalog_report()
print("\nüìã Catalog Report:")
catalog_report.show(truncate=False)

print("\nüìä Summary by Database:")
catalog_report.groupBy("database").agg(
    F.count("table").alias("tables"),
    F.sum("columns").alias("total_columns"),
    F.sum("rows").alias("total_rows")
).show()

In [None]:
# Example 2: Schema comparison
print("\nüìä EXAMPLE 2: SCHEMA COMPARISON")
print("-" * 80)

def compare_schemas(table1, table2):
    """Compare schemas of two tables"""
    
    cols1 = {col.name: col.dataType for col in spark.catalog.listColumns(table1)}
    cols2 = {col.name: col.dataType for col in spark.catalog.listColumns(table2)}
    
    # Columns only in table1
    only_in_1 = set(cols1.keys()) - set(cols2.keys())
    # Columns only in table2
    only_in_2 = set(cols2.keys()) - set(cols1.keys())
    # Common columns
    common = set(cols1.keys()) & set(cols2.keys())
    # Type differences
    type_diffs = {col: (cols1[col], cols2[col]) 
                  for col in common if cols1[col] != cols2[col]}
    
    print(f"\nComparing: {table1} vs {table2}")
    print(f"\nColumns only in {table1}: {only_in_1}")
    print(f"Columns only in {table2}: {only_in_2}")
    print(f"Common columns: {len(common)}")
    print(f"Type differences: {type_diffs}")
    
    return {
        'only_in_1': only_in_1,
        'only_in_2': only_in_2,
        'common': common,
        'type_diffs': type_diffs
    }

# Create two similar tables for comparison
df1 = employees.select("employee_id", "name", "age", "department", "salary")
df1.createOrReplaceTempView("employees_v1")

df2 = employees.select("employee_id", "name", "department", "salary", "status")
df2.createOrReplaceTempView("employees_v2")

comparison = compare_schemas("employees_v1", "employees_v2")

print("""
üí° Use Case:
   - Compare dev vs prod schemas
   - Validate schema migrations
   - Track schema evolution
""")

In [None]:
# Example 3: Metadata-driven processing
print("\nüìä EXAMPLE 3: METADATA-DRIVEN PROCESSING")
print("-" * 80)

def process_all_tables_in_database(database_name, operation):
    """Process all tables in a database using metadata"""
    
    tables = spark.catalog.listTables(database_name)
    results = []
    
    for table in tables:
        if not table.isTemporary:
            table_name = f"{database_name}.{table.name}"
            print(f"\nProcessing: {table_name}")
            
            try:
                df = spark.table(table_name)
                result = operation(df, table_name)
                results.append((table_name, "SUCCESS", result))
            except Exception as e:
                results.append((table_name, "FAILED", str(e)))
    
    return results

# Example operation: Count rows
def count_rows(df, table_name):
    count = df.count()
    print(f"   Rows: {count:,}")
    return count

# Process all tables
results = process_all_tables_in_database("hr_database", count_rows)

print("\nüìä Processing Results:")
results_df = spark.createDataFrame(results, ["table", "status", "result"])
results_df.show(truncate=False)

print("""
üí° Use Case:
   - Batch processing all tables
   - Data quality checks
   - Automated reporting
   - Metadata-driven ETL
""")

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Spark Catalog API**
   - Access metadata via `spark.catalog`
   - List databases, tables, columns, functions
   - Check existence, cache status

2. **Database Management**
   - CREATE/DROP databases
   - Switch databases
   - Query database properties

3. **Table Management**
   - Create temporary/persistent tables
   - List and describe tables
   - ALTER and DROP tables
   - Table properties and metadata

4. **Schema Operations**
   - List columns and types
   - Get schema information
   - Schema evolution
   - Schema serialization

5. **Metadata Queries**
   - Table and column statistics
   - Function discovery
   - Custom metadata reports

6. **Cache Management**
   - Cache/uncache tables
   - Check cache status
   - Clear all cache
   - Refresh metadata

### **üìä Quick Reference:**

```python
# List databases
spark.catalog.listDatabases()

# List tables
spark.catalog.listTables("database_name")

# List columns
spark.catalog.listColumns("table_name")

# Check existence
spark.catalog.databaseExists("db_name")
spark.catalog.tableExists("table_name")

# Cache management
spark.catalog.cacheTable("table_name")
spark.catalog.isCached("table_name")
spark.catalog.uncacheTable("table_name")
spark.catalog.clearCache()

# Refresh metadata
spark.catalog.refreshTable("table_name")
```

### **üöÄ Next:** Day 5 - Lesson 4: SQL Optimization

---

In [None]:
# Cleanup
print("="*80)
print("üßπ CLEANUP")
print("="*80)

# Drop test databases
for db_name in ["hr_database", "sales_database", "analytics_database"]:
    spark.sql(f"DROP DATABASE IF EXISTS {db_name} CASCADE")
    print(f"‚úÖ Dropped database: {db_name}")

# Clear cache
spark.catalog.clearCache()
print("‚úÖ Cleared cache")

# Stop session
spark.stop()

print("\n‚úÖ Spark session stopped")
print("\nüéâ DAY 5 - LESSON 3 COMPLETED!")
print("\nüí° Remember:")
print("   - Catalog API provides programmatic access to metadata")
print("   - Use databases to organize tables")
print("   - Document schemas and tables")
print("   - Compute statistics for better performance")
print("   - Manage cache carefully")
print("\nüî• Quote: 'Good metadata is the foundation of good data!' üóÑÔ∏è")