# PySpark: Zero to Hero
## Module 27: Spark SQL, Catalogs, and Query Hints

Spark SQL allows us to query data using standard SQL syntax, providing an alternative to the DataFrame API. This module explores how to use Spark SQL, manage metadata using Catalogs, and optimize queries using SQL Hints.

### Agenda:
1.  **Spark SQL Basics:** Registering DataFrames as Temp Views and querying them with SQL.
2.  **Catalog Implementation:** Understanding `in-memory` vs. `hive` catalog for persisting metadata.
3.  **Persistence:** Creating Managed Tables that survive application restarts.
4.  **SQL Hints:** Using hints like `BROADCAST` and `SHUFFLE_MERGE` directly in SQL queries to control execution plans.

In [None]:
from pyspark.sql import SparkSession

# Initially, we start with the default "in-memory" catalog
# This means tables created here will be lost when the session restarts.
spark = SparkSession.builder \
    .appName("Spark_SQL_Demo") \
    .master("local[*]") \
    .config("spark.sql.catalogImplementation", "in-memory") \
    .getOrCreate()

print("Spark Session Created with In-Memory Catalog")

In [None]:
# Create sample DataFrame
data = [
    (1, "Alice", 1000, "HR"),
    (2, "Bob", 1200, "Engineering"),
    (3, "Charlie", 1100, "Engineering"),
    (4, "David", 1300, "HR")
]
schema = ["id", "name", "salary", "dept_name"]
emp_df = spark.createDataFrame(data, schema)

# 1. Register as Temporary View
# This makes the dataframe accessible via SQL within this session
emp_df.createOrReplaceTempView("employee")

# 2. Query using Spark SQL
result = spark.sql("""
    SELECT dept_name, AVG(salary) as avg_salary
    FROM employee
    GROUP BY dept_name
""")

print("SQL Query Result:")
result.show()

# 3. Check Catalog
print("Listing Tables in Default Database:")
spark.sql("SHOW TABLES").show()

## 2. Using Hive Catalog for Persistence

The default catalog is transient. To persist table metadata (schema, location) across different Spark sessions, we enable Hive support. This creates a `metastore_db` locally (by default using Derby) to store table definitions.

*Note: You might need to restart the kernel if switching catalog implementations in the same notebook context causes issues, but typically Spark handles separate sessions well if configured correctly.*

In [None]:
# Stop previous session to switch configuration
spark.stop()

# Enable Hive Support to persist metadata
spark = SparkSession.builder \
    .appName("Spark_SQL_Hive_Demo") \
    .master("local[*]") \
    .config("spark.sql.catalogImplementation", "hive") \
    .config("spark.sql.warehouse.dir", "spark-warehouse") \
    .enableHiveSupport() \
    .getOrCreate()

print("Spark Session Restarted with Hive Support")

In [None]:
# Re-create DataFrame
emp_df = spark.createDataFrame(data, schema)

# Save as a MANAGED Table
# This saves both data (in warehouse dir) and metadata (in metastore_db)
emp_df.write.mode("overwrite").saveAsTable("employee_managed")

print("Table 'employee_managed' created.")

# Verify it exists in catalog
spark.sql("SHOW TABLES").show()

# Querying the table
spark.sql("SELECT * FROM employee_managed WHERE salary > 1100").show()

## 3. SQL Query Hints

Just like in the DataFrame API, we can guide the optimizer using Hints in SQL.

*   **Broadcast Hint:** `/*+ BROADCAST(table_name) */` - Forces a broadcast join.
*   **Merge Join Hint:** `/*+ SHUFFLE_MERGE(table_name) */` - Forces a SortMergeJoin.

In [None]:
# Create a small department DataFrame for joining
dept_data = [("HR", "Human Resources"), ("Engineering", "Software Engineering")]
dept_df = spark.createDataFrame(dept_data, ["dept_name", "dept_full_name"])
dept_df.createOrReplaceTempView("department")

# 1. Force Broadcast Join using SQL Hint
print("Plan with Broadcast Hint:")
spark.sql("""
    SELECT /*+ BROADCAST(d) */ 
        e.name, d.dept_full_name 
    FROM employee_managed e 
    JOIN department d ON e.dept_name = d.dept_name
""").explain()

# 2. Force Sort Merge Join using SQL Hint
print("\nPlan with Merge Hint:")
spark.sql("""
    SELECT /*+ SHUFFLE_MERGE(d) */ 
        e.name, d.dept_full_name 
    FROM employee_managed e 
    JOIN department d ON e.dept_name = d.dept_name
""").explain()

## Summary

1.  **Spark SQL:** Provides a declarative way to query DataFrames using SQL.
2.  **Catalog:** 
    *   `in-memory`: Transient (lost on restart).
    *   `hive`: Persistent (saves metadata to metastore).
3.  **Tables:** `saveAsTable` creates persistent tables that can be queried in future sessions.
4.  **Hints:** Use SQL comment syntax `/*+ HINT_NAME(table) */` to enforce join strategies.