# üöÄ SPARK SQL BASICS

---

## üìã **DAY 5 - LESSON 1: SQL BASICS**

### **üéØ M·ª§C TI√äU:**

1. **Spark SQL Overview** - T·∫°i sao d√πng SQL?
2. **Temporary Views** - createOrReplaceTempView
3. **Basic Queries** - SELECT, WHERE, GROUP BY
4. **SQL Functions** - Built-in functions
5. **Subqueries & CTEs** - WITH clause
6. **SQL vs DataFrame API** - So s√°nh performance

---

## üí° **T·∫†I SAO D√ôNG SPARK SQL?**

### **DataFrame API:**
```python
df.filter(col("age") > 25) \
  .groupBy("country") \
  .agg(F.avg("salary").alias("avg_salary")) \
  .orderBy(desc("avg_salary"))
```

### **Spark SQL:**
```sql
SELECT country, AVG(salary) as avg_salary
FROM employees
WHERE age > 25
GROUP BY country
ORDER BY avg_salary DESC
```

### **∆Øu ƒëi·ªÉm:**
- ‚úÖ D·ªÖ ƒë·ªçc, d·ªÖ hi·ªÉu (SQL standard)
- ‚úÖ Quen thu·ªôc v·ªõi SQL developers
- ‚úÖ Catalyst optimizer (same performance)
- ‚úÖ T√≠ch h·ª£p v·ªõi BI tools

---

## üîß **SETUP**

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import col, lit, when, desc, asc, count, sum, avg, max, min
from pyspark.sql.types import *
import time
import random
from datetime import datetime, timedelta

spark = SparkSession.builder \
    .appName("SparkSQL_Basics") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "1g") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Spark SQL Enabled: {spark.conf.get('spark.sql.adaptive.enabled')}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
26/01/11 15:29:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


‚úÖ Spark Session Created
Spark Version: 3.5.1
Spark SQL Enabled: true


---

## üìä **1. T·∫†O DATA M·∫™U**

In [2]:
print("üîπ Generating sample data...")

# Employees data
departments = ["Engineering", "Sales", "Marketing", "HR", "Finance"]
countries = ["USA", "UK", "Germany", "France", "Canada"]
cities = {
    "USA": ["New York", "San Francisco", "Seattle"],
    "UK": ["London", "Manchester"],
    "Germany": ["Berlin", "Munich"],
    "France": ["Paris", "Lyon"],
    "Canada": ["Toronto", "Vancouver"]
}

employees_data = []
for i in range(1, 1001):
    country = random.choice(countries)
    city = random.choice(cities[country])
    dept = random.choice(departments)
    
    # Salary based on department
    base_salary = {
        "Engineering": 80000,
        "Sales": 70000,
        "Marketing": 65000,
        "HR": 60000,
        "Finance": 75000
    }[dept]
    
    salary = base_salary + random.randint(-10000, 30000)
    
    employees_data.append((
        f"EMP{i:04d}",
        f"Employee {i}",
        random.randint(22, 60),
        dept,
        country,
        city,
        salary,
        random.choice(["Active", "Active", "Active", "Inactive"]),  # 75% active
        (datetime(2020, 1, 1) + timedelta(days=random.randint(0, 1460))).strftime("%Y-%m-%d")
    ))

employees = spark.createDataFrame(employees_data,
    ["employee_id", "name", "age", "department", "country", "city", "salary", "status", "hire_date"])

print(f"‚úÖ Generated {employees.count():,} employees")
employees.show(5)

# Projects data
projects_data = []
for i in range(1, 51):
    dept = random.choice(departments)
    budget = random.randint(50000, 500000)
    
    projects_data.append((
        f"PROJ{i:03d}",
        f"Project {i}",
        dept,
        budget,
        random.choice(["Planning", "In Progress", "Completed", "On Hold"])
    ))

projects = spark.createDataFrame(projects_data,
    ["project_id", "project_name", "department", "budget", "status"])

print(f"\n‚úÖ Generated {projects.count():,} projects")
projects.show(5)

üîπ Generating sample data...


                                                                                

‚úÖ Generated 1,000 employees
+-----------+----------+---+-----------+-------+-------------+------+--------+----------+
|employee_id|      name|age| department|country|         city|salary|  status| hire_date|
+-----------+----------+---+-----------+-------+-------------+------+--------+----------+
|    EMP0001|Employee 1| 59|  Marketing| France|         Lyon| 80535|  Active|2020-06-29|
|    EMP0002|Employee 2| 30|         HR| Canada|      Toronto| 83090|  Active|2021-10-04|
|    EMP0003|Employee 3| 40|Engineering|    USA|San Francisco| 78789|Inactive|2022-04-07|
|    EMP0004|Employee 4| 59|      Sales| Canada|      Toronto| 91691|  Active|2022-11-07|
|    EMP0005|Employee 5| 50|      Sales|     UK|       London| 83566|  Active|2022-07-30|
+-----------+----------+---+-----------+-------+-------------+------+--------+----------+
only showing top 5 rows


‚úÖ Generated 50 projects
+----------+------------+----------+------+-----------+
|project_id|project_name|department|budget|     stat

---

## üìã **2. CREATE TEMPORARY VIEWS**

### **Temporary View l√† g√¨?**
- Table ·∫£o trong Spark SQL
- Ch·ªâ t·ªìn t·∫°i trong session
- Cho ph√©p query b·∫±ng SQL

### **Syntax:**
```python
df.createOrReplaceTempView("table_name")
spark.sql("SELECT * FROM table_name")
```

In [3]:
print("="*80)
print("üîπ DEMO 1: Create Temporary Views")
print("="*80)

# Create temporary views
employees.createOrReplaceTempView("employees")
projects.createOrReplaceTempView("projects")

print("\n‚úÖ Created temporary views:")
print("   - employees")
print("   - projects")

# List all tables
print("\nüìã Available tables:")
spark.sql("SHOW TABLES").show()

# Simple query
print("\nüîπ Simple SQL query:")
result = spark.sql("""
    SELECT * 
    FROM employees 
    LIMIT 5
""")
result.show()

print("""
üí° KEY POINTS:
   - createOrReplaceTempView() creates a temporary table
   - Use spark.sql() to run SQL queries
   - Returns a DataFrame
   - View exists only in current session
""")

üîπ DEMO 1: Create Temporary Views

‚úÖ Created temporary views:
   - employees
   - projects

üìã Available tables:
+---------+---------+-----------+
|namespace|tableName|isTemporary|
+---------+---------+-----------+
|         |employees|       true|
|         | projects|       true|
+---------+---------+-----------+


üîπ Simple SQL query:
+-----------+----------+---+-----------+-------+-------------+------+--------+----------+
|employee_id|      name|age| department|country|         city|salary|  status| hire_date|
+-----------+----------+---+-----------+-------+-------------+------+--------+----------+
|    EMP0001|Employee 1| 59|  Marketing| France|         Lyon| 80535|  Active|2020-06-29|
|    EMP0002|Employee 2| 30|         HR| Canada|      Toronto| 83090|  Active|2021-10-04|
|    EMP0003|Employee 3| 40|Engineering|    USA|San Francisco| 78789|Inactive|2022-04-07|
|    EMP0004|Employee 4| 59|      Sales| Canada|      Toronto| 91691|  Active|2022-11-07|
|    EMP0005|Employee 

---

## üîç **3. BASIC SQL QUERIES**

In [4]:
print("="*80)
print("üîπ DEMO 2: Basic SQL Queries")
print("="*80)

# Query 1: SELECT with WHERE
print("\nüìä Query 1: High earners (salary > 80000)")
result1 = spark.sql("""
    SELECT employee_id, name, department, salary
    FROM employees
    WHERE salary > 80000
    ORDER BY salary DESC
    LIMIT 10
""")
result1.show()

# Query 2: GROUP BY with aggregations
print("\nüìä Query 2: Average salary by department")
result2 = spark.sql("""
    SELECT 
        department,
        COUNT(*) as employee_count,
        AVG(salary) as avg_salary,
        MIN(salary) as min_salary,
        MAX(salary) as max_salary
    FROM employees
    WHERE status = 'Active'
    GROUP BY department
    ORDER BY avg_salary DESC
""")
result2.show()

# Query 3: Multiple conditions
print("\nüìä Query 3: Engineering employees in USA")
result3 = spark.sql("""
    SELECT name, city, salary
    FROM employees
    WHERE department = 'Engineering'
      AND country = 'USA'
      AND salary > 75000
    ORDER BY salary DESC
""")
result3.show()

# Query 4: HAVING clause
print("\nüìä Query 4: Departments with avg salary > 70000")
result4 = spark.sql("""
    SELECT 
        department,
        COUNT(*) as count,
        AVG(salary) as avg_salary
    FROM employees
    GROUP BY department
    HAVING AVG(salary) > 70000
    ORDER BY avg_salary DESC
""")
result4.show()

print("""
üí° SQL BASICS:
   - SELECT: Choose columns
   - WHERE: Filter rows
   - GROUP BY: Aggregate data
   - HAVING: Filter aggregated results
   - ORDER BY: Sort results
   - LIMIT: Limit number of rows
""")

üîπ DEMO 2: Basic SQL Queries

üìä Query 1: High earners (salary > 80000)
+-----------+------------+-----------+------+
|employee_id|        name| department|salary|
+-----------+------------+-----------+------+
|    EMP0127|Employee 127|Engineering|109993|
|    EMP0184|Employee 184|Engineering|109861|
|    EMP0915|Employee 915|Engineering|109813|
|    EMP0513|Employee 513|Engineering|109597|
|    EMP0119|Employee 119|Engineering|109342|
|    EMP0227|Employee 227|Engineering|109293|
|    EMP0014| Employee 14|Engineering|109291|
|    EMP0265|Employee 265|Engineering|109183|
|    EMP0447|Employee 447|Engineering|108618|
|    EMP0301|Employee 301|Engineering|108589|
+-----------+------------+-----------+------+


üìä Query 2: Average salary by department


                                                                                

+-----------+--------------+-----------------+----------+----------+
| department|employee_count|       avg_salary|min_salary|max_salary|
+-----------+--------------+-----------------+----------+----------+
|Engineering|           128|    88664.8359375|     70683|    109861|
|    Finance|           160|      83612.98125|     65354|    104792|
|      Sales|           145|80713.99310344827|     60030|     99948|
|  Marketing|           162|76166.91358024691|     55399|     94834|
|         HR|           163|69340.77300613497|     50271|     89966|
+-----------+--------------+-----------------+----------+----------+


üìä Query 3: Engineering employees in USA
+------------+-------------+------+
|        name|         city|salary|
+------------+-------------+------+
|Employee 301|San Francisco|108589|
|Employee 380|     New York|107733|
|Employee 472|      Seattle|107447|
|Employee 675|San Francisco|105027|
|Employee 244|      Seattle|100634|
|Employee 535|      Seattle| 97087|
|Employee 

---

## üîß **4. SQL FUNCTIONS**

In [5]:
print("="*80)
print("üîπ DEMO 3: SQL Functions")
print("="*80)

# String functions
print("\nüìä String Functions:")
result_str = spark.sql("""
    SELECT 
        name,
        UPPER(name) as upper_name,
        LOWER(name) as lower_name,
        LENGTH(name) as name_length,
        SUBSTRING(name, 1, 8) as short_name,
        CONCAT(name, ' - ', department) as full_info
    FROM employees
    LIMIT 5
""")
result_str.show(truncate=False)

# Date functions
print("\nüìä Date Functions:")
result_date = spark.sql("""
    SELECT 
        name,
        hire_date,
        YEAR(hire_date) as hire_year,
        MONTH(hire_date) as hire_month,
        DAYOFWEEK(hire_date) as day_of_week,
        DATEDIFF(CURRENT_DATE(), hire_date) as days_employed,
        DATE_ADD(hire_date, 365) as first_anniversary
    FROM employees
    LIMIT 5
""")
result_date.show()

# Math functions
print("\nüìä Math Functions:")
result_math = spark.sql("""
    SELECT 
        name,
        salary,
        ROUND(salary / 12, 2) as monthly_salary,
        ROUND(salary * 1.1, 2) as salary_with_10pct_raise,
        CEIL(salary / 1000) * 1000 as rounded_up_salary,
        FLOOR(salary / 1000) * 1000 as rounded_down_salary
    FROM employees
    LIMIT 5
""")
result_math.show()

# Conditional functions
print("\nüìä Conditional Functions (CASE WHEN):")
result_case = spark.sql("""
    SELECT 
        name,
        age,
        salary,
        CASE 
            WHEN salary > 90000 THEN 'High'
            WHEN salary > 70000 THEN 'Medium'
            ELSE 'Low'
        END as salary_level,
        CASE 
            WHEN age < 30 THEN 'Junior'
            WHEN age < 45 THEN 'Mid-level'
            ELSE 'Senior'
        END as seniority
    FROM employees
    LIMIT 10
""")
result_case.show()

print("""
üí° COMMON SQL FUNCTIONS:

String:
   - UPPER(), LOWER(), LENGTH()
   - SUBSTRING(), CONCAT(), TRIM()

Date:
   - YEAR(), MONTH(), DAY()
   - DATEDIFF(), DATE_ADD(), DATE_SUB()
   - CURRENT_DATE(), CURRENT_TIMESTAMP()

Math:
   - ROUND(), CEIL(), FLOOR()
   - ABS(), SQRT(), POW()

Conditional:
   - CASE WHEN ... THEN ... ELSE ... END
   - IF(), COALESCE(), NULLIF()
""")

üîπ DEMO 3: SQL Functions

üìä String Functions:
+----------+----------+----------+-----------+----------+------------------------+
|name      |upper_name|lower_name|name_length|short_name|full_info               |
+----------+----------+----------+-----------+----------+------------------------+
|Employee 1|EMPLOYEE 1|employee 1|10         |Employee  |Employee 1 - Marketing  |
|Employee 2|EMPLOYEE 2|employee 2|10         |Employee  |Employee 2 - HR         |
|Employee 3|EMPLOYEE 3|employee 3|10         |Employee  |Employee 3 - Engineering|
|Employee 4|EMPLOYEE 4|employee 4|10         |Employee  |Employee 4 - Sales      |
|Employee 5|EMPLOYEE 5|employee 5|10         |Employee  |Employee 5 - Sales      |
+----------+----------+----------+-----------+----------+------------------------+


üìä Date Functions:
+----------+----------+---------+----------+-----------+-------------+-----------------+
|      name| hire_date|hire_year|hire_month|day_of_week|days_employed|first_anniversary|
+

---

## üîó **5. SUBQUERIES & CTEs**

### **Subquery:**
```sql
SELECT * FROM table1
WHERE col1 IN (SELECT col2 FROM table2)
```

### **CTE (Common Table Expression):**
```sql
WITH temp_table AS (
    SELECT * FROM table1
)
SELECT * FROM temp_table
```

In [6]:
print("="*80)
print("üîπ DEMO 4: Subqueries & CTEs")
print("="*80)

# Subquery example
print("\nüìä Subquery: Employees earning above department average")
result_subquery = spark.sql("""
    SELECT 
        e.name,
        e.department,
        e.salary,
        dept_avg.avg_salary as dept_avg_salary
    FROM employees e
    JOIN (
        SELECT department, AVG(salary) as avg_salary
        FROM employees
        GROUP BY department
    ) dept_avg ON e.department = dept_avg.department
    WHERE e.salary > dept_avg.avg_salary
    ORDER BY e.department, e.salary DESC
""")
result_subquery.show(10)

# CTE example 1: Single CTE
print("\nüìä CTE Example 1: Department statistics")
result_cte1 = spark.sql("""
    WITH dept_stats AS (
        SELECT 
            department,
            COUNT(*) as employee_count,
            AVG(salary) as avg_salary,
            MAX(salary) as max_salary
        FROM employees
        WHERE status = 'Active'
        GROUP BY department
    )
    SELECT 
        department,
        employee_count,
        ROUND(avg_salary, 2) as avg_salary,
        max_salary,
        ROUND(max_salary - avg_salary, 2) as salary_gap
    FROM dept_stats
    ORDER BY avg_salary DESC
""")
result_cte1.show()

# CTE example 2: Multiple CTEs
print("\nüìä CTE Example 2: Multiple CTEs")
result_cte2 = spark.sql("""
    WITH 
    high_earners AS (
        SELECT department, COUNT(*) as high_earner_count
        FROM employees
        WHERE salary > 80000
        GROUP BY department
    ),
    dept_totals AS (
        SELECT department, COUNT(*) as total_count
        FROM employees
        GROUP BY department
    )
    SELECT 
        dt.department,
        dt.total_count,
        COALESCE(he.high_earner_count, 0) as high_earner_count,
        ROUND(COALESCE(he.high_earner_count, 0) * 100.0 / dt.total_count, 2) as high_earner_pct
    FROM dept_totals dt
    LEFT JOIN high_earners he ON dt.department = he.department
    ORDER BY high_earner_pct DESC
""")
result_cte2.show()

# CTE example 3: Recursive-like query
print("\nüìä CTE Example 3: Salary percentiles")
result_cte3 = spark.sql("""
    WITH salary_stats AS (
        SELECT 
            department,
            salary,
            PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary) as percentile
        FROM employees
    )
    SELECT 
        department,
        ROUND(AVG(CASE WHEN percentile <= 0.25 THEN salary END), 2) as p25_salary,
        ROUND(AVG(CASE WHEN percentile <= 0.50 THEN salary END), 2) as median_salary,
        ROUND(AVG(CASE WHEN percentile <= 0.75 THEN salary END), 2) as p75_salary
    FROM salary_stats
    GROUP BY department
    ORDER BY median_salary DESC
""")
result_cte3.show()

print("""
üí° SUBQUERIES vs CTEs:

Subquery:
   ‚úÖ Good for simple, one-time use
   ‚ùå Can be hard to read when nested
   ‚ùå Evaluated multiple times if used multiple times

CTE (WITH clause):
   ‚úÖ More readable and maintainable
   ‚úÖ Can reference multiple times
   ‚úÖ Better for complex queries
   ‚úÖ Can chain multiple CTEs

Best Practice:
   - Use CTEs for complex queries
   - Use subqueries for simple, one-time filters
""")

üîπ DEMO 4: Subqueries & CTEs

üìä Subquery: Employees earning above department average
+------------+-----------+------+-----------------+
|        name| department|salary|  dept_avg_salary|
+------------+-----------+------+-----------------+
|Employee 127|Engineering|109993|89469.69832402235|
|Employee 184|Engineering|109861|89469.69832402235|
|Employee 915|Engineering|109813|89469.69832402235|
|Employee 513|Engineering|109597|89469.69832402235|
|Employee 119|Engineering|109342|89469.69832402235|
|Employee 227|Engineering|109293|89469.69832402235|
| Employee 14|Engineering|109291|89469.69832402235|
|Employee 265|Engineering|109183|89469.69832402235|
|Employee 447|Engineering|108618|89469.69832402235|
|Employee 301|Engineering|108589|89469.69832402235|
+------------+-----------+------+-----------------+
only showing top 10 rows


üìä CTE Example 1: Department statistics
+-----------+--------------+----------+----------+----------+
| department|employee_count|avg_salary|max_salary|s

---

## ‚ö° **6. SQL vs DATAFRAME API - PERFORMANCE**

In [7]:
print("="*80)
print("üîπ DEMO 5: SQL vs DataFrame API Performance")
print("="*80)

# Query: Average salary by department and country

# Method 1: SQL
print("\nüìä Method 1: Using SQL")
start = time.time()
result_sql = spark.sql("""
    SELECT 
        department,
        country,
        COUNT(*) as employee_count,
        AVG(salary) as avg_salary,
        MAX(salary) as max_salary
    FROM employees
    WHERE status = 'Active'
    GROUP BY department, country
    HAVING COUNT(*) > 5
    ORDER BY department, avg_salary DESC
""")
result_sql.show(10)
time_sql = time.time() - start
print(f"‚è±Ô∏è  SQL Time: {time_sql:.3f}s")

# Method 2: DataFrame API
print("\nüìä Method 2: Using DataFrame API")
start = time.time()
result_df = employees \
    .filter(col("status") == "Active") \
    .groupBy("department", "country") \
    .agg(
        F.count("*").alias("employee_count"),
        F.avg("salary").alias("avg_salary"),
        F.max("salary").alias("max_salary")
    ) \
    .filter(col("employee_count") > 5) \
    .orderBy("department", desc("avg_salary"))
result_df.show(10)
time_df = time.time() - start
print(f"‚è±Ô∏è  DataFrame Time: {time_df:.3f}s")

# Compare execution plans
print("\nüìä Execution Plans:")
print("\nüîπ SQL Plan:")
result_sql.explain()

print("\nüîπ DataFrame Plan:")
result_df.explain()

# Comparison
print("\n" + "="*80)
print("üìä PERFORMANCE COMPARISON")
print("="*80)

comparison = [
    ("SQL", time_sql, "More readable"),
    ("DataFrame API", time_df, "More programmatic")
]

comparison_df = spark.createDataFrame(comparison,
    ["Method", "Time (s)", "Note"])
comparison_df.show(truncate=False)

print("""
üí° KEY INSIGHTS:

Performance:
   - Both use Catalyst optimizer
   - Same execution plan
   - Same performance!

When to use SQL:
   ‚úÖ Complex analytical queries
   ‚úÖ Team familiar with SQL
   ‚úÖ Ad-hoc analysis
   ‚úÖ BI tool integration

When to use DataFrame API:
   ‚úÖ Programmatic data processing
   ‚úÖ Type safety (compile-time checks)
   ‚úÖ IDE autocomplete
   ‚úÖ Complex transformations

Best Practice:
   - Use both! Mix and match as needed
   - SQL for queries, DataFrame for transformations
   - Choose based on readability and team preference
""")

üîπ DEMO 5: SQL vs DataFrame API Performance

üìä Method 1: Using SQL
+-----------+-------+--------------+-----------------+----------+
| department|country|employee_count|       avg_salary|max_salary|
+-----------+-------+--------------+-----------------+----------+
|Engineering|Germany|            31|91465.19354838709|    109861|
|Engineering| France|            24|90679.04166666667|    109813|
|Engineering|     UK|            27|88126.81481481482|    109293|
|Engineering| Canada|            27|86416.14814814815|    108068|
|Engineering|    USA|            19|85511.63157894737|    108589|
|    Finance|Germany|            29|85195.06896551725|    104792|
|    Finance|     UK|            39|84679.56410256411|    104053|
|    Finance|    USA|            37|83775.18918918919|    104261|
|    Finance| Canada|            31|83401.45161290323|    104504|
|    Finance| France|            24|         79991.25|    103168|
+-----------+-------+--------------+-----------------+----------+
only

---

## üéì **KEY TAKEAWAYS**

### **‚úÖ What You Learned:**

1. **Temporary Views**
   - createOrReplaceTempView()
   - Query with spark.sql()
   - Session-scoped

2. **Basic SQL**
   - SELECT, WHERE, GROUP BY
   - HAVING, ORDER BY, LIMIT
   - Aggregations (COUNT, AVG, SUM, etc.)

3. **SQL Functions**
   - String: UPPER, LOWER, CONCAT
   - Date: YEAR, MONTH, DATEDIFF
   - Math: ROUND, CEIL, FLOOR
   - Conditional: CASE WHEN

4. **Subqueries & CTEs**
   - Subqueries for simple filters
   - CTEs (WITH) for complex queries
   - Multiple CTEs for readability

5. **SQL vs DataFrame API**
   - Same performance (Catalyst optimizer)
   - SQL: More readable for queries
   - DataFrame: More programmatic
   - Use both as needed!

### **üìä Quick Reference:**

```python
# Create view
df.createOrReplaceTempView("table_name")

# Run SQL
result = spark.sql("SELECT * FROM table_name")

# CTE
spark.sql("""
    WITH temp AS (
        SELECT * FROM table
    )
    SELECT * FROM temp
""")
```

### **üöÄ Next:** Day 5 - Lesson 2: Advanced SQL

---

In [8]:
# Cleanup
spark.catalog.clearCache()
spark.stop()

print("‚úÖ Spark session stopped")
print("\nüéâ DAY 5 - LESSON 1 COMPLETED!")
print("\nüí° Remember:")
print("   - SQL and DataFrame API have same performance")
print("   - Use createOrReplaceTempView() for SQL queries")
print("   - CTEs make complex queries readable")
print("   - Mix SQL and DataFrame API as needed")
print("\nüî• Quote: 'SQL is not dead, it's just distributed!' üöÄ")

‚úÖ Spark session stopped

üéâ DAY 5 - LESSON 1 COMPLETED!

üí° Remember:
   - SQL and DataFrame API have same performance
   - Use createOrReplaceTempView() for SQL queries
   - CTEs make complex queries readable
   - Mix SQL and DataFrame API as needed

üî• Quote: 'SQL is not dead, it's just distributed!' üöÄ
