# PySpark: Zero to Hero
## Module 34: What is Spark Connect?

**Spark Connect** (introduced in Spark 3.4) is a client-server architecture that decouples the client (where you write code) from the Spark driver (where code is planned). It allows you to connect to a Spark cluster remotely from any language (Python, Go, Rust) or IDE (VS Code, PyCharm) using a thin client via the **gRPC** protocol.

### Key Benefits:
1.  **Decoupling:** Upgrading the Spark cluster doesn't require upgrading the client immediately.
2.  **Remote Connectivity:** Run Spark code from your local laptop/IDE against a remote production cluster easily.
3.  **Stability:** Client crashes don't take down the cluster; Cluster restarts don't necessarily crash the client (until execution).

### Agenda:
1.  **Architecture:** Understand how Spark Connect works (gRPC, Arrow).
2.  **Setup:** Prerequisites for running Spark Connect (Server & Client).
3.  **Execution:** Connecting to a remote Spark server using `remote()`.
4.  **Comparison:** Spark Session vs. Spark Connect Session.
5.  **Limitations:** Why RDDs don't work in Spark Connect.

## 1. Architecture and Setup

Traditionally, the Spark Driver and Client were tightly coupled in the same JVM process (or closely linked). With Spark Connect:
*   **Client:** Translates DataFrame operations into logic plans and sends them over **gRPC**.
*   **Server:** Receives plans, executes them, and streams results back as **Apache Arrow** batches.

### Prerequisites (Based on Video Demo)
To run this notebook successfully, you need a running **Spark Connect Server**. 

**Server Side (Docker):**
If you are following the course Docker setup:
1.  Use the image `pyspark-cluster-3.5.5` (Spark 3.5+ is recommended for Connect).
2.  Start the container. The Spark Connect Server usually listens on port `15002`.

**Client Side (Libraries):**
You need to install specific dependencies to use the thin client:
```bash
pip install pyspark==3.5.5 pandas pyarrow grpcio grpcio-status protobuf

In [None]:
# Import SparkSession
from pyspark.sql import SparkSession

# In a standard local run, we usually do .master("local").getOrCreate()
# For Spark Connect, we use the .remote() option.

# Connection String Format: sc://<host>:<port>
# Default Spark Connect Port is 15002
connection_string = "sc://localhost:15002"

try:
    spark = SparkSession.builder \
        .remote(connection_string) \
        .getOrCreate()
    
    print("Spark Connect Session created successfully!")
    print(spark)
    
except Exception as e:
    print("Could not connect to Spark Connect Server. Ensure the Docker container is running.")
    print(f"Error: {e}")

# Note: If you see 'pyspark.sql.connect.session.SparkSession', you are using the Client API.

In [None]:
# Let's verify the type of our session object.
# Standard: pyspark.sql.session.SparkSession
# Connect:  pyspark.sql.connect.session.SparkSession

print(f"Session Type: {type(spark)}")

# This object is a "Thin Client". It does not contain the heavy JVM driver logic locally.

In [None]:
# Spark Connect supports the standard DataFrame API.
# The code looks exactly the same as standard PySpark.

# 1. Create a Range DataFrame
df = spark.range(10)

# 2. Perform Transformations
df_modified = df.withColumn("value_squared", df["id"] * df["id"])

# 3. Action (Trigger Execution on Remote Server)
# The plan is sent via gRPC, executed remotely, and results streamed back.
print("Executing Dataframe Action...")
df_modified.show()

In [None]:
# Unlike standard Spark, the Client doesn't host the UI on localhost:4040 directly.
# The UI lives on the Spark Server side.

# If you check the Spark Master UI (usually localhost:8080 in the Docker setup),
# you will see an application named "Spark Connect Server".
# All queries run by this client appear under that application.

## 2. Limitations: The RDD API

One major difference with Spark Connect is that it **does not support RDDs (Resilient Distributed Datasets)**.

Because RDDs contain arbitrary Python/Java code (lambdas) that are hard to serialize and send over gRPC in a language-agnostic way, Spark Connect focuses purely on the **DataFrame/Dataset API**.

In [None]:
try:
    # Attempting to access the underlying RDD will fail in Spark Connect
    rdd = df.rdd
    print(rdd.getNumPartitions())
except Exception as e:
    print("Caught Expected Error:")
    print(e)

# You should see an error like "NotImplementedError" or "AttributeError".
# This confirms we are using the decoupled architecture.

In [None]:
# For comparison, here is how a standard session looks (if you wanted to run RDDs locally).
# This creates a JVM process locally on your machine.

from pyspark.sql import SparkSession as StandardSession

spark_local = StandardSession.builder \
    .appName("Standard_Session") \
    .master("local[*]") \
    .getOrCreate()

print(f"Standard Session Type: {type(spark_local)}")

# RDDs work here
print(f"RDD Partition Count: {spark_local.range(10).rdd.getNumPartitions()}")

spark_local.stop()

## Summary

1.  **Spark Connect** decouples Client and Server.
2.  **Connectivity:** Use `SparkSession.builder.remote("sc://host:port")`.
3.  **Port:** Default is `15002`.
4.  **API Support:** Full DataFrame/SQL support. **No RDD support**.
5.  **Use Case:** Ideal for modern data stacks, connecting IDEs (VS Code/Jupyter) to remote clusters, and building lightweight data apps.