# Spark Streaming with PySpark
## Module 18: Azure Cosmos DB Integration

In this module, we explore **NoSQL** databases and integrate **Azure Cosmos DB** with Apache Spark.

### Learning Objectives:
1.  **Understand NoSQL vs. SQL:** The difference between scaling out (horizontal) vs. scaling up (vertical) and schema flexibility.
2.  **Azure Cosmos DB Architecture:**
    *   **Database:** Logical container for data.
    *   **Container:** Equivalent to a table, partitioned physically and logically.
    *   **Items:** The actual data documents (e.g., JSON).
    *   **Partition Key:** Crucial for distributing data across the cluster.
3.  **Spark Connector:** Configuring `azure-cosmos-spark` to read/write data.

In [None]:
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
from pyspark.sql.types import StringType

# Define the Cosmos DB Spark Connector package
# Ensure this version matches your Spark/Scala version (Spark 3.3.x / Scala 2.12)
cosmos_connector_package = "com.azure.cosmos.spark:azure-cosmos-spark_3-3_2-12:4.15.0"

spark = SparkSession.builder \
    .appName("Azure_Cosmos_DB_Integration") \
    .master("local[*]") \
    .config("spark.jars.packages", cosmos_connector_package) \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print("Spark Session Created successfully with Cosmos DB Connector")

In [None]:
# --- Configuration Configuration ---
# In a production environment, use spark-defaults.conf or a secret manager.
# For this lab, we define them here (Replace with your actual credentials).

# Endpoint: Found in Azure Portal -> Cosmos DB Account -> Keys
cosmos_endpoint = "https://<your-account-name>.documents.azure.com:443/"

# Key: Found in Azure Portal -> Cosmos DB Account -> Keys (Primary or Secondary Key)
cosmos_master_key = "<your-primary-key>"

database_name = "self"
container_name = "device-data"

# Create a dictionary for write configuration
write_config = {
    "spark.cosmos.accountEndpoint": cosmos_endpoint,
    "spark.cosmos.accountKey": cosmos_master_key,
    "spark.cosmos.database": database_name,
    "spark.cosmos.container": container_name,
    "spark.cosmos.write.strategy": "ItemAppend", # Strategies: ItemAppend, ItemOverwrite, ItemDelete, etc.
    "spark.cosmos.write.bulk.enabled": "true"
}

In [None]:
# We will use a sample JSON file representing device data to write to Cosmos DB.
# Make sure the file exists in your datasets folder.

input_path = "datasets/device_03.json"

# Read JSON data
source_df = spark.read.json(input_path)

# --- Important: Handling ID and Partition Key ---
# Cosmos DB requires an 'id' field for uniqueness and a Partition Key for distribution.
# Our container was created with Partition Key: /customerId
# Our source data has 'eventId'. We will create the 'id' column from 'eventId'.

df_to_write = source_df.withColumn("id", col("eventId"))

print("Source Data Schema:")
df_to_write.printSchema()
df_to_write.show(5, truncate=False)

In [None]:
# Write the dataframe to Cosmos DB using the OLTP format
try:
    df_to_write.write \
        .format("cosmos.oltp") \
        .options(**write_config) \
        .mode("append") \
        .save()
    print("Data successfully written to Cosmos DB!")
except Exception as e:
    print(f"Error writing to Cosmos DB: {e}")

In [None]:
# Configure read settings
read_config = {
    "spark.cosmos.accountEndpoint": cosmos_endpoint,
    "spark.cosmos.accountKey": cosmos_master_key,
    "spark.cosmos.database": database_name,
    "spark.cosmos.container": container_name,
    "spark.cosmos.read.inferSchema.enabled": "true" # Automatically detect data types
}

# Read from Cosmos DB
cosmos_df = spark.read \
    .format("cosmos.oltp") \
    .options(**read_config) \
    .load()

print("Data read from Cosmos DB:")
cosmos_df.printSchema()
cosmos_df.show(5, truncate=False)

## Security Note: Managing Secrets

Hardcoding keys in your notebook (as done in **Cell 3**) is strictly for learning purposes. In a real-world scenario, you should decouple secrets from your code.

**Method: Using `spark-defaults.conf`**
1.  Navigate to your Spark installation's `conf` folder.
2.  Open (or create) `spark-defaults.conf`.
3.  Add your configurations there:
    ```properties
    spark.cosmos.accountEndpoint https://<your-account>.documents.azure.com:443/
    spark.cosmos.accountKey <your-secret-key>
    ```
4.  Restart your Spark Session.
5.  In your code, you can omit the endpoint and key from the `options` dictionary, as Spark will pick them up from the environment configuration.