# PySpark: Zero to Hero
## Module 28: Azure Cosmos DB Integration

Azure Cosmos DB is a fully managed NoSQL database for modern app development. In big data pipelines, it is common to offload processed data from Spark to Cosmos DB for low-latency serving or to read transactional data from Cosmos DB into Spark for analytics.

### Agenda:
1.  **NoSQL vs SQL:** Understanding the need for Cosmos DB.
2.  **Setup:** Configuring the Spark Session with the Cosmos DB Connector.
3.  **Writing Data:** Loading JSON data into Cosmos DB using `ItemAppend` and `ItemOverwrite`.
4.  **Reading Data:** Querying data from Cosmos DB into a Dataframe.
5.  **Operations:** Deleting items using `ItemDelete`.
6.  **Security:** Best practices for managing credentials.

### Prerequisites
To run this notebook, you need:
1.  An **Azure Cosmos DB for NoSQL** account created in the Azure Portal.
2.  A Database named `self` and a Container named `device-data`.
3.  Partition Key: `/customerid`.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit

# We need to load the Azure Cosmos DB Spark Connector.
# The version depends on your Spark version.
# For Spark 3.3.x, we use: com.azure:azure-cosmos-spark_3-3_2-12:4.15.0
# Check Maven Central for the latest version compatible with your environment.

spark = SparkSession.builder \
    .appName("CosmosDB_Integration") \
    .master("local[*]") \
    .config("spark.jars.packages", "com.azure:azure-cosmos-spark_3-3_2-12:4.15.0") \
    .getOrCreate()

print("Spark Session Created with Cosmos DB Connector")

In [None]:
# -----------------------------------------------------------------
# ⚠️ WARNING: Do not commit real keys to version control (Git).
# In production, use Azure Key Vault or Spark Configs (spark-defaults.conf).
# -----------------------------------------------------------------

# Replace these with your actual Azure Cosmos DB credentials
cosmos_endpoint = "https://<YOUR_COSMOS_ACCOUNT>.documents.azure.com:443/"
cosmos_master_key = "<YOUR_PRIMARY_KEY>"
cosmos_database_name = "self"
cosmos_container_name = "device-data"

# Base Configuration Dictionary
cosmos_config = {
    "spark.cosmos.accountEndpoint": cosmos_endpoint,
    "spark.cosmos.accountKey": cosmos_master_key,
    "spark.cosmos.database": cosmos_database_name,
    "spark.cosmos.container": cosmos_container_name
}

print("Configuration Defined.")

In [None]:
# Let's create sample device data similar to the video
data = [
    {
        "eventId": "e001",
        "customerId": "C001",
        "deviceType": "Sensor",
        "temperature": 75,
        "timestamp": "2023-01-01T10:00:00"
    },
    {
        "eventId": "e002",
        "customerId": "C002",
        "deviceType": "Thermostat",
        "temperature": 68,
        "timestamp": "2023-01-01T10:05:00"
    }
]

# Create DataFrame
df = spark.createDataFrame(data)

# Cosmos DB requires a unique 'id' field for every document.
# In our data, 'eventId' is unique, so we map 'eventId' to 'id'.
df_to_write = df.withColumn("id", col("eventId"))

print("Data prepared for writing:")
df_to_write.show(truncate=False)

In [None]:
# Writing data to Cosmos DB using "cosmos.oltp" format
# spark.cosmos.write.strategy: "ItemAppend" (Default) - Adds new items.

try:
    df_to_write.write \
        .format("cosmos.oltp") \
        .options(**cosmos_config) \
        .option("spark.cosmos.write.strategy", "ItemAppend") \
        .mode("append") \
        .save()
        
    print("Data successfully written to Cosmos DB.")
except Exception as e:
    print(f"Error writing to Cosmos DB: {e}")
    print("Ensure your Cosmos DB Endpoint and Key are correct and the container exists.")

In [None]:
# Reading data back from Cosmos DB to verify
# Note: inferSchema=true allows Spark to detect data types from JSON documents

try:
    df_read = spark.read \
        .format("cosmos.oltp") \
        .options(**cosmos_config) \
        .option("spark.cosmos.read.inferSchema.enabled", "true") \
        .load()

    print("Data read from Cosmos DB:")
    df_read.show(truncate=False)
    df_read.printSchema()
except Exception as e:
    print(f"Error reading from Cosmos DB: {e}")

In [None]:
# To update an item, we modify the data and use "ItemOverwrite" strategy.
# We MUST provide the same 'id' and 'partition key' (customerId) to find and replace the item.

# Let's change temperature for eventId 'e001'
df_updated = df_to_write.withColumn(
    "temperature", 
    when(col("id") == "e001", 90).otherwise(col("temperature"))
)

try:
    df_updated.write \
        .format("cosmos.oltp") \
        .options(**cosmos_config) \
        .option("spark.cosmos.write.strategy", "ItemOverwrite") \
        .mode("append") \
        .save()
        
    print("Data updated in Cosmos DB.")
except Exception as e:
    print(f"Error updating Cosmos DB: {e}")

In [None]:
# To delete items, we only need the 'id' and the 'partition key'.
# We can select specific rows to delete.

df_to_delete = df_updated.filter(col("id") == "e002").select("id", "customerId")

try:
    df_to_delete.write \
        .format("cosmos.oltp") \
        .options(**cosmos_config) \
        .option("spark.cosmos.write.strategy", "ItemDelete") \
        .mode("append") \
        .save()
        
    print("Item e002 deleted from Cosmos DB.")
except Exception as e:
    print(f"Error deleting from Cosmos DB: {e}")

## Summary

1.  **Connectors:** You must import the specific `azure-cosmos-spark` JAR matching your Spark version.
2.  **Format:** Use `cosmos.oltp` for transactional read/write.
3.  **Write Strategies:**
    *   `ItemAppend`: Insert new items (fails on conflict by default unless configured).
    *   `ItemOverwrite`: Upsert (Insert or Update).
    *   `ItemDelete`: Delete items based on `id` and Partition Key.
4.  **ID Requirement:** Every item in Cosmos DB must have a unique string column named `id`.

### Security Best Practice
In the video, we moved the keys to `spark-defaults.conf` (or `spark-env.sh`) on the cluster.
This allows you to access configs like:
```python
# In production code
endpoint = spark.conf.get("spark.cosmos.accountEndpoint")
key = spark.conf.get("spark.cosmos.accountKey")