# Connection Layer Deep Dive

## üéØ The Problem

Your data pipeline needs to run:
- Locally during development (filesystem)
- In Azure for production (Data Lake)
- On Databricks (DBFS)
- Maybe S3 or GCS in the future

**Without connection abstraction**:
```python
# Nightmare code - don't do this!
if env == "local":
    path = f"./data/{table}.parquet"
    df = pd.read_parquet(path)
elif env == "azure":
    path = f"abfss://{container}@{account}.dfs.core.windows.net/{table}.parquet"
    storage_options = {"account_key": get_key_from_vault()}
    df = pd.read_parquet(path, storage_options=storage_options)
elif env == "databricks":
    path = f"dbfs:/FileStore/{table}.parquet"
    df = spark.read.parquet(path).toPandas()
```

**With connection abstraction**:
```python
path = connection.get_path(f"{table}.parquet")
df = pd.read_parquet(path, storage_options=connection.storage_options())
```

## ü¶â First Principles: The Contract

All connections must implement:
1. **`get_path(relative_path)`** - Convert relative to absolute/URI
2. **`validate()`** - Check configuration early

Optional but useful:
3. **`storage_options()`** - Credentials for pandas/fsspec
4. **`configure_spark(spark)`** - Set up Spark session

## üîç Read Odibi: BaseConnection

The foundation - an Abstract Base Class (ABC):

In [None]:
from abc import ABC, abstractmethod

class BaseConnection(ABC):
    """Abstract base class for connections."""

    @abstractmethod
    def get_path(self, relative_path: str) -> str:
        """Get full path for a relative path.

        Args:
            relative_path: Relative path or table name

        Returns:
            Full path to resource
        """
        pass

    @abstractmethod
    def validate(self) -> None:
        """Validate connection configuration.

        Raises:
            ConnectionError: If validation fails
        """
        pass

### Why ABC?

- **Enforces interface**: You can't create a connection without implementing these methods
- **Type safety**: All connections can be typed as `BaseConnection`
- **Extensibility**: Easy to add S3, GCS, etc.

## üîç Read Odibi: LocalConnection

Simplest implementation - just prefix paths with a base directory:

In [None]:
from pathlib import Path

class LocalConnection(BaseConnection):
    """Connection to local filesystem."""

    def __init__(self, base_path: str = "./data"):
        self.base_path = Path(base_path)

    def get_path(self, relative_path: str) -> str:
        """Get full path for a relative path."""
        full_path = self.base_path / relative_path
        return str(full_path.absolute())

    def validate(self) -> None:
        """Validate that base path exists or can be created."""
        self.base_path.mkdir(parents=True, exist_ok=True)

### Test LocalConnection

In [None]:
conn = LocalConnection(base_path="./my_data")
conn.validate()

print(conn.get_path("raw/sales.parquet"))
print(conn.get_path("processed/sales_clean.parquet"))

## üîç Read Odibi: AzureADLS

Production-ready Azure Data Lake connection with multiple features:

### Key Features
1. **Two auth modes**: Key Vault (production) or direct key (development)
2. **Path prefix**: Namespace within container
3. **Storage options**: For pandas/fsspec
4. **Spark configuration**: Auto-configure Spark sessions
5. **Validation**: Fail fast with clear errors
6. **Key caching**: Avoid repeated Key Vault calls

In [None]:
import posixpath
import warnings
import os
from typing import Optional

class AzureADLS(BaseConnection):
    """Azure Data Lake Storage Gen2 connection."""

    def __init__(
        self,
        account: str,
        container: str,
        path_prefix: str = "",
        auth_mode: str = "key_vault",
        key_vault_name: Optional[str] = None,
        secret_name: Optional[str] = None,
        account_key: Optional[str] = None,
        validate: bool = True,
        **kwargs,
    ):
        self.account = account
        self.container = container
        self.path_prefix = path_prefix.strip("/") if path_prefix else ""
        self.auth_mode = auth_mode
        self.key_vault_name = key_vault_name
        self.secret_name = secret_name
        self.account_key = account_key
        self._cached_key: Optional[str] = None

        if validate:
            self.validate()

    def validate(self) -> None:
        """Validate ADLS connection configuration."""
        if not self.account:
            raise ValueError("ADLS connection requires 'account'")
        if not self.container:
            raise ValueError("ADLS connection requires 'container'")

        if self.auth_mode == "key_vault":
            if not self.key_vault_name or not self.secret_name:
                raise ValueError(
                    f"key_vault mode requires 'key_vault_name' and 'secret_name' "
                    f"for connection to {self.account}/{self.container}"
                )
        elif self.auth_mode == "direct_key":
            if not self.account_key:
                raise ValueError(
                    f"direct_key mode requires 'account_key' "
                    f"for connection to {self.account}/{self.container}"
                )
            if os.getenv("ODIBI_ENV") == "production":
                warnings.warn(
                    f"‚ö†Ô∏è  Using direct_key in production is not recommended. "
                    f"Use auth_mode: key_vault.",
                    UserWarning,
                )

    def get_storage_key(self, timeout: float = 30.0) -> str:
        """Get storage account key (cached)."""
        if self._cached_key:
            return self._cached_key

        if self.auth_mode == "key_vault":
            from azure.identity import DefaultAzureCredential
            from azure.keyvault.secrets import SecretClient
            import concurrent.futures

            credential = DefaultAzureCredential()
            kv_uri = f"https://{self.key_vault_name}.vault.azure.net"
            client = SecretClient(vault_url=kv_uri, credential=credential)

            def _fetch():
                secret = client.get_secret(self.secret_name)
                return secret.value

            with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
                future = executor.submit(_fetch)
                try:
                    self._cached_key = future.result(timeout=timeout)
                    return self._cached_key
                except concurrent.futures.TimeoutError:
                    raise TimeoutError(
                        f"Key Vault fetch timed out after {timeout}s"
                    )

        elif self.auth_mode == "direct_key":
            return self.account_key

    def pandas_storage_options(self) -> dict:
        """Get storage options for pandas/fsspec."""
        return {"account_name": self.account, "account_key": self.get_storage_key()}

    def configure_spark(self, spark) -> None:
        """Configure Spark session with storage account key."""
        config_key = f"fs.azure.account.key.{self.account}.dfs.core.windows.net"
        spark.conf.set(config_key, self.get_storage_key())

    def uri(self, path: str) -> str:
        """Build abfss:// URI for given path."""
        if self.path_prefix:
            full_path = posixpath.join(self.path_prefix, path.lstrip("/"))
        else:
            full_path = path.lstrip("/")

        return f"abfss://{self.container}@{self.account}.dfs.core.windows.net/{full_path}"

    def get_path(self, relative_path: str) -> str:
        """Get full abfss:// URI for relative path."""
        return self.uri(relative_path)

### Test AzureADLS (Development Mode)

In [None]:
conn = AzureADLS(
    account="mystorageaccount",
    container="data",
    path_prefix="project_x/v2",
    auth_mode="direct_key",
    account_key="fake_key_for_demo"
)

print(conn.get_path("raw/sales.parquet"))
print(conn.get_path("processed/sales_clean.parquet"))
print()
print("Storage options:", conn.pandas_storage_options())

### Using with Pandas

In [None]:
import pandas as pd

path = conn.get_path("raw/sales.parquet")
storage_options = conn.pandas_storage_options()

# df = pd.read_parquet(path, storage_options=storage_options)

## üîç Read Odibi: LocalDBFS

Mock Databricks filesystem for local development:

In [None]:
from pathlib import Path
from typing import Union

class LocalDBFS(BaseConnection):
    """Mock DBFS connection for local development.
    
    Maps dbfs:/ paths to local filesystem for testing.
    """

    def __init__(self, root: Union[str, Path] = ".dbfs"):
        self.root = Path(root).resolve()

    def resolve(self, path: str) -> str:
        """Resolve dbfs:/ path to local filesystem path."""
        clean_path = path.replace("dbfs:/", "").lstrip("/")
        local_path = self.root / clean_path
        return str(local_path)

    def ensure_dir(self, path: str) -> None:
        """Create parent directories for given path."""
        local_path = Path(self.resolve(path))
        local_path.parent.mkdir(parents=True, exist_ok=True)

    def get_path(self, relative_path: str) -> str:
        """Get local filesystem path for DBFS path."""
        return self.resolve(relative_path)

    def validate(self) -> None:
        """Validate local DBFS configuration."""
        pass

### Test LocalDBFS

In [None]:
conn = LocalDBFS(root="./mock_dbfs")

print(conn.get_path("dbfs:/FileStore/raw/sales.parquet"))
print(conn.get_path("dbfs:/mnt/data/processed/sales.parquet"))
print(conn.get_path("FileStore/tables/customers.csv"))

## üèóÔ∏è Path Resolution Strategies

### LocalConnection: Base Path Join

In [None]:
local = LocalConnection(base_path="/data/project")
print(local.get_path("raw/sales.parquet"))
# /data/project/raw/sales.parquet

### AzureADLS: URI Construction with Prefix

In [None]:
azure = AzureADLS(
    account="myaccount",
    container="datalake",
    path_prefix="team/project",
    auth_mode="direct_key",
    account_key="key"
)
print(azure.get_path("raw/sales.parquet"))
# abfss://datalake@myaccount.dfs.core.windows.net/team/project/raw/sales.parquet

### LocalDBFS: Protocol Stripping

In [None]:
dbfs = LocalDBFS(root="/tmp/dbfs")
print(dbfs.get_path("dbfs:/FileStore/raw/sales.parquet"))
# /tmp/dbfs/FileStore/raw/sales.parquet

## üèóÔ∏è Polymorphism in Action

The power: write code once, swap connections:

In [None]:
def read_sales_data(connection: BaseConnection) -> pd.DataFrame:
    """Read sales data using any connection."""
    path = connection.get_path("raw/sales.parquet")
    
    # Get storage options if available
    storage_options = {}
    if hasattr(connection, 'pandas_storage_options'):
        storage_options = connection.pandas_storage_options()
    
    return pd.read_parquet(path, storage_options=storage_options)

# Works with any connection!
# df = read_sales_data(LocalConnection())
# df = read_sales_data(AzureADLS(...))
# df = read_sales_data(LocalDBFS())

## ‚úÖ Connection Validation

Validation catches configuration errors early:

In [None]:
try:
    bad_conn = AzureADLS(
        account="myaccount",
        container="data",
        auth_mode="key_vault"
        # Missing: key_vault_name and secret_name!
    )
except ValueError as e:
    print(f"‚ùå Validation caught error: {e}")

In [None]:
try:
    bad_conn = AzureADLS(
        account="",  # Empty!
        container="data",
        auth_mode="direct_key",
        account_key="key"
    )
except ValueError as e:
    print(f"‚ùå Validation caught error: {e}")

## üéØ Key Takeaways

1. **BaseConnection** defines the contract all connections must follow
2. **LocalConnection** is simple: base_path + relative_path
3. **AzureADLS** is complex: authentication, URI building, Spark config
4. **LocalDBFS** enables local testing of Databricks code
5. **Validation** catches errors at configuration time, not runtime
6. **Polymorphism** lets you write storage-agnostic code

## üöÄ Next Steps

Try the exercises to build S3 and GCS connections!