### Prepare environment

In [0]:
%run ../environment/prepare_environment


### Great Expectations

This notebook ensures the Silver layer data meets quality standards before any downstream processing by creating Great Expectations suite for validation.

In details this notebook:
* Loads the Great Expectations context for the Silver layer.
* Creates an expectation suite for telco_silver data.
* Adds expectations for required columns, boolean types, and non-negative numeric columns.
* Sets up a Spark dataframe as the data source.
* Defines a batch and validation run against the suite.

**The requirements to register a table in Feature Store are that you need to remove the target column and  define a primary key for the table.**

In [0]:
# Load the base silver table
silver_df = spark.table("ai_ml_in_practice.telco_customer_churn_silver.telco_silver")

silver_df = silver_df.drop("churn", "_is_processed", "_row_id")

display(silver_df)

### Create a Feature Store table for customer-level features

This cell registers the prepared feature DataFrame as a **Feature Store table**.

What happens here:
- `FeatureEngineeringClient` is initialized to interact with Databricks Feature Store.
- A new feature table is created from the current `silver_df`.
- `customer_id` is defined as the primary key, making features joinable and reusable.
- Metadata (description and tags) is attached for lineage and discoverability.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

fe = FeatureEngineeringClient()

fe.create_table(
    name="ai_ml_in_practice.telco_customer_churn_silver.telco_customer_features",
    primary_keys=["customer_id"],
    df=silver_df,
    description="Telco customer features",
    tags={"source": "silver", "format": "delta"}
)

### Build a training dataset using Feature Store lookups

Here we assemble the training dataset by joining labels with reusable features stored in the Feature Store.

What happens step by step:
- `FeatureLookup` defines which feature table to join and how (`customer_id`).
- Only selected feature columns are pulled into the training set.
- The base DataFrame provides the label (`churn`) and join key.
- `create_training_set` materializes a point-in-timeâ€“consistent dataset.
- `load_df()` returns a Spark DataFrame ready for modeling.

In [0]:
from databricks.feature_engineering import FeatureLookup

feature_lookups = [
  FeatureLookup(
    table_name="ai_ml_in_practice.telco_customer_churn_silver.telco_customer_features",
    feature_names=["internet_service", "online_security", "online_backup", "device_protection", "tech_support", "streaming_tv", "streaming_movies"],
    lookup_key="customer_id"
  )
]

feature_df = spark.table("ai_ml_in_practice.telco_customer_churn_silver.telco_silver").select("customer_id", "churn")
display(feature_df)

training_set = fe.create_training_set(
  df=feature_df,
  feature_lookups=feature_lookups,
  exclude_columns=["customer_id"],
  label="churn",
)

training_df = training_set.load_df()
display(training_df)

### Transforming training dataset

The same transformations done before on the base dataset can be done on assembled training dataset.

As an example, we will perform embeddings encoding of internet services columns.

In [0]:
import mlflow
import mlflow.spark
from pyspark.sql.functions import split, concat_ws, col

# Define categorical columns
internet_categorical_columns = ["internet_service", "online_security", "online_backup", "device_protection", "tech_support", "streaming_tv", "streaming_movies"]
training_df = training_df.withColumn("categorical_sequence", concat_ws(";", *internet_categorical_columns))
training_df = training_df.withColumn("categorical_tokens", split(col("categorical_sequence"), ";"))

model = mlflow.spark.load_model(
    "models:/ai_ml_in_practice.telco_customer_churn_silver.telco_word2vec_internet_services/1",
    dfs_tmpdir="/Volumes/ai_ml_in_practice/telco_customer_churn_silver/mlflow_tmp"
)

# Generate embeddings for categorical columns
training_df = model.transform(training_df)
training_df.drop("categorical_sequence", "categorical_tokens")

display(training_df)