### Prepare environment

In [0]:
%run ../environment/prepare_environment

### Perform feature engineering steps to transform the data from silver layer

In this section, we will perform several feature engineering steps to transform the dataset and save it as a feature table:
* load silver Telco dataset
* cast the integer and boolean columns to `Double` type
* perform one-hot encoding on `gender` column
* perform feature hashing on columns related to `phone service`
* perform embedding encoding on columns related to `internet services` and `payment_method`
* perform ordinal encoding on `contract` column
* perform scaling on `tenure` and `charge` columns

In [0]:
silver_df = spark.table("ai_ml_in_practice.telco_customer_churn_silver.telco_silver").drop("customer_id", "_row_id", "_is_processed")
display(silver_df)

### Type casing

We scan the schema and find all columns that are integers or booleans.
Then we explicitly cast them to `Double`.

**Note: the final vector assebly will automatically convert these columns to `Double` without manual conversion. This step is done only for code / data sanity**

In [0]:
from pyspark.sql.types import IntegerType, BooleanType, StringType, DoubleType
from pyspark.sql.functions import col, count, when

# Get a list of integer & boolean columns
integer_cols = [column.name for column in silver_df.schema.fields if (column.dataType == IntegerType() or column.dataType == BooleanType())]

# Loop through integer columns to cast each one to double
for column in integer_cols:
    silver_df = silver_df.withColumn(column, col(column).cast("double"))

display(silver_df)

### One-hot encoding

To one-hot encode categorical features in Spark ML, we follow a simple but explicit three-step workflow built around Spark MLlib transformers.

First, categorical values must be converted into numbers.
This is handled by `StringIndexer`, which:
- scans a string column,
- assigns a unique numeric index to each distinct category,
- defines how unseen values should be treated during transformation.

Next, the indexed values are expanded using `OneHotEncoder`.
At this stage:
- each category becomes its own binary dimension,
- no artificial ordering is introduced,
- the output is a vector representation suitable for ML models.

Finally, `VectorAssembler` is used to:
- collect one or more encoded columns,
- pack them into a single feature vector,
- produce a clean, model-ready column.

The *StringIndexer → OneHotEncoder → VectorAssembler* pattern is the standard approach for low-cardinality categorical features where interpretability and simplicity matter more than compactness.

In [0]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import col

# StringIndexer
string_cols = ["gender"]
index_cols = ["gender_index"]

string_indexer = StringIndexer(inputCols=string_cols, outputCols=index_cols, handleInvalid="skip")
string_indexer_model = string_indexer.fit(silver_df)
silver_df = string_indexer_model.transform(silver_df)

# Create a list of one-hot encoded feature names
ohe_cols = ["gender_encoded"]

# Instantiate the OneHotEncoder with the column lists
ohe = OneHotEncoder(inputCols=index_cols, outputCols=ohe_cols, handleInvalid="keep")

# Fit the OneHotEncoder on the indexed data
ohe_model = ohe.fit(silver_df)

# Transform indexed_df using the ohe_model
silver_df = ohe_model.transform(silver_df)

# Use VectorAssembler to assemble the selected one-hot encoded columns into a dense vector
assembler = VectorAssembler(inputCols=ohe_cols, outputCol="gender_features")
silver_df = assembler.transform(silver_df)

display(silver_df)

silver_df = silver_df.drop("gender", "gender_index", "gender_encoded")

### Categorical embeddings encoding

We transform multiple related categorical columns into dense embeddings using pre-trained `Word2Vec` models stored in MLflow.

First, for internet-related features:
- We combine columns like `internet_service`, `online_security`, and others into a single sequence per customer.
- The sequence is tokenized into an array so Word2Vec can process it.
- The registered MLflow Word2Vec model is loaded and applied to generate `categorical_embedding` vectors.
- Temporary columns used for tokenization are dropped after embedding.

Next, for payment method:
- We normalize the `payment_method` column by removing parentheses and split the string into tokens.
- The corresponding MLflow Word2Vec model for payment methods is loaded.
- Each row is transformed into a dense vector representing the payment feature.
- Original and intermediate columns are dropped to keep the dataset clean.

In [0]:
import mlflow
import mlflow.spark
from pyspark.sql.functions import split, concat_ws

# Define categorical columns
internet_categorical_columns = ["internet_service", "online_security", "online_backup", "device_protection", "tech_support", "streaming_tv", "streaming_movies"]
silver_df = silver_df.withColumn("categorical_sequence", concat_ws(";", *internet_categorical_columns))
silver_df = silver_df.withColumn("categorical_tokens", split(col("categorical_sequence"), ";"))

model = mlflow.spark.load_model(
    "models:/ai_ml_in_practice.telco_customer_churn_silver.telco_word2vec_internet_services/1",
    dfs_tmpdir="/Volumes/ai_ml_in_practice/telco_customer_churn_silver/mlflow_tmp"
)

# Generate embeddings for categorical columns
silver_df = model.transform(silver_df)

display(silver_df)

silver_df = silver_df.drop("categorical_sequence", "categorical_tokens", *internet_categorical_columns)

In [0]:
import mlflow
import mlflow.spark
from pyspark.sql.functions import split, concat_ws, regexp_replace, col

silver_df = (
    silver_df
    .withColumn("payment_normalized", regexp_replace(col('payment_method'), r'[()]', ''))
    .withColumn("payment_tokens", split(col("payment_method"), " "))
)

model = mlflow.spark.load_model(
    "models:/ai_ml_in_practice.telco_customer_churn_silver.telco_word2vec_payment_methods/1",
    dfs_tmpdir="/Volumes/ai_ml_in_practice/telco_customer_churn_silver/mlflow_tmp"
)

# Generate embeddings for categorical columns
silver_df = model.transform(silver_df)

display(silver_df)

silver_df = silver_df.drop("payment_method", "payment_normalized", "payment_tokens")

### Feature hashing

In this step we handle tricky categorical combinations with **feature hashing**.

- We first merge `phone_service` and `multiple_lines` into a single string per row.
- Then we apply `FeatureHasher` to convert this combined category into a fixed-size numeric vector.
- The `numFeatures=8` parameter keeps the embedding compact and avoids exploding dimensions.
- This approach is useful when categories are numerous or inconsistent (like “No phone service” + “Yes”).
- Finally, all intermediate and original columns are dropped, leaving only clean, hashed features.

The result is a small, consistent numerical representation of complex categorical combinations.

In [0]:
from pyspark.ml.feature import FeatureHasher
from pyspark.sql.functions import concat_ws, col

silver_df = silver_df.withColumn(
    "phone_multiple_lines",
    concat_ws("_", col("phone_service"), col("multiple_lines"))
)

hasher = FeatureHasher(
    inputCols=["phone_multiple_lines"],
    outputCol="phone_multiple_lines_hashed",
    numFeatures=8
)

silver_df = hasher.transform(silver_df)
display(silver_df)

silver_df = silver_df.drop("phone_service", "multiple_lines", "phone_multiple_lines", "phone_multiple_lines_hashed")

### Ordinal encoding

Here we encode an **ordinal categorical feature** (`contract`) into numbers.
- We define an explicit order: `"Month-to-month" → 1`, `"One year" → 2`, `"Two year" → 3`.
- A new column `contract_ordinal` is created with these numeric values.
- The original string column is dropped to avoid duplication or accidental misuse.

In [0]:
from pyspark.sql.functions import expr

ordinal_cat = "contract"

ordered_list = [
    "Month-to-month",
    "One year",
    "Two year"
]

ordinal_dict = {category: f"{index+1}" for index, category in enumerate(ordered_list)}

silver_df = (
    silver_df
    .withColumn(f"{ordinal_cat}_ordinal", col(ordinal_cat))
    .replace(to_replace=ordinal_dict, subset=[f"{ordinal_cat}_ordinal"]) 
    .withColumn(f"{ordinal_cat}_ordinal", col(f"{ordinal_cat}_ordinal").cast('int'))
)

display(silver_df)

silver_df = silver_df.drop(ordinal_cat)

### Numerical columns scaling

This step scales continuous numeric features for ML models.
- We first gather `tenure`, `monthly_charges`, and `total_charges` into a single vector using `VectorAssembler`.
- `StandardScaler` then standardizes the vector: centering (mean=0) and scaling (unit variance).
- The scaled output goes into `numeric_scaled`, making the data consistent and comparable across features.
- Original columns are dropped to avoid confusion.

In [0]:
from pyspark.ml.feature import VectorAssembler, StandardScaler

numeric_cols = ["tenure", "monthly_charges", "total_charges"]

assembler = VectorAssembler(
    inputCols=numeric_cols,
    outputCol="numeric_features"
)

silver_df = assembler.transform(silver_df)

scaler = StandardScaler(
    inputCol="numeric_features",
    outputCol="numeric_scaled",
    withMean=True,
    withStd=True
)

scaler_model = scaler.fit(silver_df)
silver_df = scaler_model.transform(silver_df)

display(silver_df.select(*numeric_cols, "numeric_scaled"))

silver_df = silver_df.drop("numeric_features", *numeric_cols)

### Final assembly

In this final step we combine all prepared features into a single vector for modeling.
- We collect all columns except the target (`churn`) as inputs.
- `VectorAssembler` merges them into one dense feature column, `customer_features`.
- Intermediate columns are dropped to keep the DataFrame clean and ML-ready.

Why:
- Spark ML models expect a single vector column for features.
- Assembling everything at the end simplifies the pipeline and avoids accidental column misuse.
- Ensures consistent input structure for training and inference.

Result: a DataFrame with `churn` as the label and `customer_features` as a unified input vector.

In [0]:
from pyspark.ml.feature import VectorAssembler

assembly_columns = [column for column in silver_df.columns if column != "churn"]

print(assembly_columns)

assembler = VectorAssembler(
    inputCols=assembly_columns,
    outputCol="customer_features"
)

silver_df = assembler.transform(silver_df)
silver_df = silver_df.drop(*assembly_columns)

display(silver_df)