
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>



# Build a Feature Engineering Pipeline

In this demo, we will be constructing a feature engineering pipeline to manage data loading, imputation, and transformation. The pipeline will be applied to the training, testing, and validation sets, with the results showcased. The final step involves saving the pipeline to disk for future use, ensuring efficient and consistent data preparation for machine learning tasks.

**Learning Objectives:**

*By the end of this demo, you will be able to:*

* Create a data preparation and feature engineering pipeline with multiple steps.
* Create a pipeline with tasks for data imputation and transformation.
* Apply a data preparation and pipeline set to a training/modeling set and a holdout set.
* Display the results of the transformation.
* Save a data preparation and feature engineering pipeline for potential future use.


## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **13.3.x-cpu-ml-scala2.122 13.3.x-scala2.12**


## Classroom Setup

Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-01

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment:
| dropping the catalog "labuser8052946_1734373459_n1iz_da"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-preparation-for-machine-learning/v01"

Validating the locally installed datasets:
| listing local files...(0 seconds)
| validation completed...(0 seconds total)
Creating & using the catalog "labuser8052946_1734373459_n1iz_da"...(1 seconds)

Predefined tables in "labuser8052946_1734373459_n1iz_da.default":
| -none-

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/labuser8052946_1734373459@vocareum.com/data-preparation-for-machine-learning
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/data-preparation-for-machine-learning/v01

Setup completed (2 seconds)


**Other Conventions:**

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains **variables such as your username, catalog name, schema name, working directory, and dataset locations**. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"Dataset Location:  {DA.paths.datasets}")

Username:          labuser8052946_1734373459@vocareum.com
Catalog Name:      labuser8052946_1734373459_n1iz_da
Schema Name:       default
Working Directory: dbfs:/mnt/dbacademy-users/labuser8052946_1734373459@vocareum.com/data-preparation-for-machine-learning
Dataset Location:  dbfs:/mnt/dbacademy-datasets/data-preparation-for-machine-learning/v01



## Data Preparation

Before building the pipeline, we will ensure consistency in the dataset by converting Integer and Boolean columns to Double data types and addressing missing values in both numeric and string columns within the **`Telco`** dataset. These are the steps we will follow in this section.

1. Load dataset

1. Split dataset to train and test sets

1. Converting Integer and Boolean Columns to Double

1. Handling Missing Values

  * Numeric Columns

  * String Columns


### Load Dataset

In [0]:
from pyspark.sql.functions import when, col

# Load dataset
dataset_path = f"{DA.paths.datasets}/telco/telco-customer-churn-missing.csv"
telco_df = spark.read.csv(dataset_path, header="true", inferSchema="true", multiLine="true", escape='"')

# Select columns of interest
telco_df = telco_df.select("gender", "SeniorCitizen", "Partner", "tenure", "InternetService", "Contract", "PaperlessBilling", "PaymentMethod", "TotalCharges", "Churn")

Quick pre-processing
* `SeniorCitizen` as `boolean`
* `TotalCharges` as `double`

In [0]:
# replace "null" values with Null
for column in telco_df.columns:
  telco_df = telco_df.withColumn(column, when(col(column) == "null", None).otherwise(col(column)))

# clean-up columns
telco_df = telco_df.withColumn("SeniorCitizen", when(col("SeniorCitizen")==1, True).otherwise(False))
telco_df = telco_df.withColumn("TotalCharges", col("TotalCharges").cast("double"))

display(telco_df)

gender,SeniorCitizen,Partner,tenure,InternetService,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn
,False,Yes,1.0,DSL,Month-to-month,Yes,Electronic check,29.85,No
Male,False,No,34.0,DSL,One year,No,Mailed check,1889.5,No
Male,False,No,2.0,DSL,Month-to-month,Yes,Mailed check,108.15,Yes
Male,False,No,45.0,DSL,One year,No,Bank transfer (automatic),1840.75,No
Female,False,No,,Fiber optic,Month-to-month,Yes,Electronic check,151.65,Yes
Female,False,No,8.0,Fiber optic,Month-to-month,Yes,Electronic check,820.5,Yes
Male,False,No,,Fiber optic,Month-to-month,Yes,Credit card (automatic),1949.4,No
Female,False,No,10.0,DSL,Month-to-month,No,Mailed check,301.9,No
,False,Yes,28.0,Fiber optic,Month-to-month,Yes,Electronic check,3046.05,Yes
Male,False,No,62.0,DSL,One year,No,Bank transfer (automatic),3487.95,No


### Train / Test Split

In [0]:
train_df, test_df = telco_df.randomSplit([.8, .2], seed=42)

### Transform Dataset

In [0]:
from pyspark.sql.types import IntegerType, BooleanType, StringType, DoubleType
from pyspark.sql.functions import col, count, when


# Get a list of integer & boolean columns
integer_cols = [column.name for column in train_df.schema.fields if (column.dataType == IntegerType() or column.dataType == BooleanType())]

# Loop through integer columns to cast each one to double
for column in integer_cols:
    train_df = train_df.withColumn(column, col(column).cast("double"))
    test_df = test_df.withColumn(column, col(column).cast("double"))

string_cols = [c.name for c in train_df.schema.fields if c.dataType == StringType()]
num_cols = [c.name for c in train_df.schema.fields if c.dataType == DoubleType()]

# Get a list of columns with missing values
# Numerical
num_missing_values_logic = [count(when(col(column).isNull(),column)).alias(column) for column in num_cols]
row_dict_num = train_df.select(num_missing_values_logic).first().asDict()
num_missing_cols = [column for column in row_dict_num if row_dict_num[column] > 0]

# String
string_missing_values_logic = [count(when(col(column).isNull(),column)).alias(column) for column in string_cols]
row_dict_string = train_df.select(string_missing_values_logic).first().asDict()
string_missing_cols = [column for column in row_dict_string if row_dict_string[column] > 0]

print(f"Numeric columns with missing values: {num_missing_cols}")
print(f"String columns with missing values: {string_missing_cols}")

Numeric columns with missing values: ['tenure', 'TotalCharges']
String columns with missing values: ['gender', 'Partner', 'InternetService', 'PaymentMethod']



## Create a Pipeline

Defines a Spark ML pipeline for preprocessing a dataset, including indexing categorical columns, imputing missing values, scaling numerical features, performing one-hot encoding on categorical features, and assembling the final feature vector for machine learning.

In this Spark ML pipeline, we preprocess a dataset for predicting customer churn in a telecommunications **`telco`** company. The pipeline includes the following key steps:

* **Convert Categorical Columns to Numerical Indices:**
This step converts categorical columns to numerical indices, allowing the model to process categorical data.

* **Impute Missing Values:**
The Imputer is used to fill in missing values in **numerical columns with missing values (e.g. `tenure`, `TotalCharges`) using the `mean` strategy**, ensuring that the dataset is complete and ready for analysis. 
**Missing categorical values will be automatically encoded as a separate category.**

* **VectorAssembler and RobustScaler:**
These steps combine relevant numerical columns into a feature vector and then scale the features to reduce sensitivity to outliers.

* **Perform One Hot Encoding on Categorical variable:** 
This step converts the indexed categorical columns into binary sparse vectors, enabling the model to process categorical data effectively.

* **Pipeline:**
 All these steps are encapsulated in a Pipeline, providing a convenient and reproducible way to preprocess the data for machine learning tasks.


In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import Imputer, VectorAssembler, RobustScaler, StringIndexer, OneHotEncoder

# Imputer (mean strategy for all double/numeric)
to_impute = num_missing_cols
imputer = Imputer(inputCols=to_impute, outputCols=to_impute, strategy='mode')

# Scale numerical
numerical_assembler = VectorAssembler(inputCols=num_cols, outputCol="numerical_assembled")
numerical_scaler = RobustScaler(inputCol="numerical_assembled", outputCol="numerical_scaled")

# String/Cat Indexer (will encode missing/null as separate index)
string_cols_indexed = [c + '_index' for c in string_cols]
string_indexer = StringIndexer(inputCols=string_cols, outputCols=string_cols_indexed, handleInvalid="keep")

# OHE categoricals
ohe_cols = [column + '_ohe' for column in string_cols]
one_hot_encoder = OneHotEncoder(inputCols=string_cols_indexed, outputCols=ohe_cols, handleInvalid="keep")

# Assembler (All)
feature_cols = ["numerical_scaled"] + ohe_cols
vector_assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")

# Instantiate the pipeline
stages_list = [
    imputer,
    numerical_assembler,
    numerical_scaler,
    string_indexer,
    one_hot_encoder,
    vector_assembler
]

pipeline = Pipeline(stages=stages_list)


## Fit the Pipeline

In the context of machine learning and MLflow, **`fitting`** corresponds to the process of training a machine learning model on a specified dataset. 

In the previous step we created a pipeline. Now, we will fit a model based on the pipeline. This pipeline will index string columns, impute specified columns, scale numerical columns, one-hot-encode specified columns, and finally create a vector from all input columns.


In [0]:
pipeline_model = pipeline.fit(train_df)


Next, we can use this model to transform, or apply, to any dataset we want.

In [0]:
# Transform both training_df and test_df
train_transformed_df = pipeline_model.transform(train_df)
test_transformed_df = pipeline_model.transform(test_df)

In [0]:
train_transformed_df.select("features").show(3, truncate=False)

+---------------------------------------------------------------------------------------------------+
|features                                                                                           |
+---------------------------------------------------------------------------------------------------+
|(28,[1,2,5,8,9,13,18,20,25],[0.02040816326530612,0.8199672156740302,1.0,1.0,1.0,1.0,1.0,1.0,1.0])  |
|(28,[1,2,5,8,9,13,17,20,26],[0.02040816326530612,0.006166575599094529,1.0,1.0,1.0,1.0,1.0,1.0,1.0])|
|(28,[1,2,5,8,9,14,18,23,25],[0.02040816326530612,2.3911638435719307,1.0,1.0,1.0,1.0,1.0,1.0,1.0])  |
+---------------------------------------------------------------------------------------------------+
only showing top 3 rows



## Save and Reuse the Pipeline

Preserving the Telco Customer Churn Prediction pipeline, encompassing the model, parameters, and metadata, is vital for maintaining reproducibility, enabling version control, and facilitating collaboration among team members. This ensures a detailed record of the machine learning workflow. In this section, we will follow these steps;

1. **Save the Pipeline:** Save the pipeline model, including all relevant components, to the designated artifact storage. The saved pipeline is organized within the **`spark_pipelines`** folder for clarity.

1. **Explore Loaded Pipeline Stages:** Upon loading the pipeline, inspect the stages to reveal key transformations and understand the sequence of operations applied during the pipeline's execution.



### Save the Pipeline

In [0]:
pipeline_model.save(f"{DA.paths.working_dir}/spark_pipelines")

### Load and Use Saved Model

In [0]:
from pyspark.ml import PipelineModel


# Load the pipeline
loaded_pipeline = PipelineModel.load(f"{DA.paths.working_dir}/spark_pipelines")

# Show pipeline stages
loaded_pipeline.stages

[ImputerModel: uid=Imputer_6a24585d832a, strategy=mode, missingValue=NaN, numInputCols=2, numOutputCols=2,
 VectorAssembler_aa193568007c,
 RobustScalerModel: uid=RobustScaler_14329d14323c, numFeatures=3, withCentering=false, withScaling=true,
 StringIndexerModel: uid=StringIndexer_75060fbb823f, handleInvalid=keep, numInputCols=7, numOutputCols=7,
 OneHotEncoderModel: uid=OneHotEncoder_7c045eab7b63, dropLast=true, handleInvalid=keep, numInputCols=7, numOutputCols=7,
 VectorAssembler_edbf6443b8da]

In [0]:
# Let's use loaded pipeline to transform the test dataset
test_transformed_df = loaded_pipeline.transform(test_df)
display(test_transformed_df)

gender,SeniorCitizen,Partner,tenure,InternetService,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn,numerical_assembled,numerical_scaled,gender_index,Partner_index,InternetService_index,Contract_index,PaperlessBilling_index,PaymentMethod_index,Churn_index,gender_ohe,Partner_ohe,InternetService_ohe,Contract_ohe,PaperlessBilling_ohe,PaymentMethod_ohe,Churn_ohe,features
,0.0,,1.0,Fiber optic,Month-to-month,Yes,Electronic check,1993.25,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 1.0, 1993.25))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.02040816326530612, 0.6223557879946922))",2.0,2.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 9, 13, 17, 20, 25), values -> List(0.02040816326530612, 0.6223557879946922, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,3.0,,Month-to-month,No,Mailed check,237.65,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 3.0, 237.65))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.061224489795918366, 0.07420185777847163))",2.0,2.0,3.0,0.0,1.0,1.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(3), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 12, 13, 18, 21, 25), values -> List(0.061224489795918366, 0.07420185777847163, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,3.0,Fiber optic,Month-to-month,No,Electronic check,257.05,Yes,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 3.0, 257.05))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.061224489795918366, 0.08025915229099993))",2.0,2.0,0.0,0.0,1.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 9, 13, 18, 20, 26), values -> List(0.061224489795918366, 0.08025915229099993, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,7.0,DSL,Month-to-month,Yes,Mailed check,19.75,Yes,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 7.0, 19.75))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.14285714285714285, 0.006166575599094529))",2.0,2.0,1.0,0.0,0.0,1.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 10, 13, 17, 21, 26), values -> List(0.14285714285714285, 0.006166575599094529, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,14.0,DSL,One year,Yes,Mailed check,773.2,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 14.0, 773.2))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.2857142857142857, 0.241417531808602))",2.0,2.0,1.0,2.0,0.0,1.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 10, 15, 17, 21, 25), values -> List(0.2857142857142857, 0.241417531808602, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,25.0,No,One year,No,Mailed check,520.1,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 25.0, 520.1))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.5102040816326531, 0.16239169463742098))",2.0,2.0,2.0,2.0,1.0,1.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 11, 15, 18, 21, 25), values -> List(0.5102040816326531, 0.16239169463742098, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,39.0,DSL,Month-to-month,No,Electronic check,2184.35,Yes,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 39.0, 2184.35))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.7959183673469387, 0.6820232612598548))",2.0,2.0,1.0,0.0,1.0,0.0,1.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 10, 13, 18, 20, 26), values -> List(0.7959183673469387, 0.6820232612598548, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,47.0,No,Two year,No,Credit card (automatic),873.4,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 47.0, 873.4))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 0.9591836734693877, 0.27270314573413473))",2.0,2.0,2.0,1.0,1.0,3.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(3), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 11, 14, 18, 23, 25), values -> List(0.9591836734693877, 0.27270314573413473, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,68.0,Fiber optic,Month-to-month,Yes,Electronic check,7320.9,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 68.0, 7320.9))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 1.3877551020408163, 2.2858168761220825))",2.0,2.0,0.0,0.0,0.0,0.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 9, 13, 17, 20, 25), values -> List(1.3877551020408163, 2.2858168761220825, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"
,0.0,,68.0,No,Two year,No,Mailed check,1657.4,No,"Map(vectorType -> dense, length -> 3, values -> List(0.0, 68.0, 1657.4))","Map(vectorType -> dense, length -> 3, values -> List(0.0, 1.3877551020408163, 0.5174927796424948))",2.0,2.0,2.0,1.0,1.0,1.0,0.0,"Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(2), values -> List(1.0))","Map(vectorType -> sparse, length -> 4, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 5, indices -> List(1), values -> List(1.0))","Map(vectorType -> sparse, length -> 3, indices -> List(0), values -> List(1.0))","Map(vectorType -> sparse, length -> 28, indices -> List(1, 2, 5, 8, 11, 14, 18, 21, 25), values -> List(1.3877551020408163, 0.5174927796424948, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0))"



## Clean up Classroom

Run the following cell to remove lessons-specific assets created during this lesson.

In [0]:
DA.cleanup()

Resetting the learning environment:
| dropping the catalog "labuser8052946_1734373459_n1iz_da"...(0 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/labuser8052946_1734373459@vocareum.com/data-preparation-for-machine-learning"...(0 seconds)

Validating the locally installed datasets:
| listing local files...(1 seconds)
| validation completed...(1 seconds total)



## Conclusion

In summary, the featured engineering pipeline showcased in this demo offers a systematic and consistent approach to handle data loading, imputation, and transformation. By demonstrating its application on different sets and emphasizing the importance of data preparation, the pipeline proves to be a valuable tool for efficient and reproducible machine learning tasks. 

The final step of saving the pipeline to disk ensures future usability, enhancing the overall effectiveness of the data preparation process.


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>