## Module 2: Perform data prep and cleansing using Apache Spark and Data Wrangler

In this challenge we will prepare the dataset for the model training.

Now that you have ingested and explored the data, you can transform the data. You can either run code in a notebook, or use the Data Wrangler to generate code for you.

**Data Wrangler** is a tool used in notebooks. It offers an easy-to-use interface for exploring data. This tool shows data in a grid format, offers automatic summary statistics, built-in visualizations, and a library of common data-cleaning operations. Each operation can be done in just a few clicks. It shows the changes to the data right away and creates code in pandas or PySpark that can be saved back to the notebook for future use.






#### Part 1 Launching Data Wrangler

To explore and transform any pandas Dataframes in your notebook, launch Data Wrangler directly from the notebook.

>[!NOTE]
>Data Wrangler can not be opened while the notebook kernel is busy. The cell execution must complete prior to launching Data Wrangler.

Reference: [Lauching Data Wrangler](https://learn.microsoft.com/en-us/fabric/data-science/data-wrangler#launching-data-wrangler)



In [1]:
import pandas as pd

# Read a CSV into a Pandas DataFrame

load_from_lakehouse=  

df=


StatementMeta(, c5fbf603-0628-452f-9a0c-d8c70bf4ddf5, 3, Finished, Available, Finished)

SyntaxError: invalid syntax (1076885635.py, line 5)


### Part 2 Data cleaning trought Data Wrangler

There are different preprocessing steps that is important for developing robust, efficient and reliable machine learning models.  

&nbsp;
 1. Removing unnecessary columns from a dataset before training a machine learning model is a best practice that enhances model performance, improves interpretability & reduces complexity.
&nbsp;
 2.  Machine learning algorithms do not support missing values, dropping rows with missing values ensures compatibility with a wide range of algorithms without needing additional imputation strategies.
&nbsp;
 3. Handling duplicate rows is an essential step in data preparation because it ensures data quality.
&nbsp;   
 4. Machine learning algorithms operate on numerical data (integers, floats, etc.). If you feed them non-numeric data (e.g., strings), they won’t work. You must adjust some columns data types.
 &nbsp;


Reference: [Browsing and applying data-cleaning operations](https://learn.microsoft.com/en-us/fabric/data-science/data-wrangler#browsing-data-cleaning-operations)


In [2]:
# Code generated by Data Wrangler for pandas DataFrame


StatementMeta(, 906244c3-0a63-4cdf-814d-7c78995ba396, 12, Finished, Available, Finished)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
6,45,F,ATA,130,237,0,Normal,170,N,0.0,Up,0


## Feature Engineering

In the feature engineering process, especially when dealing with categorical data, encoding is a crucial step. One of the simplest methods for converting categorical values into numerical values is using the **LabelEncoder**. Here’s how and why you would use LabelEncoder in the feature engineering process:


**Label encoding** is a process where categorical variables are converted into numerical labels. Each unique category value is assigned a unique integer. This method is useful for ordinal categorical data where the categories have a natural order.

1. Checking if datatypes are numerical

In [17]:
# Input the code to check the datatypes

StatementMeta(, a421d566-b331-402f-9d2e-58c33b982ae2, 19, Finished, Available, Finished)

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

2. Transforming categorical values. Run the following cell as it is to get started with the encoding process.

In [None]:
# Import the LabelEncoder class from the sklearn.preprocessing module.
# LabelEncoder is a tool provided by the scikit-learn library, a popular library in Python for machine learning, which is used to encode categorical labels into numerical values"
# Creates an instance of the LabelEncoder class and assigns it to the variable lab.

from sklearn.preprocessing import LabelEncoder
import pandas as pd
lab = LabelEncoder()

StatementMeta(, 906244c3-0a63-4cdf-814d-7c78995ba396, 14, Finished, Available, Finished)

Code to transform the categorical values, please complete the code according to the steps. Ask bing chat for help if needed. 

In [None]:
# 1.Iniatialize a dataframe to be cleaned.
data_df1 = df_clean 
# 2.Split object and non-object data types. You could use variables and the functions Include and exclude for that. 
obj =
not_obj =
# 3.Encode categorical columns 
for i in range(0, obj.shape[1]):
  obj.iloc[:,i] = lab.fit_transform(obj.iloc[:,i])

# 4.Combine enconded and non-enconded data
df_new =
# 4.Display the first 10 rows to evaluate
head(df_new)


StatementMeta(, 906244c3-0a63-4cdf-814d-7c78995ba396, 15, Finished, Available, Finished)

Unnamed: 0,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
0,1,1,1,0,2,40,140,289,0,172,0.0,0
1,0,2,1,0,1,49,160,180,0,156,1.0,1
3,0,0,1,1,1,48,138,214,0,108,1.5,1
4,1,2,1,0,2,54,150,195,0,122,0.0,0
6,0,1,1,0,2,45,130,237,0,170,0.0,0
7,1,1,1,0,2,54,110,208,0,142,0.0,0
9,0,1,1,0,2,48,120,284,0,120,0.0,0
17,0,1,1,0,2,43,120,201,0,165,0.0,0
18,1,0,1,0,1,60,100,248,0,125,1.0,1
19,1,1,1,0,1,36,120,267,0,160,3.0,1


#### Save processed data to a Delta Table

Refer to notebook 1 for more information about vorder and optimizedwrite

In [14]:
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, a421d566-b331-402f-9d2e-58c33b982ae2, 16, Finished, Available, Finished)

Refer to notebook 1 for more information on writing delta tables to the lakehouse.

In [None]:
# Saving the dataframe as a delta table, please complete the code. 
table_name = "heartfailure_processed"
data_df_processed = spark.createDataFrame(df_new)
data_df_processed.

StatementMeta(, a421d566-b331-402f-9d2e-58c33b982ae2, 17, Finished, Available, Finished)

Spark dataframe saved to delta table: heartfailure_processed


In [16]:
%%sql

select * from heartfailure_processed limit 100;

StatementMeta(, a421d566-b331-402f-9d2e-58c33b982ae2, 18, Finished, Available, Finished)

<Spark SQL result set with 100 rows and 12 fields>