## Module 2: Perform data prep and cleansing using Apache Spark and Data Wrangler

Now that you have ingested and explored the data, you can transform the data. You can either run code in a notebook, or use the Data Wrangler to generate code for you.

**Data Wrangler** is a tool used in notebooks. It offers an easy-to-use interface for exploring data. This tool shows data in a grid format, offers automatic summary statistics, built-in visualizations, and a library of common data-cleaning operations. Each operation can be done in just a few clicks. It shows the changes to the data right away and creates code in pandas or PySpark that can be saved back to the notebook for future use.




#### Launching Data Wrangler

To explore and transform any pandas Dataframes in your notebook, launch Data Wrangler directly from the notebook.

>[!NOTE]
>Data Wrangler can not be opened while the notebook kernel is busy. The cell execution must complete prior to launching Data Wrangler.



In [7]:
import pandas as pd

#Read the delta table

data_df = spark.read.format("delta").load("Tables/heartFailure")

# Convert to a pandas df (needed to use Data Wrangler)
df = data_df.toPandas()
display(df)


StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 107, Finished, Available, Finished)

SynapseWidget(Synapse.DataFrame, 4e4d314d-d410-427d-ae65-c9466a37b3d6)

**NOTE**: To follow these instructions while working with Data Wrangler, duplicate this tab as once you open Data Wrangler you will not be able to see the notebook.

1. Under the notebook ribbon Home tab, select Launch Data Wrangler. You'll see a list of activated pandas DataFrames available for editing.
<!-- 1. Under the notebook ribbon Home tab, launch the Data Wrangler in order to see a list of activated pandas DataFrames available for editing. -->

2. Select the DataFrame you just created in last cell and open in Data Wrangler. From the Pandas dataframe list, select `df`.


![image-alt-text](https://github.com/lesantana/wthds/blob/main/1-fdatascience.png?raw=true)

Data Wrangler launches and generates a descriptive overview of your data. The table in the middle shows each data column.

![image-alt-text](https://github.com/lesantana/wthds/blob/main/2-fdatascience.png?raw=true)


The Summary panel next to the table shows information about the DataFrame. When you select a column in the table, the summary updates with information about the selected column. In some instances, the data displayed and summarized will be a truncated view of your DataFrame. When this happens, you'll see warning image in the summary pane. But this is not the case in this datasframe.

>Each operation you do can be applied in a matter of clicks, updating the data display in real time and generating code that you can save back to your notebook as a reusable function.


### Let's start the data cleaning trought Data Wrangler

> There are different preprocessing steps that is important for developing robust, efficient and reliable machine learning models.  

&nbsp;
 ###### 1. Removing unnecessary columns from a dataset before training a machine learning model is a best practice that enhances model performance, improves interpretability & reduces complexity.
   a. On the **Operations** panel, expand **Schema** and select **Drop columns**.

   b. Select **RowNumber**. This column will appear in red in the preview, to show they're changed by the code (in this case, dropped.)
   
   c. Select **Apply**, a new step is created in the **Cleaning steps panel** on the bottom left. 
&nbsp;   
&nbsp;   
![image-alt-text](https://github.com/lesantana/wthds/blob/main/3-fdatascience.png?raw=true)

&nbsp;
###### 2.  Machine learning algorithms do not support missing values, dropping rows with missing values ensures compatibility with a wide range of algorithms without needing additional imputation strategies.

   a. On the **Operations** panel, select **Find and replace**, and then select **Drop missing values**.
   
   b. Select the **RestingBP**, **Cholesterol** and **FastingBS** columns. On the right left those are the ones that are pointed as missing values.

   c. Select **Apply**, a new step is created in the **Cleaning steps panel** on the bottom left. 

&nbsp;
&nbsp;
![image-alt-text](https://github.com/lesantana/wthds/blob/main/4-fdatascience.png?raw=true)

&nbsp;
###### 3. Handling duplicate rows is an essential step in data preparation because it ensures data quality.

   a. On the **Operations** panel, select **Find and replace**, and then select **Drop duplicate rows**.

   b. Select **Apply**, a new step is created in the **Cleaning steps panel** on the bottom left. 

&nbsp;
&nbsp;
   ![image-alt-text](https://github.com/lesantana/wthds/blob/main/5-fdatascience.png?raw=true)

&nbsp;   

###### 4. Select **Add code to notebook** at the top left to close Data Wrangler and add the code automatically. The **Add code to notebook** wraps the code in a function.

 &nbsp;
   ![image-alt-text](https://github.com/lesantana/wthds/blob/main/5.5.1-datascience.png?raw=true)
   ![image-alt-text](https://github.com/lesantana/wthds/blob/main/7-fdatascience.png?raw=true)


In [8]:
# Code generated by Data Wrangler for pandas DataFrame

def clean_data(df):
    # Drop column: 'RowNumber'
    df = df.drop(columns=['RowNumber'])
    # Drop rows with missing data in columns: 'RestingBP', 'Cholesterol', 'FastingBS'
    df = df.dropna(subset=['RestingBP', 'Cholesterol', 'FastingBS'])
    # Drop duplicate rows across all columns
    df = df.drop_duplicates()
    return df

df_clean = clean_data(df.copy())
df_clean.head()

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 108, Finished, Available, Finished)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
1,60,M,ASY,132.0,218.0,0.0,ST,140,Y,1.5,Down,1
2,63,M,ASY,170.0,177.0,0.0,Normal,84,Y,2.5,Down,1
3,59,M,ASY,122.0,233.0,0.0,Normal,117,Y,1.3,Down,1
4,58,M,ASY,132.0,458.0,1.0,Normal,69,N,1.0,Down,0
7,62,M,ASY,158.0,210.0,1.0,Normal,112,Y,3.0,Down,1


## Feature Engineering

###### In the feature engineering process, especially when dealing with categorical data, encoding is a crucial step. One of the simplest methods for converting categorical values into numerical values is using the **LabelEncoder**. Here’s how and why you would use LabelEncoder in the feature engineering process:


###### **Label encoding** is a process where categorical variables are converted into numerical labels. Each unique category value is assigned a unique integer. This method is useful for ordinal categorical data where the categories have a natural order.

###### 1. Checking if datatypes are numerical

In [9]:
df_clean.dtypes

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 109, Finished, Available, Finished)

Age                 int32
Sex                object
ChestPainType      object
RestingBP         float64
Cholesterol       float64
FastingBS         float64
RestingECG         object
MaxHR               int32
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int32
dtype: object

###### 2. Transforming categorical values

In [10]:
# Import the LabelEncoder class from the sklearn.preprocessing module.
# LabelEncoder is a tool provided by the scikit-learn library, a popular library in Python for machine learning, which is used to encode categorical labels into numerical values"
# Creates an instance of the LabelEncoder class and assigns it to the variable lab.

from sklearn.preprocessing import LabelEncoder
import pandas as pd
lab = LabelEncoder()

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 110, Finished, Available, Finished)

Code to transform the categorical values, please complete the code according to the steps. If needed, ask bing chat for help.

In [11]:
# 1.Iniatialize a dataframe to be cleaned.
data_df1 = df_clean
# 2.Split object and non-object data types. You could use variables and the functions Include and exclude for that.
obj = data_df1.select_dtypes(include='object')
not_obj = data_df1.select_dtypes(exclude='object')
# 3.Encode categorical columns 
for i in range(0, obj.shape[1]):
  obj.iloc[:,i] = lab.fit_transform(obj.iloc[:,i])
# 4.Combine enconded and non-enconded data
df_new = pd.concat([obj, not_obj], axis=1)
# 4.Display the first 10 rows to evaluate
df_new.head(10)

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 111, Finished, Available, Finished)

Unnamed: 0,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
1,1,0,2,1,0,60,132.0,218.0,0.0,140,1.5,1
2,1,0,1,1,0,63,170.0,177.0,0.0,84,2.5,1
3,1,0,1,1,0,59,122.0,233.0,0.0,117,1.3,1
4,1,0,1,0,0,58,132.0,458.0,1.0,69,1.0,0
7,1,0,1,1,0,62,158.0,210.0,1.0,112,3.0,1
8,1,0,2,1,0,61,120.0,282.0,0.0,135,4.0,1
9,1,0,1,1,0,59,125.0,222.0,0.0,135,2.5,1
11,1,2,0,0,0,76,104.0,113.0,0.0,120,3.5,1
12,1,0,2,1,0,70,170.0,192.0,0.0,129,3.0,1
15,1,0,1,1,0,64,134.0,273.0,0.0,102,4.0,1


**NOTE:** You can also explore how to one-hot encode the categorical columns with Data Wrangler. However, this will not create labels in your existing columns, but rather a new column for each category with True and False values. Using this alternative format might need some modification to the code in the model training process. Please discuss this possibility with hack attendees to raise awareness of this Data Wrangler feature.

#### Save processed data to a Delta Table

Refer to notebook 1 for more information about vorder and optimizedwrite

In [12]:
spark.conf.set("sprk.sql.parquet.vorder.enabled", "true") # Enable Verti-Parquet write
spark.conf.set("spark.microsoft.delta.optimizeWrite.enabled", "true") # Enable automatic delta optimized write

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 112, Finished, Available, Finished)

Refer to notebook 1 for more information on writing delta tables to the lakehouse.

In [13]:
table_name = "heartfailure_processed"
data_df_processed = spark.createDataFrame(df_new)
data_df_processed.write.mode("overwrite").format("delta").save(f"Tables/{table_name}")
print(f"Spark dataframe saved to delta table: {table_name}")

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 113, Finished, Available, Finished)

Spark dataframe saved to delta table: heartfailure_processed


In [14]:
%%sql

select * from heartfailure_processed limit 100;

StatementMeta(, 5c80421a-3f4e-42ad-a379-8447cbc01755, 114, Finished, Available, Finished)

<Spark SQL result set with 100 rows and 12 fields>