
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Exploratory Data Analysis and Feature Engineering

In this lesson, we’ll walk you through basic exploratory data analysis and the process of creating and storing a feature table in the Feature Store. We’ll begin by demonstrating how to load data into a Spark DataFrame, view essential statistical information, and perform visual analysis using both built-in tools and code. Next, we’ll create a feature table, showing you how to store and explore it within the Feature Store UI. By the end of this demo, you should have a foundational understanding of the key steps involved in creating a feature table for Feature Engineering.

## **Learning Objectives**:

_By the end of this demo, you will be able to:_


1. **Perform Basic Exploratory Data Analysis (EDA):**
    - Utilize Spark and Pandas to store our data as a DataFrame.
    - Use built-in functionality to analyze data from a statistical perspective. Additionally, we will visualize the summary statistics. 


2. **Introduction to Feature Engineering with Databricks:**
    - Create a feature table and store it in Feature Store from a PySpark DataFrame.
    - Inspect the Feature Store table using the UI and from the notebook.

## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.


## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **16.4.x-cpu-ml-scala2.12**


## Classroom Setup

To get into the lesson, we first need to build some data assets and define some configuration variables required for this demonstration. When running the following cell, the output is hidden so our space isn't cluttered. To view the details of the output, you can hover over the next cell and click the eye icon. 

The cell after the setup, titled `View Setup Variables`, displays the various variables that were created. You can click the Catalog icon in the notebook space to the right to see that your catalog was created with no data.

In [0]:
%run ../Includes/Classroom-Setup-1

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")

## Part 1: Perform Basic Exploratory Data Analysis (EDA)

In this section, we will show how you can utilize Databricks Notebooks for exploratory analysis. This will be presented in two flavors: built-in tools and demonstrative custom code.

### Read and Inspect the Dataset

In this section, we will utilize a fictional dataset from a wine rating company, which includes various information from acidity to pH levels. Ideally, a data scientist or machine learning practitioner, would take this dataset and perform various feature engineering tasks in order to be able to predict the `quality` rating of the wine. 

The next cell will create one table: `wine_quality_table`. Let's create two different dataframes, one using Spark and another using pandas.  


In [0]:
DA.create_demo_table()

In [0]:
df = spark.read.table('wine_quality_table')
pdf = df.toPandas()

display(df)

### Inspect Statistics: Numerical Values and Visuals

Here we will exhibit different ways in which you can display and visualize descriptive statistics. 
1. `dbutils.data.summarize(<spark_or_pandas_dataframe>)` - This method will separate out numerical and categorical features within your Spark or Pandas DataFrame. It also displays histograms and quartile estimates. There are various options available in the generated profile such as resizing and feature search. You can consider this a more managed approach for summarizing statistics. 
2. `describe(<spark_or_pandas_dataframe>)` - This method will only return a table with the necessary information. You can recover the generated profile like that in the dbutils approach by adding a data profile. 
    - Click on the **+** icon and select **data profile**. 
3. `display(<spark_or_pandas_dataframe>)` - This will return the table. From this, we can build a visual to inspect the feature variables. 
4. Custom code - We can use the Pandas Dataframe along with other Python libraries to build custom visualizations.

In [0]:
dbutils.data.summarize(pdf)

In [0]:
display(df.describe())

In [0]:
print(pdf.describe())

In [0]:
display(df)

In [0]:
from pyspark.sql import functions as F

# Let's find Q1, median, and Q3 of pH grouped by the quality ranking for 

column_stats = 'pH'

display(df.groupBy('quality').agg(
    F.min(f'{column_stats}').alias('min'),
    F.expr(f'percentile({column_stats}, 0.25)').alias('Q1'),
    F.expr(f'percentile({column_stats}, 0.5)').alias('median'),
    F.expr(f'percentile({column_stats}, 0.75)').alias('Q3'),
    F.max(f'{column_stats}').alias('max')
))

### Bubble Chart Using GUI Visualization Editor

We can now use the **Visualization Editor** in the Databricks UI to build a bubble chart using our grouped summary statistics.

**Steps:**
1. Create the grouped DataFrame in the following cell.
2. In the output result cell:
   - Click the **+**  dropdown next to Table (top-right of the table display).
   - Select **Visualization**.

3. In the **Visualization Editor**:
   - Select **Bubble** as the visualization type.
   - Under **X column**, select `quality`.
   - Under **Y columns**, select `median_pH`.
   - Under **Group by**, select `count`,
   - Under **Bubble size column**, select `count`.
   - Under **Bubble size coefficient**, check if it's `1`,
   - Leave **Bubble size proportional to** as `Diameter`.

4. Click **Save** to render the chart.

This creates a bubble chart that shows:
- Wine **quality** on the x-axis.
- **Median pH** level on the y-axis.
- **Bubble size** proportional to the number of samples.


In [0]:
from pyspark.sql.functions import expr, count

grouped_df = df.groupBy("quality").agg(expr("percentile(pH, 0.5)").alias("median_pH"), count("pH").alias("count"))
display(grouped_df)

Databricks visualization. Run in Databricks to view.

## Part 2: Introduction to Feature Engineering on Databricks

After exploring our data for a bit, we see that it would be beneficial to be able to predict `quality`. There are many things we can do to this dataset, such as outlier analysis, etc. Instead, since this is just an introductory lesson, let's keep it simple and add an additional feature that separates out low, average, and high `pH`. This will add an additional feature to the data we already have. 



### Business Logic

Based on our analysis above, suppose business stakeholders give you the following guidelines for pH levels.

1. Low pH: >= Q1
2. Average pH: < Q1 and < Q3
3. High pH: >= Q3

Let's take this business logic and create a new **feature** and store it in a feature table in our Feature Store.

In [0]:
feature_variables = ['fixed_acidity',
                     'volatile_acidity',
                     'citric_acid',
                     'pH',
                     'sulphates',
                     'alcohol',
                     'quality']
prediction_variable = 'quality'
primary_key = ['wine_id']

In [0]:
feature_df = df.select(primary_key + feature_variables)
display(feature_df)

In [0]:
from pyspark.sql.functions import col, expr, when

quantiles = feature_df.approxQuantile("pH", [0.25, 0.75], 0.0)

Q1, Q3 = quantiles

feature_df2 = feature_df.withColumn(
    "pHCategory",
    when(col("pH") <= Q1, "Low")
    .when((col("pH") > Q1) & (col("pH") < Q3), "Average")
    .otherwise("High")
)

display(feature_df2)

### Save features to feature table

Now that we have our feature store created, let's store it as a feature table within Feature Store. We have all the ingredients we need to do this within Databricks Unity Catalog: 
1. Feature table (Spark DataFrame)
2. Primary key (designated feature)

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

# Instantiate the FeatureEngineeringClient
fe = FeatureEngineeringClient()

In [0]:
# Set the feature table name for storage in UC
feature_table_name = f'{DA.catalog_name}.{DA.schema_name}.wine_quality_features'

print(f"The name of the feature table: {feature_table_name}\n\n")

# Create the feature table
fe.create_table(
    name = feature_table_name,
    primary_keys = primary_key,
    df = feature_df2, 
    description="Wine quality features", 
    tags = {"source": "bronze", "format": "delta"}
)

Now, go inspect your feature table using the UI!

# Conclusion And Next Steps

In this lesson, we learned about basic EDA and how to perform feature engineering and save the result to our feature store. Notice that all a feature table is a Delta table that has a primary key. However, Features allows us to separate out those tables that will be used for ML versus those that will not. In the next lesson, we will be introduced to AutoML - Databricks automated machine learning tool that can be used to establish a baseline model as well as verify the predictive power of a given dataset.

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>