In [0]:
%pip install mlflow==3.1.0
#If you issues, make sure this matches your automl dependency version. For prod usage, use env_manager='conda'
%pip install azure-core azure-storage-file-datalake #for the display() in Azure only
dbutils.library.restartPython()

# Data engineering with Databricks - Building our C360 database

Building a C360 database requires to ingest multiple datasources.  

It's a complex process requiring batch loads and streaming ingestion to support real-time insights, used for personalization and marketing targeting among other.

Ingesting, transforming and cleaning data to create clean SQL tables for our downstream user (Data Analysts and Data Scientists) is complex.

<link href="https://fonts.googleapis.com/css?family=DM Sans" rel="stylesheet"/>
<div style="width: 300px; height: 300px; text-align: center; float: right; margin: 30px 60px 10px 10px; font-family: 'DM Sans'; border-radius: 50%; border: 25px solid #fcba33ff; box-sizing: border-box; overflow: hidden;">
  <div style="display: flex; flex-direction: column; align-items: center; justify-content: center; height: 100%; width: 100%;">
    <div style="font-size: 70px; color: #70c4ab; font-weight: bold;">
      73%
    </div>
    <div style="color: #1b5162; padding: 0 30px; text-align: center;">
      of enterprise data goes unused for analytics and decision making
    </div>
  </div>
  <div style="color: #bfbfbf; padding-top: 5px;">
    Source: Forrester
  </div>
</div>

<br>


## <img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/refs/heads/main/images/john.png" style="float:left; margin: -35px 0px 0px 0px" width="80px"> John, as Data engineer, spends immense timeâ€¦.

* Hand-coding data ingestion & transformations and dealing with technical challenges:<br>
  *Supporting streaming and batch, handling concurrent operations, small files issues, GDPR requirements, complex DAG dependencies...*<br><br>
* Building custom frameworks to enforce quality and tests<br><br>
* Building and maintaining scalable infrastructure, with observability and monitoring<br><br>
* Managing incompatible governance models from different systems
<br style="clear: both">

This results in **operational complexity** and overhead, requiring expert profile and ultimatly **putting data projects at risk**.

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=1444828305810485&notebook=%2F01-Data-ingestion%2F01.3-DLT-churn-python&demo_name=lakehouse-retail-c360&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-retail-c360%2F01-Data-ingestion%2F01.3-DLT-churn-python&version=1&user_hash=0b3be070fa39374fb760232ebb606a5c489732ec881a7ebfc68231c496aed118">

# Simplify Ingestion and Transformation with Lakeflow Connect & DLT

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/cross_demo_assets/Lakehouse_Demo_Team_architecture_1.png?raw=true" style="float: right" width="500px">

In this notebook, we'll work as a Data Engineer to build our c360 database. <br>
We'll consume and clean our raw data sources to prepare the tables required for our BI & ML workload.

We want to ingest the datasets below from Salesforce Sales Cloud and blob storage (`/demos/retail/churn/`) incrementally into our Data Warehousing tables:

- Customer profile data *(name, age, address etc)*
- Orders history *(what our customer bought over time)*
- Streaming Events from our application *(when was the last time customers used the application, typically a stream from a Kafka queue)*


<a href="https://www.databricks.com/resources/demos/tours/platform/discover-databricks-lakeflow-connect-demo" target="_blank"><img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/product/lakeflow-connect-anim.gif?raw=true" style="float: right; margin-right: 20px" width="250px"></a>

## 1/ Ingest data with Lakeflow Connect


Lakeflow Connect offers built-in data ingestion connectors for popular SaaS applications, databases and file sources, such as Salesforce, Workday, and SQL Server to build incremental data pipelines at scale, fully integrated with Databricks. 


## 2/ Prepare and transform your data with DLT

<div>
  <div style="width: 45%; float: left; margin-bottom: 10px; padding-right: 45px">
    <p style="min-height: 65px;">
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/diganparikh-dp/Images/refs/heads/main/Icons/LakeFlow%20Connect.jpg"/> 
      <strong>Efficient end-to-end ingestion</strong> <br/>
      Enable analysts and data engineers to innovate rapidly with simple pipeline development and maintenance 
    </p>
    <p>
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/diganparikh-dp/Images/refs/heads/main/Icons/LakeFlow%20Pipelines.jpg"/> 
      <strong>Flexible and easy setup</strong> <br/>
      By automating complex administrative tasks and gaining broader visibility into pipeline operations
    </p>
  </div>
  <div style="width: 48%; float: left">
    <p style="min-height: 65px;">
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/logo-trust.png"/> 
      <strong>Trust your data</strong> <br/>
      With built-in orchestration, quality controls and quality monitoring to ensure accurate and useful BI, Data Science, and ML 
    </p>
    <p>
      <img style="width: 50px; float: left; margin: 0px 5px 30px 0px;" src="https://raw.githubusercontent.com/QuentinAmbard/databricks-demo/main/retail/resources/images/lakehouse-retail/logo-stream.png"/> 
      <strong>Simplify batch and streaming</strong> <br/>
      With self-optimization and auto-scaling data pipelines for batch or streaming processing 
    </p>
</div>
</div>

<br style="clear:both">

## Building a DLT pipeline to analyze and reduce churn

In this example, we'll implement a end-to-end DLT pipeline consuming our customers information. We'll use the medallion architecture but we could build star schema, data vault or any other modelisation.

We'll incrementally load new data with the autoloader, enrich this information and then load a model from MLFlow to perform our customer churn prediction.

This information will then be used to build our DBSQL dashboard to track customer behavior and churn.

Let's implement the following flow: 
 
<div><img width="1100px" src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/retail/lakehouse-churn/lakehouse-retail-churn-de.png?raw=true"/></div>

*Note that we're including the ML model our [Data Scientist built]($../04-Data-Science-ML/04.1-automl-churn-prediction) using Databricks AutoML to predict the churn. We'll cover that in the next section.*

Your DLT Pipeline has been installed and started for you! Open the <a dbdemos-pipeline-id="dlt-churn" href="#joblist/pipelines/bef07be0-ca9f-471f-ab5e-cb73b2a7c024" target="_blank">Churn DLT pipeline</a> to see it in action.<br/>
*(Note: The pipeline will automatically start once the initialization job is completed, this might take a few minutes... Check installation logs for more details)*

### 1/ Loading our data using Databricks Autoloader (cloud_files)
<div style="float:right">
  <img width="500px" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-de-small-1.png"/>
</div>
  
Autoloader allow us to efficiently ingest millions of files from a cloud storage, and support efficient schema inference and evolution at scale.

For more details on autoloader, run `dbdemos.install('auto-loader')`

Let's use it to our pipeline and ingest the raw JSON & CSV data being delivered in our blob storage `/demos/retail/churn/...`. 

In [0]:
import dlt
from pyspark.sql import functions as F

@dlt.create_table(comment="Application events and sessions")
@dlt.expect("App events correct schema", "_rescued_data IS NULL")
def churn_app_events():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .option("cloudFiles.inferColumnTypes", "true")
      .load("/Volumes/data_pioneers/c360/c360/events"))

In [0]:
@dlt.create_table(comment="Spending score from raw data")
@dlt.expect("Orders correct schema", "_rescued_data IS NULL")
def churn_orders_bronze():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.inferColumnTypes", "true")
      .load("/Volumes/data_pioneers/c360/c360/orders"))

In [0]:
@dlt.create_table(comment="Raw user data coming from json files ingested in incremental with Auto Loader to support schema inference and evolution")
@dlt.expect("Users correct schema", "_rescued_data IS NULL")
def churn_users_bronze():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .option("cloudFiles.inferColumnTypes", "true")
      .load("/Volumes/data_pioneers/c360/c360/users"))

### 2/ Enforce quality and materialize our tables for Data Analysts
<div style="float:right">
  <img width="500px" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-de-small-2.png"/>
</div>

The next layer often call silver is consuming **incremental** data from the bronze one, and cleaning up some information.

We're also adding an [expectation](https://docs.databricks.com/workflows/delta-live-tables/delta-live-tables-expectations.html) on different field to enforce and track our Data Quality. This will ensure that our dashboard are relevant and easily spot potential errors due to data anomaly.

For more advanced DLT capabilities run `dbdemos.install('dlt-loans')` or `dbdemos.install('dlt-cdc')` for CDC/SCDT2 example.

These tables are clean and ready to be used by the BI team!

In [0]:
@dlt.create_table(comment="User data cleaned and anonymized for analysis.")
@dlt.expect_or_drop("user_valid_id", "user_id IS NOT NULL")
def churn_users():
  return (dlt
          .read_stream("churn_users_bronze")
          .select(F.col("id").alias("user_id"),
                  F.sha1(F.col("email")).alias("email"), 
                  F.to_timestamp(F.col("creation_date"), "MM-dd-yyyy HH:mm:ss").alias("creation_date"), 
                  F.to_timestamp(F.col("last_activity_date"), "MM-dd-yyyy HH:mm:ss").alias("last_activity_date"), 
                  F.initcap(F.col("firstname")).alias("firstname"), 
                  F.initcap(F.col("lastname")).alias("lastname"), 
                  F.col("address"), 
                  F.col("canal"), 
                  F.col("country"),
                  F.col("gender").cast("int").alias("gender"),
                  F.col("age_group").cast("int").alias("age_group"), 
                  F.col("churn").cast("int").alias("churn")))

In [0]:
@dlt.create_table(comment="Order data cleaned and anonymized for analysis.")
@dlt.expect_or_drop("order_valid_id", "order_id IS NOT NULL")
@dlt.expect_or_drop("order_valid_user_id", "user_id IS NOT NULL")
def churn_orders():
  return (dlt
          .read_stream("churn_orders_bronze")
          .select(F.col("amount").cast("int").alias("amount"),
                  F.col("id").alias("order_id"),
                  F.col("user_id"),
                  F.col("item_count").cast("int").alias("item_count"),
                  F.to_timestamp(F.col("transaction_date"), "MM-dd-yyyy HH:mm:ss").alias("creation_date"))
         )

### 3/ Aggregate and join data to create our ML features
<div style="float:right">
  <img width="500px" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-de-small-3.png"/>
</div>

We're now ready to create the features required for our Churn prediction.

We need to enrich our user dataset with extra information which our model will use to help predicting churn, sucj as:

* last command date
* number of item bought
* number of actions in our website
* device used (ios/iphone)
* ...

In [0]:
@dlt.create_table(comment="Final user table with all information for Analysis / ML")
def churn_features():
  churn_app_events_stats_df = (dlt
          .read("churn_app_events")
          .groupby("user_id")
          .agg(F.first("platform").alias("platform"),
               F.count('*').alias("event_count"),
               F.count_distinct("session_id").alias("session_count"),
               F.max(F.to_timestamp("date", "MM-dd-yyyy HH:mm:ss")).alias("last_event"))
                              )
  
  churn_orders_stats_df = (dlt
          .read("churn_orders")
          .groupby("user_id")
          .agg(F.count('*').alias("order_count"),
               F.sum("amount").alias("total_amount"),
               F.sum("item_count").alias("total_item"),
               F.max("creation_date").alias("last_transaction"))
         )
  
  return (dlt
          .read("churn_users")
          .join(churn_app_events_stats_df, on="user_id")
          .join(churn_orders_stats_df, on="user_id")
          .withColumn("days_since_creation", F.datediff(F.current_timestamp(), F.col("creation_date")))
          .withColumn("days_since_last_activity", F.datediff(F.current_timestamp(), F.col("last_activity_date")))
          .withColumn("days_last_event", F.datediff(F.current_timestamp(), F.col("last_event")))
         )

## 5/ Enriching the gold data with a ML model
<div style="float:right">
  <img width="500px" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/lakehouse-retail/lakehouse-retail-churn-de-small-4.png"/>
</div>

Our Data scientist team has build a churn prediction model using Auto ML and saved it into Databricks Model registry. 

One of the key value of the Lakehouse is that we can easily load this model and predict our churn right into our pipeline. 

Note that we don't have to worry about the model framework (sklearn or other), MLFlow abstract that for us.

In [0]:
import mlflow 
mlflow.set_registry_uri('databricks-uc')
#                                                                                                     Stage/version  
#                                                                                   Model name               |        
#                                                                                       |                    |        
predict_churn_udf = mlflow.pyfunc.spark_udf(spark, "models:/data_pioneers.c360.dbdemos_customer_churn@prod", "long", env_manager='virtualenv')
spark.udf.register("predict_churn", predict_churn_udf)

In [0]:
model_features = predict_churn_udf.metadata.get_input_schema().input_names()

@dlt.create_table(comment="Customer at risk of churn")
def churn_prediction():
  return (dlt
          .read('churn_features')
          .withColumn('churn_prediction', predict_churn_udf(*model_features)))

## Our pipeline is now ready!

As you can see, building Data Pipeline with databricks let you focus on your business implementation while the engine solves all hard data engineering work for you.

Open the <a dbdemos-pipeline-id="dlt-churn" href="#joblist/pipelines/bef07be0-ca9f-471f-ab5e-cb73b2a7c024" target="_blank">Churn DLT pipeline</a> and click on start to visualize your lineage and consume the new data incrementally!

# Next: secure and share data with Unity Catalog

Now that these tables are available in our Lakehouse, let's review how we can share them with the Data Scientists and Data Analysts teams.

Jump to the [Governance with Unity Catalog notebook]($../00-churn-introduction-lakehouse) or [Go back to the introduction]($../00-churn-introduction-lakehouse)

## Optional: Checking your data quality metrics with DLT
DLT tracks all your data quality metrics. You can leverage the expecations directly as SQL table with Databricks SQL to track your expectation metrics and send alerts as required. This let you build the following dashboards:

<img width="1000" src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/retail/lakehouse-churn/lakehouse-retail-c360-dashboard-dlt-stat.png?raw=true">

<a dbdemos-dashboard-id="dlt-quality-stat" href='/sql/dashboardsv3/01f06ca9586f10459b035cc4a1f29312' target="_blank">Data Quality Dashboard</a>

# Building our first business dashboard with Databricks SQL

Our data is now available! We can start building dashboards to get insights from our past and current business.

<img style="float: left; margin-right: 50px;" width="500px" src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/retail/lakehouse-churn/lakehouse-retail-c360-dashboard-churn-prediction.png?raw=true" />

<img width="500px" src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/retail/lakehouse-churn/lakehouse-retail-c360-dashboard-churn.png?raw=true"/>

<a dbdemos-dashboard-id="churn-universal" href='/sql/dashboardsv3/01f06ca9586f10459b035cc4a1f29312'  target="_blank">Open the DBSQL Dashboard</a>