
# Data Science on the Databricks Lakehouse

## ML is key to disruption & personalization

Being able to ingest and query our credit-related database is a first step, but this isn't enough to thrive in a very competitive market.

Customers now expect real time personalization and new form of comunication. Modern data company achieve this with AI.

<style>
.right_box{
  margin: 30px; box-shadow: 10px -10px #CCC; width:650px;height:300px; background-color: #1b3139ff; box-shadow:  0 0 10px  rgba(0,0,0,0.6);
  border-radius:25px;font-size: 35px; float: left; padding: 20px; color: #f9f7f4; }
.badge {
  clear: left; float: left; height: 30px; width: 30px;  display: table-cell; vertical-align: middle; border-radius: 50%; background: #fcba33ff; text-align: center; color: white; margin-right: 10px}
.badge_b { 
  height: 35px}
</style>
<link href='https://fonts.googleapis.com/css?family=DM Sans' rel='stylesheet'>
<div style="font-family: 'DM Sans'">
  <div style="width: 500px; color: #1b3139; margin-left: 50px; float: left">
    <div style="color: #ff5f46; font-size:80px">90%</div>
    <div style="font-size:30px;  margin-top: -20px; line-height: 30px;">
      Enterprise applications will be AI-augmented by 2025 — IDC
    </div>
    <div style="color: #ff5f46; font-size:80px">$10T+</div>
    <div style="font-size:30px;  margin-top: -20px; line-height: 30px;">
       Projected business value creation by AI in 2030 — PwC
    </div>
  </div>
</div>



  <div class="right_box">
      But a huge challenge is getting ML to work at scale!<br/><br/>
      Most ML projects still fail before getting to production.
  </div>
  
<br style="clear: both">

<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=1444828305810485&notebook=%2F03-Data-Science-ML%2F03.2-AutoML-credit-decisioning&demo_name=lakehouse-fsi-credit&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-fsi-credit%2F03-Data-Science-ML%2F03.2-AutoML-credit-decisioning&version=1&user_hash=7804490f0d3be4559d29a7b52959f461489c4ee5e35d4afc7b55f311360ac589">

### A cluster has been created for this demo
To run this demo, just select the cluster `dbdemos-lakehouse-fsi-credit-junyi_tiong` from the dropdown menu ([open cluster configuration](https://e2-demo-field-eng.cloud.databricks.com/#setting/clusters/0922-083237-e7fg83pu/configuration)). <br />
*Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('lakehouse-fsi-credit')` or re-install the demo: `dbdemos.install('lakehouse-fsi-credit')`*


## So what makes machine learning and data science difficult?

These are the top challenges we have observed companies struggle with:
1. Inability to ingest the required data in a timely manner,
2. Inability to properly control the access of the data,
3. Inability to trace problems in the feature store to the raw data,

... and many other data-related problems.


# Data-centric Machine Learning

In Databricks, machine learning is not a separate product or service that needs to be "connected" to the data. The Lakehouse being a single, unified product, machine learning in Databricks "sits" on top of the data, so challenges like inability to discover and access data no longer exist.

<br />
<img src="https://raw.githubusercontent.com/borisbanushev/CAPM_Databricks/main/MLontheLakehouse.png" width="1300px" />

In [0]:
%pip install databricks-sdk==0.36.0 mlflow==2.19.0 databricks-feature-store==0.17.0 scikit-learn==1.3.0
dbutils.library.restartPython()

In [0]:
%run ../_resources/00-setup $reset_all_data=false


# Credit Scoring default prediction


<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/main/images/fsi/credit_decisioning/fsi-credit-decisioning-ml-2.png" style="float: right" width="800px">

## Single click deployment with AutoML


Let's see how we can now leverage the credit decisioning data to build a model predicting and explaining customer creditworthiness.

We'll start by retrieving our data from the feature store and creating our training dataset.

We'll then use Databricks AutoML to automatically build our model.

In [0]:
from databricks import feature_store
fs = feature_store.FeatureStoreClient()
features_set = fs.read_table(name=f"{catalog}.{db}.credit_decisioning_features")
display(features_set)

In [0]:
credit_bureau_label = (spark.table("credit_bureau_gold")
                            .withColumn("defaulted", F.when(col("CREDIT_DAY_OVERDUE") > 60, 1)
                                                      .otherwise(0))
                            .select("cust_id", "defaulted"))
#As you can see, we have a fairly imbalanced dataset
df = credit_bureau_label.groupBy('defaulted').count().toPandas()
px.pie(df, values='count', names='defaulted', title='Credit default ratio')

In [0]:
training_dataset = credit_bureau_label.join(features_set, "cust_id", "inner")


## Balancing our dataset

Let's downsample and upsample our dataset to improve our model performance

In [0]:
# Enable remote filtering to avoid self-join issues
spark.conf.set("spark.databricks.remoteFiltering.blockSelfJoins", "false")

major_df = training_dataset.filter(col("defaulted") == 0)
minor_df = training_dataset.filter(col("defaulted") == 1)

# Duplicate the minority rows
oversampled_df = minor_df.union(minor_df)

# Downsample majority rows
undersampled_df = major_df.sample(oversampled_df.count() / major_df.count() * 3, 42)

# Combine both oversampled minority rows and undersampled majority rows
train_df = undersampled_df.unionAll(oversampled_df).drop('cust_id').na.fill(0)

# Save it as a table to be able to select it with the AutoML UI
train_df.write.mode('overwrite').saveAsTable('credit_risk_train_df')
train_df = spark.table('credit_risk_train_df')

# Visualize the credit default ratio
px.pie(train_df.groupBy('defaulted').count().toPandas(), values='count', names='defaulted', title='Credit default ratio')


## Accelerating credit scoring model creation using MLFlow and Databricks AutoML

MLFlow is an open source project allowing model tracking, packaging and deployment. Every time your Data Science team works on a model, Databricks will track all parameters and data used and will auto-log them. This ensures ML traceability and reproductibility, making it easy to know what parameters/data were used to build each model and model version.

### A glass-box solution that empowers data teams without taking control away

While Databricks simplifies model deployment and governance (MLOps) with MLFlow, bootstraping new ML projects can still be a long and inefficient process.

Instead of creating the same boilerplate for each new project, Databricks AutoML can automatically generate state of the art models for Classifications, Regression, and Forecasting.


<img width="1000" src="https://github.com/QuentinAmbard/databricks-demo/raw/main/retail/resources/images/auto-ml-full.png"/>


Models can be directly deployed, or instead leverage generated notebooks to boostrap projects with best-practices, saving you weeks worth of effort.

<br style="clear: both">

<img style="float: right" width="600" src="https://raw.githubusercontent.com/borisbanushev/CAPM_Databricks/main/MLFlowAutoML.png"/>

### Using Databricks Auto ML with our Credit Scoring dataset

AutoML is available in the "Machine Learning" space. All we have to do is start a new AutoML Experiments and select the feature table we just created (`creditdecisioning_features`)

Our prediction target is the `defaulted` column.

Click on Start, and Databricks will do the rest.

While this is done using the UI, you can also leverage the [python API](https://docs.databricks.com/applications/machine-learning/automl.html#automl-python-api-1)

In [0]:
model_name = "dbdemos_fsi_credit_decisioning"
xp_path = "/Shared/dbdemos/experiments/lakehouse-fsi-credit-decisioning"
xp_name = f"automl_credit_{datetime.now().strftime('%Y-%m-%d_%H:%M:%S')}"
try:
    from databricks import automl
    automl_run = automl.classify(
        experiment_name = xp_name,
        experiment_dir = xp_path,
        dataset = train_df.sample(0.1),
        target_col = "defaulted",
        timeout_minutes = 10
    )
    #Make sure all users can access dbdemos shared experiment
    DBDemos.set_experiment_permission(f"{xp_path}/{xp_name}")
except Exception as e:
    if "cannot import name 'automl'" in str(e) or 'method_whitelist' in str(e):
        # Note: cannot import name 'automl' from 'databricks' likely means you're using serverless. Dbdemos doesn't support autoML serverless API - this will be improved soon.
        # Adding a temporary workaround to make sure it works well for now - ignore this for classic run
        automl_run = DBDemos.create_mockup_automl_run(f"{xp_path}/{xp_name}", train_df.sample(0.1).toPandas(), model_name=model_name, target_col="defaulted")
    else:
        raise e

## Deploying our model in production

Our model is now ready. We can review the notebook generated by the auto-ml run and customize if if required.

For this demo, we'll consider that our model is ready and deploy it in production in our Model Registry:

In [0]:
model_name = "dbdemos_fsi_credit_decisioning"
from mlflow import MlflowClient
import mlflow

#Use Databricks Unity Catalog to save our model
mlflow.set_registry_uri('databricks-uc')
client = MlflowClient()

#Add model within our catalog
latest_model = mlflow.register_model(f'runs:/{automl_run.best_trial.mlflow_run_id}/model', f"{catalog}.{db}.{model_name}")
# Flag it as Production ready using UC Aliases
client.set_registered_model_alias(name=f"{catalog}.{db}.{model_name}", alias="prod", version=latest_model.version)
#DBDemos.set_model_permission(f"{catalog}.{db}.{model_name}", "ALL_PRIVILEGES", "account users")

We just moved our automl model as production ready! 

Open [the dbdemos_fsi_credit_decisioning model](#mlflow/models/dbdemos_fsi_credit_decisioning) to explore its artifact and analyze the parameters used, including traceability to the notebook used for its creation.


## Our model predicting default risks is now deployed in production


So far we have:
* ingested all required data in a single source of truth,
* properly secured all data (including granting granular access controls, masked PII data, applied column level filtering),
* enhanced that data through feature engineering,
* used MLFlow AutoML to track experiments and build a machine learning model,
* registered the model.

### Next steps
We're now ready to use our model use it for:

- Batch inferences in notebook [03.3-Batch-Scoring-credit-decisioning]($./03.3-Batch-Scoring-credit-decisioning) to start using it for identifying currently underbanked customers with good credit-worthiness (**increase the revenue**) and predict current credit-owners who might default so we can prevent such defaults from happening (**manage risk**),
- Real time inference with [03.4-model-serving-BNPL-credit-decisioning]($./03.4-model-serving-BNPL-credit-decisioning) to enable ```Buy Now, Pay Later``` capabilities within the bank.

Extra: review model explainability & fairness with [03.5-Explainability-and-Fairness-credit-decisioning]($./03.5-Explainability-and-Fairness-credit-decisioning)