
# Accelerating Data Science with Databricks AutoML

##  Predicting patient readmission risk: Single click deployment with AutoML

<img src="https://github.com/databricks-demos/dbdemos-resources/blob/main/images/hls/patient-readmission/patient-risk-ds-flow-2.png?raw=true" width="700px" style="float: right; margin-left: 10px;" />


In this notebook, we will explore how to use Databricks AutoML to generate the best notebooks to predict our patient readmission risk and deploy our model in production.

Databricks AutoML allows you to quickly generate baseline models and notebooks. 

ML experts can accelerate their workflow by fast-forwarding through the usual trial-and-error and focus on customizations using their domain knowledge, and citizen data scientists can quickly achieve usable results with a low-code approach.

In [0]:
dbutils.widgets.text("catalog", "", "Catalog")
dbutils.widgets.text("schema", "", "Schema")

In [0]:
from pyspark.sql import SparkSession
spark: SparkSession

catalog = dbutils.widgets.get("catalog")
schema = dbutils.widgets.get("schema")

spark.sql(f"USE `{catalog}`.`{schema}`")

DataFrame[]

In [0]:
import mlflow
mlflow.set_registry_uri("databricks-uc")

### Getting our training dataset 

Let's use the our training dataset determinining the readmission after 30 days for all our population. We will use that as our training label and what we want our model to predict.

In [0]:
training_dataset = spark.sql(f"""
                      SELECT 
                      b.*
                      , a.claim_amount
                      FROM ddavis_hls_sql.ai.training_patient_claims a 
                      INNER JOIN ddavis_hls_sql.ai.feature_beneficiary b ON a.beneficiary_code = b.beneficiary_code
                      """)

training_dataset.write.mode("overwrite").saveAsTable("ddavis_hls_sql.ai.training_dataset")

display(training_dataset)

beneficiary_code,date_of_birth,date_of_death,gender,race,esrd_flag,state,county_code,heart_failure_flag,cronic_kidney_disease_flag,cancer_flag,copd_flag,depression_flag,diabetes_flag,ischemic_heart_disease_flag,osteoporosis_flag,asrheumatoid_arthritis_flag,stroke_transient_ischemic_attack_flag,claim_amount
FF54A63FA55F5556,1954-03-01,,Male,Black,No,WA,220,No,No,No,No,Yes,No,No,No,No,No,13940.0
FF5FA9DFBDCB7632,1926-01-01,,Female,White,Yes,MD,150,Yes,Yes,Yes,Yes,No,Yes,Yes,No,Yes,No,47150.0
FFB3ECE3201A4722,1954-12-01,,Female,Black,No,VA,921,No,No,No,No,No,Yes,No,No,No,No,5020.0
FFF58E99961D73D2,1932-12-01,,Female,White,Yes,CO,340,Yes,No,No,No,No,No,No,No,No,No,1000.0
000DFFF54425C08E,1943-10-01,,Female,White,No,IN,880,Yes,No,No,Yes,No,No,Yes,No,No,No,28500.0
0019EF0547183BF0,1945-12-01,2010-02-01,Male,White,No,LA,250,No,Yes,No,No,Yes,Yes,Yes,No,No,No,90860.0
0023288365A51F33,1949-05-01,,Female,White,No,MA,20,Yes,No,No,No,No,No,No,No,No,No,38000.0
0027B2294564172E,1925-05-01,,Female,White,No,AR,650,Yes,No,No,No,No,Yes,No,No,Yes,No,21580.0
00398D4814D4CBD9,1935-07-01,,Male,White,No,NJ,160,Yes,No,No,Yes,No,No,Yes,No,No,No,9230.0
003D116838833B5A,1924-02-01,2010-05-01,Male,Others,No,CA,200,No,No,No,No,No,No,No,No,No,No,7520.0


### Define what features to look up for our model

Let's only keep the relevant features for our model training. We are removing columns such as `SSN` or `IDs`.

This step could also be done selecting the training_dataset table from the UI and selecting the column of interest.

*Note: this could also be retrived from our Feature Store tables. For more details on that open the companion notebook.*

In [0]:
all_columns = training_dataset.columns
feature_names = ["race", "date_of_birth", "stroke_transient_ischemic_attack_flag", "cronic_kidney_disease_flag", "esrd_flag", "county_code", "diabetes_flag", "cancer_flag", "state", "copd_flag", "gender", "heart_failure_flag", "asrheumatoid_arthritis_flag", "depression_flag", "date_of_death", "ischemic_heart_disease_flag", "osteoporosis_flag"]

excluded_columns = [c for c in all_columns if c not in feature_names]

In [0]:
excluded_columns

['beneficiary_code', 'claim_amount']

In [0]:
from databricks import automl
summary = automl.regress(training_dataset, target_col="claim_amount", exclude_cols=excluded_columns, primary_metric="mae", timeout_minutes=30)

2024/11/27 22:05:57 INFO databricks.automl.client.manager: AutoML will optimize for mean absolute error metric, which is tracked as val_mean_absolute_error in the MLflow experiment.
2024/11/27 22:05:58 INFO databricks.automl.client.manager: MLflow Experiment ID: 3582014443838527
2024/11/27 22:05:58 INFO databricks.automl.client.manager: MLflow Experiment: https://e2-demo-field-eng.cloud.databricks.com/?o=1444828305810485#mlflow/experiments/3582014443838527
2024/11/27 22:07:01 INFO databricks.automl.client.manager: Data exploration notebook: https://e2-demo-field-eng.cloud.databricks.com/?o=1444828305810485#notebook/3582014443838552


## Deploying our model in production

Our model is now ready. We can review the notebook generated by the auto-ml run and customize if if required.

For this demo, we'll consider that our model is ready and deploy it in production in our Model Registry:

In [0]:
model_name = f"{catalog}.{schema}.claims_prediction_model"
model_registered = mlflow.register_model(f"runs:/{summary.best_trial.mlflow_run_id}/model", model_name)

#Move the model in production
client = mlflow.tracking.MlflowClient()
print("registering model version "+model_registered.version+" as production model")
client.set_registered_model_alias(model_name, "Production", model_registered.version)

com.databricks.backend.common.rpc.CommandCancelledException
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:445)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.cancelExecution(ExecutionContextManagerV1.scala:464)
	at com.databricks.spark.chauffeur.ChauffeurState.$anonfun$process$1(ChauffeurState.scala:571)
	at com.databricks.logging.UsageLogging.$anonfun$recordOperation$1(UsageLogging.scala:527)
	at com.databricks.logging.UsageLogging.executeThunkAndCaptureResultTags$1(UsageLogging.scala:631)
	at com.databricks.logging.UsageLogging.$anonfun$recordOperationWithResultTags$4(UsageLogging.scala:651)
	at com.databricks.logging.AttributionContextTracing.$anonfun$withAttributionContext$1(AttributionContextTracing.scala:48)
	at com.databricks.logging.AttributionContext$.$anonfun$withValue$1(AttributionContext.scala:276)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
	at com.databricks.logging.AttributionContext$.withValue(Attr

We just moved our automl model as production ready! 

Open [the dbdemos_hls_patient_readmission model](#mlflow/models/dbdemos_hls_patient_readmission) to explore its artifact and analyze the parameters used, including traceability to the notebook used for its creation.


## Our model predicting default risks is now deployed in production

So far we have:
* ingested all required data in a single source of truth using the OMOP data model,
* properly secured all data (including granting granular access controls, masked PII data, applied column level filtering),
* enhanced that data through feature engineering (and Feature Store as an option),
* used MLFlow AutoML to track experiments and build a machine learning model,
* registered the model.

### Next steps
We're now ready to use our model use it for:

- Batch inferences in notebook [04.3-Batch-Scoring-patient-readmission]($./04.3-Batch-Scoring-patient-readmission) to start using it for identifying patient at risk and providing cusom care to reduce readmission risk,
- Real time inference with [04.4-Model-Serving-patient-readmission]($./04.4-Model-Serving-patient-readmission) to enable realtime capabilities and instantly get insight for a specific patient.
- Explain model for our entire population or a specific patient to understand the risk factors and further personalize care with [04.5-Explainability-patient-readmission]($./04.5-Explainability-patient-readmission)