
# ML: predict credit owners with high default probability

Once all data is loaded and secured (the **data unification** part), we can proceed to exploring, understanding, and using the data to create actionable insights - **data decisioning**.


As outlined in the [introductory notebook]($../00-Credit-Decisioning), we will build machine learning (ML) models for driving three business outcomes:
1. Identify currently underbanked customers with high credit worthiness so we can offer them credit instruments,
2. Predict current credit owners with high probability of defaulting along with the loss-given default, and
3. Offer instantaneous micro-loans (Buy Now, Pay Later) when a customer does not have the required credit limit or account balance to complete a transaction.

Here is the flow we'll implement: 

<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/main/images/fsi/credit_decisioning/fsi-credit-decisioning-ml-0.png" width="1200px">


<!-- Collect usage data (view). Remove it to disable collection. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=lakehouse&org_id=1444828305810485&notebook=%2F03-Data-Science-ML%2F03.1-Feature-Engineering-credit-decisioning&demo_name=lakehouse-fsi-credit&event=VIEW&path=%2F_dbdemos%2Flakehouse%2Flakehouse-fsi-credit%2F03-Data-Science-ML%2F03.1-Feature-Engineering-credit-decisioning&version=1&user_hash=7804490f0d3be4559d29a7b52959f461489c4ee5e35d4afc7b55f311360ac589">

### A cluster has been created for this demo
To run this demo, just select the cluster `dbdemos-lakehouse-fsi-credit-junyi_tiong` from the dropdown menu ([open cluster configuration](https://e2-demo-field-eng.cloud.databricks.com/#setting/clusters/0922-083237-e7fg83pu/configuration)). <br />
*Note: If the cluster was deleted after 30 days, you can re-create it with `dbdemos.create_cluster('lakehouse-fsi-credit')` or re-install the demo: `dbdemos.install('lakehouse-fsi-credit')`*


## The need for Enhanced Collaboration

Feature Engineering is an iterative process - we need to quickly generate new features, test the model, and go back to feature selection and more feature engineering - many many times. The Databricks Lakehouse enables data teams to collaborate extremely effectively through the following Databricks Notebook features:
1. Sharing and collaborating in the same Notebook by any team member (with different access modes),
2. Ability to use python, SQL, and R simultaneously in the same Notebook on the same data,
3. Native integration with a Git repository (including AWS Code Commit, Azure DevOps, GitLabs, Github, and others), making the Notebooks tools for CI/CD,
4. Variables explorer,
5. Automatic Data Profiling (in the cell below), and
6. GUI-based dashboards (in the cell below) that can also be added to any Databricks SQL Dashboard.

These features enable teams within FSI organizations to become extremely fast and efficient in building the best ML model at reduced time, thereby making the most out of market opportunities such as the raising interest rates.

In [0]:
%pip install databricks-sdk==0.36.0 mlflow==2.19.0 databricks-feature-store==0.17.0
dbutils.library.restartPython()

In [0]:
%run ../_resources/00-setup $reset_all_data=false


## Data exploration & Features creation

<img src="https://raw.githubusercontent.com/databricks-demos/dbdemos-resources/main/images/fsi/credit_decisioning/fsi-credit-decisioning-ml-1.png" style="float: right" width="800px">

<br/><br/>
The first step as Data Scientist is to explore our data and understand it to create Features.

<br/>

This where we use our existing tables and transform the data to be ready for our ML models. These features will later be stored in Databricks Feature Store (see below) and used to train the aforementioned ML models.

<br/>

Let's start with some data exploration. Databricks comes with built-in Data Profiling to help you bootstrap that.

In [0]:
%sql
SELECT * FROM customer_gold WHERE tenure_months BETWEEN 10 AND 150

In [0]:
data = spark.table("customer_gold") \
              .where("tenure_months BETWEEN 10 AND 150") \
              .groupBy("tenure_months", "education").sum("income_monthly") \
              .orderBy('education').toPandas()

px.bar(data, x="tenure_months", y="sum(income_monthly)", color="education", title="Wide-Form Input")


# Building our Features for Credit Default risks

To build our model predicting credit default risks, we'll need a buch of features. To improve our governance and centralize our data for multiple ML project, we can save our ML features using a Feature Store.

In [0]:
customer_gold_features = (spark.table("customer_gold")
                               .withColumn('age', int(date.today().year) - col('birth_year'))
                               .select('cust_id', 'education', 'marital_status', 'months_current_address', 'months_employment', 'is_resident',
                                       'tenure_months', 'product_cnt', 'tot_rel_bal', 'revenue_tot', 'revenue_12m', 'income_annual', 'tot_assets', 
                                       'overdraft_balance_amount', 'overdraft_number', 'total_deposits_number', 'total_deposits_amount', 'total_equity_amount', 
                                       'total_UT', 'customer_revenue', 'age', 'avg_balance', 'num_accs', 'balance_usd', 'available_balance_usd')).dropDuplicates(['cust_id'])
display(customer_gold_features)

In [0]:
telco_gold_features = (spark.table("telco_gold")
                            .select('cust_id', 'is_pre_paid', 'number_payment_delays_last12mo', 'pct_increase_annual_number_of_delays_last_3_year', 'phone_bill_amt', \
                                    'avg_phone_bill_amt_lst12mo')).dropDuplicates(['cust_id'])
display(telco_gold_features)

In [0]:
fund_trans_gold_features = spark.table("fund_trans_gold").dropDuplicates(['cust_id'])

for c in ['12m', '6m', '3m']:
  fund_trans_gold_features = fund_trans_gold_features.withColumn('tot_txn_cnt_'+c, col('sent_txn_cnt_'+c)+col('rcvd_txn_cnt_'+c))\
                                                     .withColumn('tot_txn_amt_'+c, col('sent_txn_amt_'+c)+col('rcvd_txn_amt_'+c))

fund_trans_gold_features = fund_trans_gold_features.withColumn('ratio_txn_amt_3m_12m', F.when(col('tot_txn_amt_12m')==0, 0).otherwise(col('tot_txn_amt_3m')/col('tot_txn_amt_12m')))\
                                                   .withColumn('ratio_txn_amt_6m_12m', F.when(col('tot_txn_amt_12m')==0, 0).otherwise(col('tot_txn_amt_6m')/col('tot_txn_amt_12m')))\
                                                   .na.fill(0)
display(fund_trans_gold_features)

In [0]:
feature_df = customer_gold_features.join(telco_gold_features.alias('telco'), "cust_id", how="left")
feature_df = feature_df.join(fund_trans_gold_features, "cust_id", how="left")
display(feature_df)


# Databricks Feature Store

<img src="https://github.com/QuentinAmbard/databricks-demo/raw/main/product_demos/mlops-end2end-flow-feature-store.png" style="float:right" width="650" />

Once our features are ready, we'll save them in Databricks Feature Store. 

Under the hood, feature store are backed by a Delta Lake table. This will allow discoverability and reusability of our feature across our organization, increasing team efficiency.


Databricks Feature Store brings advanced capabilities to accelerate and simplify your ML journey, such as point in time support and online-store, fetching your features within ms for real time Serving. 

### Why use Databricks Feature Store?

Databricks Feature Store is fully integrated with other components of Databricks.

* **Discoverability**. The Feature Store UI, accessible from the Databricks workspace, lets you browse and search for existing features.

* **Lineage**. When you create a feature table with Feature Store, the data sources used to create the feature table are saved and accessible. For each feature in a feature table, you can also access the models, notebooks, jobs, and endpoints that use the feature.

* **Batch and Online feature lookup for real time serving**. When you use features from Feature Store to train a model, the model is packaged with feature metadata. When you use the model for batch scoring or online inference, it automatically retrieves features from Feature Store. The caller does not need to know about them or include logic to look up or join features to score new data. This makes model deployment and updates much easier.

* **Point-in-time lookups**. Feature Store supports time series and event-based use cases that require point-in-time correctness.


For more details about Databricks Feature Store, run `dbdemos.install('feature-store')`

In [0]:
from databricks import feature_store
fs = feature_store.FeatureStoreClient()

# Drop the fs table if it was already existing to cleanup the demo state
drop_fs_table(f"{catalog}.{db}.credit_decisioning_features")
  
fs.create_table(
    name=f"{catalog}.{db}.credit_decisioning_features",
    primary_keys=["cust_id"],
    df=feature_df,
    description="Features for Credit Decisioning.")


## Next steps

After creating our features and storing them in the Databricks Feature Store, we can now proceed to the [03.2-AutoML-credit-decisioning]($./03.2-AutoML-credit-decisioning) and build out credit decisioning model.