# Snowflake built-in ML functions

The Snowflake ML functions enable you to extract predictions and insights from your data using machine learning. You don’t have to be a machine learning expert to use them.

## Time-Series Functions
Train machine learning models on time-series data to determine how a specified metric (for example, sales) varies over time. The model provides insights or predictions based on the trends detected in the data.

- **Forecasting:** predict future values from past trends
- **Anomaly Detection:** flag values that deviate from expected values

## Other Analysis Functions
These models don’t require time series data, but benefit from a large number of features.

- **Classification:** map rows into two or more classes based on their most predictive features
- **Top Insights:** helps you find dimensions and values that affect the metric


# Credit scoring model
Credit scoring models evaluate an applicant's features like their demographic information, payment history, number of accounts, credit types, and other financial information to calculate a credit score. 

The credit score is used by lending institutions to determine how risky it is to lend money to the applicant.

For this demo, we will create a credit scoring model:
- Get demo data from https://www.kaggle.com/datasets/parisrohan/credit-score-classification/ and copy it into the DEMO_DATA Snowflake table
- Use the Snowflake Classification ML function

The account setup (role, database, schema, warehouse) and importing demo data are in the ```setup.sql``` script.


In [None]:
select * from demo_data;

## Clean the demo data
1. Remove underscores from numeric variables
2. Convert the Month value into numeric so we can sort it
3. Remove any records where the Age is less than 18 or greater than 100
4. Calculate the credit history age into months from years and months

In [None]:
-- clean demo data
create or replace view demo_data_clean as
select 
  "Customer_ID" as CUST_ID, 
  month(to_date('2024-'||substr("Month", 1, 3)||'-01', 'YYYY-Mon-DD')) as MTH,
  case 
    when try_to_number(replace("Age", '_', '')) < 18 then null
    when try_to_number(replace("Age", '_', '')) > 100 then null
    else try_to_number(replace("Age", '_', ''))
  end as AGE,
  case
    when substr("Occupation", 1, 2) = '__' then null
    else "Occupation"
  end as OCCUPATION,
  to_number(replace("Annual_Income", '_', '')) as ANNUAL_INCOME,
  to_number(replace("Monthly_Inhand_Salary", '_', '')) as MONTHLY_SALARY,
  to_number("Delay_from_due_date") as DELAY_FROM_DUE_DT,
  "Changed_Credit_Limit" as CHANGED_CREDIT_LIMIT,
  to_number(replace("Outstanding_Debt", '_', '')) as OUTSTANDING_DEBT,
  to_number(replace("Credit_Utilization_Ratio", '_', '')) as CREDIT_UTIL_RATIO,
  try_to_number(REGEXP_SUBSTR("Credit_History_Age", '^([0-9]+)')) * 12 +
    try_to_number(substr(REGEXP_SUBSTR("Credit_History_Age", 'and ([0-9]+)'), 5, 2)) AS CREDIT_HIST_AGE_MTHS,
  "Payment_of_Min_Amount" as PYMT_MIN_AMT,
  to_number(replace("Amount_invested_monthly", '_', '')) as AMT_INVESTED_MTHLY,
  "Payment_Behaviour" as PYMT_BEH,
  replace("Monthly_Balance", '_', '') as MONTHLY_BALANCE,
  "Credit_Score" as CREDIT_SCORE
from demo_data;


## Aggregate the cleaned demo data
The demo data contains several months worth of data for each customer. We want to aggregate the data so that we have only one record per customer.
1. Some variables are aggregated by ```last_value()```, eg. age, occupation, outstanding debt, etc.
2. Some variables are aggregated by ```avg()```, eg. monthly salary, amount invested monthly, etc.

In [None]:

-- aggregate demo data
create or replace view demo_data_agg as
with aggregates as (
  select cust_id, mth,
    last_value(age) ignore nulls over (partition by cust_id order by mth desc) as age,
    last_value(occupation) ignore nulls over (partition by cust_id order by mth desc) as occupation,
    avg(annual_income) over (partition by cust_id) as avg_annual_income,
    avg(monthly_salary) over (partition by cust_id) as avg_monthly_salary,
    last_value(delay_from_due_dt) ignore nulls over (partition by cust_id order by mth desc) as delay_from_due_dt,
    last_value(changed_credit_limit) ignore nulls over (partition by cust_id order by mth desc) as changed_credit_limit,
    last_value(outstanding_debt) ignore nulls over (partition by cust_id order by mth desc) as outstanding_debt,
    avg(credit_util_ratio) over (partition by cust_id) as credit_util_ratio,
    last_value(credit_hist_age_mths) ignore nulls over (partition by cust_id order by mth desc) as credit_hist_age_mths,
    last_value(pymt_min_amt) ignore nulls over (partition by cust_id order by mth desc) as pymt_min_amt,
    avg(amt_invested_mthly) over (partition by cust_id) as amt_invested_mthly,
    last_value(pymt_beh) ignore nulls over (partition by cust_id order by mth desc) as pymt_beh,
    avg(monthly_balance) over (partition by cust_id) as monthly_balance,
    last_value(credit_score) ignore nulls over (partition by cust_id order by mth desc) as credit_score
  from demo_data_clean
)
select * from aggregates where mth = 8;


## Exploratory data analysis
Before building any machine learning model, we perform exploratory data analysis to get familiar with the data, understand the distribution, identify any outliers, deal with missing values, etc.

We will look at the distribution of the values of the classification variable ```CREDIT_SCORE```.

In [None]:
select * from demo_data_agg;

In [None]:
select credit_score, count(*) as cnt from demo_data_agg group by all;

In [None]:
# Import python packages
import streamlit as st

df = Credit_score_dist.to_pandas()
df1 = df.set_index(df['CREDIT_SCORE'])
df1 = df1['CNT']

st.bar_chart(df1)

## Split demo data into training and test:
1. Create the training data by taking a 70% sample of the demo data
2. Create the test data by taking the remaining 30% of the demo data
3. Remove the ```cust_id``` and ```mth``` columns from the training data because they don't represent features for model training

In [None]:
-- create training data by taking 70% sample from the train_data_agg table
create or replace table demo_data_samp70 as
select * 
from demo_data_agg
SAMPLE (70);

-- create test data by taking remaining 30% data
create or replace table test_data as
select * 
from demo_data_agg 
where cust_id not in (select cust_id from demo_data_samp70);

-- remove the cust_id and mth columns from the training data
create or replace view train_data
as select * exclude (cust_id, mth) from demo_data_samp70;

## Use the Classification ML Function to create the credit scoring model
Using the ```TRAIN_DATA_SCORING``` data as input

In [None]:
create or replace SNOWFLAKE.ML.CLASSIFICATION credit_scoring_model (
    input_data => TABLE(train_data),
    target_colname => 'credit_score'
);

## Predict the credit score on the test data
Save the predicted results to the ```MODEL_OUTPUT``` table. Parse the relevant columns, such as the predicted class and the probability from the returned JSON.

In [None]:
create or replace table model_output as
select *, 
  credit_scoring_model!PREDICT(
    input_data => object_construct(*))
    as prediction
from test_data;

In [None]:
select *,
  prediction:"class"::varchar as class,
  prediction:"probability"."Good" as probability_good,
  prediction:"probability"."Standard" as probability_standard,
  prediction:"probability"."Poor" as probability_poor
from model_output;


## Create the confusion matrix
Compare the actual (column ```CREDIT_SCORE```) and the predicted (column ```CLASS```) and count the distinct pairs of predicted and actual, then plot the results as a heatmap.

In [None]:
select credit_score, prediction:"class"::varchar as class, count(cust_id) as cnt from model_output group by all;

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = Confusion_matrix_SQL.to_pandas()
heatmap_data = df.pivot_table(index='CREDIT_SCORE', columns='CLASS', values='CNT', aggfunc='sum', fill_value=0)

fig, ax = plt.subplots(figsize=(4, 3))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt='d', ax=ax)


## Evaluate model performance
Call the ```SHOW_EVALUATION_METRICS()``` and ```SHOW_FEATURE_IMPORTANCE()``` functions and evaluate the results

In [None]:
call credit_scoring_model!SHOW_EVALUATION_METRICS();

In [None]:
call credit_scoring_model!SHOW_FEATURE_IMPORTANCE();

## Create another model using training data with better balanced classes
- take 100% of the Good class
- take 30% of the Standard class
- take 55% of the Poor class

In [None]:
-- take sample data for better balanced classes
create or replace table train_data_bal as
select * from train_data where credit_score = 'Good' 
union all
select * from (select * from train_data where credit_score = 'Standard') SAMPLE (30) 
union all
select * from (select * from train_data where credit_score = 'Poor') SAMPLE (55);

select credit_score, count(*) as cnt from train_data_bal group by all;

## Train the model and predict
Use the same test data to predict as previously

In [None]:
create or replace SNOWFLAKE.ML.CLASSIFICATION credit_scoring_model_bal (
    input_data => TABLE(train_data_bal),
    target_colname => 'credit_score'
);

create or replace table model_output_bal as
select *, 
  credit_scoring_model_bal!PREDICT(
    input_data => object_construct(*))
    as prediction
from test_data;

## Evaluate the new model
Plot the confusion matrix

In [None]:
select credit_score, prediction:"class"::varchar as class, count(cust_id) as cnt from model_output_bal group by all;

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

df = Confusion_matrix_SQL_bal.to_pandas()
heatmap_data = df.pivot_table(index='CREDIT_SCORE', columns='CLASS', values='CNT', aggfunc='sum', fill_value=0)

fig, ax = plt.subplots(figsize=(4, 3))
sns.heatmap(heatmap_data, annot=True, cmap='YlGnBu', fmt='d', ax=ax)


In [None]:
call credit_scoring_model_bal!SHOW_EVALUATION_METRICS();

## The importance of exploratory data analysis
- find the distribution of feature values to identify underrepresented groups
- identify outliers and remove them if they distort the distribution
- deal with missing values (remove, impute, average, etc.)
- calculated aggregated values (features)
- balance the classes
- take data samples that represent the feature groups more evenly

## Next steps
Explore some of the newer functionality related to machine learning, such as:
- Register the model in the model registry for use by the organization (https://docs.snowflake.com/en/developer-guide/snowflake-ml/model-registry/overview)
- Use the feature store to calculate aggregated or derived values on a regular basis (https://docs.snowflake.com/en/developer-guide/snowflake-ml/feature-store/overview)