## Credit score modeling with Layer
[![Open in Layer](https://development.layer.co/assets/badge.svg)](https://development.layer.co/layer/credit-score) [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/layerai/examples/blob/main/credit-score/credit-score.ipynb) [![Layer Examples Github](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com/layerai/examples/tree/main/credit-score)

In this project we use Layer to build a credit scoring model. The project uses the [Home Credit Default Risk dataset](https://www.kaggle.com/c/home-credit-default-risk/overview) that is hosted on Kaggle.

## What are we going to learn?
- Create features in Python
- Fetching features and datasets and using them to train a model in Layer. 
- Using multiple datasets and featuresets in a Layer project
- Experimentation tracking with
     - logging model parameters
     - logging model evaluation metrics

### Install Layer

In [1]:
!pip install layer -qqq

[K     |████████████████████████████████| 29.5 MB 64.8 MB/s 
[K     |████████████████████████████████| 96 kB 5.7 MB/s 
[K     |████████████████████████████████| 256 kB 67.8 MB/s 
[K     |████████████████████████████████| 26.7 MB 3.1 MB/s 
[K     |████████████████████████████████| 16.5 MB 190 kB/s 
[K     |████████████████████████████████| 2.4 MB 37.8 MB/s 
[K     |████████████████████████████████| 212.4 MB 54 kB/s 
[K     |████████████████████████████████| 4.0 MB 40.4 MB/s 
[K     |████████████████████████████████| 271 kB 58.3 MB/s 
[K     |████████████████████████████████| 212 kB 54.3 MB/s 
[K     |████████████████████████████████| 132 kB 44.9 MB/s 
[K     |████████████████████████████████| 1.3 MB 48.9 MB/s 
[K     |████████████████████████████████| 159 kB 48.2 MB/s 
[K     |████████████████████████████████| 596 kB 51.8 MB/s 
[K     |████████████████████████████████| 381 kB 55.3 MB/s 
[K     |████████████████████████████████| 3.6 MB 44.9 MB/s 
[K     |████████████████

### Authenticate your layer account

In [2]:
import layer
from layer.decorators import model, fabric, dataset

In [3]:
!layer --version

Layer, version 0.9.345059


In [None]:
layer.login()

### Initialize a Layer project

In [5]:
# init Layer
layer.init("credit-score")

Project(name='credit-score', raw_datasets=[], derived_datasets=[], featuresets=[], models=[], path=PosixPath('.'), project_files_hash='', readme='', organization=Organization(id=UUID('d7325da3-0646-4fa6-855d-8d19eece8b79'), name='layer'), _id=UUID('da40128d-bc11-473d-af2c-bc38c8fd2dc7'), functions=[])

### Dataset definition
The first step is to define all the datasets that we will use in this project. We will use the following datasets:

- The application data
- The installment payments data
- The previous application data

In Layer, we define datasets using the `dataset` decorator. 

Calling the run command will save this data on Layer so that you can use it easily in subsequent calls.

In [55]:
@dataset("installments_payments")
def read_installments_data():
    import pandas as pd
    df = pd.read_csv("https://raw.githubusercontent.com/layerml/layerv2_credit_score/main/installments_payments.csv")
    return df

In [56]:
layer.run([read_installments_data])

Output()

Run(project_name='credit-score')

In [59]:
@dataset("previous_application")
def read_previous_application_data():
    import pandas as pd
    df = pd.read_csv("https://raw.githubusercontent.com/layerml/layerv2_credit_score/main/previous_application.csv")
    df['APPLIED_AWARDED_AMOUNT_DIFF'] = df['AMT_CREDIT'] - df['AMT_APPLICATION']
    df['GOODS_PRICE_APPLIED_DIFF'] = df['AMT_GOODS_PRICE'] - df['AMT_APPLICATION']
    df = df[['SK_ID_PREV', 'SK_ID_CURR', 'APPLIED_AWARDED_AMOUNT_DIFF','GOODS_PRICE_APPLIED_DIFF']]
    return df

In [60]:
layer.run([read_previous_application_data])

Output()

Run(project_name='credit-score')

### Create features
Next, extract features from the above datasets.

In [63]:
@dataset("application_features")
def extract_application_features():
    import pandas as pd
    df = pd.read_csv("https://raw.githubusercontent.com/layerml/layerv2_credit_score/main/application_train.csv")
    # credit amount ratio relative to the income of a client
    df['CREDIT_INCOME_RATIO'] = df['AMT_CREDIT'] / df['AMT_INCOME_TOTAL']

    # loan annuity percentage relative to the income of a client
    df['ANNUITY_INCOME_RATIO'] = df['AMT_ANNUITY'] / df['AMT_INCOME_TOTAL']
    # the length of the payment in months 
    df['CREDIT_TERM'] = df['AMT_ANNUITY'] / df['AMT_CREDIT']
    # days employed relative to the age of the client
    df['GOODS_PRICE_LOAN_DIFFERENCE'] = df['AMT_GOODS_PRICE'] - df['AMT_CREDIT']

    df['DAYS_EMPLOYED_RATIO'] = df['DAYS_EMPLOYED'] / df['DAYS_BIRTH']
    df = df[['TARGET', 'SK_ID_CURR', 'ANNUITY_INCOME_RATIO', 'CREDIT_INCOME_RATIO',
                                         'CREDIT_TERM', 'DAYS_EMPLOYED_RATIO', 'GOODS_PRICE_LOAN_DIFFERENCE',
                                          'REGION_RATING_CLIENT_W_CITY', 'OWN_CAR_AGE', 'DAYS_BIRTH',
                                         'REGION_RATING_CLIENT', 'REG_CITY_NOT_WORK_CITY',
                                         'LIVE_CITY_NOT_WORK_CITY', 'DAYS_REGISTRATION', 'DAYS_ID_PUBLISH',
                                         'FLAG_DOCUMENT_3']]
    return df

Calling the `run` command saves the new features so that you don't have to run the feature extraction functions again. 

In [64]:
layer.run([extract_application_features])

Output()

Run(project_name='credit-score')

## Model definition
In Layer, models are defined using the `model` decorator. In this case we decorate a function called `train`, however you can give this fucntion your prefered name e.g `train_model`. 

In [67]:
@fabric("f-medium")
@model(name="credit_score_model")
def train():
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import average_precision_score, roc_auc_score, roc_curve,precision_recall_curve
    from sklearn.ensemble import HistGradientBoostingClassifier
    application_features =  layer.get_dataset('layer/credit-score/datasets/application_features').to_pandas()
    previous_application_features = layer.get_dataset('layer/credit-score/datasets/previous_application').to_pandas()
    installments_payments = layer.get_dataset('layer/credit-score/datasets/installments_payments').to_pandas()
    dff = installments_payments.merge(previous_application_features, on=['SK_ID_PREV', 'SK_ID_CURR']).merge(application_features,on=['SK_ID_CURR'])
    X = dff.drop(["TARGET", "SK_ID_CURR",'index'], axis=1)
    y = dff["TARGET"]
    random_state = 13
    test_size = 0.3
    categories = dff.select_dtypes(include=['object']).columns.tolist() 
    transformer = ColumnTransformer(transformers=[('cat', OneHotEncoder(handle_unknown='ignore', drop="first"), categories)],remainder='passthrough')
    X = transformer.fit_transform(X)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size,
                                                        random_state=random_state)
  

     # Model Parameters
    learning_rate = 0.01
    max_depth = 6
    min_samples_leaf = 10
    random_state = 42
    early_stopping = True
    # Log model parameters
    layer.log(
      {
      "min_samples_leaf":min_samples_leaf,
      "learning_rate":learning_rate,
      "random_state":random_state,
      "early_stopping":early_stopping,
      "max_depth":max_depth
      })
    # Model: Define a HistGradient Boosting Classifier
    model = HistGradientBoostingClassifier(learning_rate=learning_rate,max_depth=max_depth,min_samples_leaf=min_samples_leaf,early_stopping=early_stopping,random_state=random_state)

     # Pipeline fit
    model.fit(X_train, y_train)
    # Predict probabilities of target
    probs = model.predict_proba(X_test)[:,1]
    # Calculate average precision and area under the receiver operating characteric curve (ROC AUC)
    avg_precision = average_precision_score(y_test, probs, pos_label=1)
    auc = roc_auc_score(y_test, probs)
    layer.log({"AUC":f'{auc:.4f}'})
    layer.log({"avg_precision":f'{avg_precision:.4f}'})
    plt.figure(figsize=(30,12))
    plt.subplot(1,2,1)
    plt.title("ROC Curve")
    # plot no skill roc curve
    plt.plot([0, 1], [0, 1], linestyle='--', label='No Skill')
    # calculate roc curve for model
    fpr, tpr, _ = roc_curve(y_test, probs)
    # plot model roc curve
    plt.plot(fpr, tpr, marker='.', label='HistGradientBoosting')
    # axis labels
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    # show the legend
    plt.legend()
    # calculate the precision-recall auc
    precision, recall, _ = precision_recall_curve(y_test, probs)
    # calculate the no skill line as the proportion of the positive class
    no_skill = len(y[y==1]) / len(y)
    plt.subplot(1,2,2)
    plt.title("Precision Recall curve")
    # plot the no skill precision-recall curve
    plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
    # plot the model precision-recall curve
    plt.plot(recall, precision, marker='.', label='HistGradientBoosting')
    # axis labels
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    # show the legend
    plt.legend()
    # show the plot
    plt.show()
    layer.log({"Curves":plt.gcf()})
    
    return model

Calling the `run` function with the `train` function runs all your model code and stores the resulting model to Layer. The model will then be available for inference immediately. Layer also saves all the items logged in the function, for example, metrics, parameters and images.

In [None]:
train()

In [68]:
# ++ To run the whole project on Layer Infra
layer.run([train])

Output()

Run(project_name='credit-score')

You can also run the model on your own infra by calling the `train` function. The model will be executed using your infra but the resulting model will be saved to Layer. 

## Using Layer entities 
After running the project, you can start using the Layer entites immediately. For example, you can fetch the trained model and use it to make preditions.

In [73]:
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
credit_model = layer.get_model('layer/credit-score/models/credit_score_model').get_train()
data = np.array([[1731690, -1916.0,-1953.0,6953.31,6953.31,1731690,0, 0 ,1731690 ,0.2976,7.47512,0.039812,1731690,0.189752,-161451.0,1731690,1731690,1731690,1731690,1,-16074.0, 1731690, 0.0 ]])
categories = []
transformer = ColumnTransformer(
        transformers=[('cat', OneHotEncoder(handle_unknown='ignore', drop="first"), categories)],
        remainder='passthrough')
data = transformer.fit_transform(data)
credit_model.predict(data)


array([0])

In [74]:
credit_model.predict_proba(data)

array([[0.93264026, 0.06735974]])

## Where to go from here?

Now that you have created this credit score project, you can:

- Join our [Slack Community ](https://bit.ly/layercommunityslack)
- Visit [Layer Examples Repo](https://github.com/layerai/examples) for more examples
- Browse [Trending Layer Projects](https://layer.ai) on our mainpage
- Check out [Layer Documentation](https://docs.layer.ai) to learn more