![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

## Installing `giskard`

In [1]:
!pip install giskard

Defaulting to user installation because normal site-packages is not writeable
Collecting giskard
  Downloading giskard-1.8.0-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 KB[0m [31m574.2 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting pydantic<2.0.0,>=1.10.2
  Downloading pydantic-1.10.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
[?25hCollecting grpcio<2.0.0,>=1.46.3
  Downloading grpcio-1.51.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tenacity<9.0.0,>=8.1.0
  Downloading tenacity-8.2.2-py3-none-any.whl (24 kB)
Collecting protobuf<4.0.0,>=3.9.2
  Downloading protobuf-3.20.3-cp310-cp310

## Connect the external worker in daemon mode

In [1]:
!giskard worker start -d

2023-03-05 20:41:51,220 pid:1447 MainThread giskard.cli  INFO     Starting ML Worker client daemon
2023-03-05 20:41:51,220 pid:1447 MainThread giskard.cli  INFO     Python: /usr/bin/python3 (3.10.6)
2023-03-05 20:41:51,220 pid:1447 MainThread giskard.cli  INFO     Giskard Home: /home/mathro/giskard-home
2023-03-05 20:41:51,221 pid:1447 MainThread giskard.cli_utils INFO     Writing logs to /home/mathro/giskard-home/run/ml-worker.log


# Start by creating an ML model 🚀🚀🚀

Let's create a credit scoring Model using the German Credit scoring dataset [(Link](https://github.com/Giskard-AI/giskard-client/tree/main/sample_data/classification) to download the dataset)

In [56]:
import pandas as pd
import numpy as np
import random

from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [6]:
# To download and read the credit scoring dataset
url = 'https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/credit_scoring_classification_model_dataset/german_credit_prepared.csv'
credit = pd.read_csv(url, sep=',',engine="python") #To download go to https://github.com/Giskard-AI/giskard-client/tree/main/sample_data/classification

In [7]:
# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'default':"category",
               'account_check_status':"category", 
               'duration_in_month':"numeric",
               'credit_history':"category",
               'purpose':"category",
               'credit_amount':"numeric",
               'savings':"category",
               'present_employment_since':"category",
               'installment_as_income_perc':"numeric",
               'sex':"category",
               'personal_status':"category",
               'other_debtors':"category",
               'present_residence_since':"numeric",
               'property':"category",
               'age':"numeric",
               'other_installment_plans':"category",
               'housing':"category",
               'credits_this_bank':"numeric",
               'job':"category",
               'people_under_maintenance':"numeric",
               'telephone':"category",
               'foreign_worker':"category"}

In [8]:
# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='default'}

# Pipeline to fill missing values, transform and scale the numeric columns
columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Pipeline to fill missing values and one hot encode the categorical values
columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]
categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])

# Perform preprocessing of the columns with the above pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode)
          ]
)

# Pipeline for the model Logistic Regression
clf_logistic_regression = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

# Split the data into train and test
Y=credit['default']
X= credit.drop(columns="default")
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)

In [9]:
# Fit and score your model
clf_logistic_regression.fit(X_train, Y_train)
clf_logistic_regression.score(X_test, Y_test)

0.755

In [10]:
# Prepare data to upload on Giskard
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

# Upload the model in Giskard 🚀🚀🚀

### Initiate a project

In [12]:
from giskard import GiskardClient

url = "http://localhost:19000" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY4NTQ2MDI1OX0.NuosCjh2EhAiCc7d411quTY89bAv8qfBIqpVJD1f6yo" #you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
# credit_scoring = client.create_project("credit_scoring", "German Credit Scoring", "Project to predict if user will default")

# If you've already created a project with the key "credit-scoring" use
credit_scoring = client.get_project("credit_scoring")


### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))

In [13]:
credit_scoring.upload_model_and_df(
    prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='default', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=clf_logistic_regression.classes_ ,  # List of the classification labels of your prediction
    model_name='logistic_regression_v1', # Name of the model
    dataset_name='test_data' # Name of the dataset
)



Dataset successfully uploaded to project key 'credit_scoring' with ID = 27. It is available at http://localhost:19000 
Model successfully uploaded to project key 'credit_scoring' with ID = 28. It is available at http://localhost:19000 


(28, 27)

### 🌟 If you want to upload a dataset without a model






For example, let's upload the train set in Giskard, this is key to create drift tests in Giskard.


In [14]:
credit_scoring.upload_df(
    df=train_data, # The dataset you want to upload
    column_types=column_types, # All the column types of df
    target="default", # Do not pass this parameter if dataset doesn't contain target column
    name="train_data" # Name of the dataset
)

Dataset successfully uploaded to project key 'credit_scoring' with ID = 29. It is available at http://localhost:19000 




29

You can also upload new production data to use it as a validation set for your existing model. In that case, you might not have the ground truth target variable

In [15]:
production_data = credit.drop(columns="default")

In [16]:
credit_scoring.upload_df(
    df=production_data, # The dataset you want to upload
    column_types=feature_types, # All the column types without the target
    name="production_data"# Name of the dataset
)



Dataset successfully uploaded to project key 'credit_scoring' with ID = 30. It is available at http://localhost:19000 


30

### 🌟 If you just want to upload a model without a dataframe 

This happens for instance when you built a new version of the model and you want to inspect it using a validation dataframe that is already in Giskard

For example, let's create a second version of the model using random forest

In [17]:
clf_random_forest = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(max_depth=10,random_state=0))])

clf_random_forest.fit(X_train, Y_train)
clf_random_forest.score(X_test, Y_test)

0.76

In [18]:
credit_scoring.upload_model(
    prediction_function=clf_random_forest.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    name='random_forest', # Name of the model
    validate_df=train_data, # Optional. Validation df is not uploaded in the app, it's only used to check whether the model has the good format
    target="default", # Optional. target should be a column of validate_df. Pass this parameter only if validate_df is being passed
    classification_labels=["Default","Not default"] # List of the classification labels of your prediction

)

Model successfully uploaded to project key 'credit_scoring' with ID = 31. It is available at http://localhost:19000 


31

### Happy Exploration ! 🧑‍🚀


### Analysis

In [14]:
credit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   default                     1000 non-null   object
 1   account_check_status        1000 non-null   object
 2   duration_in_month           1000 non-null   int64 
 3   credit_history              1000 non-null   object
 4   purpose                     1000 non-null   object
 5   credit_amount               1000 non-null   int64 
 6   savings                     1000 non-null   object
 7   present_employment_since    1000 non-null   object
 8   installment_as_income_perc  1000 non-null   int64 
 9   sex                         1000 non-null   object
 10  personal_status             1000 non-null   object
 11  other_debtors               1000 non-null   object
 12  present_residence_since     1000 non-null   int64 
 13  property                    1000 non-null   objec

In [15]:
credit.describe()

Unnamed: 0,duration_in_month,credit_amount,installment_as_income_perc,present_residence_since,age,credits_this_bank,people_under_maintenance
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0


In [38]:
credit.nunique()

default                         2
account_check_status            4
duration_in_month              33
credit_history                  5
purpose                        10
credit_amount                 921
savings                         5
present_employment_since        5
installment_as_income_perc      4
sex                             2
personal_status                 3
other_debtors                   3
present_residence_since         4
property                        4
age                            53
other_installment_plans         3
housing                         3
credits_this_bank               4
job                             4
people_under_maintenance        2
telephone                       2
foreign_worker                  2
dtype: int64

In [None]:
columns=credit.keys().tolist()

### Feature selection

I decided to select numerical features with high cardinal so : duration_in_month, credit_amount and age

### 1st method - Global Data Augmentation 

I had to choose first between 2 classic methods : Oversampling and Undersampling. I found it more interesting to work with oversampling for this first part.

The idea is to create new rows with very close values from the 1000 rows using normal distribution

In [41]:
credit_augmented = credit.copy()

In [42]:
from scipy.stats import norm

def random_normal_integer(mean,std):
    """
    returns random integer following normal distribution
    """
    new_value = norm.ppf(np.random.random(1), loc=mean, scale=std).astype(int)[0]
    new_value = max(new_value,1)
    return(new_value)

In [43]:
data_aug_data = {
    'duration_in_month':{
        "std" : credit['duration_in_month'].std().round().astype(int)/10,
    },
    'credit_amount':{
        "std" : credit['credit_amount'].std().round().astype(int)/10,
    },
    'age':{
        "std" : credit['age'].std().round().astype(int)/10,
    },
}

In [44]:
def data_augmentation(row,data_aug_data=data_aug_data):
    for el in data_aug_data.keys():
        row[el] = random_normal_integer(row[el],data_aug_data[el]["std"])
    return(row)

In [45]:
credit_augmented = credit_augmented.apply(data_augmentation,axis=1)

In [46]:
augmented_data = pd.concat([credit,credit_augmented],ignore_index=True)

In [50]:
# Split the data into train and test
Y=augmented_data['default']
X= augmented_data.drop(columns="default")
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)
# Fit and score your model
clf_logistic_regression.fit(X_train, Y_train)
clf_logistic_regression.score(X_test, Y_test)
# Prepare data to upload on Giskard
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

credit_scoring.upload_model_and_df(
    prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='default', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=clf_logistic_regression.classes_ ,  # List of the classification labels of your prediction
    model_name='logistic_regression_global_augmented_data', # Name of the model
    dataset_name='test_data_global_augmented_data' # Name of the dataset
)



Dataset successfully uploaded to project key 'credit_scoring' with ID = 719. It is available at http://localhost:19000 
Model successfully uploaded to project key 'credit_scoring' with ID = 720. It is available at http://localhost:19000 


(720, 719)

In [51]:
credit_scoring.upload_df(
    df=train_data, # The dataset you want to upload
    column_types=column_types, # All the column types without the target
    name="train_data_global_augmented_data"# Name of the dataset
)

Dataset successfully uploaded to project key 'credit_scoring' with ID = 721. It is available at http://localhost:19000 




721

With the Giskard platform we get for the original model:
Accuracy on the original test set: 0.76
F1 score on the original test set: 0.83
Accuracy difference : 0.01
F1 difference : 0.01


With the Giskard platform we get for the first method:
Accuracy on the original test set:  0.8
F1 score on the original test set: 0.86
Accuracy difference : 0.04
F1 difference : 0.03

Higher overfitting but it remains small overall

### 2nd method - Low performance features data augmentation

I decided to use oversampling heuristic methods rather than undersampling methods

In [84]:
def pick_rows_random(df,rows_nb,column_name,value):
    """
    The idea is to select a random subset from the 1000 original rows and replace a specific column with a specific value 
    to get more data for the low performance slices
    """
    existing_df = df.copy() #Copy
    
    selRows = existing_df[existing_df[column_name] == value ].index #Select rows ID with the specific value
    existing_df = existing_df.drop(selRows, axis=0) #Remove rows with the specific value
    
    existing_df.reset_index(drop=True, inplace=True) #Index reset
    random_index_list = random.sample(range(1, 1000-len(selRows)), rows_nb) #Random index generator
    new_data = existing_df.iloc[random_index_list] #Subset selection
    new_data[column_name]= value #New value attribution
    return(new_data)

In [67]:
def pick_rows_random_2(df,rows_nb,column_name,value,column_name_2,value_2):
    """
    This function is the same as pick_rows_random but with 2 features
    """
    existing_df = df.copy()
    
    selRows = existing_df[(existing_df[column_name] == value) & (existing_df[column_name_2] == value_2) ].index
    existing_df = existing_df.drop(selRows, axis=0)
    
    existing_df.reset_index(drop=True, inplace=True)
    random_index_list = random.sample(range(1, 1000-len(selRows)), rows_nb)
    new_data = existing_df.iloc[random_index_list]
    new_data[column_name]= value
    new_data[column_name_2]= value_2
    return(new_data)

In [59]:
# pick_rows_random(credit,250,"credit_history","all credits at this bank paid back duly").head()

I looked for each low performance slices on Giskard Inspector by analysing SHAP values

In [74]:
no_savings_account_augmentation = pick_rows_random_2(credit,50,"savings","unknown/ no savings account",
                                                   "credit_history","all credits at this bank paid back duly")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name]= value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name_2]= value_2


In [75]:
personal_status_augmentation = pick_rows_random_2(credit,50,"savings","unknown/ no savings account",
                                                 "personal_status","divorced")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name]= value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name_2]= value_2


In [78]:
duration_in_month_augmentation = pick_rows_random_2(credit,100,"duration_in_month",36,
                                                   "personal_status","divorced")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name]= value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name_2]= value_2


In [79]:
account_check_status_augmentation = pick_rows_random_2(credit,50,"account_check_status","< 0 DM",
                                                      "default","Default")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name]= value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name_2]= value_2


In [80]:
augmented_data = pd.concat([credit,
                            pick_rows_random(credit,100,"purpose","(vacation - does not exist?)"),
                            no_savings_account_augmentation,
                            personal_status_augmentation,
                            duration_in_month_augmentation,
                            account_check_status_augmentation
                           ],
                           ignore_index=True)
# augmented_data = credit

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[column_name]= value


In [81]:
# Split the data into train and test
Y=augmented_data['default']
X= augmented_data.drop(columns="default")
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)
# Fit and score your model
clf_logistic_regression.fit(X_train, Y_train)
clf_logistic_regression.score(X_test, Y_test)
# Prepare data to upload on Giskard
train_data = pd.concat([X_train, Y_train], axis=1)
test_data= pd.concat([X_test, Y_test ], axis=1)

credit_scoring.upload_model_and_df(
    prediction_function=clf_logistic_regression.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    df=test_data, # the dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='default', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    classification_labels=clf_logistic_regression.classes_ ,  # List of the classification labels of your prediction
    model_name='logistic_regression_augmented_data_detailed', # Name of the model
    dataset_name='test_data_augmented_data_detailed' # Name of the dataset
)



Dataset successfully uploaded to project key 'credit_scoring' with ID = 800. It is available at http://localhost:19000 
Model successfully uploaded to project key 'credit_scoring' with ID = 801. It is available at http://localhost:19000 


(801, 800)

In [83]:
credit_scoring.upload_df(
    df=train_data, # The dataset you want to upload
    column_types=column_types, # All the column types without the target
    name="train_data_augmented_data_detailed"# Name of the dataset
)



Dataset successfully uploaded to project key 'credit_scoring' with ID = 836. It is available at http://localhost:19000 


836

With the Giskard platform we get:
Accuracy on the original test set:  0.84
F1 score on the original test set: 0.89
Accuracy difference : 0.08
F1 difference : 0.05



## Results

## First Method with global data augmentation

### Accuracy : 5% improvement
### F1-score : 4% improvement

## First Method with global data augmentation

### Accuracy : 10% improvement
### F1-score : 7% improvement

## Conclusion

We have to be careful with overfitting but the results are encouraging