![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

## Installing `giskard`

In [None]:
!pip install giskard

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting giskard
  Downloading giskard-1.9.1-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Collecting eli5<0.14.0,>=0.13.0 (from giskard)
  Downloading eli5-0.13.0.tar.gz (216 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.2/216.2 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting grpcio<=1.51.1,>=1.46.3 (from giskard)
  Downloading grpcio-1.51.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m43.5 MB/s[0m eta [36m0:00:00[0m
Collecting importlib_metadata<5.0.0,>=4.11.4 (from giskard)
  Downloading importlib_metadata-4.13.0-py3-none-any.whl (23 kB)
Collecting lockfile<0.13.0,>=0.12.2 (from giskard)

## Connect the external worker in daemon mode

In [None]:
!giskard worker start -d

2023-05-24 14:05:29,831 pid:670 MainThread giskard.cli  INFO     Starting ML Worker client daemon
2023-05-24 14:05:29,831 pid:670 MainThread giskard.cli  INFO     Python: /usr/bin/python3 (3.10.11)
2023-05-24 14:05:29,831 pid:670 MainThread giskard.cli  INFO     Giskard Home: /root/giskard-home
2023-05-24 14:05:29,832 pid:670 MainThread giskard.cli_utils INFO     Writing logs to /root/giskard-home/run/ml-worker.log


# Start by creating an ML model 🚀🚀🚀

Let's create a credit scoring Model using the German Credit scoring dataset [(Link](https://github.com/Giskard-AI/giskard-client/tree/main/sample_data/classification) to download the dataset)

In [None]:
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")
import random
import numpy as np
from tqdm import tqdm
import copy

In [None]:
# To download and read the credit scoring dataset
url = 'https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/credit_scoring_classification_model_dataset/german_credit_prepared.csv'
credit = pd.read_csv(url, sep=',',engine="python") #To download go to https://github.com/Giskard-AI/giskard-client/tree/main/sample_data/classification

In [None]:
# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'default':"category",
               'account_check_status':"category", 
               'duration_in_month':"numeric",
               'credit_history':"category",
               'purpose':"category",
               'credit_amount':"numeric",
               'savings':"category",
               'present_employment_since':"category",
               'installment_as_income_perc':"numeric",
               'sex':"category",
               'personal_status':"category",
               'other_debtors':"category",
               'present_residence_since':"numeric",
               'property':"category",
               'age':"numeric",
               'other_installment_plans':"category",
               'housing':"category",
               'credits_this_bank':"numeric",
               'job':"category",
               'people_under_maintenance':"numeric",
               'telephone':"category",
               'foreign_worker':"category"}

In [None]:
# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='default'}

# Pipeline to fill missing values, transform and scale the numeric columns
columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]
numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Pipeline to fill missing values and one hot encode the categorical values
columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]
categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])

# Perform preprocessing of the columns with the above pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode)
          ]
)

# Pipeline for the model Logistic Regression
clf_logistic_regression = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

# Split the data into train and test


# **Defining functions that we will use for both techniques**

In [None]:
#finds all examples that differ from predicted ones 
def check_diff(X_test, y_test, y_predicted):
  indices=[]
  X_test=X_test.reset_index(drop=True)
  y_list=y_test.tolist()
  X_test["result"]=y_list
  for i, row in enumerate(y_test):
      if row!=y_predicted[i]:
        indices.append(i)
  return X_test[['account_check_status', 'duration_in_month',
       'credit_history', 'purpose', 'credit_amount', 'savings',
       'present_employment_since', 'installment_as_income_perc', 'sex',
       'personal_status', 'other_debtors', 'present_residence_since',
       'property', 'age', 'other_installment_plans', 'housing',
       'credits_this_bank', 'job', 'people_under_maintenance', 'telephone',
       'foreign_worker',"result"]][X_test.index.isin(indices)]

#by using previous function computes the performance of a data slice 
#performance of a data slice = Percetage of data slice in the orginal data - Percentage of data slice in the wrongly predicted data
def check_worst_performances(clf,X_test, Y_test,bare=5):
  print("Data slices' performance")
  y_predicted=clf_logistic_regression.predict(X_test)
  dataset_to_check=check_diff(X_test,Y_test,y_predicted)
  for i in ['account_check_status', 'duration_in_month',
        'credit_history', 'purpose', 'credit_amount', 'savings',
        'present_employment_since', 'installment_as_income_perc', 'sex',
        'personal_status', 'other_debtors', 'present_residence_since',
        'property', 'age', 'other_installment_plans', 'housing',
        'credits_this_bank', 'job', 'people_under_maintenance', 'telephone',
        'foreign_worker']:
    list_values=set(dataset_to_check[i].tolist())
    for j in list_values:
      percent_pred=len(dataset_to_check[dataset_to_check[i]==j])/len(dataset_to_check)
      percent_x=len(X_test[X_test[i]==j])/len(X_test)
      if 100*(percent_pred-percent_x)>bare:
        print("column ", i, "  value  ", j, "  score ", np.round(100*(percent_pred-percent_x)))

#Helps to find insights for data slices given the attribute name. 
#Prints the mean for the most important numererical attributes (Duration and Credit_amount)
#Prints the most common value by categorical feature for every data slice
def find_similar_cat(column_name, dataset=credit):
  print("\n")
  print(" Dataframe with mean Duration and Amount")
  print("\n")
  display(dataset.groupby([column_name])['duration_in_month',
       'credit_amount'].mean())
  print("\n")
  print(" Dataframe with the most common value for categorical attributes")
  print("\n")
  display(credit.groupby([column_name]).agg(pd.Series.mode))


In [None]:
#Example for Savings
find_similar_cat('savings')



 Dataframe with mean Duration and Amount




Unnamed: 0_level_0,duration_in_month,credit_amount
savings,Unnamed: 1_level_1,Unnamed: 2_level_1
.. >= 1000 DM,18.3125,2573.395833
... < 100 DM,20.441128,3187.832504
100 <= ... < 500 DM,22.737864,3384.038835
500 <= ... < 1000 DM,19.031746,2572.111111
unknown/ no savings account,22.715847,3906.409836




 Dataframe with the most common value for categorical attributes




Unnamed: 0_level_0,default,account_check_status,duration_in_month,credit_history,purpose,present_employment_since,installment_as_income_perc,sex,personal_status,other_debtors,present_residence_since,property,age,other_installment_plans,housing,credits_this_bank,job,people_under_maintenance,telephone,foreign_worker
savings,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
.. >= 1000 DM,Not default,no checking account,24,existing credits paid back duly till now,"[car (new), radio/television]",1 <= ... < 4 years,4,male,divorced,none,2,real estate,27,none,own,1,skilled employee / official,1,none,yes
... < 100 DM,Not default,< 0 DM,12,existing credits paid back duly till now,domestic appliances,1 <= ... < 4 years,4,male,single,none,4,"if not A121/A122 : car or other, not in attrib...",26,none,own,1,skilled employee / official,1,none,yes
100 <= ... < 500 DM,Not default,0 <= ... < 200 DM,24,existing credits paid back duly till now,car (new),1 <= ... < 4 years,4,male,single,none,4,"if not A121/A122 : car or other, not in attrib...","[30, 31]",none,own,1,skilled employee / official,1,none,yes
500 <= ... < 1000 DM,Not default,no checking account,24,existing credits paid back duly till now,domestic appliances,1 <= ... < 4 years,4,male,single,none,4,"if not A121/A122 : car or other, not in attrib...","[23, 35, 36]",none,own,1,skilled employee / official,1,none,yes
unknown/ no savings account,Not default,no checking account,24,existing credits paid back duly till now,domestic appliances,.. >= 7 years,4,male,single,none,4,"if not A121/A122 : car or other, not in attrib...",36,none,own,1,skilled employee / official,1,none,yes


In [None]:
#Here we can check what data slice has the worst performance (the bigger the value the worse is its performance)
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(credit.drop(columns=["default"]), credit["default"], test_size=0.20,random_state = 30, stratify = credit["default"])
clf_logistic_regression = Pipeline(steps=[('preprocessor', preprocessor),
                    ('classifier', LogisticRegression(max_iter =1000))])
clf_logistic_regression.fit(X_train, Y_train)
print("The original score is  ",clf_logistic_regression.score(X_test, Y_test))
print("\n")
check_worst_performances(clf_logistic_regression,X_train,Y_train.to_numpy())

The original score is   0.755


Data slices' performance
column  account_check_status   value   < 0 DM   score  11.0
column  account_check_status   value   0 <= ... < 200 DM   score  6.0
column  duration_in_month   value   18   score  6.0
column  credit_history   value   existing credits paid back duly till now   score  7.0
column  savings   value   ... < 100 DM   score  8.0
column  present_employment_since   value   1 <= ... < 4 years   score  5.0
column  installment_as_income_perc   value   2   score  7.0
column  age   value   22   score  6.0
column  housing   value   rent   score  5.0
column  credits_this_bank   value   1   score  6.0


# **Technique 1: Improve a data slice with the data from another data slice**
The principle of this data augmentation technique is as follows: Let's assume that we have identified a data slice (Slice X) that exhibits poor performance.

Using the *find_similar_cat()* function, we can determine the data slice (Slice Y) that is most similar to Slice X.

Next, we randomly duplicate data points in Slice Y and replace their numerical features with the mean values from Slice X.

Finally, we inject these augmented data points into Slice X.

In [None]:
#function for technique 1 when we are considering a unique Slice Y
#It is important to indicate that we duplicate values only from the original dataset and NOT from the augmented one to avoid multiplying datapoints too much
def data_inject_0(column_name, value_to_inject, value_to_find, number, mean_duration, mean_amount, X, Y, X_to_aug, Y_to_aug):
  X_train=X
  Y_train=Y
  randomlist = random.sample(range(0, len(X_train[X_train[column_name]==value_to_find])),number)
  X_to_inject=X_train[X_train[column_name]==value_to_find].reset_index(drop=True)
  X_to_inject=X_to_inject.loc[X_to_inject.index[randomlist]]
  X_to_inject[column_name]=value_to_inject
  X_to_inject["duration_in_month"]=mean_duration
  X_to_inject["credit_amount"]=mean_amount
  Y_to_inject=pd.DataFrame(X_to_inject['default']).reset_index(drop=True)
  X_to_aug=pd.concat([X_to_aug, X_to_inject], axis=0,ignore_index=True).reset_index(drop=True)
  Y_to_aug=pd.concat([Y_to_aug, Y_to_inject], axis=0,ignore_index=True).reset_index(drop=True)
  Y_to_aug=pd.DataFrame(Y_to_aug).reset_index(drop=True)
  return X_to_aug, Y_to_aug

#function for technique 1 when we are considering multiple Slices Y
def data_inject_1(column_name, value_to_inject, values_to_find, number, mean_duration, mean_amount, X, Y, X_to_aug, Y_to_aug):
  X_train=X
  Y_train=Y
  randomlist = random.sample(range(0, len(X_train[X_train[column_name].isin(values_to_find)])),number)
  X_to_inject=X_train[X_train[column_name].isin(values_to_find)].reset_index(drop=True)
  X_to_inject=X_to_inject.loc[X_to_inject.index[randomlist]]
  X_to_inject[column_name]=value_to_inject
  X_to_inject["duration_in_month"]=mean_duration
  X_to_inject["credit_amount"]=mean_amount
  Y_to_inject=pd.DataFrame(X_to_inject['default']).reset_index(drop=True)
  X_to_aug=pd.concat([X_to_aug, X_to_inject], axis=0,ignore_index=True).reset_index(drop=True)
  Y_to_aug=pd.concat([Y_to_aug, Y_to_inject], axis=0,ignore_index=True).reset_index(drop=True)
  Y_to_aug=pd.DataFrame(Y_to_aug).reset_index(drop=True)
  return X_to_aug, Y_to_aug

## **Performing technique 1 :)**

In [None]:
Y=credit['default']
X=credit
random.seed(41)

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)
Y_train=pd.DataFrame(Y_train).reset_index(drop=True)

X_train_new,Y_train_new=data_inject_0("savings","... < 100 DM","unknown/ no savings account",140,20.441128,3187.832504,X_train,Y_train,X_train,Y_train)
X_train_new,Y_train_new=data_inject_0("credit_history","all credits at this bank paid back duly","no credits taken/ all credits paid back duly",30,22.693878,3344.877551,X_train,Y_train,X_train_new,Y_train_new)
X_train_new,Y_train_new=data_inject_0('account_check_status',"< 0 DM","no checking account",100,21.339416,3175.218978,X_train,Y_train,X_train_new,Y_train_new)
X_train_new,Y_train_new=data_inject_1("duration_in_month",18,[15,16,20,21],45,18,2718.33628,X_train,Y_train,X_train_new,Y_train_new)
X_train_new,Y_train_new=data_inject_0('present_employment_since',"... < 1 year","unemployed",45,19.401163,2952.453488,X_train,Y_train,X_train_new,Y_train_new)
clf_logistic_regression = Pipeline(steps=[('preprocessor', preprocessor),
                    ('classifier', LogisticRegression(max_iter =1000))])
clf_logistic_regression.fit(X_train_new, Y_train_new)
print("the final score is ",clf_logistic_regression.score(X_test, Y_test))


the final score is  0.79


# **Techique 2: Use "corrected" outliers from underperformed data slice**
The principle of this data augmentation technique is as follows: Suppose we have identified a subset of data (Slice X) that exhibits poor performance.

The algorithm should perform well for the most common data points, even in the case of poor performance within Slice X. This means we need to consider outliers.

In our case, we define an outlier as a data point where more than half of its categorical features differ from the most common values.

Once an outlier is identified, we create a duplicate of it and randomly modify one of its "outlying" features to match the most common value. This makes it slightly less of an outlier.

Finally, we introduce these augmented data points into Slice X.

In [None]:
#function for technique 2
def corrected_outliers(column_name, value, dataset, columns_cat):
  columns=copy.deepcopy(columns_cat)
  account_check=dataset.groupby([column_name]).agg(pd.Series.mode)
  account_check=account_check[account_check.index==value]
  account_check=account_check.reset_index(drop=True)
  columns.remove(column_name)
  account_check=account_check[columns]
  credit_new=dataset[dataset[column_name]==value]
  new_dataset=pd.DataFrame(columns=dataset.columns)

  for i,r in credit_new.iterrows():

    #comparing the row with "the most common row"
    res=r[columns].eq(account_check)
    res=res.values.flatten().tolist()
    res=[x for x in res[1:]]

    #counting the number of outlying features 
    if res.count(False)>len(res)/2:
      index=[]
      for j in range(len(res)):
        if res[j] == False:
          index.append(j)
      column= random.choice(index)
      r[account_check.columns[column]]=account_check[account_check.columns[column]].values[0]
      new_dataset=pd.concat([r,new_dataset],axis=1, ignore_index=True)

  new_dataset=new_dataset.transpose()
  new_dataset=new_dataset.dropna(how='all')
  return new_dataset

#performing data injection using the function for the technique 2
def data_injection(X, Y, dataset_to_inject, X_to_aug,Y_to_aug):
  X_to_inject=dataset_to_inject.reset_index(drop=True)
  Y_to_inject=pd.DataFrame(dataset_to_inject['default']).reset_index(drop=True)

  X_to_aug=pd.concat([X_to_aug, X_to_inject], axis=0,ignore_index=True).reset_index(drop=True)
  Y_to_aug=pd.concat([Y_to_aug, Y_to_inject], axis=0,ignore_index=True).reset_index(drop=True)
  Y_to_aug=pd.DataFrame(Y_to_aug).reset_index(drop=True)
  return X_to_aug,Y_to_aug


In [None]:
Y=credit['default']
X=credit
random.seed(41)
columns=['account_check_status','credit_history', 'purpose',
        'savings', 'present_employment_since', 'sex', 'personal_status',
        'other_debtors', 'property', 'other_installment_plans', 'housing',
        'job', 'telephone', 'foreign_worker']

X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20,random_state = 30, stratify = Y)
Y_train=pd.DataFrame(Y_train).reset_index(drop=True)
X_train_new, Y_train_new=data_injection(X_train,Y_train,corrected_outliers(column_name='account_check_status',value="< 0 DM",dataset=credit,columns_cat=columns),X_to_aug=X_train,Y_to_aug=Y_train)
X_train_new, Y_train_new=data_injection(X_train,Y_train,corrected_outliers(column_name='savings',value="... < 100 DM",dataset=credit,columns_cat=columns),X_to_aug=X_train_new, Y_to_aug=Y_train_new)
X_train_new, Y_train_new=data_injection(X_train,Y_train,corrected_outliers(column_name="credit_history",value="no credits taken/ all credits paid back duly",dataset=credit,columns_cat=columns),X_to_aug=X_train_new, Y_to_aug=Y_train_new)
X_train_new, Y_train_new=data_injection(X_train,Y_train,corrected_outliers(column_name="personal_status",value="divorced",dataset=credit,columns_cat=columns),X_to_aug=X_train_new, Y_to_aug=Y_train_new)
Y_train_new=Y_train_new.to_numpy()
X_train_new=X_train_new.drop(columns=["default"])
clf_logistic_regression.fit(X_train_new, Y_train_new)
print("the final score is ",clf_logistic_regression.score(X_test, Y_test))

the final score is  0.805


# **Conclusion**

Conclusion:
I applied both techniques to a limited number of data slices (4-5) and observed a 4-5% improvement in the classification score. However, this improvement is just the beginning, as there are more data slices with poor performance.

Based on my observations, it appears that technique 2 yields better performance. However, technique 1 has the potential to leverage "field knowledge" more effectively. It could be interesting to explore the possibility of combining both approaches to achieve even better performance.