![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

## Installing `giskard`

In [None]:
!pip install giskard

## Connect the external worker in daemon mode

In [None]:
!giskard worker start -d

# Start by creating an ML model 🚀🚀🚀

Let's create a house pricing model based on Kaggle dataset [(Link](https://raw.githubusercontent.com/Giskard-AI/giskard-client/main/sample_data/regression/house-prices/house_price_updated.csv) to download the dataset)

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

In [2]:
# To download and read the House Pricing Dataset
url = 'https://raw.githubusercontent.com/Giskard-AI/examples/main/datasets/house_pricing_regression_model_dataset/house_price_updated.csv'
data = pd.read_csv(url)

In [11]:
data.KitchenQuality.unique()

array(['Good', 'Average', 'Excellent', 'Fair'], dtype=object)

In [3]:
# Declare the type of each column in the dataset(example: category, numeric, text)
column_types = {'TypeOfDewelling': 'category',
                'BldgType': 'category',
                'AbvGroundLivingArea': 'numeric',
                'Neighborhood': 'category',
                'KitchenQuality': 'category',
                'NumGarageCars': 'numeric',
                'YearBuilt': 'numeric',
                'RemodelYear':  'numeric',
                'ExternalQuality': 'category',
                'LotArea': 'numeric',
                'LotShape': 'category',
                'Fireplaces': 'numeric',
                'NumBathroom': 'numeric',
                'Basement1Type': 'category',
                'Basement1SurfaceArea': 'numeric',
                'Basement2Type': 'category',
                'Basement2SurfaceArea': 'numeric',
                'TotalBasementArea': 'numeric',
                'GarageArea': 'numeric',
                '1stFlrArea': 'numeric',
                '2ndFlrArea': 'numeric',
                'Utilities': 'category',
                'OverallQuality': 'category',
                'SalePrice': 'numeric'
                }

In [4]:
# feature_types is used to declare the features the model is trained on
feature_types = {i:column_types[i] for i in column_types if i!='SalePrice'}

# Pipeline to fill missing values, transform and scale the numeric columns
numeric_features = [key for key in feature_types.keys() if feature_types[key]=="numeric"]
numeric_transformer = Pipeline([('imputer', SimpleImputer(missing_values= np.nan, strategy='mean')),
                                ('scaler', StandardScaler())])

# Pipeline to fill missing values and one hot encode the categorical values
categorical_features = [key for key in feature_types.keys() if feature_types[key]=="category"]
categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(missing_values= np.nan, strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])

# Initiate Preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
      ('cat', categorical_transformer, categorical_features)
    ]
)

# Pipeline for the Random Forest Model
reg_random_forest = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', RandomForestRegressor())])

# Split the data
y = data['SalePrice']
X = data.drop(columns="SalePrice")
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 30)

In [5]:
# Fit and score your model
reg_random_forest.fit(X_train, y_train)
print("model score: %.3f" % reg_random_forest.score(X_test, y_test))

model score: 0.858


In [6]:
# Prepare data to upload on Giskard
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test ], axis=1)

# Upload the model in Giskard 🚀🚀🚀

### Initiate a project

In [None]:
from giskard import GiskardClient

url = "http://localhost:19000" #if Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "YOUR GENERATED TOKEN" #you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
house_pricing = client.create_project("house_pricing", "House pricing model", "Project to predict house prices")

#If you've already created a project with the key "house_pricing" use
# house_pricing = client.get_project("house_pricing")

### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))

In [None]:
house_pricing.upload_model_and_df(
    prediction_function=reg_random_forest.predict, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='regression', # "classification" for classification model OR "regression" for regression model
    df=test_data, # The dataset you want to use to inspect your model
    column_types=column_types, # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
    target='SalePrice', # The column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    model_name='random_forest_v1', # Name of the model
    dataset_name='test_data' # Name of the dataset
)

### 🌟 If you want to upload a dataset without a model


For example, let's upload the train set in Giskard, this is key to create drift tests in Giskard.

In [None]:
house_pricing.upload_df(
    df=train_data, # The dataset you want to upload
    column_types=column_types, # All the column types of df
    target="SalePrice", # Do not pass this parameter if dataset doesn't contain target column
    name="train_data"  # Name of the dataset
)

You can also upload new production data to use it as a validatation set for your existing model. In that case, you might not have the ground truth target variable

In [None]:
production_data = data.drop(columns="SalePrice")

In [None]:
house_pricing.upload_df(
    df=production_data, # The dataset you want to upload
    column_types=feature_types, # All the column types without the target
    name="production_data" # Name of the dataset
)

### 🌟 If you just want to upload a model without a dataframe 

This happens for instance when you built a new version of the model and you want to inspect it using a validation dataframe that is already in Giskard

For example, let's create a second version of the model using the catboost library

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostRegressor

X['Basement1Type'] = X['Basement1Type'].fillna("")
X['Basement2Type'] = X['Basement2Type'].fillna("")
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 30)

model = CatBoostRegressor(iterations=2,
                           learning_rate=1,
                           depth=2)

model.fit(X_train, y_train, cat_features=categorical_features)

In [None]:
def prediction_function(X):
  X['Basement1Type'] = X['Basement1Type'].fillna("")
  X['Basement2Type'] = X['Basement2Type'].fillna("")
  return model.predict(X)

In [None]:
house_pricing.upload_model(
    prediction_function=prediction_function, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='regression', # "classification" for classification model OR "regression" for regression model
    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
    name='catboost', # Name of the model
    validate_df=train_data, # Optional. Validation df is not uploaded in the app, it's only used to check whether the model has the good format
    target="SalePrice", # Optional. target should be a column of validate_df. Pass this parameter only if validate_df is being passed
)

### Happy Exploration ! 🧑‍🚀