### Installing Vectice

In [92]:
#Install Vectice Python library 
# In this notebook we will do code versioning using github, we also support gitlab
# and bitbucket: !pip install -q "vectice[github, gitlab, bitbucket]"
!pip install --q vectice[github]==22.3.5.1

In [93]:
#Verify if Vectice python library was installed
!pip show vectice

Name: vectice
Version: 22.3.5.1
Summary: Vectice Python library
Home-page: https://www.vectice.com
Author: Vectice Inc.
Author-email: sdk@vectice.com
License: Apache License 2.0
Location: /usr/local/lib/python3.7/dist-packages
Requires: requests, urllib3, python-dotenv
Required-by: 


## Reading the data

In [94]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('kc_house_data_cleaned.csv')

# Run head to make sure the data was loaded properly
df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Vectice configuration 

To authenticate and start logging your work into Vectice, you need an API token and an API endpoint.

- The API endpoint Is "app.vectice.com" for SAAS and the IP for private deployments.
- The API Token to authenticate your requests from the API to the Vectice application can be created from the "API Tokens" section of the "My Profile" page at any time.

Those Tokens can be set up as environmental variables, added to a .env file, or declared as parameters when you initialize Vectice.

In [95]:
import json
f = open('EB_key.json',)
EB_key = json.load(f)

In [None]:
#Import the required packages
from vectice import Vectice
from vectice import Experiment
from vectice.api.json import JobType
from vectice.api.json import ModelType
from vectice.api.json import FileMetadata
from vectice.api.json import StageStatus
from vectice.api.json import ModelVersionStatus

import logging
import os
logging.basicConfig(level=logging.INFO)

# Specify the API endpoint for Vectice.  It's "app.vectice.com" for SAAS and the IP for private deployments.
# You can specify your API endpoint here in the notebook, but we recommand you to add it to a .env file
os.environ['VECTICE_API_ENDPOINT']= "app.vectice.com"

# To use the Vectice Python library, you first need to authenticate your account using an API token.
# You can generate an API token from the Vectice UI, by going to the "API Tokens" section in the "My Profile" section
# which is located under your profile picture.
os.environ['VECTICE_API_TOKEN'] = EB_key['key']

# Add you project id. The project id can be found in the project settings page in the Vectice UI
Project_id = project_ID

# Initialize Vectice
vectice = Vectice(project=Project_id)

## Train/Test Split

In [99]:
import string
from math import sqrt

# Load scikit-learn packages
from sklearn.model_selection import train_test_split  # Model Selection
from sklearn.metrics import mean_absolute_error, mean_squared_error  # Model Evaluation
from sklearn.linear_model import LinearRegression  # Linear Regression
from sklearn.tree import DecisionTreeRegressor, plot_tree  # Decision Tree Regression

Here, we'll initialize our data preparation experiment.

- An experiment in Vectice groups different runs of any type (Extraction, PREPARATION, TRAINING, INFERENCE, DEPLOYMENT) which are a representation of all the metadata that you log to Vectice for a given job.

- Each execution of an experiment is called a run beginning when you start a tracked experiment with experiment.start() and ending when stop the experiment with experiment.complete().

- Each run has inputs and outputs. The inputs can be code, dataset and model versions and the outputs can be dataset and model versions.

-  By default, every artifact used before experiment.start() is considered as an _input_ of the run, and every artifact added between experiment.start() and experiment.complete() is considered as an _output_. However, you still can declare your run's inputs when starting it.

Please check the [documentation](https://doc.vectice.com/sdk/experiment.html) for more information


In [None]:
# Initialize your data preparation experiment
# You can also declare your API endpoint and API token when initializing your experiment if you don't want to declare them as environmental variables
# experiment = Experiment(job="Data Preparation", user_token="API Token", api_endpoint= "API Endpoint", project=Project_id, job_type=JobType.PREPARATION)
experiment = Experiment(job="Data Preparation", project=Project_id, job_type=JobType.PREPARATION)

Let's suppose that the EDA part has been done before and we have the cleaned data in the output. Here, we create a dataset for our cleaned data and its first version

In [None]:
# We create our cleaned data dataset as it doesn't exist yet.
experiment.vectice.create_dataset(name = "Cleaned data",
                                  description="cleaned data"
                                  )

In [103]:
# We create our first version of the cleaned data dataset
experiment.add_dataset_version(dataset="Cleaned data",
                               metadata=[FileMetadata(
                              name="cleaned_data.csv",
                              uri="gs://vectice_tutorial/kc_house_data_cleaned.csv",
                              size = 1.8e+6,
                              )])

ArtifactReference(code=None, dataset=Cleaned data, model=None, version_number=None, version_id=13179, version_name=None, version_strategy=VersionStrategy.MANUAL, description=None, )

In [104]:
# create a code checkpoint for this version of the notebook
# Vectice automatically tracks your code and attaches a code version to your run if you're working in a git repo
# however, you still can create code versions manually by using add_code_version()
input_code = experiment.add_code_version_uri(git_uri="https://github.com/vectice/vectice-examples",
                                             entrypoint="Quick_references/Training_notebook.ipynb")

### Create a run for the data preparation job

Each execution of an experiment is called a run beginning when you start a tracked experiment with experiment.start() and ending when stop the experiment with experiment.complete().

In [105]:
# All the dataset versions that have been created before starting the experiment will be automatically
# attached as inputs of the run
# Define some run properties
technique = ["Technique", "Train Test Split"]

experiment.start(run_properties={technique[0]: technique[1]},
                run_notes="Data preparation run")

# Train/Test split

# We will use an 80/20 split to prepare the data
test_size = 0.2

# We will set the random seed so we always generate the same split.
random_state = 42

train, test = train_test_split(df, test_size = test_size, random_state = random_state)

# Push our training and testing data to the storage

# Generate X_train, X_test, y_train, y_test, which we will need for modeling
X = df.drop("price", axis=1).values
y = df["price"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

 Let's create our training and testing datasets as they don't exist yet

In [106]:
# We create our training dataset as it doesn't exist yet
experiment.vectice.create_dataset(name = "Training data",
                                  description="Training data")

# We create our testing dataset as it doesn't exist yet
experiment.vectice.create_dataset(name = "Testing data",
                                  description="Testing data")

Dataset(name=Testing data, id=10253, description=Testing data, connection=None, resources=[])

Let's create our first versions of the training and testingdataset

In [107]:
# We create our first version of the training dataset
experiment.add_dataset_version(dataset="Training data",
                               metadata=[FileMetadata(
                              name="train.csv",
                              uri="gs://vectice_tutorial/train_cleaned_kc_house_data.csv")])

# We create our first version of the testing dataset
experiment.add_dataset_version(dataset="Testing data",
                               metadata=[FileMetadata(
                              name="test.csv",
                              uri="gs://vectice_tutorial/test_cleaned_kc_house_data.csv")])


# We complete the current experiment's run 
# All the dataset versions that have been created after starting the experiment will be automatically
# attached as outputs of the run
experiment.complete()

## Modeling

In [108]:
#first we need to find out what datasets are out there for us to use.  We need the train/test split data
vectice.list_datasets()

[Dataset(name=Testing data, id=10253, description=Testing data, connection=None, resources=None),
 Dataset(name=Training data, id=10252, description=Training data, connection=None, resources=None),
 Dataset(name=Cleaned data, id=10251, description=cleaned data, connection=None, resources=None)]

In [109]:
#from there we need to extract the current versions
vectice.list_dataset_versions(dataset = "Training data")

[DatasetVersion(dataset=Dataset(name=Training data, id=10252, description=Training data, connection=None, resources=None), id=13180, description=None, is_starred=False, auto_version=True, name=Version 1, properties=None, version=None)]

In [110]:
vectice.list_dataset_versions(dataset = "Testing data")

[DatasetVersion(dataset=Dataset(name=Testing data, id=10253, description=Testing data, connection=None, resources=None), id=13181, description=None, is_starred=False, auto_version=True, name=Version 1, properties=None, version=None)]

In [112]:
#load current dataset versions to be used in the modeling experiment
train_ds_version = vectice.get_dataset_version(version=13180)
test_ds_version =  vectice.get_dataset_version(version=13181)

In [None]:
# Initialize your Modeling experiment
# You can also declare your API endpoint and API token when initializing your experiment if you don't want to declare them as environmental variables
# experiment = Experiment(job="Modeling", user_token="API Token", api_endpoint= "API Endpoint", project=Project_id, job_type=JobType.TRAINING)
experiment = Experiment(job="Modeling", project=Project_id, job_type=JobType.TRAINING)

### Baseline model

Let's create our baseline model and log the run and the model version along with its metrics to Vectice and automatically document all of this

In [114]:
# Define some run properties
technique = ["Approach", "Linear Regression"]

# We start our run and declare its inputs, properties and notes
experiment.start(inputs = [input_code, train_ds_version, test_ds_version],
                 run_properties={technique[0]: technique[1]},
                 run_notes="Linear regression training run")

# Linear regression model
lr_rg = LinearRegression()
lr_rg.fit(X_train, y_train)
lr_pred = lr_rg.predict(X_test)

# Evaluate Metrics
MAE = round(mean_absolute_error(lr_pred, y_test),3)
RMSE = round(sqrt(mean_squared_error(lr_pred, y_test)),3)

print("Root Mean Squared Error: ", RMSE)
print("Mean Absolute Error: ", MAE)

# Let's log the model we trained along with its metrics, as a new version 
# of the "Price Predictor" model in Vectice.
metrics = {"RMSE": RMSE,
           "MAE": MAE}

model = experiment.add_model_version(
                                model="Price Predictor",
                                algorithm = technique[1],
                                metrics = metrics)

# You can automatically document the run using a default template
# You can provide a stage nameThe run's documentation will be added to the specified stage
# If the stage doesn't exists, a new stage will be created.
experiment.document_run(name="Modeling stage")

# We complete the current experiment's run 
# All the model versions that have been created after starting the experiment will be automatically
# attached as outputs of the run
experiment.complete()

Root Mean Squared Error:  156149.062
Mean Absolute Error:  109761.979


### Advanced model

Let's create an advanced model using the Decision tree algorithm and log the run and the model version along with its metrics and hyperparameters to Vectice and automatically document all of this.

Here, we'll use different values for the tree depth paramter and create a new run with a new model version for each value of this parameter

In [None]:
## Define some run properties
technique = ["Approach", "Decision Tree"]

tree_depth = [2, 4, 6]
# We create a new run with a new model version for each value of the tree depth parameter and declare its inputs
for i in range(len(tree_depth)):
  experiment.start(inputs = [input_code, train_ds_version, test_ds_version],
                 run_properties={technique[0]: technique[1]},
                 run_notes="Decision tree training run with tree_depth = "+str(tree_depth[i]))


  # Decision tree model
  dtr = DecisionTreeRegressor(max_depth=tree_depth[i], min_samples_split=50)
  dtr.fit(X_train,y_train)
  dtr_pred = dtr.predict(X_test) 

  data_feature_names = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
                      'waterfront', 'view', 'condition', 'grade', 'sqft_above',
                      'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
                      'long', 'sqft_living15', 'sqft_lot15']

  # Visualize the Decision Tree Model
  plt.figure(figsize=(25, 10))
  plot_tree(dtr, feature_names=data_feature_names, filled=True, fontsize=10)
  attachment_name = "DecisionTree_"+str(tree_depth[i])+".png"
  plt.savefig(attachment_name)

  MAE = round(mean_absolute_error(dtr_pred, y_test),3)
  RMSE = round(sqrt(mean_squared_error(dtr_pred, y_test)),3)


  # Let's log the model we trained along with its metrics and hyperparameters, as a new version 
  # of the "Price Predictor" model in Vectice.
  metrics = {"RMSE" : RMSE,
             "MAE": MAE}
  hyper_parameters = {"Tree Depth": tree_depth[i]}
  model = experiment.add_model_version(
                                model="Price Predictor",
                                algorithm = technique[1],
                                metrics = metrics,
                                hyper_parameters = hyper_parameters,
                                attachment=[attachment_name])

  # You can automatically document the run using a default template
  # You can provide a stage name. The run's documentation will be added to the specified stage
  # If the stage doesn't exists, a new stage will be created.
  experiment.document_run(name="Modeling stage")

  # We complete the current experiment's run 
  # All the model versions that have been created after starting the experiment will be automatically
  # attached as outputs of the run
  experiment.complete()

### Update the model

Let's update our model's description and type

In [123]:
experiment.update_model(model = "Price Predictor", description= " House price prediction model", type = ModelType.REGRESSION)

Model(name=Price Predictor, id=4794, description= House price prediction model, type=ModelType.REGRESSION)

### Get a table of all the model versions

You can also get all the model versions you created in previous runs, for offline analysis and understanding in more details what's driving the models performance.

In [116]:
# We get a dataframe of all our model versions and sort it by MAE descending
df = experiment.list_model_versions_dataframe(model="Price Predictor").sort_values(by='MAE', ascending=False)
df

Unnamed: 0,createdDate,name,versionNumber,status,algorithmName,isStarred,MAE,RMSE,Tree Depth
0,2022-08-25T15:09:15.131Z,Version 4,4,EXPERIMENTATION,Decision Tree,False,95872.64,144649.557,6.0
2,2022-08-25T15:08:51.707Z,Version 2,2,EXPERIMENTATION,Decision Tree,False,141604.093,203604.213,2.0
1,2022-08-25T15:09:03.145Z,Version 3,3,EXPERIMENTATION,Decision Tree,False,112088.196,165154.562,4.0
3,2022-08-25T15:07:56.602Z,Version 1,1,EXPERIMENTATION,Linear Regression,False,109761.979,156149.062,


### Update the status of the best model version and star it

The best model version is the version 4 of the decision tree algorithm with a tree depth of 6. Let's star it and update it's status to Production

In [117]:
experiment.list_model_versions(model = "Price Predictor")

[ModelVersion(model=Model(name=Price Predictor, id=4794, description=None, type=ModelType.OTHER), id=13110, description=None, metrics={}, hyper_parameters={'Tree Depth': '6'}, algorithm_name=Decision Tree, status=ModelVersionStatus.EXPERIMENTATION, is_starred=False, version={'versionNumber': 4, 'versionName': 'Version 4', 'id': 13110}, user_declared_version=None),
 ModelVersion(model=Model(name=Price Predictor, id=4794, description=None, type=ModelType.OTHER), id=13109, description=None, metrics={}, hyper_parameters={'Tree Depth': '4'}, algorithm_name=Decision Tree, status=ModelVersionStatus.EXPERIMENTATION, is_starred=False, version={'versionNumber': 3, 'versionName': 'Version 3', 'id': 13109}, user_declared_version=None),
 ModelVersion(model=Model(name=Price Predictor, id=4794, description=None, type=ModelType.OTHER), id=13108, description=None, metrics={}, hyper_parameters={'Tree Depth': '2'}, algorithm_name=Decision Tree, status=ModelVersionStatus.EXPERIMENTATION, is_starred=False,

In [118]:
experiment.update_model_version(version = 13110, status=ModelVersionStatus.PRODUCTION, is_starred=True)

ModelVersion(model=Model(name=Price Predictor, id=4794, description=None, type=ModelType.OTHER), id=13110, description=None, metrics=None, hyper_parameters=None, algorithm_name=Decision Tree, status=ModelVersionStatus.PRODUCTION, is_starred=True, version={'versionNumber': 4, 'versionName': 'Version 4', 'id': 13110}, user_declared_version=None)

### Add a conclusion and complete the stage

Let's add a conclusion to our documentation:

In [119]:
# We retrieve the stage
stage = experiment.vectice.get_stage(stage = "Modeling stage")

In [120]:
# We add our conclusion to the stage. It's going to be added at the end of the stage by default
stage.add_block(text = "Conclusion")
stage.add_block(text = "The decision tree algorithm has better performance than our baseline as expected. As the depth of the tree increases, the MAE and RMSE decrease as well. This however is more likely to overfit. One good compromise could be a depth of 6 with MAE 96K and RMSE of 146K.")

Stage(id=25007, name=Modeling stage, status=StageStatus.InProgress, origin=StageOrigin.VecticeFile)

In [121]:
# We complete our stage
experiment.vectice.update_stage(stage="Modeling stage", status=StageStatus.Completed)

Stage(id=25007, name=Modeling stage, status=StageStatus.Completed, origin=StageOrigin.VecticeFile)