## Preparation

This is the continuation of our EDA notebook. Like for the EDA notebook, you will need the secret key json file that was provided with your Vectice account to be able to access the housing data stored in Google Cloud Storage.

We will be using the following models to make our predictions and illustrate how to log information into Vectice via Vectice SDK:
- Linear Regression
- Decision Tree Regression
- Random Forest Regression

Vectice is going to enable us to track our experiments, datasets, analyses, and documentation all in one place.

### Install Vectice and GCS packages

In [None]:
## Requirements
!pip3 install fsspec
!pip3 install gcsfs
!pip3 install vectice

In [None]:
!pip3 show vectice

### Retrieve the data from GCS

Let's retrieve the file that was generated after the EDA was completed.

In [None]:
# Load your service account json key file to access GCS that was provided with your tutorial account. 
# The name should be something like gridmauk-10b1aaafb63f.json.
from google.colab import files
uploaded = files.upload()


In [None]:
# Once your file is loaded, set the credentials for GCS and load the file in a Pandas frame.
# Double check the json file name you uploaded below so it matches.
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'gcsTutorialTest.json'

#df_2019 = pd.read_csv('gs://test-vectice-storage/kc_house_data_2019.csv')
df = pd.read_csv('gs://vectice_tutorial/kc_house_data_cleaned.csv')

# Run head to make sure the data was loaded properly
df.head()


Vectice provides a generic metadata layer that is potentially suitable for most data science workflows.

For this tutorial we will use the popular scikit-learn library for modeling and track experiments directly through our Python SDK to illustrate how to fine-tune exactly what you would like to track: metrics, etc. The same mechanisms would apply to R, Java or even more generic REST APIs to track metadata from any programming language and library.

In [None]:
# In order to use Vectice SDK, let's set up the configurations first.
# The Vectice API key below can be generated from the UI.
# For better security, we strongly recommend for the settings to be put into a dedicated file and not to be added directly.
os.environ['VECTICE_API_ENDPOINT']= "be-dev.vectice.com"
#os.environ['VECTICE_API_TOKEN'] = "APITOKEN_APITOKEN"
os.environ['VECTICE_API_TOKEN'] = "wz1pa0y0Z.2xQ7rgbPO3Gdm4qLNRkWwz1pa0y0ZoX6V5BK82aM9DjnlJEYem"

from vectice import Vectice

#vectice = Vectice(project_token="PROJECTTOKEN_PROJECTTOKEN")
vectice = Vectice(project_token="nwlOqv91Hgn9JaGve863")


### Split dataset into training and testing

We are splitting the dataset into training data and testing data below and save that in GCS. The GCS code has been commented out as the data has already been generated.

In [None]:
import string
from math import sqrt

# Load scikit-learn packages
from sklearn.model_selection import train_test_split  # Model Selection
from sklearn.metrics import mean_absolute_error, mean_squared_error  # Model Evaluation
from sklearn.linear_model import LinearRegression  # Linear Regression
from sklearn.tree import DecisionTreeRegressor, plot_tree  # Decision Tree Regression
from sklearn.ensemble import RandomForestRegressor  # Random Forest Regression


In [None]:
# Use auto-versioning here
input_ds_version = vectice.create_dataset_version().with_parent_name("kc_house_data")

# Start a Vectice run. The job type should be PREPARATION in this case.
vectice.create_run("jobSplitHousingData")
vectice.start_run(inputs=[input_ds_version])

# Let's split training and testing sets using a random split of 80% for training, 20% for testing.
# We use a seed to make the results reproducible for this example.
train, test = train_test_split(df, test_size=0.2, random_state = 42)

# We commented out the code to persist the training and testing test in GCS,
# because we already generated it for you, but feel free to uncomment it and execute it.
# The key you were provided for this tutorial may not have write permissions to GCS.
# Let us know if you want to be able to write files as well and we can issue you a different key.
# train.to_csv (r'gs://vectice_data_examples/reference_data/training_data.csv', index = False, header = True)
# test.to_csv (r'gs://vectice_data_examples/reference_data/testing_data.csv', index = False, header = True)

# Let's futher generate X_train, X_test, y_train, y_test, which we will need for modeling
X = df.drop("price", axis=1).values
y = df["price"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_ds_version = vectice.create_dataset_version().with_parent_name("train_kc_house_data")
test_ds_version = vectice.create_dataset_version().with_parent_name("test_kc_house_data")

vectice.end_run(outputs=[train_ds_version,test_ds_version])

X_train


### Name your experiments

Input the experiment names and make them unique

In [None]:
# Let's generate some unique names for our following modeling experiments
import random

def get_random_string(length):
    return "".join(random.choice(string.ascii_letters) for i in range(length))

rdm_str = get_random_string(5)
print("Generated random string to make job names unique:", rdm_str)

# Linear regression
LR_EXPERIMENT_NAME = "LR-Model-" + rdm_str

# Decision tree
DT_EXPERIMENT_NAME = "DT-Model-" + rdm_str

# Random forest
RF_EXPERIMENT_NAME = "RF-Model-" + rdm_str


## Modeling

### Linear regression model

First, we will do a basic Linear Regression and observe the baseline accuracy metrics.

In [None]:
# Each "trial" for a given job is called a run, for LR we will only do one run.
# Setting the job's name is mandatory and run's name is optional, though both of them can be updated from Vectice UI.

vectice.create_run(LR_EXPERIMENT_NAME)

vectice.start_run(inputs=[input_ds_version])

lr_rg = LinearRegression()
lr_rg.fit(X_train, y_train)
lr_pred = lr_rg.predict(X_test)

# Evaluate Metrics
MAE = mean_absolute_error(lr_pred, y_test)
RMSE = sqrt(mean_squared_error(lr_pred, y_test))

print("Root Mean Squared Error: ", RMSE)
print("Mean Absolute Error: ", MAE)

# For the first time, you must define a user version
# Here, we use a random string to avoid using the same user-generated name more than once (probabilistically speaking). This is to avoid the known issue mentioned in the next commented lines
model_version = vectice.create_model_version().with_parent_name("linearRegression").with_property("Algorithm","Linear Regression").with_metric("RMSE",RMSE).with_metric("MAE",MAE).with_user_version(get_random_string(12))

# Known issue: successive runs fail and may point to another model despite defining the parent_name and number and version
#model_version = vectice.create_model_version().with_parent_name("linearRegression").with_property("Algorithm","Linear Regression").with_metric("RMSE",RMSE).with_metric("MAE",MAE).with_existing_version(186,"Version 1")

vectice.end_run(outputs=[model_version])


### Decision tree model

Let's try decision tree and see if it gives a better accuracy. Try different values for the tree_depth. Same principle of logging metrics as above.

In [None]:
# We can do a few runs with different max depth for the tree.
# Just change the value below and re-run this cell.
# You should see the new runs showing up in the Vectice UI.
tree_depth = 4

vectice.create_run(DT_EXPERIMENT_NAME)

vectice.start_run(inputs=[input_ds_version])

dtr = DecisionTreeRegressor(max_depth=tree_depth, min_samples_split=50)
dtr.fit(X_train,y_train)
dtr_pred = dtr.predict(X_test) 

data_feature_names = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
                      'waterfront', 'view', 'condition', 'grade', 'sqft_above',
                      'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
                      'long', 'sqft_living15', 'sqft_lot15']

# Visualize the Decision Tree Model
plt.figure(figsize=(25, 10))
plot_tree(dtr, feature_names=data_feature_names, filled=True, fontsize=10)

MAE = mean_absolute_error(dtr_pred, y_test)
RMSE = sqrt(mean_squared_error(dtr_pred, y_test))

print("Root Mean Squared Error:", RMSE)
print("Mean Absolute Error:", MAE)

model_version2 = vectice.create_model_version().with_parent_name("decisionTree").with_property("Algorithm","Decision Tree").with_metric("RMSE",RMSE).with_metric("MAE",MAE).with_user_version(get_random_string(12))

vectice.end_run(outputs=[model_version2])

### Random forest model

Let's use the Random Forest Regression and do some hyper-parameter tuning on it.

In [None]:
# You can do multiple runs by modifying the values below
nb_trees = 30
min_samples = 30

vectice.create_run(RF_EXPERIMENT_NAME)

vectice.start_run(inputs=[input_ds_version])

rf_regressor = RandomForestRegressor(n_estimators=nb_trees, min_samples_leaf=min_samples)
rf_regressor.fit(X_train, y_train)
rf_regressor.score(X_test, y_test)
rf_regressor_pred = rf_regressor.predict(X_test)

MAE = mean_absolute_error(rf_regressor_pred, y_test)
RMSE = sqrt(mean_squared_error(rf_regressor_pred, y_test))

print("Root Mean Squared Error:", RMSE)
print("Mean Absolute Error:", MAE)

# Here's an alternative version to declare metrics
metrics = [("RMSE",RMSE),
           ("MAE",MAE)]

model_version3 = vectice.create_model_version().with_parent_name("randomForest").with_property("Algorithm","Random Forest").with_metrics(metrics).with_user_version(get_random_string(12))

vectice.end_run(outputs=[model_version3])


We can see that the Random Forest Regressor model gives the lowest error and should be the preferred approach despite the complexity of the algorithm. Let's get the list of features' importance to discuss which variables are influencing the model the most.

In [None]:
columns = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors',
           'waterfront', 'view', 'condition', 'grade', 'sqft_above',
           'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat',
           'long', 'sqft_living15', 'sqft_lot15']

importance = pd.DataFrame({'Importance': rf_regressor.feature_importances_ * 100}, index=columns)
importance.sort_values(by="Importance", axis=0, ascending=True).plot(kind="barh", color="b")
plt.xlabel("Variable Importance")
plt.gca().legend_ = None


Thank you and congratulations! You have succesfully completed this tutorial.

In this tutorial we have illustrated how you can capture your experiments, hyper-parameters, datasets and metrics inside Vectice for analysis, documentation and to engage a business conversation around the findings.

This will enable you to:
1. Make your experiments more reproducible.
2. Track the data that is used for each experiment and be able to quickly debug issues with live models.
3. Capture the key insights that were discovered during the project and record all decisions taken.

We are keeping extending our support of the Vectice SDK.

Let us know what improvements you would like to see in the solution and what your favorite features are after completing this tutorial. Any comments will be appreciated!

Feel free to explore more and come up with your own ideas on how to best start leveraging Vectice!
