<a href="https://colab.research.google.com/github/vectice/vectice-examples/blob/master/MLflow/Diamonds_Price_Prediction/Diamonds_Price_Predictions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook walks you through the Vectice integration with MLflow

## Install Vectice, MLflow and GCS packages

In order to keep the Vectice library lite, we install just the primary dependencies and let the user install the the other dependencies when needed. Here, we install github because our notebook is on Github and we are going to need the github package to be able to point to the notebook from the Vectice UI. You have to add the other dependencies (gitlab, bitbucket) if you're going to use them (!pip install -q "vectice[github, gitlab, bitbucket]")

In [1]:
!pip3 install -q vectice[github]==22.3.5.1
!pip3 install -q fsspec
!pip3 install -q gcsfs
!pip3 install -q google-cloud-storage
!pip3 install -q mlflow
!pip3 install -q xgboost

[K     |████████████████████████████████| 121 kB 5.1 MB/s 
[K     |████████████████████████████████| 291 kB 38.9 MB/s 
[K     |████████████████████████████████| 856 kB 61.8 MB/s 
[K     |████████████████████████████████| 136 kB 5.1 MB/s 
[K     |████████████████████████████████| 1.1 MB 7.4 MB/s 
[K     |████████████████████████████████| 94 kB 3.5 MB/s 
[K     |████████████████████████████████| 271 kB 49.8 MB/s 
[K     |████████████████████████████████| 144 kB 41.1 MB/s 
[K     |████████████████████████████████| 17.8 MB 370 kB/s 
[K     |████████████████████████████████| 146 kB 49.3 MB/s 
[K     |████████████████████████████████| 210 kB 52.6 MB/s 
[K     |████████████████████████████████| 181 kB 50.1 MB/s 
[K     |████████████████████████████████| 596 kB 45.0 MB/s 
[K     |████████████████████████████████| 79 kB 7.8 MB/s 
[K     |████████████████████████████████| 62 kB 781 kB/s 
[K     |████████████████████████████████| 54 kB 2.5 MB/s 
[K     |██████████████████████████

In [2]:
!pip3 show vectice

Name: vectice
Version: 2.2.3
Summary: Vectice Python library
Home-page: https://github.com/vectice/vectice-python
Author: Vectice Inc.
Author-email: sdk@vectice.com
License: Apache License 2.0
Location: /usr/local/lib/python3.7/dist-packages
Requires: requests, python-dotenv, urllib3
Required-by: 


The main entrypoint of the SDK is the high level API which provide several solutions to follow your runs.

* a procedural solution with 2 methods to call vectice.create_run() and vectice.save_after_run()

* a more powerful solution based on vectice.Vectice class that provides itself several possibilities:

* use an instance of vectice.Vectice object to create_run(), start_run() and end_run() (fluent API)

* You can also use the context manager syntax (python with keyword): In this case, the end of the run will be automatically managed.

In [1]:
import logging
from math import sqrt
import os 
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, QuantileTransformer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_squared_error
from sklearn import metrics

import mlflow
from vectice import Experiment, Vectice
from vectice.api.json.artifact_type import JobArtifactType
from vectice.api.json.artifact_version import VersionStrategy
from vectice.api.json.model import ModelType
from vectice.api.json.job import JobType
from vectice.api.json.run import RunStatus
from vectice.models.dataset_metadata import DatasetMetadata

### Data:
This classic dataset contains the prices and other attributes of almost 54,000 diamonds. There are 10 attributes included in the dataset including the target ie. price.

### Feature description:

price price in US dollars ($326--$18,823)This is the target column containing tags for the features. 

### The 4 Cs of Diamonds:-

- carat (0.2--5.01) The carat is the diamond’s physical weight measured in metric carats.  One carat equals 1/5 gram and is subdivided into 100 points. Carat weight is the most objective grade of the 4Cs. 

- cut (Fair, Good, Very Good, Premium, Ideal) In determining the quality of the cut, the diamond grader evaluates the cutter’s skill in the fashioning of the diamond. The more precise the diamond is cut, the more captivating the diamond is to the eye.  

- color, from J (worst) to D (best) The colour of gem-quality diamonds occurs in many hues. In the range from colourless to light yellow or light brown. Colourless diamonds are the rarest. Other natural colours (blue, red, pink for example) are known as "fancy,” and their colour grading is different than from white colorless diamonds.  

- clarity (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) Diamonds can have internal characteristics known as inclusions or external characteristics known as blemishes. Diamonds without inclusions or blemishes are rare; however, most characteristics can only be seen with magnification.  

### Goal: 

The goal is to predict the prices of diamonds using the features in the given dataset. Thus it's a regression problem, you'll perform a bit of data cleaning and create a multiple models that are fed into MLflow. The code used to achieve this is hiddin but you can view it. However, it'll be more fun to give it a good old college try as a team and resort to the hidden code if all else fails.

Here is a link to the [Vectice Python library Documentation](https://doc.vectice.com/)

We are going to load data stored in Google Cloud Storage, that is provided by Vectice in the tutorial page.
```
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'readerKey.json'
```

## Credentials Setup:
##### The Vectice API Endpoint and Token are needed to connect to the Vectice UI. Furthermore, a Google Cloud Storage credential JSON is needed to connect to the Google Cloud Storage to retrieve and upload the datasets. A project token links the runs to the relevant project and it's needed to create runs.

In [2]:
# Vectice API Endpoint
os.environ['VECTICE_API_ENDPOINT'] ='https://app.vectice.com'

# To use the Vectice Python library, you first need to authenticate your account using an API token.
# You can generate an API token from the Vectice UI, by going to the "API Tokens" section in the "My Profile" section
# which is located under your profile picture.
# You can specify your API Token here in the notebook, but we recommend you to add it to a .env file
os.environ['VECTICE_API_TOKEN'] = "Vectice API Token"

# Download the "JSON file" from the "Vectice Tutorial Page" in the application so that 
# you can access the GCS bucket. The name of the JSON file should be "readerKey.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "readerKey.json"

#The project ID
project_id = ID

### Reading the data from GCS

In [4]:
data = pd.read_csv(r"gs://vectice-examples-samples/Diamonds/diamonds.csv")
data.head(5)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [5]:
# This shows you the number of rows and columns
data.shape

(53940, 10)

In [6]:
# The details of the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   carat    53940 non-null  float64
 1   cut      53940 non-null  object 
 2   color    53940 non-null  object 
 3   clarity  53940 non-null  object 
 4   depth    53940 non-null  float64
 5   table    53940 non-null  float64
 6   price    53940 non-null  int64  
 7   x        53940 non-null  float64
 8   y        53940 non-null  float64
 9   z        53940 non-null  float64
dtypes: float64(6), int64(1), object(3)
memory usage: 4.1+ MB


### Data Cleaing 
In machine learning, if the data is irrelevant or error-prone then it leads to an incorrect model being built.

In [7]:
#Dropping dimentionless diamonds
data = data.drop(data[data["x"]==0].index)
data = data.drop(data[data["y"]==0].index)
data = data.drop(data[data["z"]==0].index)
# We dropped 20 dimensionless entries
data.shape

(53920, 10)

In [None]:
#Dropping the outliers. 
data = data[(data["depth"]<75)&(data["depth"]>45)]
data = data[(data["table"]<80)&(data["table"]>40)]
data = data[(data["x"]<30)]
data = data[(data["y"]<30)]
data = data[(data["z"]<30)&(data["z"]>2)]
# We dropped 13 outliers
data.shape

In [8]:
# Get list of categorical variables
object_cols = [i for i in data.columns if data[i].dtype == 'object']
print(f"Categorical variables: {object_cols}")

Categorical variables: ['cut', 'color', 'clarity']


#### Why are Categorical Features important?
Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.

#### We have three categorical variables. Let us have a look at them with violin plots.
##### Violin plots are a method of plotting numeric data and can be considered a combination of the box plot with a kernel density plot. In the violin plot, we can find the same information as in the box plots:
* median (a white dot on the violin plot)
* interquartile range (the black bar in the center of violin)
* the lower/upper adjacent values (the black lines stretched from the bar) — defined as first quartile — 1.5 IQR and third quartile + 1.5 IQR respectively. These values can be used in a simple outlier detection technique (Tukey’s fences) — observations lying outside of these “fences” can be considered outliers.

![Image](https://miro.medium.com/max/520/1*TTMOaNG1o4PgQd-e8LurMg.png)

Probability Density Function:

![Image](https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Boxplot_vs_PDF.svg/525px-Boxplot_vs_PDF.svg.png)

#### Lable encoding the data to get rid of object dtype.
This approach is very simple and it involves converting each value in a column to a number. Consider a dataset of bridges having a column names bridge-types having below values. Though there will be many more columns in the dataset, to understand label-encoding, we will focus on one categorical column only.We choose to encode the text values by putting a running sequence for each text values like below:

![Markdown Logo is here.](https://miro.medium.com/max/289/1*VinegxkUYMzik9GpucWCFA.png)


In [9]:
# Make copy to avoid changing original data 
label_data = data.copy()

In [10]:
def encoder_labels(columns: list, dataframe: pd.DataFrame, encoder: LabelEncoder) -> pd.DataFrame:
    for col in columns:
        dataframe[col] = encoder.fit_transform(dataframe[col])
    return dataframe

In [11]:
encoder = LabelEncoder()
label_data = encoder_labels(object_cols, label_data, encoder)

In [13]:
label_data.head(5)

Unnamed: 0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
0,23
1,22
2,32
3,110
4,27


In [None]:
data.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


#### Correlation Matrix:
A correlation matrix is useful for showing the correlation coefficients (or degree of relationship) between variables. The correlation matrix is symmetric, as the correlation between a variable V1 and variable V2 is the same as the correlation between V2 and variable V1. Also, the values on the diagonal are always equal to one, because a variable is always perfectly correlated with itself.

In [None]:
#correlation matrix
corrmat= label_data.corr()
f, ax = plt.subplots(figsize=(12,12))
sns.heatmap(corrmat,annot=True);

#### Points to notice:
* "x", "y" and "z" show a high correlation to the target column.
* "depth", "cut" and "table" show low correlation. We could consider dropping them but let's rather keep them.

### Model Building
#### Steps involved in Model Building

* Setting up features and target
* Build a pipeline of standard scalar and model for five different regressors.
* Fit all the models on training data
* Get mean of cross-validation on the training set for all the models for negative root mean square error
* Pick the model with the best cross-validation score
* Fit the best model on the training set and predict on the test set

### Train-Test Split Evaluation 
The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

For this section, we will re-use some datasets that have been already created to illustrate dataset versioning. You can create new datasets with ot without a connection by using **vectice.add_dataset()** or **experiment.vectice.add_dataset()** 

In [None]:
# We create our first experiment for splitting the data and specify the workspace and the project we will be working on
# Each experiment only contains one job. Each invokation of the job is called a run.
# autocode = True enables you to track your git changes for your code automatically every time you execute a run (see below).
experiment = Experiment(job="Split Diamonds Data", project=project_id, job_type=JobType.PREPARATION, auto_code=True)

In [12]:
# The Vectice library automatically detects if there have been changes to the dataset you are using.
# If it detects changes, it will generate a new version of your dataset automatically. 
experiment.add_dataset_version(dataset="Diamonds Cleaned", version_strategy=VersionStrategy.AUTOMATIC)


# If you are using your local environment with GIT installed or JupyterLab etc... the code
# tracking is automated.
## You can also add code versions manually by using: experiment.add_code_version_uri(git_uri) 

# All the artifacts created before starting the experiment run will be attached as inputs of the run
# The created dataset version and code version will be automatically attached as inputs of the run
experiment.start(auto_code=True, check_remote_repository=False)

train, test = train_test_split(label_data, test_size=0.2, random_state = 42)

# Assigning the featurs as X and trarget as y
X = label_data.drop(["price"], axis =1)
y = label_data["price"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=42)

# Data preparation
def prepare_data():
    """Read and prepare data."""
    df = pd.read_csv(r"gs://vectice-examples-samples/Diamonds/diamonds_cleaned.csv")

    X = df.drop(["price"], axis =1)
    y = df["price"]
    X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.25, random_state=42)

    return X_train, X_test, y_train, y_test

experiment.add_dataset_version(dataset="Diamonds Train Test Split", version_strategy=VersionStrategy.AUTOMATIC)

# When completing an experiment all the artifact that have been added after the `experiment.start()` will be considered as outputs by default.
experiment.complete()

### Pipelines
In most machine learning projects the data that you have to work with is unlikely to be in the ideal format for producing the best performing model. There are quite often a number of transformational steps such as encoding categorical variables, feature scaling and normalisation that need to be performed. Scikit-learn has built in functions for most of these commonly used transformations in it’s preprocessing package.
However, in a typical machine learning workflow you will need to apply all these transformations at least twice. Once when training the model and again on any new data you want to predict on. Of course you could write a function to apply them and reuse that but you would still need to run this first and then call the model separately. Scikit-learn pipelines are a tool to simplify this process. They have several key benefits:
* They make your workflow much easier to read and understand.
* They enforce the implementation and order of steps in your project.
* These in turn make your work much more reproducible.

### StandardScaler Example:
A StandardScaler substarcts the mean and then divides by the standard deviation, this shifts the distribution to have a mean of 0 and a standard deviation of one.

In [None]:
example = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = StandardScaler().fit(example)
X_scaled = scaler.transform(example)
print(f"Before: {example[0]} \nAfter: {X_scaled[0]}")

Before: [ 1. -1.  2.] 
After: [ 0.         -1.22474487  1.33630621]


### Cross Validation:

Cross validation follows the following logic. A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

- A model is trained using of the folds as training data;

- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

![Image](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

### Models:

1. LinearRegression <a href="https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html" target="_blank">more info</a>.
2. DecisionTreeRegressor <a href="https://ml-cheatsheet.readthedocs.io/en/latest/classification_algos.html#decision-trees" target="_blank">more info</a>.
3. RandomForestRegressor <a href="https://www.geeksforgeeks.org/random-forest-regression-in-python/" target="_blank">more info</a>.
4. KNeighborsRegressor <a href="https://ml-cheatsheet.readthedocs.io/en/latest/classification_algos.html#k-nearest-neighbor" target="_blank">more info</a>.
5. XGBRegressor <a href="https://machinelearningmastery.com/xgboost-for-regression/" target="_blank">more info</a>.

In [None]:
from mlflow.tracking import MlflowClient
tracking_uri = "Enter your MLflow tracking URI"
mlflow.set_tracking_uri(tracking_uri)
# Initialise Experiment with MLflow 
experiment = Experiment(job="MLflow-Diamond-Models", project=project_id, lib=MlflowClient(), job_type=JobType.TRAINING, auto_code=True)

In [None]:
import warnings
warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO)
"""Vectice MLflow adapter fluent usage in Python ``with`` syntax."""
X_train, X_test, y_train, y_test = prepare_data()

mlflow.autolog(silent=True)

# Building pipelins of standard scaler and model for regressors.
pipeline_lr=Pipeline([("scalar1",StandardScaler()),
                 ("Diamonds_Regressor",LinearRegression())])

pipeline_dt=Pipeline([("scalar2",StandardScaler()),
                    ("Diamonds_Regressor",DecisionTreeRegressor())])

pipeline_rf=Pipeline([("scalar3",StandardScaler()),
                    ("Diamonds_Regressor",RandomForestRegressor())])


pipeline_kn=Pipeline([("scalar4",StandardScaler()),
                    ("Diamonds_Regressor",KNeighborsRegressor())])


pipeline_xgb=Pipeline([("scalar5",StandardScaler()),
                    ("Diamonds_Regressor",XGBRegressor())])

# Pipelines list to iterate over
pipelines = [pipeline_lr, pipeline_dt, pipeline_rf, pipeline_kn, pipeline_xgb]

for pipe in pipelines:
    # Expermient name for each pipeline 
    MLFLOW_EXPERIMENT_NAME = "MLflow-Diamond-Models"
    mlflow.set_experiment(MLFLOW_EXPERIMENT_NAME)
    algorithm = pipe.steps[1][1]
    
    # Fit each model 
    pipe.fit(X_train, y_train)
    ## The "with" completes the experiment run automatically
    with experiment.start():
        # Add Dataset Version as an Input for each run.
        experiment.add_dataset_version(dataset="Diamonds Train Test Split", artifact_type=JobArtifactType.INPUT)
        cv_score = cross_val_score(pipe, X_train, y_train,scoring="neg_root_mean_squared_error", cv=10, n_jobs=-1)
        mlflow.log_param('AlgorithmName', algorithm.__class__.__name__)
        mlflow.log_params(algorithm.get_params())
        mlflow.log_param('Scaler', 'StandardScaler')
        mlflow.log_metric("Cross Validation", float(cv_score.mean()))
        print(f"{MLFLOW_EXPERIMENT_NAME}: {cv_score.mean()}")

mlflow.end_run()

we can have the list of models in the project by using **experiment.list_models()**

In [None]:
experiment.list_models()

We can update an existing model by using **experiment.update_model()**

In [None]:
experiment.update_model("MLflow-Diamond-Models", type=ModelType.REGRESSION)

#### Testing the Model with the best score on the test set
In the above scores, Random Forest appears to be the model with the best scoring on negative root mean square error. Let's test this model on a test set and evaluate it with different parameters. But you might get different results.

In [21]:
# Model prediction on test data
pred = pipeline_rf.predict(X_test)

In [22]:
# Model Evaluation
print("R^2:",metrics.r2_score(y_test, pred))
print("Adjusted R^2:",1 - (1-metrics.r2_score(y_test, pred))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1))
print("MAE:",metrics.mean_absolute_error(y_test, pred))
print("MSE:",metrics.mean_squared_error(y_test, pred))
print("RMSE:",np.sqrt(metrics.mean_squared_error(y_test, pred)))

R^2: 0.9801545400964425
Adjusted R^2: 0.9801412773698418
MAE: 264.81263809241136
MSE: 301048.2996842575
RMSE: 548.6786852833427


#### End

Congratulations and as Jake Peralta would say:

![Image](https://i.imgur.com/I1wR7mE.gif?noredirect)