# Overview
In this section you'll train the same model using a Vertex AI Managed Notebook.

## Setup
First install the required packages on the notebook environment and import the libraries. 

In [None]:
! pip3 install --upgrade google-cloud-aiplatform google-cloud-bigquery google-cloud-bigquery-storage pyarrow --user
! pip3 install category_encoders fsspec gcsfs db-dtypes --user
! pip3 install scikit-learn==1.0 --user

Now, restart the kernel by running the code below - you only need to do this once after you've installed the packages. 

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Import the librariesimport numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 
from sklearn.preprocessing import OneHotEncoder
from category_encoders import *
import os
from google.cloud import bigquery
import joblib
import logging
import collections
import tempfile
import time
from json import dumps
from google.cloud import aiplatform as vertex_ai

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, classification_report, accuracy_score, confusion_matrix 
from sklearn.preprocessing import OneHotEncoder
from category_encoders import *
import os
from google.cloud import bigquery
import joblib
import logging
import collections
import tempfile
import time
from json import dumps
from google.cloud import aiplatform as vertex_ai

Next, let's define some variables. Note you should change them for your project, bucket, and dataset. 

In [None]:
# variables
REGION = "us-central1"
PROJECT_ID = "leedeb-experimentation"
BUCKET_URI = "gs://bq-logit-regression-demo"
MODELS_URI = f"{BUCKET_URI}/models/"

Within our Vertex notebooks we can use our BigQuery magic commands to read in data into dataframes using basic select statements. 

You can read more about the BigQuery magic commands here: https://cloud.google.com/bigquery/docs/visualize-jupyter

In [2]:
%bigquery_stats leedeb-experimentation.bq_databricks_vertex.training_data

Getting table schema...: 100%|██████████| 100/100 [00:01<00:00, 55.45it/s]
Querying data in column 'pageviews': : 5it [00:05,  1.06s/it]
Retrieve stats for 'pageviews': 100%|██████████| 5/5 [00:15<00:00,  3.20s/it]


In [4]:
%%bigquery train_data
SELECT
*
FROM
  `leedeb-experimentation.bq_databricks_vertex.training_data`

Query complete after 0.00s: 100%|██████████| 1/1 [00:00<00:00, 829.08query/s] 
Downloading: 100%|██████████| 829285/829285 [00:01<00:00, 738253.22rows/s] 


In [None]:
train_data.head() 

## Train the model


In [None]:
# split train and test
X = train_data.drop(columns='label', axis=1) 
y = train_data.label.values

y=y.astype('bool')

# use binary encoding to encode two categorical features
enc = BinaryEncoder().fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(numeric_dataset, y)

# define model
scaler = StandardScaler()
lr = LogisticRegression()
model = Pipeline([('standardize', scaler),
                        ('log_reg', lr)])

# fit the model
model.fit(X_train, y_train)

## Evaluate the model

In [None]:
# evaluate the model - training score
y_train_hat = model.predict(X_train)
y_train_hat_probs = model.predict_proba(X_train)[:,1]

train_accuracy = accuracy_score(y_train, y_train_hat)*100
train_auc_roc = roc_auc_score(y_train, y_train_hat_probs)*100

print('Confusion matrix:\n', confusion_matrix(y_train, y_train_hat))

print('Training AUC: %.4f %%' % train_auc_roc)

print('Training accuracy: %.4f %%' % train_accuracy)

In [None]:
# evaluate the model - testing score
y_test_hat = model.predict(X_test)
y_test_hat_probs = model.predict_proba(X_test)[:,1]

test_accuracy = accuracy_score(y_test, y_test_hat)*100
test_auc_roc = roc_auc_score(y_test, y_test_hat_probs)*100

print('Confusion matrix:\n', confusion_matrix(y_test, y_test_hat))

print('Testing AUC: %.4f %%' % test_auc_roc)

print('Testing accuracy: %.4f %%' % test_accuracy)

In [None]:
# check precision and recall
print(classification_report(y_test, y_test_hat, digits=6))

Save the model artifact to cloud storage - the bucket you defined earlier

In [None]:
joblib.dump(model, 'model.joblib')

!gsutil cp ./model.joblib {MODELS_URI}/

## Register the model
Now,  go to the Vertex AI model registry

Since we didn't use the API to log this model, let's register it with the sklearn 1.0 prebuilt container

* Click Import
* Import as new model
![](./import1.png)
* Settings - you can use the default settings, and no need to add a prediction schema as this is optional. 
![](./import2.png)
You can configure explainability for these jobs as well - this needs to be specified when you import the model.

Once your model is imported, you should see it in the model registry. Now, you can use it for either batch or online prediction. 