# Goggle Cloud - XGBoost Regression and Bigframes 

<img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">

Bigframes allows you to work with larger datasets in Vertex AI Notebooks like you would with pandas dataframes. The data is being stored in BigQuery under the hood in the [Google Region](https://cloud.google.com/about/locations) you selcted. So you will not be limited to the memory of your 'local' engine you fired up to use Vertex AI notebooks in the first place. Though you might want to consider the costs that might be associated with processing very large datasets.

Vertex AI / Colab (now) also offers the option to schedule notebooks directly.

* [Use Python XGBoost and Optuna hyper parameter tuning to build model and deploy with KNIME Python nodes](https://github.com/ml-score/knime_meets_python/blob/main/machine_learning/binary/notebooks/kn_example_python_xgboost_hyper_parameter_optuna.ipynb)
* [Machine Learning Fundamentals with BigQuery DataFrames](https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb)
* [BigQuery DataFrames: Your Gateway to Scalable Data Analysis and ML in the Cloud](https://medium.com/technoesis/bigquery-dataframes-your-gateway-to-scalable-data-analysis-and-ml-in-the-cloud-73c2d2466549)
* [End-to-end user journey for each model](https://cloud.google.com/bigquery/docs/e2e-journey)

---

by Markus Lauber (https://medium.com/@mlxl)

https://yam-united.telekom.com/profile/markus-lauber/



---
#### Google Github with massive Code base for Trainings
https://github.com/GoogleCloudPlatform/training-data-analyst

https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/self-paced-labs/vertex-ai

### MEDIUM - more articles to consider


[Getting Started with BigQuery ML: A Practical Tutorial for Beginners](https://medium.com/@dipan.saha/getting-started-with-bigquery-ml-a-practical-tutorial-for-beginners-9653329d2cc4)


[How to use advance feature engineering to preprocess data in BigQuery ML](https://cloud.google.com/blog/products/data-analytics/preprocess-data-use-bigquery-ml)

In [55]:
# Prepare the environment and the packages
from google.colab import auth
auth.authenticate_user()
project_id = 'de123456-user-prd-1'
dataset_id = 'xgb_regression_project'
region_id = 'europe-west3' #  https://cloud.google.com/bigquery/docs/locations#supported_locations

# https://cloud.google.com/about/locations

from google.cloud import bigquery
import pandas as pd
from pandas_gbq import to_gbq

import bigframes.pandas as bpd

# Initialize the BigQuery client
client = bigquery.Client(project=project_id)



In [56]:
from google.cloud import aiplatform
import joblib

In [58]:
from bigframes.ml.model_selection import train_test_split

#### Class XGBRegressor

https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble.XGBRegressor

In [57]:
# https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble.XGBRegressor

from bigframes.ml.ensemble import XGBRegressor
# import xgboost as xgb

In [59]:
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = project_id

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = region_id

In [None]:
### GCP - XGBoost and Bigframes

In [60]:
# Define your source and destination tables
train_table = 'regression_train'
train_table_id = f"{project_id}.{dataset_id}.{train_table}"
train_table_id_new = f"{project_id}.{dataset_id}.{train_table}_new"
test_table = 'regression_test'
test_table_id = f"{project_id}.{dataset_id}.{test_table}"
test_table_id_new = f"{project_id}.{dataset_id}.{test_table}_new"

print("Train (train_table_id): ", train_table_id, " - Test (test_table_id): ", test_table_id)
print("Train NEW (train_table_id_new): ", train_table_id_new, " - Test NEW (test_table_id_new): ", test_table_id_new)

Train (train_table_id):  de123456-user-prd-1.xgb_regression_project.regression_train  - Test (test_table_id):  de123456-user-prd-1.xgb_regression_project.regression_test
Train NEW (train_table_id_new):  de123456-user-prd-1.xgb_regression_project.regression_train_new  - Test NEW (test_table_id_new):  de123456-user-prd-1.xgb_regression_project.regression_test_new


In [61]:
# SQL query to get the first 10 rows as a sample file to see the structure
query = f"""
SELECT *
FROM `{train_table_id}`
LIMIT 10
"""

# Execute the query and load results into a DataFrame
query_job = client.query(query)  # Run the query
df = query_job.to_dataframe()  # Convert the results into a pandas DataFrame


# Convert to bqplot DataFrame (bpd)
# data_test_bpd = bpd.DataFrame(data_test)

In [62]:
excluded_features = ['row_id']
label = ['Target']

# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]
df_features = [feat for feat in df.columns if feat not in excluded_features and feat not in label]

df_num_cols = df[df_features].select_dtypes(include='number').columns.tolist()
df_cat_cols = df[df_features].select_dtypes(exclude='number').columns.tolist()

df_rest_cols = [feat for feat in df.columns if feat not in df_cat_cols and feat not in df_num_cols]

print(f'''{"df shape:":20} {df.shape}
{"df[features] shape:":20} {df[df_features].shape}
categorical columns: {df_cat_cols}
numerical columns: {df_num_cols}
feature columns: {df_features}
rest columns: {df_rest_cols}''')

# THX David Gutmann

df shape:            (10, 81)
df[features] shape:  (10, 79)
categorical columns: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
numerical columns: ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'v_1stFlrSF', 'v_2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotR

In [63]:
# Format columns lists as strings for the query
df_num_cols_str = ', '.join([f"'{col}'" for col in df_num_cols])
df_cat_cols_str = ', '.join([f"'{col}'" for col in df_cat_cols])
df_target_str = ', '.join([f"'{col}'" for col in label])

print("df_num_cols_str: ", df_num_cols_str)
print("df_cat_cols_str: ", df_cat_cols_str)
print("df_target_str: ", df_target_str)

df_num_cols_str:  'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'v_1stFlrSF', 'v_2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', 'v_3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'
df_cat_cols_str:  'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', '

In [64]:
# Define the SQL query to create or replace a table with the converted values for the Target variable
query = f"""
CREATE OR REPLACE TABLE `{train_table_id_new}` AS
  SELECT  SAFE_CAST(Target AS INT64) AS Target_int
        , *
FROM `{train_table_id}`
"""

# Run the query
query_job = client.query(query)

In [65]:
# Define the SQL query to create or replace a table with the converted values
query = f"""
CREATE OR REPLACE TABLE `{test_table_id_new}` AS
  SELECT  SAFE_CAST(Target AS INT64) AS Target_int
        , *
FROM `{test_table_id}`
"""

# Run the query
query_job = client.query(query)

In [68]:
# del data

In [70]:
# load the data from BigQuery into a (temporary) Bigframes structure like a Pandas dataframe
data = bpd.read_gbq(train_table_id_new)

# BigQuery DataFrames creates a default numbered index, which we can give a name
data.index.name = "train_id"
data.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0_level_0,Target_int,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,Target,row_id
train_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,166000,20,RL,80,11900,Pave,,IR1,Lvl,AllPub,...,,,,0,6,2008,WD,Normal,166000,Row830
1,582933,60,RL,107,13891,Pave,,Reg,Lvl,AllPub,...,,,,0,1,2009,New,Partial,582933,Row803
2,130000,70,RH,55,8525,Pave,,Reg,Bnk,AllPub,...,,,,0,11,2008,WD,Abnorml,130000,Row1234
3,385000,20,RL,68,50271,Pave,,IR1,Low,AllPub,...,,,,0,11,2006,WD,Normal,385000,Row53
4,320000,60,RL,134,19378,Pave,,IR1,HLS,AllPub,...,,,,0,3,2006,New,Partial,320000,Row159


[Bigframes Functions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame) - Overview

In [71]:
# The integer Target_int gets to be the 'real' Target by renaming and dropping

data = data.drop(['Target'], axis=1).rename(columns={"Target_int": "Target"})
# data = data.rename(columns={"Target_int": "Target"})
data.head()

Unnamed: 0_level_0,Target,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,row_id
train_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,166000,20,RL,80,11900,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2008,WD,Normal,Row830
1,582933,60,RL,107,13891,Pave,,Reg,Lvl,AllPub,...,0,,,,0,1,2009,New,Partial,Row803
2,130000,70,RH,55,8525,Pave,,Reg,Bnk,AllPub,...,0,,,,0,11,2008,WD,Abnorml,Row1234
3,385000,20,RL,68,50271,Pave,,IR1,Low,AllPub,...,0,,,,0,11,2006,WD,Normal,Row53
4,320000,60,RL,134,19378,Pave,,IR1,HLS,AllPub,...,0,,,,0,3,2006,New,Partial,Row159


In [72]:
type(data)

bigframes.dataframe.DataFrame

In [75]:
# del data_test

In [76]:
# del data_test
data_test = bpd.read_gbq(test_table_id_new)

# BigQuery DataFrames creates a default numbered index, which we can give a name
data_test.index.name = "test_id"
data_test.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0_level_0,Target_int,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,Target,row_id
test_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,174000,20,RL,80.0,8400,Pave,,Reg,Lvl,AllPub,...,,GdPrv,,0,7,2008,COD,Abnorml,174000,Row1435
1,225000,50,RL,81.0,15593,Pave,,Reg,Lvl,AllPub,...,,,,0,7,2006,WD,Normal,225000,Row69
2,119200,20,RL,60.0,11664,Pave,,Reg,Lvl,AllPub,...,,,,0,11,2007,WD,Normal,119200,Row1014
3,150900,90,RL,55.0,12640,Pave,,IR1,Lvl,AllPub,...,,,,0,7,2006,WD,Normal,150900,Row940
4,161500,50,RL,,11250,Pave,,Reg,Lvl,AllPub,...,,,,0,11,2009,WD,Normal,161500,Row1262


In [77]:
data_test = data_test.drop(['Target'], axis=1).rename(columns={"Target_int": "Target"})
# data_test = data_test.rename(columns={"Target_int": "Target"})
data_test.head()

Unnamed: 0_level_0,Target,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,row_id
test_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,174000,20,RL,80.0,8400,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,,0,7,2008,COD,Abnorml,Row1435
1,225000,50,RL,81.0,15593,Pave,,Reg,Lvl,AllPub,...,0,,,,0,7,2006,WD,Normal,Row69
2,119200,20,RL,60.0,11664,Pave,,Reg,Lvl,AllPub,...,0,,,,0,11,2007,WD,Normal,Row1014
3,150900,90,RL,55.0,12640,Pave,,IR1,Lvl,AllPub,...,0,,,,0,7,2006,WD,Normal,Row940
4,161500,50,RL,,11250,Pave,,Reg,Lvl,AllPub,...,0,,,,0,11,2009,WD,Normal,Row1262


In [78]:
excluded_features = ['row_id']
label = ['Target']
# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]
features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]

num_cols = data[features].select_dtypes(include='number').columns.tolist()
cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()

rest_cols = [feat for feat in data.columns if feat not in cat_cols and feat not in num_cols]

print(f'''{"data shape:":20} {data.shape}
{"data[features] shape:":20} {data[features].shape}
categorical columns: {cat_cols}
numerical columns: {num_cols}
feature columns: {features}
rest columns: {rest_cols}''')

# THX David Gutmann

data shape:          (1183, 81)
data[features] shape: (1183, 79)
categorical columns: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']
numerical columns: ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'Tot

In [None]:
print(data.dtypes)

In [None]:
# data[cat_cols] = data[cat_cols].astype('category')

In [None]:
# split training data into X and y
X = data[features]
y = data[label]

In [None]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

In [None]:
type(X_train)

In [None]:
# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")

In [None]:
# D_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
# D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)


Example how to modify the model

https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble.XGBRegressor

```Python
model = XGBRegressor(
    n_estimators=500,            # Increase the number of boosting rounds
    learning_rate=0.05,          # Lower learning rate
    max_depth=8,                 # Maximum depth of trees
    subsample=0.8,               # Subsample ratio
    colsample_bytree=0.8,        # Column subsample ratio by tree
    min_child_weight=1,          # Minimum child weight
    gamma=0,                     # Minimum loss reduction
    reg_alpha=0.01,              # L1 regularization term on weights
    reg_lambda=1                 # L2 regularization term on weights
)
# Als Code formatiert
```




In [None]:
# Using the XGBRegressor from the bigframes.ml package
# from bigframes.ml.ensemble import XGBClassifier

model = XGBRegressor()
# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(X_train, y_train)

In [None]:
import bigframes.ml.metrics

In [None]:
# evaluate the best model on the test data

y_pred = model.predict(X_test)

In [None]:
type(y_pred)

In [None]:
y_pred.head()

In [None]:
# Assuming df1 and df2 are your BigFrames DataFrames
# Merge df1 and df2 by their index
y_pred= y_pred.join(y_test, how='inner')

# This will perform an inner join based on the indexes of df1 and df2


In [None]:
y_pred.head()

In [None]:
# Assuming 'df' is your BigFrames DataFrame loaded with your data
print(y_pred['Target'].iloc[0], " ", y_pred['predicted_Target_probs'].iloc[0])


In [None]:
# Extract the probability for label=1
# y_pred['prob_1'] = y_pred['predicted_Target_probs'].apply(lambda x: [item['prob'] for item in x if item['label'] == 1][0])

https://cloud.google.com/bigquery/docs/clustered-tables

table_id = df.to_gbq(clustering_columns=("index", "int_col"))

In [None]:
# write results back to to your BigQuery project

v_target_table = f"{project_id}.{dataset_id}.census_predicted"
y_pred.to_gbq(destination_table=v_target_table,  if_exists='replace')

In [None]:
# de123456-user-prd-1.pasm_mybucket.census_predicted
# Define the SQL query to create or replace a table with the converted values
# unnest the data table

query = f"""
CREATE OR REPLACE TABLE `{v_target_table}_new` AS
SELECT
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 1 LIMIT 1) AS proba_1,
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 0 LIMIT 1) AS proba_0,
  *
FROM `{v_target_table}`
"""

# `{v_target_table}`

# Run the query
query_job = client.query(query)

In [None]:
if 'y_score' in globals():
    del y_score

In [None]:
# Specify your SQL query to select only the desired columns
sql_query = f"""
SELECT train_id,
       Target,
       proba_1
FROM `{v_target_table}_new`
"""

# Use the SQL query to load data
y_score = bpd.read_gbq(sql_query)

In [None]:
y_score.head()

In [None]:
auc_score = bigframes.ml.metrics.roc_auc_score(y_score['Target'], y_score['proba_1'])
print("AUC: ", auc_score)

In [None]:
# Initialize Vertex AI
aiplatform.init(project=project_id, location=region_id)

In [None]:
# Save the model locally in your BigQuery project (under Models)
model_name = 'model_xgboost_01'
model.to_gbq(f"{dataset_id}.{model_name}", replace=True)

In [None]:
model_load = bpd.read_gbq_model(f"{dataset_id}.{model_name}")

In [None]:
type(model_load)

### Apply the stored model to a (new) dataset

In [None]:
data_test = data_test.rename(columns={"native-country": "native_country"})
# data_test = data_test.rename(columns={"Target_int": "Target"})
data_test.head()

In [None]:
print(data_test.dtypes)

In [None]:
# Predict the outcome with (new) data

test_pred = model_load.predict(data_test)

In [None]:
# the predicted Target Probs are a nested object that will have to be untagled

test_pred.head()

## Apply the model via SQL Code in BigQuery

You can use the model created directly on BigQuery in SQL code. Given that the dataset has the same features that were there when training the model.

```SQL
SELECT
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 1 LIMIT 1) AS proba_1,
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 0 LIMIT 1) AS proba_0,
  predicted_Target,
  Target
FROM
  ML.PREDICT(MODEL `de123456-user-prd-1.pasm_mybucket.model_xgboost_01`,
    (SELECT
      *
     FROM
     `de123456-user-prd-1.pasm_mybucket.census_test_new`))
```

