## Google Cloud - XGBoost Binary Classification and Bigframes 

<img                                                                  src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">

by Markus Lauber (https://medium.com/@mlxl)


Bigframes allows you to work with larger datasets in Vertex AI Notebooks like you would with pandas dataframes. The data is being stored in BigQuery under the hood in the [Google Region](https://cloud.google.com/about/locations) you selcted. So you will not be limited to the memory of your 'local' engine you fired up to use Vertex AI notebooks in the first place. Though you might want to consider the costs that might be associated with processing very large datasets.

Vertex AI / Colab (now) also offers the option to schedule notebooks directly.

* [Use Python XGBoost and Optuna hyper parameter tuning to build model and deploy with KNIME Python nodes](https://github.com/ml-score/knime_meets_python/blob/main/machine_learning/binary/notebooks/kn_example_python_xgboost_hyper_parameter_optuna.ipynb)
* [Machine Learning Fundamentals with BigQuery DataFrames](https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb)
* [BigQuery DataFrames: Your Gateway to Scalable Data Analysis and ML in the Cloud](https://medium.com/technoesis/bigquery-dataframes-your-gateway-to-scalable-data-analysis-and-ml-in-the-cloud-73c2d2466549)
* [End-to-end user journey for each model](https://cloud.google.com/bigquery/docs/e2e-journey)

---
#### Google Github with massive Code base for Trainings
https://github.com/GoogleCloudPlatform/training-data-analyst

https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/self-paced-labs/vertex-ai

### MEDIUM - more articles to consider

[Getting Started with BigQuery ML: A Practical Tutorial for Beginners](https://medium.com/@dipan.saha/getting-started-with-bigquery-ml-a-practical-tutorial-for-beginners-9653329d2cc4)


[How to use advance feature engineering to preprocess data in BigQuery ML](https://cloud.google.com/blog/products/data-analytics/preprocess-data-use-bigquery-ml)



---


### Helpful Links

##### by Google

*   [bigframes.pandas](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame) provides a pandas-compatible API for analytics.
*   [bigframes.ml](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.cluster) provides a scikit-learn-like API for ML.
*   [bigframes.ml.llm](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.llm) provides LLM capabilities.


* [BigFrames API Reference](https://cloud.google.com/python/docs/reference/bigframes/latest)


* [BigFrames GitHub page with sample notebooks](https://github.com/googleapis/python-bigquery-dataframes)

* [Troubleshooting notebook runtimes](https://cloud.google.com/colab/docs/troubleshooting)


In [1]:
# Prepare the environment and the packages
from google.colab import auth
auth.authenticate_user()
project_id = 'de123456-user-prd-1'
dataset_id = 'xgb_classification_project'
region_id = 'europe-west3' #  https://cloud.google.com/bigquery/docs/locations#supported_locations

# https://cloud.google.com/about/locations

from google.cloud import bigquery
import pandas as pd
from pandas_gbq import to_gbq

import bigframes.pandas as bpd

# Initialize the BigQuery client
client = bigquery.Client(project=project_id)



In [2]:
from google.cloud import aiplatform
import joblib

In [3]:
from bigframes.ml.model_selection import train_test_split

In [4]:
from bigframes.ml.ensemble import XGBClassifier
# import xgboost as xgb

In [5]:
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = project_id

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = region_id

In [None]:
### GCP - XGBoost and Bigframes

In [6]:
# Define your source and destination tables
train_table = 'census_train'
train_table_id = f"{project_id}.{dataset_id}.{train_table}"
train_table_id_new = f"{project_id}.{dataset_id}.{train_table}_new"
test_table = 'census_test'
test_table_id = f"{project_id}.{dataset_id}.{test_table}"
test_table_id_new = f"{project_id}.{dataset_id}.{test_table}_new"

print("Train (train_table_id): ", train_table_id, " - Test (test_table_id): ", test_table_id)
print("Train NEW (train_table_id_new): ", train_table_id_new, " - Test NEW (test_table_id_new): ", test_table_id_new)

Train (train_table_id):  de123456-user-prd-1.xgb_classification_project.census_train  - Test (test_table_id):  de123456-user-prd-1.xgb_classification_project.census_test
Train NEW (train_table_id_new):  de123456-user-prd-1.xgb_classification_project.census_train_new  - Test NEW (test_table_id_new):  de123456-user-prd-1.xgb_classification_project.census_test_new


In [7]:
# SQL query to get the first 10 rows as a sample file to see the structure
query = f"""
SELECT *
FROM `{train_table_id}`
LIMIT 10
"""

# Execute the query and load results into a DataFrame
query_job = client.query(query)  # Run the query
df = query_job.to_dataframe()  # Convert the results into a pandas DataFrame

# Convert to bqplot DataFrame (bpd)
# data_test_bpd = bpd.DataFrame(data_test)

In [8]:
excluded_features = ['row_id']
label = ['Target']

# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]
df_features = [feat for feat in df.columns if feat not in excluded_features and feat not in label]

df_num_cols = df[df_features].select_dtypes(include='number').columns.tolist()
df_cat_cols = df[df_features].select_dtypes(exclude='number').columns.tolist()

df_rest_cols = [feat for feat in df.columns if feat not in df_cat_cols and feat not in df_num_cols]

print(f'''{"df shape:":20} {df.shape}
{"df[features] shape:":20} {df[df_features].shape}
categorical columns: {df_cat_cols}
numerical columns: {df_num_cols}
feature columns: {df_features}
rest columns: {df_rest_cols}''')

# THX David Gutmann

df shape:            (10, 16)
df[features] shape:  (10, 14)
categorical columns: ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
numerical columns: ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
feature columns: ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']
rest columns: ['Target', 'row_id']


In [9]:
# Format columns lists as strings for the query
df_num_cols_str = ', '.join([f"'{col}'" for col in df_num_cols])
df_cat_cols_str = ', '.join([f"'{col}'" for col in df_cat_cols])
df_target_str = ', '.join([f"'{col}'" for col in label])

print("df_num_cols_str: ", df_num_cols_str)
print("df_cat_cols_str: ", df_cat_cols_str)
print("df_target_str: ", df_target_str)

df_num_cols_str:  'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week'
df_cat_cols_str:  'workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country'
df_target_str:  'Target'


In [10]:
# Define the SQL query to create or replace a table with the converted values for the Target variable
query = f"""
CREATE OR REPLACE TABLE `{train_table_id_new}` AS
  SELECT  SAFE_CAST(Target AS INT64) AS Target_int
        , *
FROM `{train_table_id}`
"""

# Run the query
query_job = client.query(query)

In [11]:
# Define the SQL query to create or replace a table with the converted values
query = f"""
CREATE OR REPLACE TABLE `{test_table_id_new}` AS
  SELECT  SAFE_CAST(Target AS INT64) AS Target_int
        , *
FROM `{test_table_id}`
"""

# Run the query
query_job = client.query(query)

In [52]:
if 'data' in globals():
    del data

In [53]:
# load the data from BigQuery into a (temporary) Bigframes structure like a Pandas dataframe
data = bpd.read_gbq(train_table_id_new)
data = data.reset_index(drop=True)

In [54]:
# BigQuery DataFrames creates a default numbered index, which we can give a name
# data.index.name = "train_id"
data.head()

Unnamed: 0,Target_int,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Target,row_id
0,0,27,Private,124953,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,1980,40,United-States,0,Row121
1,0,24,Private,229773,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,40,United-States,0,Row3745
2,0,52,Private,208137,Assoc-voc,11,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,0,Row1150
3,0,65,,146722,12th,8,Married-civ-spouse,,Husband,White,Male,0,0,10,United-States,0,Row28369
4,0,25,Private,193787,Bachelors,13,Married-civ-spouse,Sales,Wife,White,Female,0,0,45,United-States,0,Row34115


[Bigframes Functions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.dataframe.DataFrame) - Overview

In [55]:
# The integer Target_int gets to be the 'real' Target by renaming and dropping

data = data.drop(['Target'], axis=1).rename(columns={"Target_int": "Target"})
# data = data.rename(columns={"Target_int": "Target"})
data.head()

Unnamed: 0,Target,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,row_id
0,0,27,Private,124953,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,1980,40,United-States,Row121
1,0,24,Private,229773,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,40,United-States,Row3745
2,0,52,Private,208137,Assoc-voc,11,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,Row1150
3,0,65,,146722,12th,8,Married-civ-spouse,,Husband,White,Male,0,0,10,United-States,Row28369
4,0,25,Private,193787,Bachelors,13,Married-civ-spouse,Sales,Wife,White,Female,0,0,45,United-States,Row34115


In [56]:
type(data)

bigframes.dataframe.DataFrame

#### Import the Test data

In [60]:
if 'data_test' in globals():
    del data_test

In [61]:
data_test = bpd.read_gbq(test_table_id_new)
data_test = data_test.reset_index(drop=True)

In [62]:
# BigQuery DataFrames creates a default numbered index, which we can give a name
# data_test.index.name = "test_id"
data_test.head()

Unnamed: 0,Target_int,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Target,row_id
0,0,38,Private,206535,Some-college,10,Divorced,Tech-support,Unmarried,White,Female,0,0,50,United-States,0,Row1079
1,0,56,Private,183169,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Female,0,0,35,United-States,0,Row13436
2,0,69,Private,29087,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,6,United-States,0,Row14372
3,1,38,Private,149347,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,1,Row14644
4,0,48,Private,310639,Some-college,10,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,50,United-States,0,Row10348


In [63]:
data_test = data_test.drop(['Target'], axis=1).rename(columns={"Target_int": "Target"})
# data_test = data_test.rename(columns={"Target_int": "Target"})
data_test.head()

Unnamed: 0,Target,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,row_id
0,0,38,Private,206535,Some-college,10,Divorced,Tech-support,Unmarried,White,Female,0,0,50,United-States,Row1079
1,0,56,Private,183169,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Female,0,0,35,United-States,Row13436
2,0,69,Private,29087,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,6,United-States,Row14372
3,1,38,Private,149347,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,Row14644
4,0,48,Private,310639,Some-college,10,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,50,United-States,Row10348


In [64]:
# examine the data BigQuery object

excluded_features = ['row_id', 'train_id', 'test_id']
label = ['Target']

# features = [feat for feat in data.columns if feat not in excluded_features and not feat==label]
features = [feat for feat in data.columns if feat not in excluded_features and feat not in label]

num_cols = data[features].select_dtypes(include='number').columns.tolist()
cat_cols = data[features].select_dtypes(exclude='number').columns.tolist()

rest_cols = [feat for feat in data.columns if feat not in cat_cols and feat not in num_cols]

print(f'''{"data shape:":20} {data.shape}
{"data[features] shape:":20} {data[features].shape}
categorical columns: {cat_cols}
numerical columns: {num_cols}
feature columns: {features}
rest columns: {rest_cols}''')

# THX David Gutmann

data shape:          (34189, 16)
data[features] shape: (34189, 14)
categorical columns: ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
numerical columns: ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
feature columns: ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country']
rest columns: ['Target', 'row_id']


In [65]:
print(data.dtypes)

Target                      Int64
age                         Int64
workclass         string[pyarrow]
fnlwgt                      Int64
education         string[pyarrow]
education_num               Int64
marital_status    string[pyarrow]
occupation        string[pyarrow]
relationship      string[pyarrow]
race              string[pyarrow]
sex               string[pyarrow]
capital_gain                Int64
capital_loss                Int64
hours_per_week              Int64
native_country    string[pyarrow]
row_id            string[pyarrow]
dtype: object


In [None]:
# data[cat_cols] = data[cat_cols].astype('category')

In [68]:
if 'X' in globals():
    del X

if 'y' in globals():
    del y

In [69]:
# split training data into X and y
X = data[features]
y = data[label]

In [70]:
print(X.dtypes)

age                         Int64
workclass         string[pyarrow]
fnlwgt                      Int64
education         string[pyarrow]
education_num               Int64
marital_status    string[pyarrow]
occupation        string[pyarrow]
relationship      string[pyarrow]
race              string[pyarrow]
sex               string[pyarrow]
capital_gain                Int64
capital_loss                Int64
hours_per_week              Int64
native_country    string[pyarrow]
dtype: object


In [71]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)

In [72]:
type(X_train)

bigframes.dataframe.DataFrame

In [73]:
# Show the shape of the data after the split
print(f"""X_train shape: {X_train.shape}
X_test shape: {X_test.shape}
y_train shape: {y_train.shape}
y_test shape: {y_test.shape}""")

X_train shape: (22907, 14)
X_test shape: (11282, 14)
y_train shape: (22907, 1)
y_test shape: (11282, 1)


In [None]:
# D_train = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)
# D_test = xgb.DMatrix(X_test, label=y_test, enable_categorical=True)

In [74]:
X_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
8976,20,,307149,Some-college,10,Never-married,,Own-child,White,Female,0,0,35,United-States
30998,37,Private,172538,HS-grad,9,Never-married,Machine-op-inspct,Own-child,White,Male,0,0,40,United-States
20268,21,Private,206681,Some-college,10,Never-married,Sales,Own-child,White,Female,0,0,15,United-States
1197,21,Private,270043,Some-college,10,Never-married,Other-service,Own-child,White,Female,0,0,16,United-States
3067,19,,285177,Some-college,10,Never-married,,Own-child,White,Male,0,0,18,United-States



Example how to modify the model

https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.ml.ensemble.XGBClassifier

```Python
model = XGBClassifier(
    n_estimators=500,            # Increase the number of boosting rounds
    learning_rate=0.05,          # Lower learning rate
    max_depth=8,                 # Maximum depth of trees
    subsample=0.8,               # Subsample ratio
    colsample_bytree=0.8,        # Column subsample ratio by tree
    min_child_weight=1,          # Minimum child weight
    gamma=0,                     # Minimum loss reduction
    reg_alpha=0.01,              # L1 regularization term on weights
    reg_lambda=1                 # L2 regularization term on weights
)
# Als Code formatiert
```




In [75]:
# Using the XGBRegressor from the bigframes.ml package
# from bigframes.ml.ensemble import XGBClassifier

model = XGBClassifier()
# Here we pass the feature columns without transforms - BQML will then use
# automatic preprocessing to encode these columns
model.fit(X_train[features], y_train)

XGBClassifier()

In [91]:
# import bigframes.ml.metrics
from bigframes.ml import metrics

In [77]:
# evaluate the best model on the test data
y_pred = model.predict(X_test)

In [78]:
type(y_pred)

bigframes.dataframe.DataFrame

In [79]:
y_pred.head()

Unnamed: 0,predicted_Target,predicted_Target_probs,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country
0,0,"[{'label': 1, 'prob': 0.02194412797689438}  {'...",27,Private,124953,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,1980,40,United-States
1,0,"[{'label': 1, 'prob': 0.017343081533908844}  {...",24,Private,229773,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,40,United-States
5,0,"[{'label': 1, 'prob': 0.0384746752679348}  {'l...",33,Private,80058,HS-grad,9,Never-married,Transport-moving,Not-in-family,White,Male,0,0,40,United-States
7,0,"[{'label': 1, 'prob': 0.06342051923274994}  {'...",47,Private,70209,HS-grad,9,Divorced,Adm-clerical,Unmarried,White,Female,0,0,40,United-States
8,0,"[{'label': 1, 'prob': 0.028626788407564163}  {...",32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States


In [80]:
# Merge y_pred and y_test by their index to get the original Target colum back
y_pred= y_pred.join(y_test, how='inner')

# This will perform an inner join based on the indexes of y_pred and y_test


In [81]:
y_pred.head()

Unnamed: 0,predicted_Target,predicted_Target_probs,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Target
0,0,"[{'label': 1, 'prob': 0.02194412797689438}  {'...",27,Private,124953,HS-grad,9,Never-married,Other-service,Not-in-family,White,Male,0,1980,40,United-States,0
1,0,"[{'label': 1, 'prob': 0.017343081533908844}  {...",24,Private,229773,Bachelors,13,Never-married,Exec-managerial,Not-in-family,White,Male,0,0,40,United-States,0
5,0,"[{'label': 1, 'prob': 0.0384746752679348}  {'l...",33,Private,80058,HS-grad,9,Never-married,Transport-moving,Not-in-family,White,Male,0,0,40,United-States,0
7,0,"[{'label': 1, 'prob': 0.06342051923274994}  {'...",47,Private,70209,HS-grad,9,Divorced,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,0
8,0,"[{'label': 1, 'prob': 0.028626788407564163}  {...",32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,0


In [82]:
# inspect the first line of results to see the structure of the prediction column
print(y_pred['Target'].iloc[0], " ", y_pred['predicted_Target_probs'].iloc[0])


0   [{'label': 1, 'prob': 0.02194412797689438}, {'label': 0, 'prob': 0.9780558347702026}]


In [None]:
# Extract the probability for label=1
# y_pred['prob_1'] = y_pred['predicted_Target_probs'].apply(lambda x: [item['prob'] for item in x if item['label'] == 1][0])

https://cloud.google.com/bigquery/docs/clustered-tables

table_id = df.to_gbq(clustering_columns=("index", "int_col"))

In [83]:
# write results back to to your BigQuery project

v_target_table = f"{project_id}.{dataset_id}.census_predicted"
y_pred.to_gbq(destination_table=v_target_table,  if_exists='replace')

'de123456-user-prd-1.xgb_classification_project.census_predicted'

In [84]:
# Define the SQL query to create or replace a table with the converted values
# unnest the data table

query = f"""
CREATE OR REPLACE TABLE `{v_target_table}_new` AS
SELECT
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 1 LIMIT 1) AS proba_1,
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 0 LIMIT 1) AS proba_0,
  *
FROM `{v_target_table}`
"""

# `{v_target_table}`

# Run the query
query_job = client.query(query)

In [85]:
if 'y_score' in globals():
    del y_score

In [88]:
# Specify your SQL query to select only the desired columns
sql_query = f"""
SELECT
       Target,
       proba_1
FROM `{v_target_table}_new`
"""

# Use the SQL query to load data
y_score = bpd.read_gbq(sql_query)

In [35]:
y_score.head()

Unnamed: 0,train_id,Target,proba_1
0,16821,0,0.035299
1,29310,0,0.009112
2,14649,0,0.147506
3,5670,0,0.006607
4,31296,1,0.99401


In [89]:
auc_score = bigframes.ml.metrics.roc_auc_score(y_score['Target'], y_score['proba_1'])
print("AUC: ", auc_score)

AUC:  0.9195750098181543


In [93]:
# Initialize Vertex AI
aiplatform.init(project=project_id, location=region_id)

In [94]:
# Save the model locally in your BigQuery project (under Models)
model_name = 'model_xgboost_01'
model.to_gbq(f"{dataset_id}.{model_name}", replace=True)

XGBClassifier(booster='GBTREE', tree_method='AUTO')

In [95]:
model_load = bpd.read_gbq_model(f"{dataset_id}.{model_name}")

In [96]:
type(model_load)

bigframes.ml.ensemble.XGBClassifier

### Apply the stored model to a (new) dataset

In [97]:
# data_test = data_test.rename(columns={"native-country": "native_country"})
# data_test = data_test.rename(columns={"Target_int": "Target"})
data_test.head()

Unnamed: 0,Target,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,row_id
0,0,38,Private,206535,Some-college,10,Divorced,Tech-support,Unmarried,White,Female,0,0,50,United-States,Row1079
1,0,56,Private,183169,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Female,0,0,35,United-States,Row13436
2,0,69,Private,29087,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,6,United-States,Row14372
3,1,38,Private,149347,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,Row14644
4,0,48,Private,310639,Some-college,10,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,50,United-States,Row10348


In [98]:
print(data_test.dtypes)

Target                      Int64
age                         Int64
workclass         string[pyarrow]
fnlwgt                      Int64
education         string[pyarrow]
education_num               Int64
marital_status    string[pyarrow]
occupation        string[pyarrow]
relationship      string[pyarrow]
race              string[pyarrow]
sex               string[pyarrow]
capital_gain                Int64
capital_loss                Int64
hours_per_week              Int64
native_country    string[pyarrow]
row_id            string[pyarrow]
dtype: object


In [99]:
# Predict the outcome with (new) data
test_pred = model_load.predict(data_test)

In [100]:
# the predicted Target Probs are a nested object that will have to be untagled
test_pred.head()

Unnamed: 0,predicted_Target,predicted_Target_probs,Target,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,row_id
0,0,"[{'label': 1, 'prob': 0.10922787338495255}  {'...",0,38,Private,206535,Some-college,10,Divorced,Tech-support,Unmarried,White,Female,0,0,50,United-States,Row1079
1,0,"[{'label': 1, 'prob': 0.03274129703640938}  {'...",0,56,Private,183169,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Female,0,0,35,United-States,Row13436
2,0,"[{'label': 1, 'prob': 0.05487420782446861}  {'...",0,69,Private,29087,7th-8th,4,Married-civ-spouse,Handlers-cleaners,Husband,White,Male,0,0,6,United-States,Row14372
3,1,"[{'label': 1, 'prob': 0.7182402610778809}  {'l...",1,38,Private,149347,Masters,14,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,Row14644
4,0,"[{'label': 1, 'prob': 0.21385003626346588}  {'...",0,48,Private,310639,Some-college,10,Divorced,Exec-managerial,Not-in-family,White,Male,0,0,50,United-States,Row10348


## Apply the model via SQL Code in BigQuery

You can use the model created directly on BigQuery in SQL code. Given that the dataset has the same features that were there when training the model.

```SQL
SELECT
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 1 LIMIT 1) AS proba_1,
  (SELECT prob FROM UNNEST(predicted_Target_probs) WHERE label = 0 LIMIT 1) AS proba_0,
  predicted_Target,
  Target
FROM
  ML.PREDICT(MODEL `de123456-user-prd-1.xgb_classification_project.model_xgboost_01`,
    (SELECT
      *
     FROM
     `de123456-user-prd-1.xgb_classification_project.census_test_new`))
```

