DISCLAIMER
Copyright 2025 Google LLC.

This solution, including any related sample code or data, is made available on an “as is,” “as available,” and “with all faults” basis, solely for illustrative purposes, and without warranty or representation of any kind. This solution is experimental, unsupported and provided solely for your convenience. Your use of it is subject to your agreements with Google, as applicable, and may constitute a beta feature as defined under those agreements. To the extent that you make any data available to Google in connection with your use of the solution, you represent and warrant that you have all necessary and appropriate rights, consents and permissions to permit Google to use and process that data. By using any portion of this solution, you acknowledge, assume and accept all risks, known and unknown, associated with its usage, including with respect to your deployment of any portion of this solution in your systems, or usage in connection with your business, if at all.

# Step 1: Install product_return_predictor

In [None]:
cd ~

In [None]:
cd test/product_return_predictor

In [None]:
!pip install product_return_predictor
# Install immutabledict for using product_return_predictor package
!pip install immutabledict

# Step 2: Import General Python Packages and Python Modules

## Import General Python Packages

In [None]:
import numpy as np
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
import logging
logging.basicConfig(level=logging.INFO, force=True)

## Import Python Modules

In [None]:
import product_return_predictor.constant as constant
import product_return_predictor.product_return_predictor as product_return_predictor
import product_return_predictor.utils as utils
import product_return_predictor.model as model
import product_return_predictor.data_cleaning_feature_selection as data_cleaning_feature_selection
import product_return_predictor.model_prediction_evaluation as model_prediction_evaluation

# Step 3: Define Solution Parameters

There are 3 types of solution parameters/attributes that need to be defined:

- **GCP Product Return Project Parameters**: These parameters define the core GCP resources used by the solution.
    - `project_id`: Your Google Cloud Project ID where the product return prediction solution will be deployed and run. All BigQuery datasets, models, and other resources created by this solution will reside within this project.
    - `dataset_id`: The BigQuery dataset ID within your project_id where intermediate and final tables (e.g., ML-ready data, predictions) will be stored.
    - `gcp_bq_client`: An authenticated BigQuery client object. This client is used to interact with BigQuery for querying data, creating tables, and managing datasets. This is typically initialized as bigquery.Client() which will use your default GCP credentials or those configured in your environment. You generally don't need to modify this unless you have a specific authentication setup.
    - `gcp_storage`: An authenticated Google Cloud Storage client object. This client is used for interacting with Cloud Storage buckets, for example, to store temporary files or model artifacts. Similar to gcp_bq_client, this is usually initialized as storage.Client() and uses your configured GCP credentials.
    - `gcp_bucket_name`: The name of a Google Cloud Storage bucket within your project_id. This bucket will be used to store temporary files, model artifacts, or other data during the pipeline execution. Provide the name of an existing or new Cloud Storage bucket. Ensure the service account running the solution has appropriate permissions (read/write) to this bucket.
    - `location`: The geographic location for your GCP resources (e.g., BigQuery datasets, Cloud Storage buckets). It's best practice to keep all resources in the same location for performance and cost efficiency. Example Value: 'us' (United States multi-region) Specify a valid GCP region or multi-region, such as 'us', 'europe-west1', etc.

    
- **Feature Engineering Parameters**: These parameters control how your data is prepared for machine learning, including column definitions and data filtering.
    - `use_ga4_data_for_feature_engineering`: A boolean flag indicating whether the solution should use GA4 raw data as the source for feature engineering. Example Value: True or False. Set to True if you're leveraging GA4 data. If set to False, you must provide a preprocessed table via `ml_training_table_name` and specify the corresponding column names for transaction, refund, and ID fields. When you set this false, you would need to preprocess your own data source and do all the feature engineering to prepare your data for ML model in provided `ml_training_table_name` under your GCP project and dataset used for the solution (see `project_id`, `dataset_id` above).
    - `transaction_date_col`: The name of the column in your dataset that represents the transaction date. This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'transaction_date'. If not using GA4 data, ensure this matches the column name in your input table.
    - `transaction_id_col`: The name of the column in your dataset that uniquely identifies each transaction. This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If use_ga4_data_for_feature_engineering is True, this is automatically set. Example Value: 'transaction_id'. If not using GA4 data, ensure this matches the column name in your input table.
    - `refund_value_col`: The name of the column representing the monetary value of the refund (the target variable for regression). This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'refund_value' If not using GA4 data, ensure this matches the column name in your input table.
    - `refund_flag_col`: The name of the column indicating whether a transaction was refunded (a binary flag, e.g., 0 for no refund, 1 for refund; the target variable for classification). This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'refund_flag'. If not using GA4 data, ensure this matches the column name in your input table.
    - `refund_proportion_col`: The name of the column representing the proportion of the original transaction amount that was refunded (another potential target variable for regression). This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'refund_proportion'. If not using GA4 data, ensure this matches the column name in your input table.
    - `recency_of_transaction_for_prediction_in_days`: The number of days to look back from the current date (or a defined cutoff) to consider transactions for prediction. Transactions older than this window will not be included in the prediction dataset. Example Value: 2 - this means we are only considering the latest 2 days of transactions for making predictions using pre-trained BigQuery ML mdoel. How to Define: Adjust this value based on your business needs for recent transactions. When `use_ga4_data_for_feature_engineering` is False, this parameter is not needed.
    - `return_policy_window_in_days`: The number of days within which a product can be returned according to your business's return policy. This influences how refunds are identified and considered for prediction. Example Value: 30 - this means after 30 days of transaction date, customers are no longer allowed to return the products they bought. We used the return policy window to remove transactions that have not passed the return deadline for model training to avoid noise. How to Define: Set this to match your actual return policy. When `use_ga4_data_for_feature_engineering` is False, this parameter is not needed.
    - `recency_of_data_in_days`: The number of days to look back in your historical data for model training. This parameter helps define the training data window. Example Value: 365 (approximately 1 year) - this means we are considering using past 1 year of historical transaction and return data for model training. Choose a duration that provides sufficient historical data for training relevant models. Note that consumer behaviors and your products may have evolved over time therefore when deciding on the number please also keep the recency and relevancy of the data in mind. This parameter is also needed and relevant for the prediction pipeline as it decides the amount of historical used for creating features for the data in the prediction pipeline. When `use_ga4_data_for_feature_engineering` is False, this parameter is not needed.

- **Google Analytics 4 (GA4) Raw Datasets Parameters**: These parameters are crucial if you plan to use GA4 data for feature engineering.
    - `ga4_project_id:` The Google Cloud Project ID where your GA4 raw dataset resides. This is typically a public dataset or a project you own containing your GA4 export. Example Value: 'bigquery-public-data'. if `use_ga4_data_for_feature_engineering` is False, then you can leave this as None.
    - `ga4_dataset_id`: The BigQuery dataset ID within your ga4_project_id that contains your raw GA4 event data. Example Value: 'my_ga4_dataset_id'. if `use_ga4_data_for_feature_engineering` is False, then you can leave this as None.


- **Modeling Parameters**: These parameters control the type of machine learning models used and the overall modeling approach.
    - `regression_model_type`: Specifies the type of regression model to be used for predicting the refund_value or refund_proportion. Example Value: constant.LinearBigQueryMLModelType.LINEAR_REGRESSION. You can choose from available regression model types within the constant.LinearBigQueryMLModelType enum (e.g., LINEAR_REGRESSION).
    - `binary_classifier_model_type`: Specifies the type of binary classification model to be used for predicting the refund_flag. Example Value: constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION. You can choose from available classification model types within the constant.LinearBigQueryMLModelType enum (e.g., LOGISTIC_REGRESSION).
    - `is_two_step_model`: A boolean flag indicating whether to use a two-step modeling approach. In a two-step model, a binary classifier first predicts if a refund will occur, and if so, a regression model then predicts the refund value. Example Value: True. How to Define: Set to True for a two-step approach or False for a single-step model (either regression or classification directly).
    

- **Optional Parameters**: These parameters offer further customization and are often required under specific conditions, as noted in their descriptions.
    - ml_prediction_table_name: The name of the BigQuery table containing your preprocessed data for generating predictions. This parameter is required when `use_ga4_data_for_feature_engineering` is False. If you are providing your own data for prediction, specify the BigQuery table name here.
        - **[Important Note]**: ml_prediction_table_name needs to be preprocessed properly with columns that represent refund value, refund proportion and refund flag.

    - ml_training_table_name: The name of the BigQuery table containing your preprocessed, ML-ready data for model training. This parameter is required when use_ga4_data_for_feature_engineering is False. If you are providing your own preprocessed data, specify the BigQuery table name here and make sure the table is under your GCP project and dataset (provided based on project_id, dataset_id) for your model.
        - **[Important Note]**: ml_training_table_name needs to be preprocessed properly with columns that represent refund value, refund proportion and refund flag.


In [None]:
# GCP product return project Parameters:
project_id = 'your-gcp-project-id'
dataset_id = 'your_dataset_id'
gcp_bq_client = bigquery.Client()
gcp_storage = storage.Client()
gcp_bucket_name='your_gcp_bucket_name'
location='your_gcp_location_name'

# Feature Enginnering Parameters:
use_ga4_data_for_feature_engineering = True
transaction_date_col = 'transaction_date'
transaction_id_col='transaction_id'
refund_value_col='refund_value'
refund_flag_col = 'refund_flag'
refund_proportion_col = 'refund_proportion'
recency_of_transaction_for_prediction_in_days=1800
return_policy_window_in_days=30
recency_of_data_in_days=1800

# GA4 raw datasets Parameters:
ga4_project_id = 'your-gcp-project-id-for-ga4-data'
ga4_dataset_id = 'your_dataset_id_for_ga4_data'

# Modeling Parameters:
regression_model_type = constant.LinearBigQueryMLModelType.LINEAR_REGRESSION
binary_classifier_model_type = constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION
is_two_step_model=True

# Optional Parameters
ml_prediction_table_name= None
ml_training_table_name= None

# Validate input parameters
if use_ga4_data_for_feature_engineering and not ga4_project_id:
    raise ValueError('ga4_project_id should be provided when use_ga4_data_for_feature_engineering is set to True.')
if not use_ga4_data_for_feature_engineering and not ml_prediction_table_name:
    raise ValueError('ml_prediction_table_name should not be None when use_ga4_data_for_feature_engineering is set to False.')
if not use_ga4_data_for_feature_engineering and not ml_training_table_name:
    raise ValueError('ml_training_table_name should not be None when use_ga4_data_for_feature_engineering is set to False.')
if use_ga4_data_for_feature_engineering and not ga4_dataset_id:
    raise ValueError('ga4_dataset_id should be provided when use_ga4_data_for_feature_engineering is set to True.')

# Step 4: Create a `ProductReturnPredictor` instance called `product_return` with all the required paramters

Use the predefined parameters above to create a `product_retun` instance.
For more details on the `ProductReturnPredictor` class, please refer to `product_return_predictor.py` module under `product_return_predictor/product_return_predictor` directory.

In [None]:
product_return = product_return_predictor.ProductReturnPredictor(project_id=project_id,
                                                                 dataset_id = dataset_id,
                                                                 gcp_bq_client=gcp_bq_client,
                                                                 gcp_storage=gcp_storage,
                                                                 gcp_bucket_name=gcp_bucket_name,
                                                                 location=location,
                                                                 ga4_project_id=ga4_project_id,
                                                                 ga4_dataset_id=ga4_dataset_id,
                                                                 use_ga4_data_for_feature_engineering=use_ga4_data_for_feature_engineering,
                                                                 transaction_date_col=transaction_date_col,
                                                                 transaction_id_col=transaction_id_col,
                                                                 refund_value_col=refund_value_col,
                                                                 refund_flag_col=refund_flag_col,
                                                                 refund_proportion_col=refund_proportion_col,
                                                                 regression_model_type=regression_model_type,
                                                                 binary_classifier_model_type=binary_classifier_model_type,
                                                                 ml_training_table_name=ml_training_table_name,
                                                                 ml_prediction_table_name=ml_prediction_table_name,
                                                                 is_two_step_model=is_two_step_model)

# Step 5:  Prediction Pipeline - Data Processing and Feature Engineering for Prediction Data (data_processing_feature_engineering)
This section details how the `data_processing_feature_engineering` method operates when you are preparing data for generating predictions with your trained model. Unlike the training phase, this step focuses on consistently applying the same preprocessing and feature engineering steps that were learned during model training, ensuring your new prediction data is in the correct format.

Here's how you'd typically call this method for prediction in your Colab notebook:
```
product_return.data_processing_feature_engineering(
      data_pipeline_type=constant.DataPipelineType.PREDICTION,
      recency_of_transaction_for_prediction_in_days=recency_of_transaction_for_prediction_in_days,
      return_policy_window_in_days=return_policy_window_in_days,
      recency_of_data_in_days=recency_of_data_in_days)
      
```


### What's Happening During This Step?
When `data_pipeline_type` is set to `constant.DataPipelineType.PREDICTION`, the `data_processing_feature_engineering` method performs the following:
- **Determining Data Source**: It checks the `use_ga4_data_for_feature_engineering` parameter to identify the source of your raw data.
- **Data Ingestion and Feature Creation (If using GA4 Data)**: If `use_ga4_data_for_feature_engineering` is True, the pipeline connects to your specified GA4 BigQuery project and dataset (using `product_return.ga4_project_id` and `product_return.ga4_dataset_id`). It executes the same BigQuery SQL queries used during training to extract and engineer relevant transaction features from ga4 data. These queries are specifically formatted for the prediction pipeline (`data_pipeline_typ`e.value will be 'prediction'). This ensures that the features created for prediction data are consistent with those the model was trained on. The processed data (for both "existing customers" and "first-time purchases") is then loaded into Pandas DataFrames. If `use_ga4_data_for_feature_engineering` is False, then `ml_prediction_table_name` needs to be provided and the prediction table should have the same format and same features as the provided ml training data.
- **Applying Pre-trained Pipelines**: Unlike the training phase where pipelines are fit (learned), here, the already trained data preprocessing and feature selection pipelines are loaded from your Google Cloud Storage bucket (using `product_return.gcp_storage` and `product_return.gcp_bucket_name`). These pre-trained pipelines ensure that data scaling, categorical encoding, and feature selection are applied consistently to your new prediction data, preventing data leakage or inconsistencies. This means:
    - **Data Cleaning**: Data types are converted, and missing values are imputed (e.g., 'unknown' for strings, 0 for numerics). Rows or columns with high invalid values are NOT removed at this stage to prevent loss of prediction instances; the model is expected to handle them based on how it was trained.
    - **Feature Selection**: Only the features identified as important during the training phase are retained.
    - **Data Transformation**: Data is scaled and transformed using the exact transformations learned from the training data (e.g., `MinMaxScaler` parameters from training are applied).

- The final preprocessed, ML-ready data for prediction is then saved to BigQuery in your dataset_id, typically in tables named something like PREDICTION_ml_ready_data_for_existing_customers and PREDICTION_ml_ready_data_for_first_time_purchase.

- **Handling User-Provided Preprocessed Data (If not using GA4 Data)**: If `use_ga4_data_for_feature_engineering` is False, the solution reads the ml_prediction_table_name you've provided. The processed data is then saved back to BigQuery in your dataset_id, usually in a table named `PREDICTION_ml_data_your_table_name_with_target_variable_refund_value`.

The goal of this prediction pipeline step is purely to prepare new, unseen data in the exact same way as the training data, so the model can make accurate and reliable predictions.



In [None]:
# Run feature enginnering for creating prediction data
product_return.data_processing_feature_engineering(data_pipeline_type=constant.DataPipelineType.PREDICTION,
                                                   recency_of_transaction_for_prediction_in_days=recency_of_transaction_for_prediction_in_days,
                                                   return_policy_window_in_days=return_policy_window_in_days,
                                                   recency_of_data_in_days=recency_of_data_in_days)

# Step 6: Prediction Pipeline - Prediction Generation (prediction_pipeline_prediction_generation)
This critical step of the solution leverages your previously trained BigQuery ML models to generate real-time (or batch) predictions on new, unseen data. The output of this stage—the generated predictions—will be stored directly in BigQuery for easy access and integration.

Here's how you'll call this method in your Colab notebook:

```
probability_threshold_for_prediction = 0.5

product_return.prediction_pipeline_prediction_generation(
    is_two_step_model=is_two_step_model,
    regression_model_type=regression_model_type,
    binary_classifier_model_type=binary_classifier_model_type,
    first_time_purchase=first_time_purchase,
    probability_threshold_for_prediction=probability_threshold_for_prediction
)
```

### What's Happening During This Step?
The prediction_pipeline_prediction_generation method focuses solely on applying the pre-trained models to new data and storing the results. It orchestrates the following:
- Identify Input Data for Prediction: The method first determines the BigQuery table containing the preprocessed, ML-ready data for which you want to generate predictions.
    - If `use_ga4_data_for_feature_engineering` is True, it fetches data from the tables prepared by the data_processing_feature_engineering method (e.g., `PREDICTION_ml_ready_data_for_first_time_purchase` or `PREDICTION_ml_ready_data_for_existing_customers`). The choice depends on the first_time_purchase flag you provide.
    - If `use_ga4_data_for_feature_engineering` is False, it uses the ml_prediction_table_name you've specified, which should contain your own preprocessed data.

    -It also identifies the name of the training table (preprocessed_training_table_name) to correctly reference the trained models in BigQuery, as BigQuery ML models are often associated with their training data.

- Model Selection and Prediction Execution (`model.bigquery_ml_model_prediction`): Based on the is_two_step_model parameter:
    - **Single-Step Model**: If `is_two_step_model` is False, it identifies the relevant pre-trained regression model (based on regression_model_type) in BigQuery.
    - **Two-Step Model**: If `is_two_step_model` is True, it identifies both the pre-trained binary classification model (based on binary_classifier_model_type) and the regression model.
    - It then constructs and executes BigQuery ML `ML.PREDICT` queries. These queries apply the chosen trained model(s) directly within BigQuery to your new prediction data.
    - For a two-step model, the classification model first predicts the likelihood of a refund (refund_flag). If the predicted probability exceeds the `probability_threshold_for_prediction`, the regression model then predicts the refund_value or refund_proportion.

- **Prediction Output**: The results of these BigQuery ML prediction queries are written to a new BigQuery table within your specified dataset_id. The name of this table will be derived from the preprocessed_table_name and the target variables (e.g., `prediction_ml_data_your_table_name_with_target_variable_refund_value`).
- **Logging**: The process logs important information, such as the BigQuery job ID for the prediction query and the name of the destination table where your predictions are stored. This is helpful for monitoring and debugging.
- **The primary outcome of this step** is a BigQuery table filled with your model's predictions for the input data, ready for downstream consumption, analysis, or integration into other systems.

- **What You Need to Provide**: When calling `prediction_pipeline_prediction_generation`, you'll provide the following:
    - `is_two_step_model`: A boolean value indicating whether the model used for prediction is a two-step model (classification followed by regression) or a single-step regression model. This must match the model architecture used during training.
    - `regression_model_type`:The type of regression model that was previously trained and will now be used for generating predictions. This must match the regression_model_type used in the training pipeline. Accepted Values: Values from `constant.LinearBigQueryMLModelType`, `constant.DNNBigQueryMLModelType`, or `constant.BoostedTreeBigQueryMLModelType` (e.g., LINEAR_REGRESSION). Use the appropriate constant (e.g., `constant.LinearBigQueryMLModelType.LINEAR_REGRESSION`).
    - `binary_classifier_model_type`:The type of binary classification model that was previously trained and will be used for generating predictions (if `is_two_step_model` is True). This must match the binary_classifier_model_type used in the training pipeline. Accepted Values: Values from constant.LinearBigQueryMLModelType, constant.DNNBigQueryMLModelType, or constant.BoostedTreeBigQueryMLModelType (e.g., LOGISTIC_REGRESSION). How to Provide: Use the appropriate constant (e.g., constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION).
    - `first_time_purchase`:This boolean flag is required if `product_return.use_ga4_data_for_feature_engineering` is True. It specifies whether to use the model trained for first-time purchasers or existing customers for generating predictions. This ensures you're applying the correct specialized model. This parameter is ignored if you're not using GA4 data. True (to use the model trained on first-time purchase data) or False (to use the model trained on existing customer data).
    - `probability_threshold_for_prediction`: For two-step models, this is the probability cutoff used by the binary classification model to determine if a transaction is predicted as a "refund" (1) or "no refund" (0). Predictions exceeding this threshold will proceed to the regression step.
    - `bqml_template_files_dir`: (Optional, usually uses default) A mapping (dictionary) providing file paths to the BigQuery ML SQL query templates for prediction operations. This parameter typically uses a default value provided by the solution (`constant.BQML_QUERY_TEMPLATE_FILES)`.


In [None]:
probability_threshold_for_prediction = 0.5
first_time_purchase = True
regression_model_type=constant.LinearBigQueryMLModelType.LINEAR_REGRESSION
binary_classifier_model_type=constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION
first_time_purchase=True
product_return.prediction_pipeline_prediction_generation(is_two_step_model=is_two_step_model,
                                                         regression_model_type=regression_model_type,
                                                         binary_classifier_model_type=binary_classifier_model_type,
                                                         first_time_purchase=first_time_purchase,
                                                         probability_threshold_for_prediction=probability_threshold_for_prediction)