DISCLAIMER
Copyright 2025 Google LLC.

This solution, including any related sample code or data, is made available on an “as is,” “as available,” and “with all faults” basis, solely for illustrative purposes, and without warranty or representation of any kind. This solution is experimental, unsupported and provided solely for your convenience. Your use of it is subject to your agreements with Google, as applicable, and may constitute a beta feature as defined under those agreements. To the extent that you make any data available to Google in connection with your use of the solution, you represent and warrant that you have all necessary and appropriate rights, consents and permissions to permit Google to use and process that data. By using any portion of this solution, you acknowledge, assume and accept all risks, known and unknown, associated with its usage, including with respect to your deployment of any portion of this solution in your systems, or usage in connection with your business, if at all.

# Step 1: Install product_return_predictor

In [None]:
cd ~

In [None]:
cd test/product_return_predictor

In [None]:
!pip install product_return_predictor
# Install immutabledict for using product_return_predictor package
!pip install immutabledict

# Step 2: Import General Python Packages and Python Modules

## Import General Python Packages

In [None]:
import numpy as np
import pandas as pd
from google.cloud import bigquery
from google.cloud import storage
import logging
logging.basicConfig(level=logging.INFO, force=True)

## Import Python Modules

In [None]:
import product_return_predictor.constant as constant
import product_return_predictor.product_return_predictor as product_return_predictor
import product_return_predictor.utils as utils
import product_return_predictor.model as model
import product_return_predictor.data_cleaning_feature_selection as data_cleaning_feature_selection
import product_return_predictor.model_prediction_evaluation as model_prediction_evaluation

# Step 3: Define Solution Parameters

There are 5 types of solution parameters/attributes that need to be defined:

- **GCP Product Return Project Parameters**: These parameters define the core GCP resources used by the solution.
    - `project_id`: Your Google Cloud Project ID where the product return prediction solution will be deployed and run. All BigQuery datasets, models, and other resources created by this solution will reside within this project.
    - `dataset_id`: The BigQuery dataset ID within your project_id where intermediate and final tables (e.g., ML-ready data, predictions) will be stored.
    - `gcp_bq_client`: An authenticated BigQuery client object. This client is used to interact with BigQuery for querying data, creating tables, and managing datasets. This is typically initialized as bigquery.Client() which will use your default GCP credentials or those configured in your environment. You generally don't need to modify this unless you have a specific authentication setup.
    - `gcp_storage`: An authenticated Google Cloud Storage client object. This client is used for interacting with Cloud Storage buckets, for example, to store temporary files or model artifacts. Similar to gcp_bq_client, this is usually initialized as storage.Client() and uses your configured GCP credentials.
    - `gcp_bucket_name`: The name of a Google Cloud Storage bucket within your project_id. This bucket will be used to store temporary files, model artifacts, or other data during the pipeline execution. Provide the name of an existing or new Cloud Storage bucket. Ensure the service account running the solution has appropriate permissions (read/write) to this bucket.
    - `location`: The geographic location for your GCP resources (e.g., BigQuery datasets, Cloud Storage buckets). It's best practice to keep all resources in the same location for performance and cost efficiency. Example Value: 'us' (United States multi-region) Specify a valid GCP region or multi-region, such as 'us', 'europe-west1', etc.

    
- **Feature Engineering Parameters**: These parameters control how your data is prepared for machine learning, including column definitions and data filtering.
    - `use_ga4_data_for_feature_engineering`: A boolean flag indicating whether the solution should use GA4 raw data as the source for feature engineering. Example Value: True or False. Set to True if you're leveraging GA4 data. If set to False, you must provide a preprocessed table via `ml_training_table_name` and specify the corresponding column names for transaction, refund, and ID fields. When you set this false, you would need to preprocess your own data source and do all the feature engineering to prepare your data for ML model in provided `ml_training_table_name` under your GCP project and dataset used for the solution (see `project_id`, `dataset_id` above).
    - `transaction_date_col`: The name of the column in your dataset that represents the transaction date. This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'transaction_date'. If not using GA4 data, ensure this matches the column name in your input table.
    - `transaction_id_col`: The name of the column in your dataset that uniquely identifies each transaction. This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If use_ga4_data_for_feature_engineering is True, this is automatically set. Example Value: 'transaction_id'. If not using GA4 data, ensure this matches the column name in your input table.
    - `refund_value_col`: The name of the column representing the monetary value of the refund (the target variable for regression). This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'refund_value' If not using GA4 data, ensure this matches the column name in your input table.
    - `refund_flag_col`: The name of the column indicating whether a transaction was refunded (a binary flag, e.g., 0 for no refund, 1 for refund; the target variable for classification). This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'refund_flag'. If not using GA4 data, ensure this matches the column name in your input table.
    - `refund_proportion_col`: The name of the column representing the proportion of the original transaction amount that was refunded (another potential target variable for regression). This is a required parameter if `use_ga4_data_for_feature_engineering` is False. If `use_ga4_data_for_feature_engineering` is True, this is automatically set. Example Value: 'refund_proportion'. If not using GA4 data, ensure this matches the column name in your input table.
    - `return_policy_window_in_days`: The number of days within which a product can be returned according to your business's return policy. This influences how refunds are identified and considered for prediction. Example Value: 30 - this means after 30 days of transaction date, customers are no longer allowed to return the products they bought. We used the return policy window to remove transactions that have not passed the return deadline for model training to avoid noise. How to Define: Set this to match your actual return policy. When `use_ga4_data_for_feature_engineering` is False, this parameter is not needed.
    - `recency_of_data_in_days`: The number of days to look back in your historical data for model training. This parameter helps define the training data window. Example Value: 365 (approximately 1 year1) - this means we are considering using past 1 year of historical transaction and return data for model training. Choose a duration that provides sufficient historical data for training relevant models. Note that consumer behaviors and your products may have evolved over time therefore when deciding on the number please also keep the recency and relevancy of the data in mind. When `use_ga4_data_for_feature_engineering` is False, this parameter is not needed.

- **Google Analytics 4 (GA4) Raw Datasets Parameters**: These parameters are crucial if you plan to use GA4 data for feature engineering.
    - `ga4_project_id:` The Google Cloud Project ID where your GA4 raw dataset resides. This is typically a public dataset or a project you own containing your GA4 export. Example Value: 'bigquery-public-data'. if `use_ga4_data_for_feature_engineering` is False, then you can leave this as None.
    - `ga4_dataset_id`: The BigQuery dataset ID within your ga4_project_id that contains your raw GA4 event data. Example Value: 'my_ga4_dataset_id'. if `use_ga4_data_for_feature_engineering` is False, then you can leave this as None.

- **Modeling Parameters**: These parameters control the type of machine learning models used and the overall modeling approach.
    - `regression_model_type`: Specifies the type of regression model to be used for predicting the refund_value or refund_proportion. Example Value: constant.LinearBigQueryMLModelType.LINEAR_REGRESSION. You can choose from available regression model types within the constant.LinearBigQueryMLModelType enum (e.g., LINEAR_REGRESSION).
    - `binary_classifier_model_type`: Specifies the type of binary classification model to be used for predicting the refund_flag. Example Value: constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION. You can choose from available classification model types within the constant.LinearBigQueryMLModelType enum (e.g., LOGISTIC_REGRESSION).
    - `is_two_step_model`: A boolean flag indicating whether to use a two-step modeling approach. In a two-step model, a binary classifier first predicts if a refund will occur, and if so, a regression model then predicts the refund value. Example Value: True. How to Define: Set to True for a two-step approach or False for a single-step model (either regression or classification directly).
    

- **Optional Parameters**: These parameters offer further customization and are often required under specific conditions, as noted in their descriptions.
    - `ml_training_table_name`: The name of the BigQuery table containing your preprocessed, ML-ready data for model training. This parameter is required when `use_ga4_data_for_feature_engineering` is False. If you are providing your own preprocessed data, specify the BigQuery table name here and make sure the table is under your GCP project and dataset (provided based on project_id, dataset_id) for your model.
        - **[Important Note]**: ml_training_table_name needs to be preprocessed properly with columns that represent refund value, refund proportion and refund flag.
    - `invalid_value_threshold_for_row_removal`: The threshold (as a proportion) for removing rows during data cleaning. If a row has a proportion of invalid (e.g., null) values exceeding this threshold, the entire row will be removed. Default Value: 0.5 (50%) Adjust this value based on your data quality and tolerance for missing data.
    - `invalid_value_threshold_for_column_removal`: The threshold (as a proportion) for removing columns during data cleaning. If a column has a proportion of invalid values exceeding this threshold, the entire column will be removed. Default Value: 0.95 (95%): Adjust this value based on your data quality. Columns with very high proportions of missing values might not be useful for modeling.
    - `min_correlation_threshold_with_numeric_labels_for_feature_reduction`: The minimum correlation threshold used for feature reduction. Features with a correlation below this threshold with the numeric target labels (e.g., refund_value, refund_proportion) might be removed to simplify the model and prevent overfitting. Default Value: 0.1. Adjust this value to control the aggressiveness of feature reduction based on correlation.

In [None]:
# GCP product return project Parameters:
project_id = 'your-gcp-project-id'
dataset_id = 'your_dataset_id'
gcp_bq_client = bigquery.Client()
gcp_storage = storage.Client()
gcp_bucket_name='your_gcp_bucket_name'
location='your_gcp_location_name'

# Feature Enginnering Parameters:
use_ga4_data_for_feature_engineering = True
transaction_date_col = 'transaction_date'
transaction_id_col='transaction_id'
refund_value_col='refund_value'
refund_flag_col = 'refund_flag'
refund_proportion_col = 'refund_proportion'
return_policy_window_in_days=30
recency_of_data_in_days=1800

# GA4 raw datasets Parameters:
ga4_project_id = 'your-gcp-project-id-for-ga4-data'
ga4_dataset_id = 'your_dataset_id_for_ga4_data'

# Modeling Parameters:
regression_model_type = constant.LinearBigQueryMLModelType.LINEAR_REGRESSION
binary_classifier_model_type = constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION
is_two_step_model=True

# Optional Parameters
ml_training_table_name = None

# Validate input parameters
if use_ga4_data_for_feature_engineering and not ga4_project_id:
    raise ValueError('ga4_project_id should be provided when use_ga4_data_for_feature_engineering is set to True.')
if use_ga4_data_for_feature_engineering and not ga4_dataset_id:
    raise ValueError('ga4_dataset_id should be provided when use_ga4_data_for_feature_engineering is set to True.')
if not use_ga4_data_for_feature_engineering and not ml_training_table_name:
    raise ValueError('ml_training_table_name should not be None when use_ga4_data_for_feature_engineering is set to False.')

# Step 4: Create a `ProductReturnPredictor` instance called `product_return` with all the required paramters

Use the predefined parameters above to create a `product_retun` instance.
For more details on the `ProductReturnPredictor` class, please refer to `product_return_predictor.py` module under `product_return_predictor/product_return_predictor` directory.

In [None]:
product_return = product_return_predictor.ProductReturnPredictor(project_id=project_id,
                                                                 dataset_id = dataset_id,
                                                                 gcp_bq_client=gcp_bq_client,
                                                                 gcp_storage=gcp_storage,
                                                                 gcp_bucket_name=gcp_bucket_name,
                                                                 location=location,
                                                                 ga4_project_id=ga4_project_id,
                                                                 ga4_dataset_id=ga4_dataset_id,
                                                                 use_ga4_data_for_feature_engineering=use_ga4_data_for_feature_engineering,
                                                                 transaction_date_col=transaction_date_col,
                                                                 transaction_id_col=transaction_id_col,
                                                                 refund_value_col=refund_value_col,
                                                                 refund_flag_col=refund_flag_col,
                                                                 refund_proportion_col=refund_proportion_col,
                                                                 regression_model_type=regression_model_type,
                                                                 binary_classifier_model_type=binary_classifier_model_type,
                                                                 ml_training_table_name=ml_training_table_name,
                                                                 is_two_step_model=is_two_step_model)

# Step 5: Data Processing and Feature Engineering (`data_processing_feature_engineering`)
This step is where your raw data gets transformed into a format suitable for machine learning. It involves several critical sub-steps, including data cleaning, feature creation (if using GA4 data), and preparing the data for model training or prediction.

The `data_processing_feature_engineering` method handles the heavy lifting of preparing your data. It intelligently adapts its process based on whether you're using Google Analytics 4 (GA4) data as your source or providing your own preprocessed data.

**Here's how to call this method:**

```
product_return.data_processing_feature_engineering(
      data_pipeline_type=constant.DataPipelineType.TRAINING,
      recency_of_transaction_for_prediction_in_days=recency_of_transaction_for_prediction_in_days,
      return_policy_window_in_days=return_policy_window_in_days,
      recency_of_data_in_days=recency_of_data_in_days
)
```

**What's Happening During This Step?**

The `data_processing_feature_engineering` method orchestrates the following:

- **Determining Data Source**: It checks the `use_ga4_data_for_feature_engineering parameter` (defined in the previous step). This flag dictates whether the pipeline will query GA4 raw data or use a pre-existing table you've provided.
- **Data Ingestion and Feature Creation (If using GA4 Data)**: If `use_ga4_data_for_feature_engineering` is True, the pipeline connects to your specified GA4 BigQuery project and dataset. It executes a series of BigQuery SQL queries (defined in constant.GA4_DATA_PIPELINE_QUERY_TEMPLATES) to extract relevant transaction data and engineer features directly within BigQuery. This includes calculating metrics like `refund_value`, `refund_flag`, and `refund_proportion`. It also segregates the data into two main categories:` _ml_ready_data_for_existing_customers` and `_ml_ready_data_for_first_time_purchase`, which are then further processed.

- **Data Cleaning and Preprocessing**: The extracted data is then loaded into Pandas DataFrames. A comprehensive data cleaning process is applied, which includes:
    - **Type Conversion**: Ensuring columns have the correct data types (e.g., string, numeric, date).
    - **Missing Value Imputation**: Filling missing string values with `'unknown'` and numeric values with 0.
    - **Invalid Data Removal (for Training)**: If you are running the training pipeline, rows and columns with a high proportion of invalid or zero values are identified and potentially removed based on thresholds (invalid_value_threshold_for_row_removal and invalid_value_threshold_for_column_removal). This step is skipped during the prediction pipeline to avoid altering the data structure unexpectedly.
    
- **Feature Selection**: A feature selection pipeline is applied to identify and retain the most relevant features for modeling based on their correlation with target variables. This pipeline is trained during the training phase and then saved to Google Cloud Storage for consistent use during prediction.

- **Data Transformation (Scaling and Resampling)**: Features are scaled (e.g., using `MinMaxScaler`) to normalize their ranges, which can improve model performance. If there's a significant imbalance in your target variable (e.g., very few refunds), data resampling techniques may be applied to balance the classes. This data processing pipeline is also trained and saved for reusability.

- **Train/Test Split**: For the training pipeline, the data is split into training and testing sets based on your specified `train_test_split_test_size_proportion` and `transaction_date_col` (for chronological splitting). This ensures your model is evaluated on unseen data.

- The processed, ML-ready data for both existing and first-time customers is then saved back to BigQuery in your specified dataset_id.


- **Handling User-Provided Preprocessed Data (If not using GA4 Data)**: If `use_ga4_data_for_feature_engineering` is False, the solution assumes you've already performed the initial data preparation and provides an ML-ready table. It uses a BigQuery SQL query to read your `ml_training_table_name` (or `ml_prediction_table_name`) and prepares it by creating a train_test split column based on the `transaction_id_col` and `train_test_split_test_size_proportion`. This data is then saved as ML-ready tables in BigQuery.




**What You Need to Provide?**

For this step, you'll explicitly provide values for the following parameters when calling the function:

- `data_pipeline_type`: This crucial parameter tells the pipeline whether it's preparing data for model training or generating predictions.
    - Accepted Values:
        - `constant.DataPipelineType.TRAINING`: Select this when you are training a new model. The pipeline will perform data cleaning, feature engineering, and split data into training and testing sets. It will also train and save the data preprocessing and feature selection pipelines to Google Cloud Storage.
        - `constant.DataPipelineType.PREDICTION`: Choose this when you want to generate predictions using a pre-trained model. The pipeline will load the saved preprocessing and feature selection pipelines from Cloud Storage and apply them to your new data, without splitting into train/test sets.
    - How to Provide: Directly use constant.DataPipelineType.TRAINING or constant.DataPipelineType.PREDICTION.

- `return_policy_window_in_days`: An integer indicating your product return policy window in days. This is used to define the timeframe within which a return is considered valid for labeling purposes in the training data. An integer value (e.g., 30).
    - Note: This should be provided during Step 3.

- `recency_of_data_in_days`: An integer specifying the number of historical days of data to consider for model training. This helps define the scope of your training dataset. How to Provide: An integer value (e.g., 1800 for approximately 5 years).
    - Note: This should be provided during Step 3.

**Important Considerations:**

- **Pre-defined Parameters**: This step heavily relies on the parameters you've set up in the initial configuration (e.g., ga4_project_id, project_id, dataset_id, gcp_bucket_name, `use_ga4_data_for_feature_engineering`, `transaction_date_col`, etc.). Ensure those are correctly defined before running this step.

- **Data Availability**: Make sure your GA4 data (if `use_ga4_data_for_feature_engineering` is True) or your ml_training_table_name/ml_prediction_table_name (if `use_ga4_data_for_feature_engineering` is False) are accessible in BigQuery.

- **If you have done feature engineering yourself without using the ga4 data export directly**:
    - You will still need to run the following code to prep your dataset for modeling However, all the data cleaning, validation, feature engineering, data scaling steps will be skipped.
    - Also, make use you turn **`use_ga4_data_for_feature_engineering`** to False, and make sure set the values for the following parameters when creating **ProductReturnPredictor** instance:
        - `ml_training_table_name`
        - `ml_prediction_table_name`
        - `transaction_date_col`
        - `transaction_id_col`
        - `refund_value_col`
        - `refund_flag_col`
        - `refund_proportion_col`
        


- **If you decide to rely on the solution to do feature engineering for you**, then make sure to turn **`use_ga4_data_for_feature_engineering`** to True. In this case there's no need to specify the parameters listed above.
    

By understanding this step, you'll have a clear picture of how your raw data evolves into the clean, transformed, and ML-ready format necessary for building powerful prediction models!

In [None]:
product_return.data_processing_feature_engineering(
    data_pipeline_type=constant.DataPipelineType.TRAINING,
    return_policy_window_in_days=return_policy_window_in_days,
    recency_of_data_in_days=recency_of_data_in_days
)

# Step 6: Run Modeling Pipeline

## Model Training, Evaluation, and Prediction (model_training_pipeline_evaluation_and_prediction)

This is a central step in the Product Return Predictor solution. This method handles the entire machine learning lifecycle, from training your models on the prepared data to evaluating their performance and generating initial predictions. It also provides valuable insights into what drives these predictions through feature importance analysis.

Here's how you'd typically call this method in your Colab notebook:
```
regression_model_type = constant.LinearBigQueryMLModelType.LINEAR_REGRESSION
binary_classifier_model_type = constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION
first_time_purchase = True

performance_metrics_dfs, model_prediction_df, predictions_actuals_distribution, feature_importance_dfs = product_return.model_training_pipeline_evaluation_and_prediction(
    is_two_step_model=True,
    regression_model_type=regression_model_type,
    binary_classifier_model_type=binary_classifier_model_type,
    first_time_purchase=first_time_purchase,
    num_tiers_to_create_avg_prediction=10,
    probability_threshold_for_prediction=0.5,
    probability_threshold_for_model_evaluation=0.5,
    bqml_template_files_dir=constant.BQML_QUERY_TEMPLATE_FILES
)
```

**What's Happening During This Step?**
The `model_training_pipeline_evaluation_and_prediction` method orchestrates a sequence of critical operations:
- **Determine Input Data Table**: The method first identifies the BigQuery table containing your ML-ready data.
    - If `use_ga4_data_for_feature_engineering` (a parameter set previously) is True, it uses the data prepared by the data_processing_feature_engineering step, specifically for "first-time purchases" or "existing customers" based on the first_time_purchase flag you provide.
    - If `use_ga4_data_for_feature_engineering` is False, it utilizes the ml_training_table_name you previously defined.

- **Model Training (model.bigquery_ml_model_training)**: The core of this step involves training one or more BigQuery ML models.
    - Single-Step Model: If is_two_step_model is False, it directly trains a single regression model (e.g., Linear Regression) to predict the refund_value or refund_proportion.
    - Two-Step Model: If is_two_step_model is True, it trains two separate models:
        - A binary classification model (e.g., Logistic Regression) to predict the refund_flag (whether a refund will occur).
        - A regression model (e.g., Linear Regression) to predict the refund_value or refund_proportion for transactions identified as likely to be refunded by the classification model.
    These models are trained directly within BigQuery, leveraging its powerful ML capabilities.

- **Model Performance Evaluation (model_prediction_evaluation.model_performance_metrics)**: After training, the solution retrieves various performance metrics for the trained models from BigQuery.
    - For regression models, metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R2 are typically evaluated.
    - For classification models, metrics such as accuracy, precision, recall, F1-score, and AUC (Area Under the Receiver Operating Characteristic Curve) are commonly assessed.
     - If a two-step model is used, it also evaluates the combined performance of both models.
     - These metrics are returned as Pandas DataFrames, providing a detailed understanding of how well your model performs.

- **Model Prediction Generation (model_prediction_evaluation.model_prediction)**: The trained models are then used to generate predictions on the training data itself. This allows for a direct comparison between the model's predictions and the actual historical outcomes.

    - **Prediction and Actuals Distribution Visualization (`model_prediction_evaluation.plot_predictions_actuals_distribution`)**: The method generates a plot showing the distribution of your model's predictions versus the actual observed values. This helps you visually inspect if the model's predictions align with the real-world data distribution.
    - **Tier-Level Average Prediction Comparison (`model_prediction_evaluation.compare_and_plot_tier_level_avg_prediction`)**: To further assess the model's performance and understand its behavior across different prediction ranges, this step divides predictions into "tiers" (e.g., deciles). It then compares the average predicted refund value within each tier to the actual average refund value for that tier, providing a valuable sanity check and insights into model calibration.
        - **Note:** We want to make sure the model differentiates between the orders with high and low predicted refund values so that we can prioritize our money and resources on high customers with high net value when it comes to activation on Google Ads. Therefore, another good way to see how the model performs is to create a chart to compared the average predicted value and actual value broken down in deciles based on the predicted refund value. If the decile level average predicted & actual values are closely aligned and there’s a big/significant difference across deciles on the predicted value, that means the model is doing a decent job overall.

- **Feature Importance Analysis** (`model_prediction_evaluation.training_feature_importance`): Understanding why a model makes certain predictions is crucial. This step retrieves and visualizes the feature importance for your trained model(s). Feature importance indicates which input variables had the most significant impact on the model's predictions. This can help you understand the key drivers of product returns and validate business hypotheses.

**Output:**
The method returns several objects:
- `performance_metrics_dfs`: A dictionary containing DataFrames of model performance metrics for each trained model.
- `model_prediction_df`: A DataFrame with the model's predictions on the training data.
- `predictions_actuals_distribution`: A dictionary containing descriptive statistics of the prediction and actual distributions.
- `feature_importance_dfs`: A dictionary containing DataFrames of feature importance for each trained model.

**What You Need to Provide**:
When calling model_training_pipeline_evaluation_and_prediction, you'll provide the following:
- `is_two_step_model`: A boolean value that determines whether to use a two-step modeling approach.
    - Accepted Values:
        - True: The solution will first train a classification model to predict if a refund will occur (refund_flag), and then a regression model to predict the refund amount (refund_value or refund_proportion) for transactions identified as refunds.
        - False: The solution will train a single regression model to directly predict the refund_value or refund_proportion.

- `regression_model_type`: Specifies the type of regression model to use for predicting refund values.
    - Accepted Values: Values from constant.LinearBigQueryMLModelType (e.g., LINEAR_REGRESSION), constant.DNNBigQueryMLModelType (e.g., DNN_REGRESSOR), or constant.BoostedTreeBigQueryMLModelType (e.g., BOOSTED_TREE_REGRESSOR).
    - Use the appropriate constant, e.g., constant.LinearBigQueryMLModelType.LINEAR_REGRESSION.

- `binary_classifier_model_type`: Specifies the type of binary classification model to use for predicting refund flags (if is_two_step_model is True).
    - Accepted Values: Values from constant.LinearBigQueryMLModelType (e.g., LOGISTIC_REGRESSION), constant.DNNBigQueryMLModelType (e.g., DNN_CLASSIFIER), or constant.BoostedTreeBigQueryMLModelType (e.g., BOOSTED_TREE_CLASSIFIER).
    - Use the appropriate constant, e.g., constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION.

- `first_time_purchase`: A boolean flag that, when `use_ga4_data_for_feature_engineering` is True, tells the pipeline whether to train the model specifically for first-time purchasers or existing customers. This allows for tailored models based on customer behavior. This parameter is ignored if you're not using GA4 data.
    - Accepted Values: True (for first-time purchase data) or False (for existing customer data).
    - How to Provide: Set to True or False if `self.use_ga4_data_for_feature_engineering` is True.

- `num_tiers_to_create_avg_prediction`: The number of tiers (or bins) to divide the predictions into for the "tier-level average prediction vs. actual" comparison. More tiers provide a more granular view. How to Provide: An integer value (e.g., 10).

- `probability_threshold_for_prediction`: For binary classification models, this is the probability cutoff used to classify a transaction as a "refund" (1) or "no refund" (0) during prediction. For example, if set to 0.5, any predicted probability greater than or equal to 0.5 will be classified as a refund. How to Provide: A float value between 0 and 1 (e.g., 0.5).

- `probability_threshold_for_model_evaluation`: Similar to probability_threshold_for_prediction, but specifically used for evaluating the binary classification model's performance metrics (e.g., calculating precision, recall, accuracy). You might use a different threshold for evaluation than for live prediction. How to Provide: A float value between 0 and 1 (e.g., 0.5).

- `bqml_template_files_dir`: A mapping (dictionary) that provides the file paths to the BigQuery ML SQL query templates. These templates define the BigQuery ML operations for model training and prediction. This parameter usually uses a default value provided by the solution (constant.BQML_QUERY_TEMPLATE_FILES).
   - **Note**: Typically, you won't need to change this and can use the default constant.
- `**plot_kwargs`: This allows you to pass additional keyword arguments directly to the underlying `matplotlib.pyplot` functions used for generating plots (e.g., `figsize=(10, 6)` to control plot size, `dpi=300` for higher resolution).
    - How to Provide: You can add arguments like `figsize=(10, 7)` directly to the function call.

By successfully running this step, you'll have trained models, assessed their performance, and gained insights into their predictions and the factors influencing them. This is a crucial step towards understanding and leveraging your product return prediction solution!

In [None]:
regression_model_type=constant.LinearBigQueryMLModelType.LINEAR_REGRESSION
binary_classifier_model_type=constant.LinearBigQueryMLModelType.LOGISTIC_REGRESSION
first_time_purchase=True
performance_metrics_dfs, model_prediction_df, predictions_actuals_distribution, feature_importance_dfs = product_return.model_training_pipeline_evaluation_and_prediction(
    is_two_step_model=True,
    regression_model_type=regression_model_type,
    binary_classifier_model_type=binary_classifier_model_type,
    first_time_purchase=first_time_purchase,
    num_tiers_to_create_avg_prediction=10,
    probability_threshold_for_prediction=0.5,
    probability_threshold_for_model_evaluation=0.5,
    bqml_template_files_dir=constant.BQML_QUERY_TEMPLATE_FILES )

The `model_training_pipeline_evaluation_and_prediction` method returns a tuple containing four key components. Each component is designed to give you a different perspective on your model's effectiveness.
 - `performance_metrics_dfs`: This output is a mapping (dictionary) where each key represents a type of model (e.g., 'regression_model', 'classification_model', 'two_step_model') and its corresponding value is a Pandas DataFrame. These DataFrames contain the calculated performance metrics for each model trained during the pipeline. These DataFrames are crucial for understanding how well your models are performing against the specified target variables.

 - For Regression Models: You'll typically find metrics like:
     - Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values. It gives you a sense of the average magnitude of errors.
     - Mean Squared Error (MSE): The average of the squared differences between predicted and actual values. It penalizes larger errors more heavily.
     - Root Mean Squared Error (RMSE): The square root of MSE, often preferred as it's in the same units as the target variable.
     - R-squared (R2): Represents the proportion of the variance in the dependent variable that's predictable from the independent variables. A higher R2 indicates a better fit.

- For Classification Models: You'll typically find metrics like:
    - Accuracy: The proportion of correctly classified instances (both true positives and true negatives) out of the total instances.
    - Precision: Of all the instances predicted as positive, what proportion were actually positive. Useful when the cost of false positives is high.
    - Recall (Sensitivity): Of all the actual positive instances, what proportion were correctly identified. Useful when the cost of false negatives is high.
    - F1-Score: The harmonic mean of precision and recall. It's a good measure when you need a balance between precision and recall.
    - AUC (Area Under the Receiver Operating Characteristic Curve): Measures the ability of a binary classifier to discriminate between positive and negative classes. A higher AUC indicates better discriminatory power.

You can access specific model metrics like this:

```
# For regression model metrics
print(performance_metrics_dfs['linear_regression_model'])

#For classification model metrics (if is_two_step_model is True)
print(performance_metrics_dfs['logistic_regression_model'])
```

- `model_prediction_df`: This is a Pandas DataFrame that contains the predictions generated by your trained model(s) on the input dataset (specifically, the preprocessed data that was used for training and testing). It includes the original transaction_id_col and transaction_date_col (if available), the actual refund value/flag, and the model's corresponding predictions. This DataFrame allows you to inspect individual predictions and compare them directly with the actual outcomes. It's the raw data behind the distribution plots and tier-level comparisons. It also serves as the basis for understanding model behavior at a granular level.
    - Columns typically included:
        - `transaction_id_col` (e.g., 'transaction_id')
        - `transaction_date_col` (e.g., 'transaction_date')
        - Actual target variable column(s) (e.g., 'refund_value', 'refund_flag')
        - prediction (the model's predicted value)


- `predictions_actuals_distribution`: This is a mapping (dictionary) where the keys are 'prediction' and, if use_prediction_pipeline was False (i.e., you were in a training pipeline), also 'actual'. The values are Pandas Series containing descriptive statistics (e.g., count, mean, standard deviation, min, max, quartiles) for the distribution of your model's predictions and the actual target values.  This provides a high-level summary of the statistical properties of your predictions and actuals. It's particularly useful for quickly checking for biases or unexpected ranges in the model's output compared to the real data. This output is directly related to the histograms plotted by the `plot_predictions_actuals_distribution` function.

- `feature_importance_dfs`: This is a mapping (dictionary) where keys are the model types (e.g., 'linear_regression', 'logistic_regression') and values are Pandas DataFrames. Each DataFrame lists the features used by that specific model and their corresponding importance scores (or attribution values). What it tells you: Feature importance helps you understand which input variables contributed most significantly to the model's predictions.
    - This is invaluable for:
        - Interpretability: Gaining insights into the drivers of product returns.
        - Feature Engineering: Identifying features that might need further refinement or new features that could be created.
        - Domain Knowledge Validation: Confirming if the model's learned importances align with your business understanding.
    - Columns typically included:
        - feature (the name of the input column)
        - attribution (the importance score for that feature)


Check out the model performance metrics for each of the model types:

In [None]:
for model_type in performance_metrics_dfs.keys():
    print(f'performance_metrics for {model_type}:')
    display(performance_metrics_dfs[model_type])

Check out the feature importance for each of the model types:

In [None]:
for model_type in feature_importance_dfs.keys():
    print(f'feature importance for {model_type}:')
    display(feature_importance_dfs[model_type].sort_values(by='attribution', ascending=False))

You can check the model predictions dataframe on the training and test dataset here:

In [None]:
model_prediction_df.head()

You can compare the distribution of the predictions versus the actuals here

In [None]:
predictions_actuals_distribution['prediction']

In [None]:
predictions_actuals_distribution['actual']