# AGNBoost Basic Usage Tutorial

This notebook demonstrates the basic workflow for using AGNBoost to predict AGN fractions from photometric data. We'll walk through:

1. Loading astronomical data with the Catalog class
2. Exploring the dataset structure and properties
3. Splitting data into training, validation, and test sets
4. Cleaning the data by removing rows with missing values
5. Loading a pre-trained AGN fraction model
6. Making predictions with uncertainty quantification
7. Evaluating model performance

Let's start by importing the necessary libraries and loading our data.

In [2]:
# Set agnbioost folder as root
import os
os.chdir(os.path.expanduser("/home/kurt/Documents/agnboost/"))

# Import necessary libraries
import numpy as np
import pandas as pd
from agnboost import dataset, model
#from sklearn.metrics import mean_squared_error

# Set random seed for reproducibility
np.random.seed(123)

print("AGNBoost Basic Usage Tutorial")
print("=" * 40)

[32m2025-05-25 16:36:50.131[0m | [1mINFO    [0m | [36magnboost.config[0m:[36m<module>[0m:[36m11[0m - [1mPROJ_ROOT path is: /home/kurt/Documents/agnboost[0m


AGNBoost Basic Usage Tutorial


## Loading the Data

We'll use the Catalog class to load our astronomical dataset. The `models-block-0.fits` file contains photometric measurements and AGN fraction labels for our analysis.

In [25]:
# Load the astronomical data using the Catalog class
catalog = dataset.Catalog(path="data/cigale_mock_small.csv",summarize = False)


Current working directory: /home/kurt/Documents/agnboost
Looking for bands file at: /home/kurt/Documents/agnboost/allowed_bands.json
[INFO] Loaded bands file metadata: This file contains the allowed photometric bands for JWST
[INFO] Loaded 11 allowed bands from agnboost/allowed_bands.json
[INFO] Attempting to load file with delimiter: ','
[INFO] Successfully loaded data with 1000 rows.
[INFO] Found 11 valid band columns:
[INFO]   - jwst.nircam.F115W (F115W): 1.154 μm
[INFO]   - jwst.nircam.F150W (F150W): 1.501 μm
[INFO]   - jwst.nircam.F200W (F200W): 1.988 μm
[INFO]   - jwst.nircam.F277W (F277W): 2.776 μm
[INFO]   - jwst.nircam.F356W (F356W): 3.565 μm
[INFO]   - jwst.nircam.F410M (F410M): 4.083 μm
[INFO]   - jwst.nircam.F444W (F444W): 4.402 μm
[INFO]   - jwst.miri.F770W (F770W): 7.7 μm
[INFO]   - jwst.miri.F1000W (F1000W): 10.0 μm
[INFO]   - jwst.miri.F1500W (F1500W): 15.0 μm
[INFO]   - jwst.miri.F2100W (F2100W): 21.0 μm


## Exploring the Dataset

Let's examine the structure of our data to understand what photometric bands are available and get basic statistics about our dataset. The `print_data_summary()` method provides comprehensive information about:

- Dataset dimensions and memory usage
- Photometric band validation and metadata
- Column-by-column statistics including missing values
- Summary statistics for numerical columns

This information helps us understand data quality and identify any potential issues before modeling.

In [26]:
# Display comprehensive data summary
catalog.print_data_summary()

# Check which photometric bands were validated
valid_bands = catalog.get_valid_bands()
print(f"\nValid photometric bands found: {len(valid_bands)}")
for band_name, info in valid_bands.items():
    print(f"  {band_name}: {info['shorthand']} ({info['wavelength']} μm)")

# Check if our target variable exists
target_column = 'agn.fracAGN'
if target_column in catalog.get_data().columns:
    print(f"\nTarget variable '{target_column}' found in dataset")
    target_stats = catalog.get_data()[target_column].describe()
    print("Target variable statistics:")
    print(target_stats)
else:
    print(f"Warning: Target variable '{target_column}' not found in dataset")
    print("Available columns:", list(catalog.get_data().columns))


DATA SUMMARY: cigale_mock_small.csv
Dimensions: 1000 rows × 26 columns
Memory usage: 0.20 MB
--------------------------------------------------------------------------------
Valid Band Columns:
--------------------------------------------------------------------------------
Column Name                    Shorthand       Wavelength (μm)
--------------------------------------------------------------------------------
jwst.nircam.F115W              F115W           1.154          
jwst.nircam.F150W              F150W           1.501          
jwst.nircam.F200W              F200W           1.988          
jwst.nircam.F277W              F277W           2.776          
jwst.nircam.F356W              F356W           3.565          
jwst.nircam.F410M              F410M           4.083          
jwst.nircam.F444W              F444W           4.402          
jwst.miri.F770W                F770W           7.700          
jwst.miri.F1000W               F1000W          10.000         
jwst.miri.F15

## Creating Train/Test/Validation Splits

Before any modeling, we need to split our data into separate sets for training, validation, and testing. AGNBoost provides intelligent data splitting with optional stratification to ensure representative samples across all splits.

We'll use the default split ratios:
- 60% for training
- 20% for validation  
- 20% for testing

The random state ensures reproducible results.

In [27]:
# Create train/validation/test splitsget_train_val_test_sizes
catalog.split_data(test_size=0.2, val_size=0.2, random_state=42)

# Get split information
split_info = catalog.get_train_val_test_sizes()
print("Data split summary:")
print(f"  Total samples: {split_info['total']}")
print(f"  Training: {split_info['train']['size']} ({split_info['train']['percentage']:.1f}%)")
print(f"  Validation: {split_info['validation']['size']} ({split_info['validation']['percentage']:.1f}%)")
print(f"  Test: {split_info['test']['size']} ({split_info['test']['percentage']:.1f}%)")

Data split summary:
  Total samples: 1000
  Training: 600 (60.0%)
  Validation: 200 (20.0%)
  Test: 200 (20.0%)


## Cleaning the Data

Real astronomical datasets often contain missing values due to various observational limitations. Before training or making predictions, we will remove rows that have NaN values in critical columns.

The `drop_nan()` method removes rows with missing values in the validated photometric band columns, ensuring our model receives complete data for all features.

In [28]:
# Drop rows with NaN values in the validated columns
catalog.drop_nan(inplace=True)


[INFO] No rows with NaN values found in the specified columns.


## Creating Features

AGNBoost automatically engineers features from photometric data, including colors and transformations. Let's create the feature dataframe that will be used for modeling.

By default, AGNBoost will create a features consisting of the photometric bands + derived colors + the squares of those derived colors

In [29]:
# Create features for modeling
catalog.create_feature_dataframe()

# Get information about created features
features = catalog.get_features()
print(f"Feature engineering complete:")
print(f"  Feature dataframe shape: {features.shape}")


[INFO] Created feature dataframe with 121 columns and 1000 rows.
Feature engineering complete:
  Feature dataframe shape: (1000, 121)


## Loading the Pre-trained Model

AGNBoost comes with pre-trained models for common astronomical tasks. We'll load the model specifically trained for AGN fraction estimation (`agn.fracAGN`).

The `load_models()` method automatically:
- Checks for compatible pre-trained models
- Validates feature compatibility between the model and our data
- Loads model metadata including training parameters and performance metrics

In [30]:
# Initialize AGNBoost with the target model
agnboost_m = model.AGNBoost( feature_names = catalog.get_feature_names(),
                          target_variables = {'agn.fracAGN' : 'ZABeta'},
                         )

# Load pre-trained models
filename = '2025_05_22-PM06_59_58_agn.fracAGN_model.pkl.gz'
agnboost_m.load_model(file_name = filename, overwrite = True)
print(agnboost_m.models)

if agnboost_m.models['agn.fracAGN'] is not None:
    print("✅ Pre-trained model loaded successfully!")
    
    # Display model information
    model_info = agnboost_m.model_info.get('agn.fracAGN', {})
    if model_info:
        print("\nModel information:")
        if 'training_timestamp' in model_info:
            print(f"  Trained: {model_info['training_timestamp']}")
        if 'best_score' in model_info:
            print(f"  Best validation score: {model_info['best_score']:.6f}")
        if 'features' in model_info:
            print(f"  Number of features: {len(model_info['features'])}")
else:
    print("❌ No pre-trained models found!")
    print("You may need to train a new model or check the models directory.")

{'agn.fracAGN': <xgboostlss.model.XGBoostLSS object at 0x7905e74b1090>}
✅ Pre-trained model loaded successfully!

Model information:
  Best validation score: -649218.125000
  Number of features: 121


## Making Predictions

Now we'll use our loaded model to predict AGN fractions for the test set. AGNBoost seamlessly handles the conversion of our catalog data into the format required by the underlying XGBoost model.

The prediction process uses the engineered features (colors, log magnitudes, etc.) that were automatically created from our photometric band data.

In [31]:
# Make predictions on the test set
#agnboost_m.models['agn.fracAGN'].booster.set_param( {'device': 'cpu'})
preds = agnboost_m.predict( data = catalog, split_use = 'test', model_name = 'agn.fracAGN')

print(f"  Mean: {np.mean(preds):.6f}")
print(f"  Std: {np.std(preds):.6f}")
print(f"  Min: {np.min(preds):.6f}")
print(f"  Max: {np.max(preds):.6f}")




  Mean: 0.504962
  Std: 0.325251
  Min: 0.000308
  Max: 0.989859


## Quantifying Prediction Uncertainty

One of AGNBoost's key advantages is its ability to provide robust uncertainty estimates through XGBoostLSS distributional modeling. Rather than just point estimates, we get full uncertainty quantification for each prediction.

The `prediction_uncertainty()` method returns uncertainty estimates that account for both model uncertainty and the inherent variability in the data. This is crucial for astronomical applications where understanding prediction confidence is essential for scientific interpretation.

Since the loaded data is a CIGALE mock catalog with no photometric uncertainty, we will only estimate the model (aleatoric + epistemic) uncertainty for each source.

In [32]:
model_uncertainty = agnboost_m.prediction_uncertainty( uncertainty_type = 'model', model_name = 'agn.fracAGN', catalog = catalog)

print(f"✅ Uncertainty estimates generated")
print(f"Uncertainty statistics:")
print(f"  Mean uncertainty: {np.mean(model_uncertainty):.6f}")
print(f"  Std uncertainty: {np.std(model_uncertainty):.6f}")
print(f"  Min uncertainty: {np.min(model_uncertainty):.6f}")
print(f"  Max uncertainty: {np.max(model_uncertainty):.6f}")

Processing truncated model uncertainty: 100%|█| 1000/1000 [07:09<00:00,  2.33it/

✅ Uncertainty estimates generated
Uncertainty statistics:
  Mean uncertainty: 0.033900
  Std uncertainty: 0.013166
  Min uncertainty: 0.000940
  Max uncertainty: 0.071419



