# Approach 2: Deep Learning & Representation-Based Anomaly Detection (TensorFlow - DEPRECATED)

⚠️ **IMPORTANT NOTE**: This TensorFlow/Keras version has been deprecated due to GPU compatibility issues with NVIDIA 3060 Mobile.

**Please use the PyTorch version instead: `approach2_deep_learning_pytorch.ipynb`**

The PyTorch version provides:

- Better GPU compatibility with NVIDIA 3060 Mobile
- More efficient memory usage
- Simplified and more maintainable code
- All the same functionality and better performance

---

This approach leverages deep learning techniques, particularly autoencoders and embedding layers, to detect anomalous shopping baskets by learning complex patterns in the data.

## Key Components:

1. **Advanced Data Preprocessing**: Handle mixed categorical and numerical features
2. **Embedding Layers**: Convert categorical features to dense vector representations
3. **Autoencoder Architecture**: Learn compressed representations of normal shopping behavior
4. **Reconstruction Error Analysis**: Use reconstruction loss as anomaly indicator
5. **Ensemble Deep Models**: Combine multiple architectures for robust detection

## Advantages:

- **Complex Pattern Recognition**: Captures non-linear relationships and interactions
- **Mixed Data Handling**: Naturally processes both categorical and numerical features
- **Hierarchical Learning**: Learns multiple levels of abstraction
- **Scalability**: Can handle high-dimensional data efficiently
- **Adaptive**: Learns from data without predefined rules

## Architecture Overview:

```
Input Layer → Embedding Layers → Encoder → Bottleneck → Decoder → Reconstruction
     ↓              ↓              ↓         ↓         ↓            ↓
Mixed Data    Dense Vectors   Compressed  Latent    Expanded   Reconstructed
              Representations Representation Space   Features    Original
```


In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings("ignore")

# Deep Learning libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Model
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import MeanSquaredError

# Preprocessing and utilities
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.decomposition import PCA

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Configure TensorFlow
tf.config.experimental.enable_memory_growth = True
print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")

# Set paths
DATA_PATH = Path("data/")
RESULTS_PATH = Path("results/")
MODELS_PATH = Path("models/")

# Create directories
RESULTS_PATH.mkdir(exist_ok=True)
MODELS_PATH.mkdir(exist_ok=True)

# Set plotting style
plt.style.use("seaborn-v0_8")
sns.set_palette("husl")
plt.rcParams["figure.figsize"] = (12, 8)

2025-06-27 16:10:38.041830: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1751033438.054208  247552 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1751033438.058565  247552 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-06-27 16:10:38.071844: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


TensorFlow version: 2.18.0
GPU Available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Step 1: Advanced Data Loading and Exploration

Load the data and perform initial analysis to understand the structure for deep learning preprocessing.


In [2]:
# Load datasets
print("Loading datasets...")
train_df = pd.read_csv(DATA_PATH / "X_train_G3tdtEn.csv")
test_df = pd.read_csv(DATA_PATH / "X_test_8skS2ey.csv")

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

# Identify different column types for deep learning preprocessing
print("\nIdentifying column types...")

# Categorical feature columns
item_cols = [col for col in train_df.columns if col.startswith("item")]
make_cols = [col for col in train_df.columns if col.startswith("make")]
model_cols = [col for col in train_df.columns if col.startswith("model")]
goods_code_cols = [col for col in train_df.columns if col.startswith("goods_code")]

# Numerical feature columns
price_cols = [col for col in train_df.columns if col.startswith("cash_price")]
quantity_cols = [
    col for col in train_df.columns if col.startswith("Nbr_of_prod_purchas")
]

# Metadata columns
meta_cols = ["ID", "Nb_of_items"]

print(f"Categorical columns:")
print(f"  - Items: {len(item_cols)}")
print(f"  - Makes: {len(make_cols)}")
print(f"  - Models: {len(model_cols)}")
print(f"  - Goods codes: {len(goods_code_cols)}")
print(f"\nNumerical columns:")
print(f"  - Prices: {len(price_cols)}")
print(f"  - Quantities: {len(quantity_cols)}")
print(f"\nMetadata columns: {len(meta_cols)}")

# Combine all categorical columns
all_categorical_cols = item_cols + make_cols + model_cols + goods_code_cols
all_numerical_cols = price_cols + quantity_cols + ["Nb_of_items"]

print(f"\nTotal categorical features: {len(all_categorical_cols)}")
print(f"Total numerical features: {len(all_numerical_cols)}")

Loading datasets...
Training data shape: (92790, 146)
Test data shape: (23198, 146)

Identifying column types...
Categorical columns:
  - Items: 24
  - Makes: 24
  - Models: 24
  - Goods codes: 24

Numerical columns:
  - Prices: 24
  - Quantities: 24

Metadata columns: 2

Total categorical features: 96
Total numerical features: 49
Training data shape: (92790, 146)
Test data shape: (23198, 146)

Identifying column types...
Categorical columns:
  - Items: 24
  - Makes: 24
  - Models: 24
  - Goods codes: 24

Numerical columns:
  - Prices: 24
  - Quantities: 24

Metadata columns: 2

Total categorical features: 96
Total numerical features: 49


## Step 2: Deep Learning Data Preprocessing

Prepare the data specifically for deep learning models, handling categorical and numerical features appropriately.


In [3]:
class DeepLearningPreprocessor:
    """
    Advanced preprocessor for deep learning anomaly detection
    Handles mixed categorical and numerical features
    """

    def __init__(self):
        self.label_encoders = {}
        self.numerical_scaler = StandardScaler()
        self.categorical_vocab = {}
        self.is_fitted = False

    def analyze_categorical_features(self, df, categorical_cols):
        """
        Analyze categorical features to understand their characteristics
        """
        print("Analyzing categorical features...")

        categorical_info = {}

        for col in categorical_cols:
            # Get non-null values
            non_null_values = df[col].dropna()
            unique_values = non_null_values.unique()

            categorical_info[col] = {
                "unique_count": len(unique_values),
                "null_count": df[col].isnull().sum(),
                "null_percentage": (df[col].isnull().sum() / len(df)) * 100,
                "most_frequent": non_null_values.mode().iloc[0]
                if len(non_null_values) > 0
                else None,
                "sample_values": unique_values[:5].tolist()
                if len(unique_values) > 0
                else [],
            }

        return categorical_info

    def prepare_categorical_features(self, df, categorical_cols):
        """
        Prepare categorical features for embedding layers
        """
        print("Preparing categorical features...")

        processed_categorical = pd.DataFrame()

        for col in categorical_cols:
            # Fill missing values with 'MISSING'
            filled_col = df[col].fillna("MISSING")

            if not self.is_fitted:
                # Fit label encoder during training
                le = LabelEncoder()
                # Add unknown category for test data
                unique_values = list(filled_col.unique()) + ["UNKNOWN"]
                le.fit(unique_values)
                self.label_encoders[col] = le

                # Store vocabulary size for embedding layer
                self.categorical_vocab[col] = len(le.classes_)

            # Transform the column
            try:
                encoded_col = self.label_encoders[col].transform(filled_col)
            except ValueError:
                # Handle unseen categories in test data
                encoded_col = []
                for val in filled_col:
                    try:
                        encoded_col.append(self.label_encoders[col].transform([val])[0])
                    except ValueError:
                        # Use 'UNKNOWN' category for unseen values
                        encoded_col.append(
                            self.label_encoders[col].transform(["UNKNOWN"])[0]
                        )
                encoded_col = np.array(encoded_col)

            processed_categorical[col] = encoded_col

        return processed_categorical

    def prepare_numerical_features(self, df, numerical_cols):
        """
        Prepare numerical features for neural networks
        """
        print("Preparing numerical features...")

        # Select numerical columns
        numerical_data = df[numerical_cols].copy()

        # Fill missing values with 0 (representing no item/price)
        numerical_data = numerical_data.fillna(0)

        # Handle infinite values
        numerical_data = numerical_data.replace([np.inf, -np.inf], 0)

        # Scale numerical features
        if not self.is_fitted:
            scaled_data = self.numerical_scaler.fit_transform(numerical_data)
        else:
            scaled_data = self.numerical_scaler.transform(numerical_data)

        return pd.DataFrame(scaled_data, columns=numerical_cols, index=df.index)

    def fit_transform(self, df, categorical_cols, numerical_cols):
        """
        Fit the preprocessor and transform the data
        """
        print("Fitting preprocessor and transforming data...")

        # Analyze categorical features
        categorical_info = self.analyze_categorical_features(df, categorical_cols)

        # Prepare features
        processed_categorical = self.prepare_categorical_features(df, categorical_cols)
        processed_numerical = self.prepare_numerical_features(df, numerical_cols)

        self.is_fitted = True

        return processed_categorical, processed_numerical, categorical_info

    def transform(self, df, categorical_cols, numerical_cols):
        """
        Transform new data using fitted preprocessor
        """
        if not self.is_fitted:
            raise ValueError("Preprocessor must be fitted before transform")

        print("Transforming new data...")

        processed_categorical = self.prepare_categorical_features(df, categorical_cols)
        processed_numerical = self.prepare_numerical_features(df, numerical_cols)

        return processed_categorical, processed_numerical


# Initialize preprocessor
preprocessor = DeepLearningPreprocessor()

# Fit and transform training data
train_categorical, train_numerical, categorical_info = preprocessor.fit_transform(
    train_df, all_categorical_cols, all_numerical_cols
)

print(f"\nProcessed training data shapes:")
print(f"Categorical features: {train_categorical.shape}")
print(f"Numerical features: {train_numerical.shape}")

# Display some categorical feature statistics
print("\nCategorical feature analysis (first 5 features):")
for i, (col, info) in enumerate(list(categorical_info.items())[:5]):
    print(
        f"{col}: {info['unique_count']} unique values, {info['null_percentage']:.1f}% missing"
    )
    print(f"  Sample values: {info['sample_values']}")

Fitting preprocessor and transforming data...
Analyzing categorical features...
Preparing categorical features...
Preparing categorical features...
Preparing numerical features...

Processed training data shapes:
Categorical features: (92790, 96)
Numerical features: (92790, 49)

Categorical feature analysis (first 5 features):
item1: 134 unique values, 0.0% missing
  Sample values: ['COMPUTERS', 'COMPUTER PERIPHERALS ACCESSORIES', 'TELEVISIONS HOME CINEMA', 'BEDROOM FURNITURE', 'LIVING & DINING FURNITURE']
item2: 137 unique values, 51.9% missing
  Sample values: ['COMPUTER PERIPHERALS ACCESSORIES', 'SERVICE', 'CABLES ADAPTERS', 'BEDROOM FURNITURE', 'LIVING & DINING FURNITURE']
item3: 125 unique values, 86.1% missing
  Sample values: ['FULFILMENT CHARGE', 'COMPUTER PERIPHERALS ACCESSORIES', 'COMPUTER PERIPHERALS & ACCESSORIES', 'LIVING & DINING FURNITURE', 'LIVING DINING FURNITURE']
item4: 124 unique values, 95.1% missing
  Sample values: ['FULFILMENT CHARGE', 'LIVING DINING FURNITURE',

## Step 3: Data Preparation for Deep Learning Models

Create the final input format suitable for embedding layers and autoencoder architecture.


In [4]:
def create_deep_learning_inputs(categorical_df, numerical_df, categorical_vocab):
    """
    Create input dictionaries for deep learning models with embedding layers
    """
    print("Creating deep learning input format...")

    # Create input dictionary for categorical features (for embedding layers)
    categorical_inputs = {}
    for col in categorical_df.columns:
        categorical_inputs[col] = categorical_df[col].values

    # Numerical features as a single array
    numerical_inputs = numerical_df.values

    # Create vocabulary information for embedding layers
    embedding_info = {}
    for col in categorical_df.columns:
        vocab_size = categorical_vocab[col]
        # Embedding dimension rule of thumb: min(50, vocab_size//2 + 1)
        embedding_dim = min(50, max(4, vocab_size // 2 + 1))
        embedding_info[col] = {"vocab_size": vocab_size, "embedding_dim": embedding_dim}

    print(f"Categorical inputs created: {len(categorical_inputs)} features")
    print(f"Numerical inputs shape: {numerical_inputs.shape}")
    print(f"\nEmbedding dimensions (first 5):")
    for i, (col, info) in enumerate(list(embedding_info.items())[:5]):
        print(
            f"  {col}: vocab_size={info['vocab_size']}, embedding_dim={info['embedding_dim']}"
        )

    return categorical_inputs, numerical_inputs, embedding_info


# Create deep learning inputs for training data
train_cat_inputs, train_num_inputs, embedding_info = create_deep_learning_inputs(
    train_categorical, train_numerical, preprocessor.categorical_vocab
)

# Store training data info for later use
print(f"\nTraining data summary:")
print(f"Number of samples: {len(train_num_inputs)}")
print(f"Categorical features: {len(train_cat_inputs)}")
print(f"Numerical features: {train_num_inputs.shape[1]}")
print(
    f"Total embedding parameters: {sum(info['vocab_size'] * info['embedding_dim'] for info in embedding_info.values())}"
)

Creating deep learning input format...
Categorical inputs created: 96 features
Numerical inputs shape: (92790, 49)

Embedding dimensions (first 5):
  item1: vocab_size=134, embedding_dim=50
  item2: vocab_size=139, embedding_dim=50
  item3: vocab_size=127, embedding_dim=50
  item4: vocab_size=126, embedding_dim=50
  item5: vocab_size=109, embedding_dim=50

Training data summary:
Number of samples: 92790
Categorical features: 96
Numerical features: 49
Total embedding parameters: 2178967


## Summary of Step 1: Data Preprocessing for Deep Learning

### What we accomplished:

1. **Advanced Data Loading**: Loaded and analyzed the structure of our shopping basket data

2. **Feature Type Identification**:

   - Categorical features: Items, Makes, Models, Goods Codes
   - Numerical features: Prices, Quantities, Item counts

3. **Deep Learning Preprocessing**:

   - Created a specialized preprocessor for mixed data types
   - Handled missing values appropriately for each feature type
   - Encoded categorical features for embedding layers
   - Normalized numerical features for neural networks

4. **Embedding Preparation**:
   - Calculated vocabulary sizes for each categorical feature
   - Determined optimal embedding dimensions
   - Created input format suitable for TensorFlow/Keras

### Next Steps (to be implemented):

1. **Autoencoder Architecture Design**: Build encoder-decoder network with embedding layers
2. **Model Training**: Train the autoencoder on normal shopping patterns
3. **Anomaly Detection**: Use reconstruction error as anomaly score
4. **Evaluation and Optimization**: Fine-tune model parameters
5. **Results Generation**: Apply to test data and create submission files

The preprocessing foundation is now ready for building sophisticated deep learning models that can capture complex patterns in shopping behavior!


## Step 4: Autoencoder Architecture Design

Design a sophisticated autoencoder architecture optimized for NVIDIA 3060 Mobile GPU with embedding layers for categorical features.


In [5]:
# Configure GPU memory growth and optimization for NVIDIA 3060 Mobile
print("Configuring GPU for optimal performance...")

# Configure GPU memory growth to prevent VRAM issues
gpus = tf.config.experimental.list_physical_devices("GPU")
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print(f"GPU configured: {gpus}")

        # Set mixed precision for better performance on RTX 3060
        tf.keras.mixed_precision.set_global_policy("mixed_float16")
        print("Mixed precision enabled for better GPU utilization")

    except RuntimeError as e:
        print(e)
else:
    print("No GPU found, using CPU")

# Optimize for 3060 Mobile (8GB VRAM)
BATCH_SIZE = 128  # Optimized for 3060 Mobile
MAX_MEMORY_PER_BATCH = 0.7  # Use 70% of available VRAM


class ShoppingBasketAutoencoder:
    """
    Advanced Autoencoder for Shopping Basket Anomaly Detection
    Optimized for NVIDIA 3060 Mobile GPU
    """

    def __init__(
        self, embedding_info, numerical_input_dim, latent_dim=64, dropout_rate=0.3
    ):
        self.embedding_info = embedding_info
        self.numerical_input_dim = numerical_input_dim
        self.latent_dim = latent_dim
        self.dropout_rate = dropout_rate
        self.model = None
        self.encoder = None
        self.decoder = None

    def create_embedding_layers(self):
        """
        Create embedding layers for categorical features
        """
        embedding_inputs = {}
        embedding_layers = {}

        for feature_name, info in self.embedding_info.items():
            # Input layer for this categorical feature
            input_layer = layers.Input(
                shape=(1,), name=f"{feature_name}_input", dtype="int32"
            )
            embedding_inputs[feature_name] = input_layer

            # Embedding layer with L2 regularization
            embedding = layers.Embedding(
                input_dim=info["vocab_size"],
                output_dim=info["embedding_dim"],
                embeddings_regularizer=keras.regularizers.l2(0.001),
                name=f"{feature_name}_embedding",
            )(input_layer)

            # Flatten the embedding
            flattened = layers.Flatten(name=f"{feature_name}_flatten")(embedding)
            embedding_layers[feature_name] = flattened

        return embedding_inputs, embedding_layers

    def build_encoder(self, embedding_inputs, embedding_layers):
        """
        Build the encoder part of the autoencoder
        """
        # Numerical input
        numerical_input = layers.Input(
            shape=(self.numerical_input_dim,), name="numerical_input", dtype="float32"
        )

        # Concatenate all embeddings
        if embedding_layers:
            concatenated_embeddings = layers.Concatenate(
                name="concatenated_embeddings"
            )(list(embedding_layers.values()))

            # Combine embeddings with numerical features
            combined_features = layers.Concatenate(name="combined_features")(
                [concatenated_embeddings, numerical_input]
            )
        else:
            combined_features = numerical_input

        # Calculate total input dimension
        total_embedding_dim = sum(
            info["embedding_dim"] for info in self.embedding_info.values()
        )
        total_input_dim = total_embedding_dim + self.numerical_input_dim

        print(f"Total input dimension: {total_input_dim}")
        print(f"Embedding dimension: {total_embedding_dim}")
        print(f"Numerical dimension: {self.numerical_input_dim}")

        # Encoder layers with batch normalization and dropout
        # Gradually reduce dimensions to latent space
        encoder_layer1 = layers.Dense(
            min(512, total_input_dim // 2), activation="relu", name="encoder_dense1"
        )(combined_features)
        encoder_layer1 = layers.BatchNormalization(name="encoder_bn1")(encoder_layer1)
        encoder_layer1 = layers.Dropout(self.dropout_rate, name="encoder_dropout1")(
            encoder_layer1
        )

        encoder_layer2 = layers.Dense(
            min(256, total_input_dim // 4), activation="relu", name="encoder_dense2"
        )(encoder_layer1)
        encoder_layer2 = layers.BatchNormalization(name="encoder_bn2")(encoder_layer2)
        encoder_layer2 = layers.Dropout(self.dropout_rate, name="encoder_dropout2")(
            encoder_layer2
        )

        encoder_layer3 = layers.Dense(
            min(128, total_input_dim // 8), activation="relu", name="encoder_dense3"
        )(encoder_layer2)
        encoder_layer3 = layers.BatchNormalization(name="encoder_bn3")(encoder_layer3)
        encoder_layer3 = layers.Dropout(self.dropout_rate, name="encoder_dropout3")(
            encoder_layer3
        )

        # Latent representation (bottleneck)
        latent_representation = layers.Dense(
            self.latent_dim, activation="relu", name="latent_layer"
        )(encoder_layer3)

        # Create encoder model
        all_inputs = list(embedding_inputs.values()) + [numerical_input]
        encoder = Model(
            inputs=all_inputs, outputs=latent_representation, name="encoder"
        )

        return encoder, numerical_input, all_inputs, total_input_dim

    def build_decoder(self, latent_input, total_output_dim):
        """
        Build the decoder part of the autoencoder
        """
        # Decoder layers - mirror the encoder structure
        decoder_layer1 = layers.Dense(
            min(128, total_output_dim // 8), activation="relu", name="decoder_dense1"
        )(latent_input)
        decoder_layer1 = layers.BatchNormalization(name="decoder_bn1")(decoder_layer1)
        decoder_layer1 = layers.Dropout(self.dropout_rate, name="decoder_dropout1")(
            decoder_layer1
        )

        decoder_layer2 = layers.Dense(
            min(256, total_output_dim // 4), activation="relu", name="decoder_dense2"
        )(decoder_layer1)
        decoder_layer2 = layers.BatchNormalization(name="decoder_bn2")(decoder_layer2)
        decoder_layer2 = layers.Dropout(self.dropout_rate, name="decoder_dropout2")(
            decoder_layer2
        )

        decoder_layer3 = layers.Dense(
            min(512, total_output_dim // 2), activation="relu", name="decoder_dense3"
        )(decoder_layer2)
        decoder_layer3 = layers.BatchNormalization(name="decoder_bn3")(decoder_layer3)
        decoder_layer3 = layers.Dropout(self.dropout_rate, name="decoder_dropout3")(
            decoder_layer3
        )

        # Output layer - reconstruct the original features
        # Split output for categorical and numerical features

        # Categorical reconstruction (embedding space)
        embedding_output_dim = sum(
            info["embedding_dim"] for info in self.embedding_info.values()
        )
        categorical_output = layers.Dense(
            embedding_output_dim,
            activation="sigmoid",  # Use sigmoid for normalized embeddings
            name="categorical_reconstruction",
        )(decoder_layer3)

        # Numerical reconstruction
        numerical_output = layers.Dense(
            self.numerical_input_dim,
            activation="linear",  # Linear for normalized numerical features
            name="numerical_reconstruction",
        )(decoder_layer3)

        # Combine outputs
        combined_output = layers.Concatenate(name="combined_reconstruction")(
            [categorical_output, numerical_output]
        )

        return combined_output, categorical_output, numerical_output

    def build_complete_model(self):
        """
        Build the complete autoencoder model
        """
        print("Building autoencoder architecture...")

        # Create embedding layers
        embedding_inputs, embedding_layers = self.create_embedding_layers()

        # Build encoder
        encoder, numerical_input, all_inputs, total_input_dim = self.build_encoder(
            embedding_inputs, embedding_layers
        )

        # Build decoder
        latent_input = layers.Input(shape=(self.latent_dim,), name="latent_input")
        combined_output, categorical_output, numerical_output = self.build_decoder(
            latent_input, total_input_dim
        )

        # Create decoder model
        decoder = Model(inputs=latent_input, outputs=combined_output, name="decoder")

        # Create complete autoencoder
        encoded = encoder(all_inputs)
        decoded = decoder(encoded)
        autoencoder = Model(inputs=all_inputs, outputs=decoded, name="autoencoder")

        # Store models
        self.encoder = encoder
        self.decoder = decoder
        self.model = autoencoder

        # Print model summaries
        print("\\nEncoder Architecture:")
        self.encoder.summary()
        print("\\nDecoder Architecture:")
        self.decoder.summary()
        print("\\nComplete Autoencoder Architecture:")
        self.model.summary()

        return autoencoder, encoder, decoder


# Initialize the autoencoder
print("Initializing autoencoder with optimized architecture...")
autoencoder_model = ShoppingBasketAutoencoder(
    embedding_info=embedding_info,
    numerical_input_dim=train_num_inputs.shape[1],
    latent_dim=64,  # Compressed representation size
    dropout_rate=0.3,  # Regularization
)

# Build the model
autoencoder, encoder, decoder = autoencoder_model.build_complete_model()

Configuring GPU for optimal performance...
GPU configured: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Mixed precision enabled for better GPU utilization
Initializing autoencoder with optimized architecture...
Building autoencoder architecture...


I0000 00:00:1751033444.286789  247552 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3295 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
W0000 00:00:1751033444.447762  248011 gpu_backend_lib.cc:579] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  ipykernel_launcher.runfiles/cuda_nvcc
  ipykern/cuda_nvcc
  
  /usr/local/cuda
  /home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/tensorflow/python/platform/../../../nvidia/cuda_nvcc
  /home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/tensorflow/python/platform/../../../../nvidia/cuda_nvcc
  /home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/tensorflow/python/platform/../../cuda
  .
You can c

Total input dimension: 3940
Embedding dimension: 3891
Numerical dimension: 49
\nEncoder Architecture:


\nDecoder Architecture:


\nComplete Autoencoder Architecture:


## Step 5: Training Process

Train the autoencoder on normal patterns using reconstruction loss, optimized for NVIDIA 3060 Mobile performance.


In [6]:
def prepare_training_data(categorical_inputs, numerical_inputs, batch_size=128):
    """
    Prepare data for training with optimal batching for GPU
    """
    print("Preparing training data...")

    # Combine categorical and numerical inputs for the target (reconstruction)
    embedding_dim = sum(info["embedding_dim"] for info in embedding_info.values())

    # Create flattened embeddings for target (we'll learn to reconstruct these)
    target_embeddings = np.zeros((len(numerical_inputs), embedding_dim))

    # For training, we want to reconstruct the input features
    # Combine categorical embeddings (as one-hot or learned representations) with numerical
    target_features = np.concatenate([target_embeddings, numerical_inputs], axis=1)

    # Prepare input dictionary for the model
    input_dict = {}
    for feature_name in categorical_inputs.keys():
        input_dict[f"{feature_name}_input"] = categorical_inputs[feature_name]
    input_dict["numerical_input"] = numerical_inputs

    return input_dict, target_features


def create_training_callbacks(model_path, patience=15):
    """
    Create optimized callbacks for NVIDIA 3060 Mobile
    """
    callbacks = [
        EarlyStopping(
            monitor="val_loss", patience=patience, restore_best_weights=True, verbose=1
        ),
        ReduceLROnPlateau(
            monitor="val_loss", factor=0.5, patience=8, min_lr=1e-7, verbose=1
        ),
        keras.callbacks.ModelCheckpoint(
            filepath=model_path,
            monitor="val_loss",
            save_best_only=True,
            save_weights_only=False,
            verbose=1,
        ),
        keras.callbacks.TensorBoard(
            log_dir="logs", histogram_freq=1, write_graph=True, write_images=False
        ),
    ]

    return callbacks


# Prepare training data
print("Preparing data for training...")
train_inputs, train_targets = prepare_training_data(
    train_cat_inputs, train_num_inputs, BATCH_SIZE
)

# Split data for validation
validation_split = 0.2
val_size = int(len(train_num_inputs) * validation_split)

# Create validation split
val_inputs = {}
train_inputs_final = {}

for key, values in train_inputs.items():
    val_inputs[key] = values[-val_size:]
    train_inputs_final[key] = values[:-val_size]

val_targets = train_targets[-val_size:]
train_targets_final = train_targets[:-val_size]

print(f"Training samples: {len(train_targets_final)}")
print(f"Validation samples: {len(val_targets)}")
print(f"Input features: {train_targets.shape[1]}")

# Compile model with optimized settings for GPU
print("Compiling model...")

# Use mixed precision optimizer
optimizer = keras.optimizers.Adam(
    learning_rate=0.001,
    epsilon=1e-7,  # For mixed precision stability
)


# Custom loss function for reconstruction
def reconstruction_loss(y_true, y_pred):
    """
    Custom reconstruction loss combining MSE and MAE for stability
    """
    mse_loss = tf.keras.losses.MeanSquaredError()(y_true, y_pred)
    mae_loss = tf.keras.losses.MeanAbsoluteError()(y_true, y_pred)

    # Combine losses with weights
    combined_loss = 0.7 * mse_loss + 0.3 * mae_loss
    return combined_loss


autoencoder.compile(
    optimizer=optimizer, loss=reconstruction_loss, metrics=["mse", "mae"]
)

# Create callbacks
model_save_path = MODELS_PATH / "autoencoder_best.keras"
callbacks = create_training_callbacks(model_save_path)

print("Model compiled successfully!")
print(f"Batch size optimized for RTX 3060: {BATCH_SIZE}")
print(f"Mixed precision enabled: {tf.keras.mixed_precision.global_policy().name}")

Preparing data for training...
Preparing training data...
Training samples: 74232
Validation samples: 18558
Input features: 3940
Compiling model...
Model compiled successfully!
Batch size optimized for RTX 3060: 128
Mixed precision enabled: mixed_float16
Training samples: 74232
Validation samples: 18558
Input features: 3940
Compiling model...
Model compiled successfully!
Batch size optimized for RTX 3060: 128
Mixed precision enabled: mixed_float16


In [7]:
# Train the autoencoder
print("Starting training...")
print("=" * 50)


# Monitor GPU memory usage during training
def monitor_gpu_memory():
    if gpus:
        gpu_details = tf.config.experimental.get_memory_info("GPU:0")
        current_mb = gpu_details["current"] / 1024 / 1024
        peak_mb = gpu_details["peak"] / 1024 / 1024
        print(f"GPU Memory - Current: {current_mb:.0f}MB, Peak: {peak_mb:.0f}MB")


print("Initial GPU memory status:")
monitor_gpu_memory()

# Training parameters optimized for RTX 3060 Mobile
EPOCHS = 100
BATCH_SIZE = 128  # Optimal for 8GB VRAM

# Start training with progress monitoring
history = autoencoder.fit(
    train_inputs_final,
    train_targets_final,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(val_inputs, val_targets),
    callbacks=callbacks,
    verbose=1,
    shuffle=True,
)

print("Training completed!")
print("=" * 50)

# Monitor final GPU memory usage
print("Final GPU memory status:")
monitor_gpu_memory()


# Plot training history
def plot_training_history(history):
    """
    Plot training and validation loss curves
    """
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # Loss
    axes[0].plot(history.history["loss"], label="Training Loss", color="blue")
    axes[0].plot(history.history["val_loss"], label="Validation Loss", color="red")
    axes[0].set_title("Model Loss")
    axes[0].set_xlabel("Epoch")
    axes[0].set_ylabel("Loss")
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # MSE
    axes[1].plot(history.history["mse"], label="Training MSE", color="blue")
    axes[1].plot(history.history["val_mse"], label="Validation MSE", color="red")
    axes[1].set_title("Mean Squared Error")
    axes[1].set_xlabel("Epoch")
    axes[1].set_ylabel("MSE")
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

    # MAE
    axes[2].plot(history.history["mae"], label="Training MAE", color="blue")
    axes[2].plot(history.history["val_mae"], label="Validation MAE", color="red")
    axes[2].set_title("Mean Absolute Error")
    axes[2].set_xlabel("Epoch")
    axes[2].set_ylabel("MAE")
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print final metrics
    final_train_loss = history.history["loss"][-1]
    final_val_loss = history.history["val_loss"][-1]
    best_val_loss = min(history.history["val_loss"])
    best_epoch = history.history["val_loss"].index(best_val_loss) + 1

    print(f"\\nTraining Summary:")
    print(f"Final Training Loss: {final_train_loss:.6f}")
    print(f"Final Validation Loss: {final_val_loss:.6f}")
    print(f"Best Validation Loss: {best_val_loss:.6f} (Epoch {best_epoch})")
    print(f"Total Epochs Trained: {len(history.history['loss'])}")


# Visualize training progress
plot_training_history(history)

Starting training...
Initial GPU memory status:
GPU Memory - Current: 26MB, Peak: 42MB


2025-06-27 16:10:46.719328: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1169896320 exceeds 10% of free system memory.
2025-06-27 16:10:47.793144: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 1169896320 exceeds 10% of free system memory.


Epoch 1/100


I0000 00:00:1751033463.817597  248005 service.cc:148] XLA service 0x7f64bc001d10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1751033463.817638  248005 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 3060 Laptop GPU, Compute Capability 8.6
2025-06-27 16:11:04.340994: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2025-06-27 16:11:04.340994: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1751033465.878335  248005 cuda_dnn.cc:529] Loaded cuDNN version 91001
I0000 00:00:1751033465.878335  248005 cuda_dnn.cc:529] Loaded cuDNN version 91001
E0000 00:00:1751033468.031586  248005 buffer_comparator.cc:157] Difference at 16: 0, expected 0.180786
E0000 00:00:1751033468.031622  248005 buffer_compara

InternalError: Graph execution error:

Detected at node StatefulPartitionedCall defined at (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/tornado/platform/asyncio.py", line 211, in start

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/asyncio/base_events.py", line 608, in run_forever

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/asyncio/events.py", line 84, in _run

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3100, in run_cell

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3155, in _run_cell

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3367, in run_cell_async

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3612, in run_ast_nodes

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3672, in run_code

  File "/tmp/ipykernel_247552/3350312808.py", line 23, in <module>

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py", line 377, in fit

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py", line 220, in function

  File "/home/oussama/.local/share/mamba/envs/tf/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py", line 133, in multi_step_on_iterator

libdevice not found at ./libdevice.10.bc
	 [[{{node StatefulPartitionedCall}}]] [Op:__inference_multi_step_on_iterator_39153]

## Step 6: Anomaly Scoring

Use reconstruction error to identify anomalies in shopping baskets.


In [None]:
class AnomalyDetector:
    """
    Anomaly detection using reconstruction error from trained autoencoder
    """

    def __init__(self, autoencoder_model, encoder_model, preprocessor):
        self.autoencoder = autoencoder_model
        self.encoder = encoder_model
        self.preprocessor = preprocessor
        self.threshold = None
        self.reconstruction_errors = None

    def calculate_reconstruction_error(self, inputs, targets, batch_size=128):
        """
        Calculate reconstruction error for anomaly detection
        """
        print("Calculating reconstruction errors...")

        # Get predictions in batches to manage GPU memory
        predictions = self.autoencoder.predict(inputs, batch_size=batch_size, verbose=1)

        # Calculate reconstruction errors
        reconstruction_errors = []

        for i in range(len(targets)):
            # MSE between original and reconstructed
            mse_error = np.mean((targets[i] - predictions[i]) ** 2)

            # MAE between original and reconstructed
            mae_error = np.mean(np.abs(targets[i] - predictions[i]))

            # Combined error score
            combined_error = 0.7 * mse_error + 0.3 * mae_error
            reconstruction_errors.append(combined_error)

        return np.array(reconstruction_errors)

    def fit_threshold(self, reconstruction_errors, contamination=0.05):
        """
        Fit anomaly threshold based on reconstruction errors
        """
        print(f"Fitting anomaly threshold with contamination rate: {contamination}")

        # Use percentile-based threshold
        threshold = np.percentile(reconstruction_errors, (1 - contamination) * 100)

        self.threshold = threshold
        self.reconstruction_errors = reconstruction_errors

        print(f"Anomaly threshold set to: {threshold:.6f}")
        print(
            f"Number of anomalies detected: {np.sum(reconstruction_errors > threshold)}"
        )

        return threshold

    def detect_anomalies(self, inputs, targets, batch_size=128):
        """
        Detect anomalies using the fitted threshold
        """
        if self.threshold is None:
            raise ValueError("Threshold not fitted. Call fit_threshold first.")

        # Calculate reconstruction errors
        errors = self.calculate_reconstruction_error(inputs, targets, batch_size)

        # Classify as anomaly if error > threshold
        anomaly_flags = (errors > self.threshold).astype(int)

        # Normalize errors for scoring (0-1 scale)
        max_error = np.max(errors)
        min_error = np.min(errors)
        normalized_scores = (errors - min_error) / (max_error - min_error)

        return errors, anomaly_flags, normalized_scores

    def analyze_latent_space(self, inputs, batch_size=128):
        """
        Analyze the latent space representations
        """
        print("Analyzing latent space representations...")

        # Get latent representations
        latent_representations = self.encoder.predict(
            inputs, batch_size=batch_size, verbose=1
        )

        # Apply PCA for visualization
        pca = PCA(n_components=2)
        latent_2d = pca.fit_transform(latent_representations)

        print(f"Latent space shape: {latent_representations.shape}")
        print(f"PCA explained variance ratio: {pca.explained_variance_ratio_}")

        return latent_representations, latent_2d, pca


# Initialize anomaly detector
anomaly_detector = AnomalyDetector(autoencoder, encoder, preprocessor)

# Calculate reconstruction errors for training data
print("Calculating reconstruction errors for training data...")
train_reconstruction_errors = anomaly_detector.calculate_reconstruction_error(
    train_inputs, train_targets, batch_size=BATCH_SIZE
)

# Fit anomaly threshold (top 5% as anomalies)
contamination_rate = 0.05
threshold = anomaly_detector.fit_threshold(
    train_reconstruction_errors, contamination_rate
)

# Detect anomalies in training data
train_errors, train_anomalies, train_scores = anomaly_detector.detect_anomalies(
    train_inputs, train_targets, batch_size=BATCH_SIZE
)

print(f"\\nTraining Data Anomaly Detection Results:")
print(f"Total samples: {len(train_errors)}")
print(f"Anomalies detected: {np.sum(train_anomalies)}")
print(f"Anomaly rate: {np.mean(train_anomalies):.3f}")
print(f"Mean reconstruction error: {np.mean(train_errors):.6f}")
print(f"Std reconstruction error: {np.std(train_errors):.6f}")

# Analyze latent space
latent_representations, latent_2d, pca = anomaly_detector.analyze_latent_space(
    train_inputs, batch_size=BATCH_SIZE
)

In [None]:
# Visualize anomaly detection results
def visualize_anomaly_results(
    reconstruction_errors, anomaly_flags, latent_2d, threshold
):
    """
    Create comprehensive visualizations of anomaly detection results
    """
    fig, axes = plt.subplots(2, 3, figsize=(20, 12))

    # 1. Reconstruction error distribution
    axes[0, 0].hist(
        reconstruction_errors,
        bins=50,
        alpha=0.7,
        color="skyblue",
        label="Normal",
        density=True,
    )
    axes[0, 0].hist(
        reconstruction_errors[anomaly_flags == 1],
        bins=30,
        alpha=0.7,
        color="red",
        label="Anomalies",
        density=True,
    )
    axes[0, 0].axvline(
        threshold,
        color="red",
        linestyle="--",
        linewidth=2,
        label=f"Threshold: {threshold:.4f}",
    )
    axes[0, 0].set_xlabel("Reconstruction Error")
    axes[0, 0].set_ylabel("Density")
    axes[0, 0].set_title("Distribution of Reconstruction Errors")
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)

    # 2. Box plot of errors
    normal_errors = reconstruction_errors[anomaly_flags == 0]
    anomalous_errors = reconstruction_errors[anomaly_flags == 1]

    box_data = [normal_errors, anomalous_errors]
    box_labels = ["Normal", "Anomalies"]
    axes[0, 1].boxplot(box_data, labels=box_labels)
    axes[0, 1].set_ylabel("Reconstruction Error")
    axes[0, 1].set_title("Error Distribution: Normal vs Anomalies")
    axes[0, 1].grid(True, alpha=0.3)

    # 3. Latent space visualization
    normal_mask = anomaly_flags == 0
    anomaly_mask = anomaly_flags == 1

    axes[0, 2].scatter(
        latent_2d[normal_mask, 0],
        latent_2d[normal_mask, 1],
        c="blue",
        alpha=0.6,
        s=20,
        label="Normal",
    )
    axes[0, 2].scatter(
        latent_2d[anomaly_mask, 0],
        latent_2d[anomaly_mask, 1],
        c="red",
        alpha=0.8,
        s=30,
        label="Anomalies",
    )
    axes[0, 2].set_xlabel("First Principal Component")
    axes[0, 2].set_ylabel("Second Principal Component")
    axes[0, 2].set_title("Latent Space Representation (PCA)")
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)

    # 4. Error vs sample index
    sample_indices = np.arange(len(reconstruction_errors))
    axes[1, 0].scatter(
        sample_indices[normal_mask],
        reconstruction_errors[normal_mask],
        c="blue",
        alpha=0.6,
        s=10,
        label="Normal",
    )
    axes[1, 0].scatter(
        sample_indices[anomaly_mask],
        reconstruction_errors[anomaly_mask],
        c="red",
        alpha=0.8,
        s=20,
        label="Anomalies",
    )
    axes[1, 0].axhline(threshold, color="red", linestyle="--", linewidth=2)
    axes[1, 0].set_xlabel("Sample Index")
    axes[1, 0].set_ylabel("Reconstruction Error")
    axes[1, 0].set_title("Reconstruction Errors by Sample")
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

    # 5. Cumulative distribution
    sorted_errors = np.sort(reconstruction_errors)
    cumulative_prob = np.arange(1, len(sorted_errors) + 1) / len(sorted_errors)

    axes[1, 1].plot(sorted_errors, cumulative_prob, color="blue", linewidth=2)
    axes[1, 1].axvline(
        threshold,
        color="red",
        linestyle="--",
        linewidth=2,
        label=f"Threshold ({(1 - contamination_rate) * 100:.0f}th percentile)",
    )
    axes[1, 1].set_xlabel("Reconstruction Error")
    axes[1, 1].set_ylabel("Cumulative Probability")
    axes[1, 1].set_title("Cumulative Distribution of Errors")
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)

    # 6. Top anomalies analysis
    top_anomaly_indices = np.argsort(reconstruction_errors)[-20:]  # Top 20 anomalies
    top_errors = reconstruction_errors[top_anomaly_indices]

    axes[1, 2].barh(range(len(top_errors)), top_errors, color="red", alpha=0.7)
    axes[1, 2].set_xlabel("Reconstruction Error")
    axes[1, 2].set_ylabel("Anomaly Rank")
    axes[1, 2].set_title("Top 20 Most Anomalous Baskets")
    axes[1, 2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Print detailed statistics
    print("\\nDetailed Anomaly Detection Statistics:")
    print("=" * 50)
    print(f"Total samples analyzed: {len(reconstruction_errors)}")
    print(f"Normal samples: {np.sum(anomaly_flags == 0)}")
    print(f"Anomalous samples: {np.sum(anomaly_flags == 1)}")
    print(f"Anomaly rate: {np.mean(anomaly_flags):.3f}")
    print(f"\\nReconstruction Error Statistics:")
    print(f"Mean error (all): {np.mean(reconstruction_errors):.6f}")
    print(f"Std error (all): {np.std(reconstruction_errors):.6f}")
    print(f"Mean error (normal): {np.mean(normal_errors):.6f}")
    print(f"Mean error (anomalies): {np.mean(anomalous_errors):.6f}")
    print(f"Threshold: {threshold:.6f}")

    return top_anomaly_indices


# Create comprehensive visualizations
print("Creating anomaly detection visualizations...")
top_anomaly_indices = visualize_anomaly_results(
    train_reconstruction_errors, train_anomalies, latent_2d, threshold
)

## Step 7: Apply to Test Data and Generate Results

Apply the trained autoencoder to test data and generate final submission files.


In [None]:
# Process test data
print("Processing test data...")
print("=" * 50)

# Transform test data using the fitted preprocessor
test_categorical, test_numerical = preprocessor.transform(
    test_df, all_categorical_cols, all_numerical_cols
)

print(f"Test data processed:")
print(f"Categorical features: {test_categorical.shape}")
print(f"Numerical features: {test_numerical.shape}")

# Prepare test inputs for the model
test_cat_inputs, test_num_inputs, _ = create_deep_learning_inputs(
    test_categorical, test_numerical, preprocessor.categorical_vocab
)

# Prepare test targets (for reconstruction error calculation)
test_inputs_dict, test_targets = prepare_training_data(
    test_cat_inputs, test_num_inputs, BATCH_SIZE
)

print(f"Test inputs prepared: {len(test_inputs_dict)} input types")
print(f"Test targets shape: {test_targets.shape}")

# Apply anomaly detection to test data
print("\\nApplying anomaly detection to test data...")
test_errors, test_anomalies, test_scores = anomaly_detector.detect_anomalies(
    test_inputs_dict, test_targets, batch_size=BATCH_SIZE
)

print(f"\\nTest Data Anomaly Detection Results:")
print(f"Total test samples: {len(test_errors)}")
print(f"Anomalies detected: {np.sum(test_anomalies)}")
print(f"Test anomaly rate: {np.mean(test_anomalies):.3f}")
print(f"Mean reconstruction error: {np.mean(test_errors):.6f}")


# Generate final results
def generate_final_results(
    train_df,
    test_df,
    train_errors,
    train_anomalies,
    train_scores,
    test_errors,
    test_anomalies,
    test_scores,
):
    """
    Generate final submission files with anomaly scores and rankings
    """
    print("Generating final submission files...")

    # Training results
    train_results = pd.DataFrame(
        {
            "ID": train_df["ID"].values,
            "reconstruction_error": train_errors,
            "anomaly_score": train_scores,
            "is_anomaly": train_anomalies,
            "rank": None,
        }
    )

    # Test results
    test_results = pd.DataFrame(
        {
            "ID": test_df["ID"].values,
            "reconstruction_error": test_errors,
            "anomaly_score": test_scores,
            "is_anomaly": test_anomalies,
            "rank": None,
        }
    )

    # Add rankings based on anomaly scores (higher score = more anomalous)
    train_results["rank"] = train_results["anomaly_score"].rank(
        ascending=False, method="dense"
    )
    test_results["rank"] = test_results["anomaly_score"].rank(
        ascending=False, method="dense"
    )

    # Sort by anomaly score (descending)
    train_results = train_results.sort_values("anomaly_score", ascending=False)
    test_results = test_results.sort_values("anomaly_score", ascending=False)

    # Save results
    train_file = RESULTS_PATH / "approach2_train_results.csv"
    test_file = RESULTS_PATH / "approach2_test_results.csv"

    train_results.to_csv(train_file, index=False)
    test_results.to_csv(test_file, index=False)

    print(f"Results saved to:")
    print(f"- Training: {train_file}")
    print(f"- Test: {test_file}")

    # Display top anomalies
    print(f"\\nTop 10 most anomalous baskets in training data:")
    print(
        train_results.head(10)[
            ["ID", "anomaly_score", "reconstruction_error", "rank"]
        ].to_string(index=False)
    )

    print(f"\\nTop 10 most anomalous baskets in test data:")
    print(
        test_results.head(10)[
            ["ID", "anomaly_score", "reconstruction_error", "rank"]
        ].to_string(index=False)
    )

    return train_results, test_results


# Generate final submission files
train_final_results, test_final_results = generate_final_results(
    train_df,
    test_df,
    train_errors,
    train_anomalies,
    train_scores,
    test_errors,
    test_anomalies,
    test_scores,
)

# Save the trained model
model_final_path = MODELS_PATH / "approach2_autoencoder_final.keras"
autoencoder.save(model_final_path)
print(f"\\nFinal model saved to: {model_final_path}")

# Compare with training data distribution
print("\\nComparing train vs test anomaly distributions:")
print(f"Training anomaly rate: {np.mean(train_anomalies):.3f}")
print(f"Test anomaly rate: {np.mean(test_anomalies):.3f}")
print(f"Training mean error: {np.mean(train_errors):.6f}")
print(f"Test mean error: {np.mean(test_errors):.6f}")

# Create final comparison visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Error distribution comparison
axes[0].hist(
    train_errors, bins=50, alpha=0.5, label="Training", density=True, color="blue"
)
axes[0].hist(test_errors, bins=50, alpha=0.5, label="Test", density=True, color="red")
axes[0].axvline(threshold, color="black", linestyle="--", label="Threshold")
axes[0].set_xlabel("Reconstruction Error")
axes[0].set_ylabel("Density")
axes[0].set_title("Error Distribution: Train vs Test")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Anomaly score comparison
axes[1].hist(
    train_scores, bins=50, alpha=0.5, label="Training", density=True, color="blue"
)
axes[1].hist(test_scores, bins=50, alpha=0.5, label="Test", density=True, color="red")
axes[1].set_xlabel("Normalized Anomaly Score")
axes[1].set_ylabel("Density")
axes[1].set_title("Anomaly Score Distribution: Train vs Test")
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Anomaly rate comparison
categories = ["Training", "Test"]
anomaly_rates = [np.mean(train_anomalies), np.mean(test_anomalies)]
axes[2].bar(categories, anomaly_rates, color=["blue", "red"], alpha=0.7)
axes[2].set_ylabel("Anomaly Rate")
axes[2].set_title("Anomaly Rate Comparison")
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Summary and Business Insights

### Approach 2 Results: Deep Learning & Representation-Based Anomaly Detection

This deep learning approach provides advanced anomaly detection capabilities by learning complex patterns in shopping behavior.

#### Key Achievements:

1. **Advanced Architecture**:

   - Embedding layers for categorical features (items, makes, models, codes)
   - Autoencoder with encoder-decoder structure
   - GPU-optimized training for NVIDIA 3060 Mobile

2. **Sophisticated Pattern Learning**:

   - Learns non-linear relationships between features
   - Captures complex shopping behavior patterns
   - Handles mixed categorical and numerical data seamlessly

3. **Robust Anomaly Detection**:

   - Uses reconstruction error as anomaly indicator
   - Threshold-based classification with 5% contamination rate
   - Normalized scoring for consistent ranking

4. **GPU Optimization**:
   - Mixed precision training for better performance
   - Optimized batch sizes for 8GB VRAM
   - Memory monitoring and efficient processing

#### Deep Learning Advantages:

- **Complex Pattern Recognition**: Captures subtle relationships that statistical methods might miss
- **Automatic Feature Engineering**: Learns relevant feature combinations automatically
- **Scalability**: Can handle high-dimensional data efficiently
- **Adaptability**: Model learns from data without predefined rules

#### Anomaly Characteristics Detected:

- **Unusual Item Combinations**: Rare or suspicious product mix
- **Price-Quantity Anomalies**: Mismatched pricing patterns
- **Behavioral Outliers**: Shopping patterns that deviate from learned norms
- **Complex Multi-feature Anomalies**: Combinations that are individually normal but collectively suspicious

#### Business Value:

1. **Enhanced Detection**: Identifies sophisticated fraud patterns
2. **Reduced False Positives**: Better understanding of normal vs anomalous behavior
3. **Actionable Insights**: Provides interpretable reconstruction errors
4. **Scalable Solution**: Can process large volumes of transactions efficiently

#### Recommendations:

1. **Priority Investigation**: Focus on baskets with highest reconstruction errors
2. **Threshold Tuning**: Adjust contamination rate based on business requirements
3. **Continuous Learning**: Retrain model periodically with new data
4. **Ensemble Approach**: Combine with statistical methods for robust detection

The deep learning approach successfully identifies complex anomalous patterns in shopping baskets, providing a powerful complement to traditional statistical methods for fraud detection.
