# Fake News Classifier on Google Colab

This notebook implements a fake news classifier using PhoBERT and TensorFlow, adapted from the `fake_news_classifier_no_shadow.py` script. It is designed to run on Google Colab with GPU support.

## Setup Instructions
1. **Enable GPU**: Go to `Runtime > Change runtime type` and select `GPU`.
2. Run the cells below to install dependencies and execute the code.
3. **Upload Data**: Ensure `vnexpress_dataset.csv` and `vnexpress_fake_dataset.csv` are uploaded to the Colab environment (use the file upload feature in the left sidebar).
4. Monitor logs and TensorBoard for training progress and results.

**Note**: You may need to restart the runtime after installing dependencies if prompted.

In [2]:
!conda install -y tensorflow=2.10 cudatoolkit=11.2 cudnn=8.1 -c conda-forge
!conda install -y pandas=1.5 scikit-learn=1.2 imbalanced-learn=0.10 numpy=1.23 -c conda-forge
!pip install transformers==4.25

# Verify installations
import tensorflow as tf
import transformers
import pandas as pd
import sklearn
import imblearn
import numpy as np

print(f"TensorFlow version: {tf.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"Scikit-learn version: {sklearn.__version__}")
print(f"Imbalanced-learn version: {imblearn.__version__}")
print(f"Numpy version: {np.__version__}")

# Check GPU availability and CUDA/CUDNN
print("GPU Available:", tf.config.list_physical_devices('GPU'))
print("CUDA Available:", tf.test.is_built_with_cuda())
try:
    import subprocess
    nvcc_output = subprocess.check_output(['nvcc', '--version']).decode('utf-8')
    print("CUDA Version (nvcc):", nvcc_output.split('\n')[3])
except Exception as err:
    print(f"CUDA Version (nvcc): Not found {err}")

# Check TensorBoard compatibility
try:
    from tensorboard import program
    print("TensorBoard is available")
except ImportError:
    print("TensorBoard is not available, please ensure TensorFlow is installed correctly")

TypeError: Descriptors cannot be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

## Code Overview

The code performs the following steps:
1. **System Configuration**: Configures CPU/GPU and TensorFlow settings.
2. **Data Preprocessing**: Loads and preprocesses Vietnamese news datasets using PhoBERT tokenizer.
3. **Model Building**: Constructs a model with a custom PhoBERT layer, CLS token extractor, and dense layers.
4. **Training**: Trains the model with early stopping and TensorBoard logging.
5. **Evaluation**: Evaluates the model on a test set and generates a classification report.

Run the cell below to execute the entire pipeline.

In [15]:
import os
import subprocess
import time
import pandas as pd
import tensorflow as tf
import logging
import json
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_auc_score
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.layers import Dropout, Dense, Input
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard, ReduceLROnPlateau
from transformers import AutoTokenizer, TFAutoModel
from imblearn.over_sampling import SMOTE

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('fake_news_classifier.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Resource optimization configurations
NUM_FOLDS = 3  # Default number of folds, can be adjusted
MAX_EPOCHS = 10  # Default max epochs, can be adjusted
LOW_MEMORY_THRESHOLD = 8 * 1024  # 8GB in MB

# Utility functions for system checks
def configure_cpu_threads():
    num_physical_cores = os.cpu_count()
    num_logical_cores = os.cpu_count()
    if num_physical_cores:
        os.environ["TF_NUM_INTEROP_THREADS"] = str(num_physical_cores)
        os.environ["TF_NUM_INTRAOP_THREADS"] = str(num_logical_cores)
        logger.info(f"CPU Threading configured: {num_physical_cores} physical cores, {num_logical_cores} logical cores")

def check_nvidia_gpu():
    try:
        output = subprocess.check_output(['nvidia-smi'], stderr=subprocess.STDOUT).decode('utf-8')
        return output
    except (subprocess.CalledProcessError, FileNotFoundError):
        return None

def get_gpu_memory():
    nvidia_output = check_nvidia_gpu()
    if not nvidia_output:
        return 0
    try:
        lines = nvidia_output.split('\n')
        for line in lines:
            if 'MiB' in line and '/' in line:
                used, total = [int(x) for x in line.split('|')[3].strip().split('MiB')[0].split('/')]
                return total
        return 0
    except Exception as e:
        logger.warning(f"Error parsing GPU memory: {e}")
        return 0

def check_cuda_installation():
    try:
        subprocess.check_output(['nvcc', '--version'], stderr=subprocess.STDOUT).decode('utf-8')
        return True
    except (subprocess.CalledProcessError, FileNotFoundError):
        return False

def configure_tensorflow():
    gpu_memory = get_gpu_memory()
    gpu_devices = tf.config.list_physical_devices('GPU')
    is_low_memory = gpu_memory < LOW_MEMORY_THRESHOLD or not gpu_devices

    if gpu_devices and not is_low_memory:
        try:
            for gpu_device in gpu_devices:
                tf.config.experimental.set_memory_growth(gpu_device, True)
            tf.keras.mixed_precision.set_global_policy("mixed_float16")
            logger.info(f"Detected {len(gpu_devices)} GPU(s) with {gpu_memory}MB memory: {[gpu_device.name for gpu_device in gpu_devices]}")
        except RuntimeError as runtime_error:
            logger.error(f"GPU error: {runtime_error}")
    else:
        logger.warning(f"Low memory GPU ({gpu_memory}MB) or no GPU detected. Running on CPU or with reduced settings.")
        tf.keras.mixed_precision.set_global_policy("float32")

    tf.config.optimizer.set_jit(True)
    logger.info("XLA JIT compilation enabled")
    return is_low_memory

# Preprocessing function with better error handling
def preprocess_data(texts, labels, text_tokenizer, max_length=256):
    if len(texts) == 0:
        raise ValueError("Empty texts provided for preprocessing")

    texts_list = texts.tolist() if hasattr(texts, 'tolist') else list(texts)
    labels_list = labels.tolist() if hasattr(labels, 'tolist') else list(labels)

    # Filter out invalid entries
    valid_data = [(str(text), label) for text, label in zip(texts_list, labels_list)
                  if text and not (isinstance(text, float) and pd.isna(text))]

    if not valid_data:
        raise ValueError("No valid text entries found after filtering")

    cleaned_texts, cleaned_labels = zip(*valid_data)
    invalid_count = len(texts_list) - len(valid_data)
    if invalid_count > 0:
        logger.warning(f"Removed {invalid_count} invalid text entries during preprocessing")

    # Process in batches to avoid memory issues with large datasets
    batch_size = 1000
    all_input_ids = []
    all_attention_masks = []

    for i in range(0, len(cleaned_texts), batch_size):
        batch_texts = list(cleaned_texts[i:i+batch_size])
        tokenized_inputs = text_tokenizer(batch_texts, padding='max_length', truncation=True,
                                        max_length=max_length, return_tensors='tf')
        all_input_ids.append(tokenized_inputs['input_ids'])
        all_attention_masks.append(tokenized_inputs['attention_mask'])

    return (tf.concat(all_input_ids, axis=0),
            tf.concat(all_attention_masks, axis=0),
            tf.convert_to_tensor(cleaned_labels, dtype=tf.float32))

# Function to determine optimal max_length
def determine_max_length(texts, tokenizer, sample_size=1000, is_low_memory=False):
    """
    Determines optimal max_length based on tokenized text lengths.
    Returns a value between 128 and 512, or 128 for low memory.
    """
    if is_low_memory:
        logger.info("Low memory detected, using max_length=128")
        return 128

    sample_texts = texts.sample(min(sample_size, len(texts))).tolist()
    lengths = [len(tokenizer.encode(text, add_special_tokens=True)) for text in sample_texts]
    avg_length = int(sum(lengths) / len(lengths))
    max_length = min(max(128, avg_length + 50), 512)  # Add padding, cap at 512
    logger.info(f"Calculated max_length: {max_length} (based on average token length: {avg_length})")
    return max_length

# Function to get embeddings for SMOTE
def get_embeddings(texts, tokenizer, model, max_length=256):
    input_ids, attention_mask, _ = preprocess_data(texts, [0]*len(texts), tokenizer, max_length=max_length)
    embeddings = model([input_ids, attention_mask])[0][:, 0, :].numpy()  # CLS token
    return embeddings

# Inference function
def predict_news(model, tokenizer, texts, max_length=256):
    """
    Predicts whether input text(s) are fake news.
    Args:
        model: Trained model
        tokenizer: Transformer tokenizer
        texts: Single string or list of strings
        max_length: Max token length
    Returns:
        List of dicts with text, probability, and label
    """
    if isinstance(texts, str):
        texts = [texts]

    input_ids, attention_mask, _ = preprocess_data(texts, [0]*len(texts), tokenizer, max_length=max_length)
    predictions = model.predict([input_ids, attention_mask], verbose=0)
    results = []

    for text, prob in zip(texts, predictions.flatten()):
        label = "Fake" if prob > 0.5 else "Real"
        results.append({
            "text": text[:100] + "..." if len(text) > 100 else text,
            "probability_fake": float(prob),
            "label": label
        })

    return results

# Custom layers
class CustomTransformerLayer(tf.keras.layers.Layer):
    def __init__(self, model_name="vinai/phobert-base", trainable_layers=2, **kwargs):
        super().__init__(**kwargs)
        self.model_name = model_name
        self.transformer = TFAutoModel.from_pretrained(model_name)
        self.trainable_layers = trainable_layers
        logger.info(f"Transformer model initialized: {model_name}")
        # Fine-tune only the last few layers
        for layer in self.transformer.layers[:-trainable_layers]:
            layer.trainable = False
        for layer in self.transformer.layers[-trainable_layers:]:
            layer.trainable = True
        self.supports_masking = True

    def call(self, inputs_data, **kwargs):
        input_ids_data, attention_mask_data = inputs_data
        return self.transformer(input_ids=input_ids_data, attention_mask=attention_mask_data, **kwargs)[0]

    def get_config(self):
        config = super().get_config()
        config.update({"model_name": self.model_name, "trainable_layers": self.trainable_layers})
        return config

class CLSTokenExtractor(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.supports_masking = True

    def call(self, input_data):
        return input_data[:, 0, :]

    def get_config(self):
        return super().get_config()

# Model building function
def build_model(dropout_rate=0.4, learning_rate=1e-5, model_name="vinai/phobert-base"):
    """
    Builds model with specified transformer (PhoBERT or BERT multilingual).
    model_name: 'vinai/phobert-base' or 'bert-base-multilingual-cased'
    """
    input_ids_data = Input(shape=(256,), dtype=tf.int32, name='input_ids')
    attention_mask_data = Input(shape=(256,), dtype=tf.int32, name='attention_mask')

    transformer_layer = CustomTransformerLayer(model_name=model_name, trainable_layers=2)
    transformer_output = transformer_layer([input_ids_data, attention_mask_data])

    cls_token_extractor = CLSTokenExtractor()
    text_embedding = cls_token_extractor(transformer_output)

    dropout_layer = Dropout(dropout_rate)(text_embedding)
    output_layer = Dense(1, activation='sigmoid')(dropout_layer)

    built_model = Model(inputs=[input_ids_data, attention_mask_data], outputs=output_layer)

    optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
    built_model.compile(optimizer=optimizer, loss='binary_crossentropy',
                       metrics=['accuracy', tf.keras.metrics.AUC(name='auc')])
    return built_model

# Dataset creation with memory optimization
def create_dataset(input_ids, attention_masks, labels, batch_size=32, buffer_size=1000, is_large_dataset=False):
    """
    Creates a TensorFlow dataset with specified batch size.
    Note: Adjust batch_size and buffer_size based on dataset size and hardware.
    """
    if is_large_dataset:
        buffer_size = min(buffer_size, 500)  # Reduce buffer for large datasets
        logger.info(f"Large dataset detected, reduced buffer_size to {buffer_size}")

    dataset = tf.data.Dataset.from_tensor_slices(
        ({'input_ids': input_ids, 'attention_mask': attention_masks}, labels)
    )
    return dataset.shuffle(buffer_size).batch(batch_size).prefetch(tf.data.AUTOTUNE)

# Model saving and loading with proper directory creation
def save_model(model_to_save, path='./model/fake_news_model.keras'):
    os.makedirs(os.path.dirname(path), exist_ok=True)

    try:
        model_to_save.save(path, save_format='keras')
        logger.info(f"Model saved successfully to {path}")
    except Exception as save_error:
        logger.error(f"Error saving model: {save_error}")
        weights_path = path.replace('.keras', '.weights.h5')
        model_to_save.save_weights(weights_path)
        logger.info(f"Model weights saved to {weights_path}")

def load_model(path='./model/fake_news_model.keras'):
    custom_objects = {
        'CustomTransformerLayer': CustomTransformerLayer,
        'CLSTokenExtractor': CLSTokenExtractor
    }

    try:
        return tf.keras.models.load_model(path, custom_objects=custom_objects)
    except Exception as load_error:
        logger.error(f"Error loading model: {load_error}")
        weights_path = path.replace('.keras', '.weights.h5')
        if os.path.exists(weights_path):
            loaded_model = build_model()
            loaded_model.load_weights(weights_path)
            logger.info(f"Model weights loaded from {weights_path}")
            return loaded_model
        else:
            raise FileNotFoundError(f"Neither model nor weights found at {path}")

# Main execution
configure_cpu_threads()
logger.info("CHECKING GPU PREREQUISITES:")
logger.info(f"NVIDIA GPU available: {bool(check_nvidia_gpu())}")
logger.info(f"CUDA installed: {check_cuda_installation()}")
is_low_memory = configure_tensorflow()

try:
    # Set up logging directory
    log_dir = "./logs/" + time.strftime("%Y%m%d-%H%M%S")
    os.makedirs(log_dir, exist_ok=True)
    logger.info("To visualize training progress, run: tensorboard --logdir=./logs")

    # Load data
    logger.info("Loading dataset...")
    combined_data = pd.read_csv('./data/vnexpress_combined_dataset.csv')

    # Check for NaN values and invalid Labels
    if combined_data['Content'].isnull().any() or combined_data['Label'].isnull().any():
        logger.error("Found NaN values in 'Content' or 'Label' columns")
        raise ValueError("Invalid data detected")
    if not combined_data['Label'].isin([0, 1]).all():
        logger.error("Invalid Label values detected (must be 0 or 1)")
        raise ValueError("Invalid Label values")

    # Check label distribution
    label_ratio = combined_data['Label'].mean()
    is_large_dataset = len(combined_data) > 100000

    # Initialize tokenizer for SMOTE and max_length
    smote_tokenizer = AutoTokenizer.from_pretrained("vinai/phobert-base")
    max_length = determine_max_length(combined_data['Content'], smote_tokenizer, is_low_memory=is_low_memory)

    if label_ratio < 0.2 or label_ratio > 0.8:
        logger.warning(f"Imbalanced dataset detected: {label_ratio*100:.1f}% fake news. Applying SMOTE to balance data.")

        # Load embedding model for SMOTE
        embedding_model = TFAutoModel.from_pretrained("vinai/phobert-base")

        # Apply SMOTE on embeddings
        logger.info("Generating embeddings for SMOTE...")
        embeddings = get_embeddings(combined_data['Content'], smote_tokenizer, embedding_model, max_length=max_length)
        smote = SMOTE(random_state=42)
        embeddings_resampled, labels_resampled = smote.fit_resample(embeddings, combined_data['Label'])

        # Create new balanced dataset
        balanced_data = []
        original_texts = combined_data['Content'].tolist()
        for idx in range(len(labels_resampled)):
            original_idx = min(idx, len(original_texts)-1)
            balanced_data.append({
                'Content': original_texts[original_idx],
                'Label': labels_resampled[idx]
            })
        combined_data = pd.DataFrame(balanced_data)
        new_label_ratio = combined_data['Label'].mean()
        logger.info(f"After SMOTE: {len(combined_data)} records, {new_label_ratio*100:.1f}% fake news")

    # Compute class weights
    class_weight = {0: 1.0, 1: 1.0 / label_ratio if label_ratio > 0.5 else 1.0}
    logger.info(f"Class weights: {class_weight}")

    logger.info(f"Dataset loaded: {len(combined_data)} records "
                f"({len(combined_data[combined_data['Label'] == 0])} real, "
                f"{len(combined_data[combined_data['Label'] == 1])} fake)")

    # Resource optimization
    num_folds = NUM_FOLDS
    max_epochs = MAX_EPOCHS
    batch_size = 32

    if is_large_dataset:
        num_folds = min(num_folds, 2)  # Reduce folds for large datasets
        max_epochs = min(max_epochs, 5)  # Reduce epochs for large datasets
        batch_size = 64
        logger.info("Large dataset detected (>100K samples), reduced num_folds to 2, max_epochs to 5, batch_size to 64")

    if is_low_memory:
        batch_size = 8  # Reduce batch size for low memory
        max_epochs = min(max_epochs, 5)  # Reduce epochs for low memory
        logger.info("Low memory detected, reduced batch_size to 8, max_epochs to 5")

    # Multi-model evaluation with cross-validation
    model_names = ["vinai/phobert-base", "bert-base-multilingual-cased"]
    model_results = []
    tokenizers = {}  # Cache tokenizers for each model
    kfold = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)

    for model_name in model_names:
        logger.info(f"\n=== Training model: {model_name} with {num_folds}-fold cross-validation ===")

        # Load and cache tokenizer
        tokenizers[model_name] = AutoTokenizer.from_pretrained(model_name)

        # Prepare data
        X = combined_data['Content'].values
        y = combined_data['Label'].values
        fold_results = []
        best_fold_auc = 0
        best_fold_model_path = None

        for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(X, y)):
            logger.info(f"\n--- Fold {fold_idx + 1}/{num_folds} ---")

            # Split data
            train_texts, val_texts = X[train_idx], X[val_idx]
            train_labels, val_labels = y[train_idx], y[val_idx]

            # Preprocess data
            logger.info("Preprocessing data...")
            train_input_ids, train_attention_mask, train_labels_data = preprocess_data(train_texts, train_labels, tokenizers[model_name], max_length=max_length)
            val_input_ids, val_attention_mask, val_labels_data = preprocess_data(val_texts, val_labels, tokenizers[model_name], max_length=max_length)

            # Create datasets
            logger.info("Creating datasets...")
            train_dataset = create_dataset(train_input_ids, train_attention_mask, train_labels_data, batch_size=batch_size, is_large_dataset=is_large_dataset)
            val_dataset = create_dataset(val_input_ids, val_attention_mask, val_labels_data, batch_size=batch_size, is_large_dataset=is_large_dataset)

            # Train model
            logger.info("Building model...")
            model = build_model(dropout_rate=0.4, learning_rate=1e-5, model_name=model_name)

            # Set up callbacks
            model_dir = f"./model/{model_name.split('/')[-1]}/fold_{fold_idx + 1}"
            os.makedirs(model_dir, exist_ok=True)
            callbacks = [
                EarlyStopping(patience=5, restore_best_weights=True),
                ModelCheckpoint(f"{model_dir}/checkpoint_loss.keras", monitor='val_loss', save_best_only=True),
                ModelCheckpoint(f"{model_dir}/checkpoint_auc.keras", monitor='val_auc', save_best_only=True, mode='max'),
                TensorBoard(log_dir=os.path.join(log_dir, f"{model_name.split('/')[-1]}/fold_{fold_idx + 1}"), histogram_freq=1),
                ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=3, min_lr=1e-6)
            ]

            logger.info("Training model...")
            start_time = time.time()
            history = model.fit(
                train_dataset,
                validation_data=val_dataset,
                epochs=max_epochs,
                callbacks=callbacks,
                class_weight=class_weight
            )
            training_time = time.time() - start_time
            logger.info(f"Total training time: {training_time:.2f} seconds")

            # Save training history
            history_path = os.path.join(log_dir, f"training_history_{model_name.split('/')[-1]}_fold_{fold_idx + 1}_{time.strftime('%Y%m%d-%H%M%S')}.json")
            with open(history_path, 'w') as f:
                json.dump(history.history, f)
            logger.info(f"Training history saved to {history_path}")

            # Evaluate model
            logger.info("Evaluating model...")
            val_results = model.evaluate(val_dataset, verbose=1)
            logger.info(f"Validation Loss: {val_results[0]:.4f}")
            logger.info(f"Validation Accuracy: {val_results[1]:.4f}")
            logger.info(f"Validation AUC: {val_results[2]:.4f}")

            # Predict and compute metrics
            logger.info("Generating predictions for detailed evaluation...")
            predictions = model.predict(val_dataset)
            predicted_labels = (predictions > 0.5).astype(int).flatten()
            true_labels = val_labels_data.numpy().flatten()

            cm = confusion_matrix(true_labels, predicted_labels)
            logger.info(f"Confusion Matrix:\n{cm}")
            report = classification_report(true_labels, predicted_labels, target_names=['Real', 'Fake'], output_dict=True)
            logger.info(f"\nClassification Report:\n{classification_report(true_labels, predicted_labels, target_names=['Real', 'Fake'])}")

            # Compute F1-score and AUC
            f1 = f1_score(true_labels, predicted_labels)
            auc = roc_auc_score(true_labels, predictions.flatten())

            # Save model
            model_path = f"{model_dir}/fake_news_model.keras"
            save_model(model, model_path)

            # Store fold results
            fold_results.append({
                "fold": fold_idx + 1,
                "val_loss": float(val_results[0]),
                "val_accuracy": float(val_results[1]),
                "val_auc": float(val_results[2]),
                "f1_score": float(f1),
                "classification_report": report,
                "model_path": model_path
            })

            # Update best fold
            if auc > best_fold_auc:
                best_fold_auc = auc
                best_fold_model_path = model_path

        # Compute average metrics
        avg_results = {
            "model_name": model_name,
            "avg_val_loss": float(np.mean([r["val_loss"] for r in fold_results])),
            "avg_val_accuracy": float(np.mean([r["val_accuracy"] for r in fold_results])),
            "avg_val_auc": float(np.mean([r["val_auc"] for r in fold_results])),
            "avg_f1_score": float(np.mean([r["f1_score"] for r in fold_results])),
            "std_val_auc": float(np.std([r["val_auc"] for r in fold_results])),
            "folds": fold_results,
            "best_fold_model_path": best_fold_model_path
        }
        model_results.append(avg_results)

        # Log fold results
        logger.info(f"\nSummary for {model_name}:")
        logger.info(f"Average Validation Loss: {avg_results['avg_val_loss']:.4f}")
        logger.info(f"Average Validation Accuracy: {avg_results['avg_val_accuracy']:.4f}")
        logger.info(f"Average Validation AUC: {avg_results['avg_val_auc']:.4f}")
        logger.info(f"Average F1-score: {avg_results['avg_f1_score']:.4f}")
        logger.info(f"Standard Deviation AUC: {avg_results['std_val_auc']:.4f}")
        logger.info(f"Best fold model saved at: {best_fold_model_path}")

    # Save model comparison results
    comparison_path = os.path.join(log_dir, f"cross_validation_comparison_{time.strftime('%Y%m%d-%H%M%S')}.json")
    with open(comparison_path, 'w') as f:
        json.dump(model_results, f, indent=2)
    logger.info(f"Cross-validation comparison results saved to {comparison_path}")

    # Log best model
    best_model = max(model_results, key=lambda x: x["avg_val_auc"])
    logger.info(f"\nBest model based on average AUC: {best_model['model_name']}")
    logger.info(f"Average Validation AUC: {best_model['avg_val_auc']:.4f}")
    logger.info(f"Average F1-score: {best_model['avg_f1_score']:.4f}")
    logger.info(f"Best fold model saved at: {best_model['best_fold_model_path']}")

    # Example inference
    logger.info("\nTesting inference on sample text...")
    sample_texts = [
        "Chính phủ Việt Nam công bố kế hoạch phát triển năng lượng tái tạo đến năm 2030.",
        "Cá mập biết bay xuất hiện ở Hà Nội, gây hoang mang dư luận."
    ]
    best_model_instance = load_model(best_model["best_fold_model_path"])
    best_tokenizer = tokenizers[best_model["model_name"]]
    predictions = predict_news(best_model_instance, best_tokenizer, sample_texts, max_length=max_length)
    for pred in predictions:
        logger.info(f"Text: {pred['text']}")
        logger.info(f"Probability Fake: {pred['probability_fake']:.4f}")
        logger.info(f"Label: {pred['label']}\n")

except Exception as e:
    logger.error(f"Error in execution: {e}")
    import traceback
    logger.error(traceback.format_exc())

AttributeError: module 'tensorflow._api.v2.compat.v2.__internal__' has no attribute 'register_load_context_function'

## Viewing Results

- **Logs**: Check the `fake_news_classifier.log` file in the Colab file explorer for detailed logs, including `Test Accuracy`, `Test AUC`, and `Classification Report`.
- **TensorBoard**: Run the cell below to visualize training metrics (loss, accuracy, AUC) in TensorBoard.
- **Model Files**: The trained model is saved as `./model/fake_news_model.keras` or weights as `./model/fake_news_model.weights.h5`.

### Launch TensorBoard
Run the following cell to start TensorBoard and view training progress.

In [13]:
%load_ext tensorboard
%tensorboard --logdir ./logs