### 3.6 Working with TensorFlow Datasets

In this section, we reproduce the baseline NLP setup using a `TensorFlow Dataset (TFDS)` pipeline rather than the Hugging Face `datasets` library. While Hugging Face datasets offer rich features and streamlined integration with Transformers, TFDS remains a robust alternative—often preferred in contexts where:

- Native TensorFlow data pipelines are required,
- Seamless integration with `tf.data` APIs and performance optimizations (e.g., prefetching, caching, parallelization) is desired,
- Projects are bound to TensorFlow/TPU environments with minimal external dependencies.

This notebook demonstrates compatibility with TFDS by mirroring one of the earlier experiments. However, this is a standalone detour meant purely for demonstration purposes—**no significant analytical insights were gained** here in the context of the project’s primary goals. For the remainder of the research, the Hugging Face `datasets` library will continue to be used for data handling.


In [1]:
# Install dependencies
!pip install transformers scikit-learn pandas numpy tqdm tensorflow
!pip install -q datasets

# Imports
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
import numpy as np



In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
import numpy as np

# Ensure reproducibility
tf.random.set_seed(42)

# Load the dataset from Hugging Face
def load_raw_dataset():
    return load_dataset("financial_phrasebank", "sentences_allagree")["train"]

# Initialize tokenizer
def load_tokenizer():
    return AutoTokenizer.from_pretrained("distilbert-base-uncased", trust_remote_code=True)

# Tokenize the dataset manually
def tokenize_dataset(dataset_split, tokenizer, max_length=256):
    input_ids, attention_masks, labels = [], [], []

    for example in dataset_split:
        encoded = tokenizer(
            example["sentence"],
            truncation=True,
            padding="max_length",
            max_length=max_length
        )
        input_ids.append(encoded["input_ids"])
        attention_masks.append(encoded["attention_mask"])
        labels.append(example["label"])

    return input_ids, attention_masks, labels

# Build the full tf.data.Dataset and split AFTER creation
def build_tf_datasets(input_ids, attention_masks, labels, batch_size=16, val_frac=0.1, test_frac=0.1):
    total_size = len(input_ids)
    val_size = int(val_frac * total_size)
    test_size = int(test_frac * total_size)

    full_dataset = tf.data.Dataset.from_tensor_slices((
        {
            "input_ids": tf.constant(input_ids),
            "attention_mask": tf.constant(attention_masks)
        },
        tf.constant(labels)
    ))

    # Shuffle once and split into test, val, and train
    full_dataset = full_dataset.shuffle(total_size, reshuffle_each_iteration=False)
    test_ds = full_dataset.take(test_size).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    val_ds = full_dataset.skip(test_size).take(val_size).batch(batch_size).prefetch(tf.data.AUTOTUNE)
    train_ds = full_dataset.skip(test_size + val_size).batch(batch_size).prefetch(tf.data.AUTOTUNE)

    return train_ds, val_ds, test_ds

# Load and compile the model
def load_model():
    model = TFAutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased", num_labels=3
    )
    # Freeze encoder to reduce overfitting
    model.distilbert.trainable = False

    # Learning rate schedule
    lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
        initial_learning_rate=5e-5,
        decay_steps=1000,
        end_learning_rate=1e-6,
        power=1.0
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=lr_schedule),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"]
    )
    return model

# Main runner
def main():
    raw_dataset = load_raw_dataset()
    tokenizer = load_tokenizer()
    input_ids, attention_masks, labels = tokenize_dataset(raw_dataset, tokenizer)

    train_ds, val_ds, test_ds = build_tf_datasets(input_ids, attention_masks, labels)

    model = load_model()
    model.summary()

    early_stop = tf.keras.callbacks.EarlyStopping(
        monitor="val_loss", patience=2, restore_best_weights=True
    )

    history = model.fit(
        train_ds,
        validation_data=val_ds,
        epochs=3,
        callbacks=[early_stop]
    )

    eval_loss, eval_accuracy = model.evaluate(test_ds)
    print(f"\nEvaluated Test Loss: {eval_loss:.4f}, Evaluated Test Accuracy: {eval_accuracy:.4f}")

    return model, history

# Run the pipeline
model_manual_tfds, history_manual_tfds = main()




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMa  multiple                  66362880  
 inLayer)                                                        
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  2307      
                                                                 
 dropout_19 (Dropout)        multiple                  0 (unused)
                                                                 
Total params: 66955779 (255.42 MB)
Trainable params: 592899 (2.26 MB)
Non-trainable params: 66362880 (253.15 MB)
_________________________________________________________________
Epoch 1/3
Epoch 2/3
Epoch 3/3

Evaluated Te

#### Why This Is Different

In the main workflow, we typically:
- Load the dataset using `datasets.load_dataset()` (Hugging Face)
- Tokenize using `dataset.map(...)`
- Convert directly to TensorFlow format using `.to_tf_dataset(...)`

While convenient and efficient, this abstracts away the underlying mechanics of how TensorFlow ingests and prepares data for training. The `.to_tf_dataset()` method wraps a lot of logic—batching, formatting, and padding—into a single function.

This demonstrates how to bridge between tokenized NLP data and TensorFlow training infrastructure without relying on helper abstractions.

