### 3.6 Working with TensorFlow Datasets

In this section, we reproduce the baseline NLP setup using a `TensorFlow Dataset (TFDS)` pipeline rather than the Hugging Face `datasets` library. While Hugging Face datasets offer rich features and streamlined integration with Transformers, TFDS remains a robust alternative—often preferred in contexts where:

- Native TensorFlow data pipelines are required,
- Seamless integration with `tf.data` APIs and performance optimizations (e.g., prefetching, caching, parallelization) is desired,
- Projects are bound to TensorFlow/TPU environments with minimal external dependencies.

This notebook demonstrates compatibility with TFDS by mirroring one of the earlier experiments. However, this is a standalone detour meant purely for demonstration purposes—**no significant analytical insights were gained** here in the context of the project’s primary goals. For the remainder of the research, the Hugging Face `datasets` library will continue to be used for data handling.


In [None]:
# Imports
from datasets import load_dataset
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
import tensorflow as tf
import numpy as np

In [None]:
# Load dataset from Hugging Face
dataset = load_dataset("financial_phrasebank", "sentences_allagree")
checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Tokenize the dataset manually
def encode_dataset(dataset_split, max_length=128):
    input_ids, attention_masks, labels = [], [], []

    for example in dataset_split:
        encoded = tokenizer(
            example["sentence"],
            truncation=True,
            padding="max_length",
            max_length=max_length
        )
        input_ids.append(encoded["input_ids"])
        attention_masks.append(encoded["attention_mask"])
        labels.append(example["label"])

    return input_ids, attention_masks, labels

# Encode training data only (demo)
input_ids, attention_masks, labels = encode_dataset(dataset["train"])

# Manually build a tf.data.Dataset (NOT using .to_tf_dataset)
def create_tf_dataset(input_ids, attention_masks, labels, batch_size=16):
    tf_dataset = tf.data.Dataset.from_tensor_slices((
        {
            "input_ids": tf.constant(input_ids),
            "attention_mask": tf.constant(attention_masks)
        },
        tf.constant(labels)
    ))
    return tf_dataset.shuffle(1000).batch(batch_size).prefetch(tf.data.AUTOTUNE)

# Create TensorFlow dataset
tf_dataset_manual = create_tf_dataset(input_ids, attention_masks, labels)

# Load model and compile
model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=3)
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

# Train
history = model.fit(tf_dataset_manual, epochs=3)


#### Why This Is Different

In the main workflow, we typically:
- Load the dataset using `datasets.load_dataset()` (Hugging Face)
- Tokenize using `dataset.map(...)`
- Convert directly to TensorFlow format using `.to_tf_dataset(...)`

While convenient and efficient, this abstracts away the underlying mechanics of how TensorFlow ingests and prepares data for training. The `.to_tf_dataset()` method wraps a lot of logic—batching, formatting, and padding—into a single function.

This demonstrates how to bridge between tokenized NLP data and TensorFlow training infrastructure without relying on helper abstractions.

