<a href="https://colab.research.google.com/github/joannachang1028/95820_Application-of-NL-X-and-LLM/blob/main/95_820_A1_LSTM_Example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Goal & Overview

**Goal.** Demonstrate the *same* NLP model—**Embedding → LSTM → Dense (logits)**—implemented in both **PyTorch** and **TensorFlow/Keras**, trained and evaluated on the **exact same inputs**, for a clean apples-to-apples comparison.

**Task.** Tiny **binary sentiment** classification (1 = positive, 0 = negative) on a hand-made list of short movie-review–style sentences.

**What we keep identical across PyTorch and TensorFlow/Keras**

- **Data & preprocessing**
  - Tokenization: simple whitespace; all text lowercased
  - Vocabulary & ID mapping shared by both frameworks, with special tokens:
    - `<pad>` → 0 (padding)
    - `<unk>` → 1 (out-of-vocabulary)
  - Fixed sequence length `MAX_LEN` with **right-truncation** (drop extra tokens) and **right-padding** (append `<pad>`)
  - Deterministic train/validation split using a fixed random seed

- **Model architecture**
  - `Embedding(EMBED_DIM) → LSTM(HIDDEN_DIM) → Dense(NUM_CLASSES logits)` (no activation on the final layer; we use logits)

- **Training recipe**
  - Optimizer: **Adam** with learning rate **1e-3**
  - Loss: **cross-entropy on logits** (`from_logits=True` in Keras; `nn.CrossEntropyLoss` in PyTorch)
  - Epochs: small and identical in both
  - Batch size: **full-batch** (one update per epoch) for clarity and brevity

- **Inference**
  - Same sample sentences evaluated in both implementations
  - Report **softmax probabilities** `[P(neg), P(pos)]` and the predicted label

> **Why these simplifications?**
> We intentionally **do not use masking** and we take the **final time step** (even if it corresponds to padding) so the two frameworks behave identically for teaching. In production, you’d typically enable masking (e.g., `mask_zero=True` in Keras) or use packed sequences in PyTorch to ignore padded positions.

**Takeaway.** The two implementations are functionally equivalent; any tiny performance differences come from framework-level initializers or numeric nuances, not from modeling choices.


### 詳細解釋「訓練設置」
訓練設置是指在訓練機器學習模型時所使用的一系列策略和參數。這包括如何計算模型的錯誤（損失函數），如何調整模型的內部參數以最小化錯誤（優化器和學習率），以及數據如何被饋送到模型中進行訓練（epoch 和 batch size）。在這個筆記本中，訓練設置的目的是為了讓 PyTorch 和 TensorFlow 模型能夠在相同的條件下學習，以便進行公平的比較。

* 優化器 (Optimizer):
優化器是一種算法，用於根據損失函數計算出的梯度來更新模型的權重。梯度指示了損失函數在每個權重上的變化方向和大小，優化器利用這些信息來決定如何調整權重，以便逐步減小損失。
Adam (Adaptive Moment Estimation) 是一種非常流行的優化器。它結合了兩種其他優化器 (RMSprop 和 AdaGrad) 的優點，能夠有效地處理稀疏梯度和非穩定的目標函數。Adam 會為模型的每個參數獨立地調整學習率，通常能夠比傳統的隨機梯度下降 (SGD) 更快地收斂。

* 學習率 (Learning Rate):
學習率是一個超參數，它決定了優化器在每次更新權重時邁出的步長大小。
1e-3 (0.001) 是一個常用的預設學習率。對於不同的任務和模型，最佳學習率可能會有所不同。
學習率的設定會影響訓練的速度和結果：
學習率太高：模型可能會在損失函數的最小值周圍震盪，甚至發散，導致訓練不穩定或失敗。
學習率太低：模型會非常緩慢地學習，需要更多的 epoch 才能收斂，可能會陷入局部最小值。
這樣是快還是慢？ 相較於一些非常小的學習率（如 1e-5），1e-3 算是中等偏快的學習率。對於這個小型數據集，這個學習率是合理的，可以較快地進行實驗。

* 損失函數 (Loss Function):
損失函數用於衡量模型預測值與真實標籤之間的差異。訓練的目標是最小化損失函數的值。
交叉熵損失 (Cross-Entropy Loss)：這是用於分類問題的一種常用損失函數。它衡量了模型輸出概率分佈與真實標籤概率分佈之間的差異。
在這個二元分類任務中，模型輸出的是兩個類別的 logits (未經過激活函數的原始分數)。tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) (TensorFlow) 和 nn.CrossEntropyLoss() (PyTorch) 都會自動處理 logits，並計算出對應的交叉熵損失。
from_logits=True 的設置告訴損失函數模型的輸出是 logits，而不是已經經過 softmax 處理的概率。損失函數會在內部應用 softmax 來計算損失，這在數值上更穩定。
Epoch 數值 (Epochs):
一個 epoch 表示整個訓練數據集被模型完整地遍歷並用於參數更新一次。
在這個筆記本中，EPOCHS = 3 表示模型會對整個訓練數據集進行 3 次完整的訓練。
對於小型數據集和演示目的，少量的 epoch 數是可以接受的，可以快速展示模型的訓練過程。在實際應用中，通常需要更多的 epoch 才能讓模型充分學習。
Batch Size:
Batch Size 決定了在每次參數更新時使用多少個訓練樣本。

* 全批量 (Full-Batch)：在這個筆記本中，BATCH_SIZE_FULL = len(X_train)，這意味著每次更新權重時都使用了所有的訓練樣本。這也是「全批量梯度下降」的含義。
優點：對於小型數據集，全批量訓練可以提供更精確的梯度估計。
缺點：對於大型數據集，全批量訓練會非常耗費內存和計算資源，並且可能收斂較慢。在實際應用中，更常用的是 mini-batch (小批量) 訓練，它在效率和梯度估計的準確性之間取得了平衡。

* Softmax 概率
Softmax 函數 是一種常用的激活函數，通常用於多類別分類問題的輸出層之後。它將模型的原始輸出 (logits) 轉換為一個概率分佈，使得所有類別的輸出值介於 0 和 1 之間，並且所有輸出值的總和為 1。
在這個二元分類任務中，softmax 函數將模型的兩個 logits 輸出轉換為兩個概率值，分別表示屬於負面類別和正面類別的概率 [P(neg), P(pos)]。例如，如果模型的 logits 是 [1.0, 2.0]，經過 softmax 處理後可能得到 [0.2689, 0.7311]，這表示模型預測是負面的概率約為 26.89%，是正面的概率約為 73.11%。
* Softmax 概率是一種衡量 binary classification 的 metrics 嗎？ Softmax 概率本身不是衡量二元分類整體性能的 metrics。它是模型對於單個樣本的預測輸出，表示模型對該樣本屬於每個類別的置信度。
衡量二元分類性能的常用 metrics 包括：準確率 (Accuracy)、精確率 (Precision)、召回率 (Recall)、F1 分數、AUC (Area Under the ROC Curve) 等。
你可以根據 softmax 概率來決定最終的預測類別 (通常選擇概率最高的那一個)，然後使用真實標籤來計算上述的性能 metrics。

### 為何刻意不使用 masking?
* 什麼是 Masking?
在處理變長序列（如句子）時，我們經常需要將它們填充到相同的長度 (MAX_LEN)，以便能夠將多個序列組成一個批次 (batch) 進行處理。填充的部分通常是特殊的 padding tokens (例如 <pad>)。
Masking 的目的是告訴模型哪些位置是填充的，以便模型在計算時忽略這些填充位置的信息。例如，在計算平均值或注意力分數時，不應該考慮填充位置的值。
為什麼在這個範例中不使用 Masking?
PyTorch 和 TensorFlow/Keras 在處理 masking 的方式上存在一些細微的差異。雖然它們都提供了 masking 的功能，但底層的實現和預設行為可能不同。

為了避免這些框架層級的差異影響模型的比較結果，這個範例刻意選擇不使用 masking。這意味著模型在處理序列時，會將填充位置 (<pad>) 也視為有效的輸入，並且它們的 embedding 和在 LSTM 中的計算都會被考慮進去。

* 這樣做的影響：
模型需要學習如何處理填充位置。理想情況下，模型會學習到 <pad> 的 embedding 是不包含有用信息的，並且 LSTM 在處理到填充位置時，其狀態不會發生顯著變化。

在實際應用中，通常會使用 masking，因為它可以提高模型的效率 (不計算填充位置) 和性能 (模型不會被填充信息干擾)。但在這個比較範例中，犧牲一點性能來換取框架間行為的完全一致是值得的。
採取 final time step 為何可以讓兩種演算法 (PyTorch 和 TensorFlow) 達到相同的效果?

* LSTM 的輸出：
LSTM 層通常會輸出兩個東西：
所有時間步的隱藏狀態 (hidden states)：形狀通常是 (batch_size, sequence_length, hidden_dim)。這包含了 LSTM 在處理序列中每個詞元後的內部狀態。
最後一個時間步的隱藏狀態和單元狀態 (final hidden and cell states)：形狀通常是 (num_layers * num_directions, batch_size, hidden_dim)。這代表了 LSTM 處理完整個序列後的最終狀態。
為什麼選擇「最後一個時間步的所有時間步隱藏狀態」?
在這個範例中，模型不是直接使用 LSTM 的最終隱藏狀態元組 (h_n, c_n)，而是使用了「所有時間步隱藏狀態」矩陣的「最後一個時間步」的輸出。
在 PyTorch 中，這對應於 out[:, -1, :]。
在 TensorFlow/Keras 中，當 return_sequences=False (預設值) 時，LSTM 層直接輸出每個樣本的最後一個時間步的隱藏狀態。
如何確保相同效果?
由於序列都被填充到了相同的 MAX_LEN，每個序列的「最後一個時間步」都對應於索引 MAX_LEN - 1。
無論這個時間步對應的是序列的真實結尾詞元，還是填充詞元 <pad>，兩個框架都會從 LSTM 輸出矩陣的相同索引位置提取出隱藏狀態。
關鍵在於： 由於我們沒有使用 masking，LSTM 在處理 <pad> 時的計算是標準的 LSTM 前向傳播，其權重和計算方式在 PyTorch 和 TensorFlow 中是等效的 (給定相同的輸入和初始權重)。因此，無論最後一個時間步是真實詞元還是 <pad>，兩個框架都會根據相同的計算邏輯得出這個時間步的隱藏狀態。
如果使用了 masking，框架在處理填充位置時的行為會有所不同，這就會導致最後一個有效時間步的選擇和計算方式不同，從而影響模型的行為。

總結來說，簡化的原因是為了：

1. 避免 PyTorch 和 TensorFlow/Keras 在處理變長序列和 masking 機制上的潛在差異。
2. 確保兩個框架在處理填充序列時，在完全相同的位置 (即固定的 MAX_LEN - 1 索引) 獲取 LSTM 的輸出。
3. 使得兩個模型的輸入、模型結構的每個層的計算過程（包括處理填充位置）以及如何從 LSTM 獲取最終表示在邏輯上完全一致，從而實現真正意義上的「蘋果對蘋果」比較。

這樣做的犧牲是，模型可能需要學習如何忽略填充位置的信息，而且在處理變長序列時不如使用 masking 的模型效率高。但對於比較框架本身而言，這是一種有效的控制變量的方法。

In [None]:
# @title Shared setup: tiny dataset, vocab, encoding, splits
import numpy as np
from collections import Counter
import random

# --------------------------
# Reproducibility
# --------------------------
SEED = 1337
random.seed(SEED)
np.random.seed(SEED)

# --------------------------
# Tiny movie-sentiment dataset: (text, label) with 1=pos, 0=neg
# --------------------------
data = [
    ("i love this movie", 1),
    ("this film was great", 1),
    ("amazing acting and story", 1),
    ("highly recommend it", 1),
    ("i hate this movie", 0),
    ("this film was terrible", 0),
    ("boring plot and bad acting", 0),
    ("do not recommend it", 0),
    ("what a fantastic experience", 1),
    ("worst film ever", 0),
    ("best film ever", 1),
    ("what a waste of time", 0),
]

# ============================================================
# WORD-LEVEL VOCAB (detailed explanation)
# ------------------------------------------------------------
# Goal: Map each *word* to a stable integer ID that both frameworks will share.
# Steps:
#   1) Tokenize each sentence by whitespace (quick & deterministic).
#   2) Count word frequencies with Counter (not required, but useful if you later want cutoffs).
#   3) Create a vocabulary list (itos = index->token) that starts with special symbols:
#        <pad> : used to pad short sequences to MAX_LEN (index 0 here)
#        <unk> : used for any out-of-vocabulary word seen at inference time (index 1 here)
#      Then append all observed words in a sorted order for determinism.
#   4) Build the reverse map (stoi = token->index).
#
# Notes:
# - We choose **PAD_ID=0** and **UNK_ID=1**, which is common and convenient.
# - Sorting the tokens makes the mapping stable across runs (given fixed data).
# - In more realistic pipelines, you'd lowercase consistently (we do), and maybe strip punctuation.
# - For *strict parity* between frameworks, we’ll reuse these exact IDs everywhere.
# ============================================================
counter = Counter()
for text, _ in data:
    counter.update(text.lower().split())

# Counter will look like:
# {'this': 4, 'film': 4, 'i': 2, 'movie': 2, 'was': 2, 'acting': 2, 'and': 2,
#  'recommend': 2, 'it': 2, 'what': 2, 'a': 2, 'ever': 2,
#  'love': 1, 'great': 1, 'amazing': 1, 'story': 1, 'highly': 1, 'hate': 1,
#  'terrible': 1, 'boring': 1, 'plot': 1, 'bad': 1, 'do': 1, 'not': 1,
#  'fantastic': 1, 'experience': 1, 'worst': 1, 'best': 1,
#  'waste': 1, 'of': 1, 'time': 1}

PAD, UNK = "<pad>", "<unk>"
itos = [PAD, UNK] + sorted(counter.keys())   # index -> token
stoi = {w: i for i, w in enumerate(itos)}    # token -> index
PAD_ID, UNK_ID = stoi[PAD], stoi[UNK]
vocab_size = len(itos)

# ============================================================
# ENCODE TO FIXED LENGTH (detailed explanation)
# ------------------------------------------------------------
# Goal: Convert each sentence into a *fixed-length* vector of token IDs so that:
#   - Batches can be simple NumPy arrays / tensors (no ragged shapes).
#   - Both frameworks see identical inputs.
#
# Choices we make:
#   - MAX_LEN = 6: short enough for a toy demo.
#   - Truncation policy: keep the *first* MAX_LEN tokens (drop the rest).
#       This is often called "right-truncation".
#   - Padding policy: if a sentence has < MAX_LEN tokens, append PAD_ID to the right
#       ("right-padding") until length is exactly MAX_LEN.
#
# Consequences:
#   - The *last* position is often padding for short sentences. Since we take the output at
#     the last time step, that step may correspond to <pad>. This is OK here because:
#       * both frameworks do the exact same thing, so the comparison is fair, and
#       * the model can learn to treat PAD as "uninformative".
#   - In practice, you'd often use masking so the model ignores padded positions.
# ============================================================
MAX_LEN = 6

def encode(text):
    toks = text.lower().split()
    ids = [stoi.get(tok, UNK_ID) for tok in toks]     # map tokens to IDs, OOV -> UNK_ID
    ids = ids[:MAX_LEN]                               # truncate to MAX_LEN (right-truncation)
    ids = ids + [PAD_ID] * (MAX_LEN - len(ids))       # right-pad with PAD_ID to fixed length
    return ids


# itos (index → token)

# (alphabetical after the two specials)

# 0: <pad>
# 1: <unk>
# 2: a
# 3: acting
# 4: amazing
# 5: and
# 6: bad
# 7: best
# 8: boring
# 9: do
# 10: ever
# 11: experience
# 12: fantastic
# 13: film
# 14: great
# 15: hate
# 16: highly
# 17: i
# 18: it
# 19: love
# 20: movie
# 21: not
# 22: of
# 23: plot
# 24: recommend
# 25: story
# 26: terrible
# 27: this
# 28: time
# 29: was
# 30: waste
# 31: what
# 32: worst

# stoi (token → index)
# {
#  '<pad>':0, '<unk>':1,
#  'a':2, 'acting':3, 'amazing':4, 'and':5, 'bad':6, 'best':7, 'boring':8, 'do':9,
#  'ever':10, 'experience':11, 'fantastic':12, 'film':13, 'great':14, 'hate':15,
#  'highly':16, 'i':17, 'it':18, 'love':19, 'movie':20, 'not':21, 'of':22, 'plot':23,
#  'recommend':24, 'story':25, 'terrible':26, 'this':27, 'time':28, 'was':29,
#  'waste':30, 'what':31, 'worst':32
# }

# IDs for specials and vocab size
# PAD_ID = 0
# UNK_ID = 1
# vocab_size = 33

# encode(t) with MAX_LEN = 6 (right-truncate then right-pad with PAD_ID)
# encode("this film was great")
# # -> [27, 13, 29, 14, 0, 0]

# encode("boring plot and bad acting")
# # -> [8, 23, 5, 6, 3, 0]

# encode("what a fantastic experience")
# # -> [31, 2, 12, 11, 0, 0]

# encode("i love this movie")
# # -> [17, 19, 27, 20, 0, 0]

# encode("this film was terrible")
# # -> [27, 13, 29, 26, 0, 0]

# encode("do not recommend it")
# # -> [9, 21, 24, 18, 0, 0]

# encode("worst film ever")
# # -> [32, 13, 10, 0, 0, 0]

# encode("best film ever")
# # -> [7, 13, 10, 0, 0, 0]

# encode("what a waste of time")
# # -> [31, 2, 30, 22, 28, 0]

# Vectorize the whole dataset
X = np.array([encode(t) for t, _ in data], dtype=np.int32)  # int32: good for TF; will cast to long in Torch
y = np.array([lbl for _, lbl in data], dtype=np.int32)

# --------------------------
# Train/val split (75/25), with a fixed permutation for reproducibility
# --------------------------
perm = np.random.RandomState(SEED).permutation(len(X))
X, y = X[perm], y[perm]
split = int(0.75 * len(X))
X_train, y_train = X[:split], y[:split]
X_val,   y_val   = X[split:], y[split:]

# --------------------------
# Shared hyperparameters (kept small for speed & clarity)
# --------------------------
EMBED_DIM   = 16
HIDDEN_DIM  = 32
NUM_CLASSES = 2
EPOCHS = 3

# We'll do *full-batch* training to keep code tiny:
BATCH_SIZE_FULL = len(X_train)  # one update per epoch using all training examples

# Three sample sentences for end-of-notebook predictions
sample_texts = ["this movie was great", "this movie was boring", "i recommend this film"]

def encode_batch(texts):
    return np.array([encode(t) for t in texts], dtype=np.int32)

print(f"Vocab size: {vocab_size} | Train={len(X_train)} | Val={len(X_val)} | MAX_LEN={MAX_LEN}")
print("First encoded example:", X_train[0])

Vocab size: 33 | Train=9 | Val=3 | MAX_LEN=6
First encoded example: [27 13 29 26  0  0]


In [None]:
# @title PyTorch LSTM (tiny, full-batch)
import torch
import torch.nn as nn
torch.manual_seed(SEED)

# Convert numpy arrays → torch tensors
Xtr = torch.from_numpy(X_train).long()   # Embedding expects torch.long (int64)
ytr = torch.from_numpy(y_train).long()
Xva = torch.from_numpy(X_val).long()
yva = torch.from_numpy(y_val).long()

# --------------------------
# Model: Embedding → LSTM → Dense(logits)
# --------------------------
class LSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes, pad_id):
        super().__init__()
        # padding_idx=pad_id keeps the PAD row frozen at zeros (not updated by training).
        # Note: In TF below we DON'T freeze PAD; the PAD embedding will be trainable there.
        # That tiny discrepancy is usually negligible for this demo. If you want exact parity,
        # remove padding_idx here so PAD is also trainable in PyTorch.
        self.emb  = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_id)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc   = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.emb(x)            # shape: (B, T, E)
        out, _ = self.lstm(x)      # shape: (B, T, H); we ignore hidden state tuple for simplicity
        last = out[:, -1, :]       # take the *final* time step (could be PAD; same choice in TF)
        return self.fc(last)       # logits (unnormalized scores), shape: (B, C)

model = LSTMClassifier(vocab_size, EMBED_DIM, HIDDEN_DIM, NUM_CLASSES, PAD_ID)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

# ============================================================
# FULL-BATCH GRADIENT DESCENT (detailed explanation)
# ------------------------------------------------------------
# "Full-batch" means we use *all* training examples in one giant batch to compute:
#   1) A single forward pass producing logits for the entire training set.
#   2) A single scalar loss value (average over all examples).
#   3) A single backward pass computing gradients w.r.t. all model parameters.
#   4) One optimizer step that updates parameters using those gradients.
#
# Why do this here?
#   - Our dataset is tiny, so it's easy and compact to express.
#   - It reduces code (no DataLoader or loops over mini-batches).
#   - It ensures both frameworks take an equally simple path.
#
# Trade-offs (for real training):
#   - Full-batch can be slow/ memory-heavy for large datasets.
#   - Mini-batches add stochasticity that can help generalization.
# ============================================================
for epoch in range(1, EPOCHS+1):
    model.train()
    opt.zero_grad()                 # clear old gradients
    logits = model(Xtr)             # forward pass on *all* training examples
    loss = loss_fn(logits, ytr)     # compute average cross-entropy loss over the batch
    loss.backward()                 # backprop: compute gradients dLoss/dParam
    opt.step()                      # update parameters (Adam optimizer)

    # Quick validation accuracy (no gradient tracking)
    model.eval()
    with torch.no_grad():
        val_logits = model(Xva)
        val_pred = val_logits.argmax(1)
        val_acc = (val_pred == yva).float().mean().item()
    print(f"[PyTorch] Epoch {epoch}/{EPOCHS}  train_loss={loss.item():.4f}  val_acc={val_acc:.3f}")

# Inference on sample sentences
with torch.no_grad():
    logits = model(torch.from_numpy(encode_batch(sample_texts)).long())
    probs = torch.softmax(logits, dim=1).cpu().numpy()
    preds = probs.argmax(1)

print("\n[PyTorch] Predictions:")
for t, p, pr in zip(sample_texts, preds, probs):
    print(f"{t!r} → {'pos' if p==1 else 'neg'}  probs={pr}")


[PyTorch] Epoch 1/3  train_loss=0.6908  val_acc=0.333
[PyTorch] Epoch 2/3  train_loss=0.6891  val_acc=0.333
[PyTorch] Epoch 3/3  train_loss=0.6874  val_acc=0.333

[PyTorch] Predictions:
'this movie was great' → pos  probs=[0.4464912 0.5535088]
'this movie was boring' → pos  probs=[0.4553566  0.54464334]
'i recommend this film' → pos  probs=[0.43842876 0.56157124]


In [None]:
# @title TensorFlow/Keras LSTM (tiny, full-batch)
import tensorflow as tf
tf.random.set_seed(SEED)

# Same architecture: Embedding → LSTM → Dense(logits)
tf_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size,
                              output_dim=EMBED_DIM,
                              input_length=MAX_LEN),
    tf.keras.layers.LSTM(HIDDEN_DIM),
    tf.keras.layers.Dense(NUM_CLASSES)  # logits (no activation)
])

tf_model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-3),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"]
)

# ============================================================
# FULL-BATCH GRADIENT DESCENT IN KERAS (what happens under the hood)
# ------------------------------------------------------------
# We set batch_size = number of training examples → one update per epoch.
# Keras still handles the same steps internally:
#   forward → loss → backward (autodiff) → optimizer step → repeat per epoch
# This mirrors the PyTorch loop conceptually, just handled by .fit().
# ============================================================
tf_model.fit(
    X_train, y_train,
    batch_size=len(X_train),   # full-batch (one gradient update per epoch)
    epochs=EPOCHS,
    validation_data=(X_val, y_val),
    verbose=1
)

# Inference on sample sentences (same as PyTorch)
arr = encode_batch(sample_texts)
logits = tf_model(arr, training=False)
probs = tf.nn.softmax(logits, axis=1).numpy()
preds = probs.argmax(1)

print("\n[TensorFlow] Predictions:")
for t, p, pr in zip(sample_texts, preds, probs):
    print(f"{t!r} → {'pos' if p==1 else 'neg'}  probs={pr}")


Epoch 1/3




[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3s/step - accuracy: 0.6667 - loss: 0.6902 - val_accuracy: 0.3333 - val_loss: 0.6976
Epoch 2/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 283ms/step - accuracy: 0.5556 - loss: 0.6891 - val_accuracy: 0.3333 - val_loss: 0.6997
Epoch 3/3
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 99ms/step - accuracy: 0.5556 - loss: 0.6880 - val_accuracy: 0.3333 - val_loss: 0.7020

[TensorFlow] Predictions:
'this movie was great' → pos  probs=[0.48581555 0.5141845 ]
'this movie was boring' → pos  probs=[0.486884 0.513116]
'i recommend this film' → pos  probs=[0.48593917 0.51406074]
