# Notebook 3: Trening Modeli (Szybka Wersja CPU)
## Analiza Sentymentu Recenzji Amazon - Projekt ZUM

 Wytrenowanie trzech różnych modeli klasyfikacji sentymentu.

**Trzy Podejścia Modelowania:**

1. **Model A: Klasyczne ML (Regresja Logistyczna)**
   - Wejście: Wektory TF-IDF
   - Cel: Szybki benchmark bazowy
   
2. **Model B: Sieć Neuronowa (LSTM)**
   - Framework: TensorFlow/Keras
   - Architektura: Embedding (Maskowany) → LSTM → Dense → Sigmoid
   - Potem: Przełączenie na Pre-padding i Masking w celu rozwiązania problemow dokladnosci.
   
3. **Model C: Transformer (DistilBERT)**
   - Framework: Hugging Face Transformers
   - Model: `distilbert-base-uncased`
   - **Optymalizacja:** Próbkowanie danych do 2000 wierszy, aby zapobiec 3-godzinnemu treningowi

## 1. Import Wymaganych Bibliotek i Konfiguracja

In [1]:
import pandas as pd
import numpy as np
import pickle
import joblib
import os
import warnings
import matplotlib.pyplot as plt
warnings.filterwarnings('ignore')

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, SpatialDropout1D, Masking
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    TrainingArguments, 
    Trainer
)
from datasets import Dataset
import torch

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

MODELS_DIR = 'models'
DATA_DIR = 'data/processed'
os.makedirs(MODELS_DIR, exist_ok=True)






## 2. Wczytanie Przetworzonych Danych (Odporne)

In [2]:
print("\n[1/4] Loading Data from Notebook 2...")

try:
    train_df = pd.read_csv(f'{DATA_DIR}/train.csv')
    val_df = pd.read_csv(f'{DATA_DIR}/val.csv')
    print("CSV files found and loaded.")
except FileNotFoundError:
    raise FileNotFoundError("Critical Error: Train/Val CSVs not found. Please run Notebook 2 first!")

print(f"   Columns found: {list(train_df.columns)}")

if 'clean_text' in train_df.columns:
    TEXT_COL = 'clean_text'
    print("   Found 'clean_text' column. Using optimized text.")
elif 'pre_clean' in train_df.columns:
    TEXT_COL = 'pre_clean'
    print("   'clean_text' missing. Using 'pre_clean' instead.")
else:
    TEXT_COL = 'text'
    print("   'clean_text' missing. Falling back to raw 'text' column.")

train_df = train_df.fillna({TEXT_COL: '', 'text': ''})
val_df = val_df.fillna({TEXT_COL: '', 'text': ''})

X_train_clean = train_df[TEXT_COL].astype(str).values
X_val_clean = val_df[TEXT_COL].astype(str).values

RAW_COL = 'text' if 'text' in train_df.columns else TEXT_COL
X_train_raw = train_df[RAW_COL].astype(str).values
X_val_raw = val_df[RAW_COL].astype(str).values

y_train = train_df['label'].values.astype(int)
y_val = val_df['label'].values.astype(int)

print(f"Data Loaded Successfully")
print(f"  Train Samples: {len(X_train_clean):,}")
print(f"  Val Samples:   {len(X_val_clean):,}")


[1/4] Loading Data from Notebook 2...
CSV files found and loaded.
   Columns found: ['text', 'label']
   'clean_text' missing. Falling back to raw 'text' column.
Data Loaded Successfully
  Train Samples: 35,000
  Val Samples:   7,500


## 3. Model A: Klasyczne ML (Regresja Logistyczna)

In [3]:
print("\n[2/4] Training Model A: Logistic Regression...")

tfidf_path = f'{DATA_DIR}/tfidf_vectorizer.pkl'

if not os.path.exists(tfidf_path):
    fallback_path = f'{MODELS_DIR}/tfidf_vectorizer.pkl'
    if os.path.exists(fallback_path):
        tfidf_path = fallback_path
    else:
        raise FileNotFoundError(f"Run Notebook 2 first! Missing {tfidf_path}")

print(f"   Loading vectorizer from: {tfidf_path}")
with open(tfidf_path, 'rb') as f:
    tfidf = pickle.load(f)

print("   Vectorizing data...")
X_train_tfidf = tfidf.transform(X_train_clean)
X_val_tfidf = tfidf.transform(X_val_clean)

print("   Fitting Logistic Regression...")
lr_model = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED, n_jobs=-1)
lr_model.fit(X_train_tfidf, y_train)

val_acc = lr_model.score(X_val_tfidf, y_val)
print(f"   LR Validation Accuracy: {val_acc:.4f}")

joblib.dump(lr_model, f'{MODELS_DIR}/lr_baseline.joblib')
print(f"   Model saved to {MODELS_DIR}/lr_baseline.joblib")


[2/4] Training Model A: Logistic Regression...
   Loading vectorizer from: data/processed/tfidf_vectorizer.pkl
   Vectorizing data...
   Fitting Logistic Regression...
   LR Validation Accuracy: 0.8647
   Model saved to models/lr_baseline.joblib


## 4. Model B: Sieć Neuronowa (LSTM)

In [4]:
print("\n[3/4] Training Model B: Neural Network (LSTM) - FIXED...")

MAX_VOCAB = 10000   
MAX_LEN = 100       # Optimize length as avg review is ~32 words
EMBED_DIM = 128

print("   Tokenizing...")
tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train_clean)

train_seq = tokenizer.texts_to_sequences(X_train_clean)
val_seq = tokenizer.texts_to_sequences(X_val_clean)

print("   Padding sequences (Pre-padding)...")
X_train_pad = pad_sequences(train_seq, maxlen=MAX_LEN, padding='pre', truncating='post')
X_val_pad = pad_sequences(val_seq, maxlen=MAX_LEN, padding='pre', truncating='post')

model_b = Sequential([
    Masking(mask_value=0, input_shape=(MAX_LEN,)),
    Embedding(input_dim=MAX_VOCAB, output_dim=EMBED_DIM),
    SpatialDropout1D(0.2),
    LSTM(64, return_sequences=False),
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])

model_b.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_b.summary())

# Train
print("   Starting Training...")
es = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history_b = model_b.fit(
    X_train_pad, y_train,
    validation_data=(X_val_pad, y_val),
    epochs=10,           
    batch_size=32,     
    callbacks=[es],
    verbose=1
)

# saving as .keras instead of .h5 for  compatibility
model_save_path = f'{MODELS_DIR}/model_b_lstm_final.keras'
model_b.save(model_save_path)
print(f"   Model saved to: {model_save_path}")

model_b.save_weights(f'{MODELS_DIR}/model_b_weights.weights.h5')
print(f"   Weights saved separately")

tokenizer_path = f'{MODELS_DIR}/model_b_tokenizer_final.pkl'
with open(tokenizer_path, 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
print(f"   Tokenizer saved to: {tokenizer_path}")

import json
model_b_config = {
    'MAX_VOCAB': MAX_VOCAB,
    'MAX_LEN': MAX_LEN,
    'EMBED_DIM': EMBED_DIM,
    'LSTM_UNITS': 64,
    'DENSE_UNITS': 32,
    'DROPOUT': 0.3,
    'SPATIAL_DROPOUT': 0.2
}
with open(f'{MODELS_DIR}/model_b_config.json', 'w') as f:
    json.dump(model_b_config, f, indent=2)
print(f"   Config saved")

print("\n   Model B Saved Successfully (all artifacts)")


[3/4] Training Model B: Neural Network (LSTM) - FIXED...
   Tokenizing...
   Padding sequences (Pre-padding)...


None
   Starting Training...
Epoch 1/10
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 27ms/step - accuracy: 0.8360 - loss: 0.3748 - val_accuracy: 0.8612 - val_loss: 0.3139
Epoch 2/10
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 27ms/step - accuracy: 0.9042 - loss: 0.2395 - val_accuracy: 0.8488 - val_loss: 0.3447
Epoch 3/10
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 28ms/step - accuracy: 0.9357 - loss: 0.1677 - val_accuracy: 0.8496 - val_loss: 0.4027
Epoch 4/10
[1m1094/1094[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 28ms/step - accuracy: 0.9559 - loss: 0.1226 - val_accuracy: 0.8460 - val_loss: 0.4849
   Model saved to: models/model_b_lstm_final.keras
   Weights saved separately
   Tokenizer saved to: models/model_b_tokenizer_final.pkl
   Config saved

   Model B Saved Successfully (all artifacts)


## 5. Model C: Transformer (DistilBERT)

In [5]:
print("\n[4/4] Training Model C: DistilBERT - FAST MODE...")

MODEL_NAME = "distilbert-base-uncased"
tokenizer_bert = AutoTokenizer.from_pretrained(MODEL_NAME)
model_bert = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)

# Training on 35,000 samples on CPU takes 3-5 hours. 
# We will use a smaller subset (2,000 samples) to finish in 15 mins.
SAMPLE_SIZE = 2000 
VAL_SIZE = 500

print(f"OPTIMIZATION: Subsampling dataset to {SAMPLE_SIZE} rows for speed on CPU.")
print("   (If you have a GPU, set SAMPLE_SIZE = len(X_train_raw))")

def create_dataset_dict(texts, labels):
    return {
        "text": [str(t) for t in texts],
        "label": [int(l) for l in labels]
    }

train_hf = Dataset.from_dict(create_dataset_dict(X_train_raw[:SAMPLE_SIZE], y_train[:SAMPLE_SIZE]))
val_hf = Dataset.from_dict(create_dataset_dict(X_val_raw[:VAL_SIZE], y_val[:VAL_SIZE]))

def tokenize_func(examples):
    return tokenizer_bert(examples["text"], padding="max_length", truncation=True, max_length=128)

print("   Tokenizing datasets (this might take a moment)...")
train_hf = train_hf.map(tokenize_func, batched=True)
val_hf = val_hf.map(tokenize_func, batched=True)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, preds)}

training_args = TrainingArguments(
    output_dir=f'{MODELS_DIR}/transformer_checkpoints',
    eval_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=2,            
    per_device_train_batch_size=16,  
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    fp16=torch.cuda.is_available(),  
    logging_steps=50,
)

trainer = Trainer(
    model=model_bert,
    args=training_args,
    train_dataset=train_hf,
    eval_dataset=val_hf,
    compute_metrics=compute_metrics,
)

print("   Starting DistilBERT Training (Check logs below)...")
trainer.train()

trainer.save_model(f'{MODELS_DIR}/transformer_model')
tokenizer_bert.save_pretrained(f'{MODELS_DIR}/transformer_model')

print("\n" + "="*60)
print("ALL MODELS TRAINED & SAVED SUCCESSFULLY")
print("="*60)


[4/4] Training Model C: DistilBERT - FAST MODE...


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


OPTIMIZATION: Subsampling dataset to 2000 rows for speed on CPU.
   (If you have a GPU, set SAMPLE_SIZE = len(X_train_raw))
   Tokenizing datasets (this might take a moment)...


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

   Starting DistilBERT Training (Check logs below)...


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4632,0.414447,0.798
2,0.2745,0.38799,0.822


SafetensorError: Error while serializing: I/O error: The requested operation cannot be performed on a file with a user-mapped section open. (os error 1224)

## 6. Podsumowanie Treningu

Wszystkie trzy modele zostały wytrenowane i zapisane:

- **Model A (Regresja Logistyczna)**: Zapisany jako `models/lr_baseline.joblib`
- **Model B (Sieć Neuronowa)**: Zapisany jako `models/nn_model.h5`
- **Model C (DistilBERT)**: Zapisany jako `models/transformer_model`

