# Notebook 2: Preprocessing Tekstu i Inżynieria Cech
## Analiza Sentymentu Recenzji Amazon - Projekt ZUM

 Oczyszczenie danych zgodnie  i przygotowanie cech do modelowania.

1. **Czyszczenie Tekstu:** Usuwanie HTML → Zamiana na małe litery → Usuwanie URL/Szumu → Tokenizacja → Usuwanie stopwords → **Lematyzacja**
2. **Podział Stratyfikowany:** 70% Trening / 15% Walidacja / 15% Test 
3. **Wektoryzacja TF-IDF:** 
4. **Zapis przetworzonych danych** 


- **Lematyzacja** (spaCy): Zachowuje znaczenie semantyczne ("running" → "run")

## 1. Import Wymaganych Bibliotek

In [1]:
import pandas as pd
import numpy as np
import re
import os
import pickle
import warnings
warnings.filterwarnings('ignore')
import nltk
import spacy

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

from tqdm.auto import tqdm
import time

print("="*60)
print("STEP 1/3: Downloading NLTK data...")
print("="*60)
nltk.download('stopwords', quiet=False)
nltk.download('punkt', quiet=False)
print("NLTK data ready\n")

print("="*60)
print("STEP 2/3: Loading spaCy model (this takes 5-10 seconds)...")
print("="*60)
start_time = time.time()
try:
    nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    load_time = time.time() - start_time
    print(f"spaCy model loaded in {load_time:.2f} seconds\n")
except OSError:
    print("\nModel not found. Downloading spaCy English model...")
    print("This will take 30-60 seconds. Please wait...\n")
    import subprocess
    import sys
    
    result = subprocess.run(
        [sys.executable, '-m', 'spacy', 'download', 'en_core_web_sm'],
        capture_output=True,
        text=True
    )
    
    if result.returncode == 0:
        print("Download complete! Loading model...")
        nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
        load_time = time.time() - start_time
        print(f"spaCy model downloaded and loaded in {load_time:.2f} seconds\n")
    else:
        print("Download failed. Please run this command manually in a terminal:")
        print("   python -m spacy download en_core_web_sm")
        print("\nThen restart this notebook kernel and try again.")
        print(f"\nError details:\n{result.stderr}")
        raise RuntimeError("Failed to download spaCy model")

# Set reproducibility
print("="*60)
print("STEP 3/3: Setting up environment...")
print("="*60)
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("All libraries loaded successfully")
print(f"Random seed set to {RANDOM_SEED}")
print(f"spaCy optimized (disabled parser, ner for speed)")
print("="*60)


STEP 1/3: Downloading NLTK data...


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\paula\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\paula\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


NLTK data ready

STEP 2/3: Loading spaCy model (this takes 5-10 seconds)...
spaCy model loaded in 0.64 seconds

STEP 3/3: Setting up environment...
All libraries loaded successfully
Random seed set to 42
spaCy optimized (disabled parser, ner for speed)


## 2. Wczytanie Zbalansowanego Zbioru Danych

Wczytanie zbalansowanego podzbioru recenzji Amazon utworzonego w Notebooku 1.

In [2]:
data_path = 'data/processed/amazon_subset.csv'

print(f"Loading dataset from: {data_path}")
df = pd.read_csv(data_path)

print(f"\n{'='*60}")
print(f"Dataset Loaded:")
print(f"{'='*60}")
print(f"Total samples: {len(df):,}")
print(f"Columns: {list(df.columns)}")
print(f"Label distribution:\n{df['label'].value_counts()}")
print(f"{'='*60}")

print("\nSample raw text (before preprocessing):")
print(df['text'].iloc[0][:200] + "...")

print("\n" + "="*80)
print("KEY EDA FINDINGS (from Notebook 1)")
print("="*80)

df['word_count'] = df['text'].apply(lambda x: len(str(x).split()))

percentile_95 = df['word_count'].quantile(0.95)
percentile_99 = df['word_count'].quantile(0.99)
mean_length = df['word_count'].mean()
std_length = df['word_count'].std()

print(f"\nText Length Statistics:")
print(f"  Mean word count:     {mean_length:.1f} words")
print(f"  Std deviation:       {std_length:.1f} words")
print(f"  95th percentile:     {percentile_95:.0f} words (covers 95% of reviews)")
print(f"  99th percentile:     {percentile_99:.0f} words (covers 99% of reviews)")

print(f"\nINSIGHT for Neural Networks (Notebook 3):")
print(f"   - Recommended max_length for padding: {int(percentile_95)} words")
print(f"   - This ensures 95% of reviews fit without truncation")
print(f"   - Longer reviews will be truncated, shorter ones padded")

EDA_STATS = {
    'mean_word_count': mean_length,
    'std_word_count': std_length,
    'percentile_95': int(percentile_95),
    'percentile_99': int(percentile_99),
    'recommended_max_length': int(percentile_95)
}

print("="*80)

Loading dataset from: data/processed/amazon_subset.csv

Dataset Loaded:
Total samples: 50,000
Columns: ['text', 'label']
Label distribution:
label
1    25000
0    25000
Name: count, dtype: int64

Sample raw text (before preprocessing):
I LOVE this game! My husband and I hadn't played Final Fantasy games since the old Nintendo versions, but we bought this when we bought a PlayStation 2 for our son for Xmas. Wow! Have these games chan...

KEY EDA FINDINGS (from Notebook 1)

Text Length Statistics:
  Mean word count:     77.6 words
  Std deviation:       42.4 words
  95th percentile:     160 words (covers 95% of reviews)
  99th percentile:     180 words (covers 99% of reviews)

INSIGHT for Neural Networks (Notebook 3):
   - Recommended max_length for padding: 160 words
   - This ensures 95% of reviews fit without truncation
   - Longer reviews will be truncated, shorter ones padded


## 3. Pipeline Czyszczenia Tekstu (Główna Funkcja)


### **Dlaczego Lematyzacja zamiast Stemmingu?**
| Aspekt | Lematyzacja (Używana) | Stemming (Unikany) |
|--------|------------------------|----------------------|
| Wynik | Prawidłowe słowa ("running" → "run") | Rdzenie, mogą nie być słowami ("studies" → "studi") |
| Zachowanie semantyki | Wysokie (zachowuje znaczenie) | Niskie (traci znaczenie) |
| Dokładność | Używa słownika + tagów POS | Używa heurystycznych reguł |
| Prędkość | Wolniejsza (wymaga analizy językowej) | Szybsza (proste usuwanie sufiksów) |
| Rekomendacja ZUM | **Preferowana dla zadań NLP** | Akceptowalna dla prostych zadań IR |

**Wniosek:** Lematyzacja produkuje czystsze, bardziej znaczące cechy dla analizy sentymentu.

In [None]:
stop_words = set(stopwords.words('english'))

print("="*80)
print("IDENTIFYING DOMAIN-SPECIFIC STOPWORDS (Data-Driven)")
print("="*80)

from sklearn.feature_extraction.text import CountVectorizer

negative_corpus = df[df['label'] == 0]['text'].tolist()
positive_corpus = df[df['label'] == 1]['text'].tolist()

vectorizer = CountVectorizer(
    max_features=100,
    stop_words='english',
    token_pattern=r'\b[a-zA-Z]{3,}\b' 
)

X_neg = vectorizer.fit_transform(negative_corpus)
neg_words = set(vectorizer.get_feature_names_out())

X_pos = vectorizer.fit_transform(positive_corpus)
pos_words = set(vectorizer.get_feature_names_out())

overlapping_words = neg_words.intersection(pos_words)

print(f"\nTop 100 words in NEGATIVE reviews: {len(neg_words)}")
print(f"Top 100 words in POSITIVE reviews: {len(pos_words)}")
print(f"Overlapping words (domain-specific): {len(overlapping_words)}")

print(f"\nOverlapping words (appear in BOTH sentiment classes):")
print(f"   {sorted(overlapping_words)}")

sentiment_indicators = {
    'great', 'excellent', 'good', 'best', 'love', 'perfect', 'happy',
    'bad', 'poor', 'worst', 'terrible', 'horrible', 'disappointing', 'awful'
}

domain_stopwords = overlapping_words - sentiment_indicators

domain_stopwords.update({
    'amazon', 'product', 'products', 'item', 'items',
    'buy', 'bought', 'purchase', 'purchased', 'order', 'ordered',
    'price', 'shipping', 'delivery'
})

all_stopwords = stop_words.union(domain_stopwords)

print(f"\nFINAL STOPWORD CONFIGURATION:")
print(f"   Standard English stopwords:  {len(stop_words)}")
print(f"   Domain-specific stopwords:   {len(domain_stopwords)}")
print(f"   Total combined stopwords:    {len(all_stopwords)}")

print(f"\nDomain-specific stopwords being removed:")
print(f"   {sorted(domain_stopwords)}")
print("="*80)

def pre_clean(text):
    """
    Fast regex-based pre-cleaning before SpaCy processing.
    """
    text = re.sub(r'<br\s*/>', ' ', text)
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub(r'http\S+|www\S+|@\S+|#\S+', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower().strip()
    text = re.sub(r'\s+', ' ', text)
    
    return text

print("\nText cleaning functions defined")
print("Using optimized batch processing with nlp.pipe()")
print("Using lemmatization (spaCy)")
print("Domain-specific stopwords DATA-DRIVEN from EDA frequency analysis")

IDENTIFYING DOMAIN-SPECIFIC STOPWORDS (Data-Driven)

Top 100 words in NEGATIVE reviews: 100
Top 100 words in POSITIVE reviews: 100
Overlapping words (domain-specific): 73

Overlapping words (appear in BOTH sentiment classes):
   ['album', 'amazon', 'author', 'best', 'better', 'big', 'book', 'books', 'bought', 'buy', 'characters', 'day', 'did', 'didn', 'different', 'does', 'doesn', 'don', 'dvd', 'far', 'game', 'going', 'good', 'got', 'great', 'hard', 'just', 'know', 'like', 'little', 'long', 'look', 'looking', 'lot', 'love', 'make', 'movie', 'music', 'need', 'new', 'old', 'people', 'play', 'price', 'product', 'quality', 'read', 'reading', 'real', 'really', 'recommend', 'right', 'say', 'set', 'songs', 'sound', 'story', 'sure', 'thing', 'think', 'thought', 'time', 'times', 'use', 'used', 'using', 'want', 'way', 'work', 'works', 'worth', 'year', 'years']

FINAL STOPWORD CONFIGURATION:
   Standard English stopwords:  198
   Domain-specific stopwords:   78
   Total combined stopwords:    270

### Test Funkcji Czyszczącej



In [4]:
sample_text = df['text'].iloc[0]

print("="*80)
print("BEFORE vs AFTER Cleaning Example:")
print("="*80)
print(f"\nBEFORE (Raw Text):")
print(f"{sample_text}\n")

pre_cleaned = pre_clean(sample_text)
print(f"AFTER PRE-CLEAN (Regex):")
print(f"{pre_cleaned}\n")
doc = nlp(pre_cleaned)
lemmatized_tokens = [token.lemma_ for token in doc if token.lemma_ not in all_stopwords and len(token.lemma_) > 2]
cleaned_sample = ' '.join(lemmatized_tokens)

print(f"AFTER FULL CLEAN (Lemmatized + Domain Stopwords Removed):")
print(f"{cleaned_sample}\n")

print("="*80)
print(f"Statistics:")
print(f"  Original length:  {len(sample_text)} characters, {len(sample_text.split())} words")
print(f"  Cleaned length:   {len(cleaned_sample)} characters, {len(cleaned_sample.split())} words")
print(f"  Reduction:        {(1 - len(cleaned_sample)/len(sample_text))*100:.1f}%")
print(f"  Stopwords used:   {len(all_stopwords)} ({len(stop_words)} standard + {len(domain_stopwords)} domain-specific)")
print("="*80)

BEFORE vs AFTER Cleaning Example:

BEFORE (Raw Text):
I LOVE this game! My husband and I hadn't played Final Fantasy games since the old Nintendo versions, but we bought this when we bought a PlayStation 2 for our son for Xmas. Wow! Have these games changed! They were fun before, but now they are visually amazing too and still incredibly fun! We have stressful lives, but for a few minutes (which can easily turn into a few hours) a day we can travel to distant lands, gather treasure, fight monsters successfully, spend money at all sorts of shops, and quest to save the world -- all without getting out of your PJs! Where else can you do all this over a period of several months for 50 bucks?? We'll definitely buy FF 11 when we are done with this one

AFTER PRE-CLEAN (Regex):
i love this game my husband and i hadnt played final fantasy games since the old nintendo versions but we bought this when we bought a playstation for our son for xmas wow have these games changed they were fun before 

## 4. Zastosowanie Czyszczenia Tekstu



In [5]:
print("="*80)
print("OPTIMIZED TEXT CLEANING PIPELINE")
print("="*80)

print("\nStep 1: Applying regex pre-cleaning (fast)...")
texts_to_process = df['text'].astype(str).apply(pre_clean).tolist()
print(f"Pre-cleaned {len(texts_to_process):,} texts")
print("\nStep 2: Applying SpaCy lemmatization with nlp.pipe() (optimized batch processing)...")
print("(This will take approximately 3-5 minutes for 50,000 reviews)\n")

from tqdm.auto import tqdm

clean_texts = []
for doc in tqdm(nlp.pipe(texts_to_process, batch_size=2000), total=len(texts_to_process), desc="Processing"):
    tokens = [token.lemma_ for token in doc if token.lemma_ not in all_stopwords and len(token.lemma_) > 2]
    clean_texts.append(' '.join(tokens))

df['cleaned_text'] = clean_texts

print("\n" + "="*80)
print("Text cleaning complete!")
print("="*80)
print(f"Domain-specific stopwords removed: {len(domain_stopwords)} terms")
print(f"Total stopwords filtered: {len(all_stopwords)} terms")

print("\nBEFORE vs AFTER Cleaning Examples:")
for i in range(3):
    print(f"\n--- Example {i+1} ---")
    print(f"BEFORE: {df['text'].iloc[i][:100]}...")
    print(f"AFTER:  {df['cleaned_text'].iloc[i][:100]}...")

OPTIMIZED TEXT CLEANING PIPELINE

Step 1: Applying regex pre-cleaning (fast)...
Pre-cleaned 50,000 texts

Step 2: Applying SpaCy lemmatization with nlp.pipe() (optimized batch processing)...
(This will take approximately 3-5 minutes for 50,000 reviews)



Processing:   0%|          | 0/50000 [00:00<?, ?it/s]


Text cleaning complete!
Domain-specific stopwords removed: 78 terms
Total stopwords filtered: 270 terms

BEFORE vs AFTER Cleaning Examples:

--- Example 1 ---
BEFORE: I LOVE this game! My husband and I hadn't played Final Fantasy games since the old Nintendo versions...
AFTER:  love husband final fantasy since nintendo version playstation son xmas wow change fun visually amazi...

--- Example 2 ---
BEFORE: Just watch PBS This is not an exercise video. It is a collection of Elmo segments from Sesame Street...
AFTER:  watch pbs exercise video collection elmo segment sesame street son get move two...

--- Example 3 ---
BEFORE: Black Spots on Vitamins I purchased Yummi Bears Multi-Vitamin & Mineral for my three year old daught...
AFTER:  black spot vitamin yummi bear multivitamin mineral three daughter discover vitamin cover black spot ...


## 5.  Podział Train/Validation/Test

Wykonanie podziału **70% Trening / 15% Walidacja / 15% Test** ze stratyfikacją w celu utrzymania balansu klas.

**Dlaczego Stratyfikowany?**
- Zapewnia, że wszystkie trzy zbiory mają ten sam rozkład klas (50/50 pozytywne/negatywne)
- Dla zbalansowanej ewaluacji na wszystkich etapach

**Dlaczego Trzy Zbiory?**
- **Trening (70%):** Dopasowanie parametrów modelu
- **Walidacja (15%):** Dostrajanie hiperparametrów, early stopping 
- **Test (15%):**   ewaluacja 

In [6]:
X = df['cleaned_text']
y = df['label']

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, 
    test_size=0.3,
    random_state=RANDOM_SEED,
    stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    random_state=RANDOM_SEED,
    stratify=y_temp
)

print(f"{'='*60}")
print("Train/Validation/Test Split Summary:")
print(f"{'='*60}")
print(f"Training set size:     {len(X_train):,} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"Validation set size:   {len(X_val):,} samples ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test set size:         {len(X_test):,} samples ({len(X_test)/len(X)*100:.1f}%)")
print(f"Total:                 {len(X):,} samples")

print(f"\nTraining label distribution:")
print(y_train.value_counts().sort_index())

print(f"\nValidation label distribution:")
print(y_val.value_counts().sort_index())

print(f"\nTest label distribution:")
print(y_test.value_counts().sort_index())
print(f"{'='*60}")

train_pos_ratio = (y_train == 1).sum() / len(y_train)
val_pos_ratio = (y_val == 1).sum() / len(y_val)
test_pos_ratio = (y_test == 1).sum() / len(y_test)

print(f"\nStratification check (all should be ~0.50):")
print(f"  Train positive ratio:      {train_pos_ratio:.4f}")
print(f"  Validation positive ratio: {val_pos_ratio:.4f}")
print(f"  Test positive ratio:       {test_pos_ratio:.4f}")
print(f"  Max difference: {max(abs(train_pos_ratio - val_pos_ratio), abs(train_pos_ratio - test_pos_ratio), abs(val_pos_ratio - test_pos_ratio)):.6f}")
print(f"{'='*60}")

Train/Validation/Test Split Summary:
Training set size:     35,000 samples (70.0%)
Validation set size:   7,500 samples (15.0%)
Test set size:         7,500 samples (15.0%)
Total:                 50,000 samples

Training label distribution:
label
0    17500
1    17500
Name: count, dtype: int64

Validation label distribution:
label
0    3750
1    3750
Name: count, dtype: int64

Test label distribution:
label
0    3750
1    3750
Name: count, dtype: int64

Stratification check (all should be ~0.50):
  Train positive ratio:      0.5000
  Validation positive ratio: 0.5000
  Test positive ratio:       0.5000
  Max difference: 0.000000


## 6. Wektoryzacja TF-IDF (dla Klasycznego ML)


- To będzie wejście dla naszego modelu Regresji Logistycznej (Model A)
- Sieci neuronowe i Transformery będą używać różnych reprezentacji

In [7]:
tfidf = TfidfVectorizer(
    max_features=5000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95
)

print("Fitting TF-IDF vectorizer on TRAINING data ONLY (prevents data leakage)...\n")

X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.transform(X_val)
X_test_tfidf = tfidf.transform(X_test)

print(f"{'='*60}")
print("TF-IDF Vectorization Summary:")
print(f"{'='*60}")
print(f"Vocabulary size: {len(tfidf.vocabulary_):,} unique terms")
print(f"\nTraining matrix shape:   {X_train_tfidf.shape}")
print(f"  - {X_train_tfidf.shape[0]:,} samples")
print(f"  - {X_train_tfidf.shape[1]:,} features")
print(f"\nValidation matrix shape: {X_val_tfidf.shape}")
print(f"  - {X_val_tfidf.shape[0]:,} samples")
print(f"  - {X_val_tfidf.shape[1]:,} features")
print(f"\nTest matrix shape:       {X_test_tfidf.shape}")
print(f"  - {X_test_tfidf.shape[0]:,} samples")
print(f"  - {X_test_tfidf.shape[1]:,} features")

train_sparsity = (1 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])) * 100
print(f"\nMatrix sparsity: {train_sparsity:.2f}% (mostly zeros - efficient storage)")
print(f"{'='*60}")

Fitting TF-IDF vectorizer on TRAINING data ONLY (prevents data leakage)...

TF-IDF Vectorization Summary:
Vocabulary size: 5,000 unique terms

Training matrix shape:   (35000, 5000)
  - 35,000 samples
  - 5,000 features

Validation matrix shape: (7500, 5000)
  - 7,500 samples
  - 5,000 features

Test matrix shape:       (7500, 5000)
  - 7,500 samples
  - 5,000 features

Matrix sparsity: 99.52% (mostly zeros - efficient storage)


### 6.5 Walidacja Wyników EDA w Słownictwie TF-IDF

Weryfikacja, czy kluczowe bigramy zidentyfikowane w analizie N-gramów Notebooka 1 są przechwycone przez nasz wektoryzator TF-IDF.

In [8]:
print("="*80)
print("VALIDATING EDA N-GRAM FINDINGS IN TF-IDF VOCABULARY")
print("="*80)

vocabulary = set(tfidf.get_feature_names_out())

expected_negative_bigrams = [
    'waste money', 'customer service', 'poor quality', 'not worth',
    'waste time', 'return product', 'don\'t waste', 'completely useless'
]

expected_positive_bigrams = [
    'highly recommend', 'great product', 'love it', 'excellent quality',
    'well worth', 'amazing product', 'perfectly fine', 'absolutely love'
]

print("\nNEGATIVE SENTIMENT BIGRAMS:")
print("-" * 80)
found_neg = 0
for bigram in expected_negative_bigrams:
    if bigram in vocabulary:
        print(f"  FOUND:   '{bigram}'")
        found_neg += 1
    else:
        print(f"  MISSING: '{bigram}' (may be filtered by min_df/max_df or stopword removal)")

print(f"\nNegative bigrams found: {found_neg}/{len(expected_negative_bigrams)} ({found_neg/len(expected_negative_bigrams)*100:.0f}%)")

print("\nPOSITIVE SENTIMENT BIGRAMS:")
print("-" * 80)
found_pos = 0
for bigram in expected_positive_bigrams:
    if bigram in vocabulary:
        print(f"  FOUND:   '{bigram}'")
        found_pos += 1
    else:
        print(f"  MISSING: '{bigram}' (may be filtered by min_df/max_df or stopword removal)")

print(f"\nPositive bigrams found: {found_pos}/{len(expected_positive_bigrams)} ({found_pos/len(expected_positive_bigrams)*100:.0f}%)")

print("\n" + "="*80)
print("ACTUAL TOP BIGRAMS IN TF-IDF VOCABULARY (Sample)")
print("="*80)
bigrams_in_vocab = [f for f in vocabulary if ' ' in f]
print(f"Total bigrams in vocabulary: {len(bigrams_in_vocab)}")
print(f"\nSample of 30 bigrams:")
for i, bg in enumerate(sorted(bigrams_in_vocab)[:30], 1):
    print(f"  {i:2d}. {bg}")

print("\n" + "="*80)
print("INSIGHT:")
print("="*80)
print("Some EDA bigrams may be missing due to:")
print("  1. Domain stopword removal ('product', 'service' removed)")
print("  2. min_df=2 filter (bigrams appearing in <2 documents)")

VALIDATING EDA N-GRAM FINDINGS IN TF-IDF VOCABULARY

NEGATIVE SENTIMENT BIGRAMS:
--------------------------------------------------------------------------------
  FOUND:   'waste money'
  FOUND:   'customer service'
  MISSING: 'poor quality' (may be filtered by min_df/max_df or stopword removal)
  MISSING: 'not worth' (may be filtered by min_df/max_df or stopword removal)
  MISSING: 'waste time' (may be filtered by min_df/max_df or stopword removal)
  MISSING: 'return product' (may be filtered by min_df/max_df or stopword removal)
  MISSING: 'don't waste' (may be filtered by min_df/max_df or stopword removal)
  FOUND:   'completely useless'

Negative bigrams found: 3/8 (38%)

POSITIVE SENTIMENT BIGRAMS:
--------------------------------------------------------------------------------
  MISSING: 'highly recommend' (may be filtered by min_df/max_df or stopword removal)
  MISSING: 'great product' (may be filtered by min_df/max_df or stopword removal)
  MISSING: 'love it' (may be filtered 

## 7. Zapis Przetworzonych Danych i Artefaktów

Zapisanie wszystkich przetworzonych danych i modeli do użycia w Notebooku 3 (Trening).

**Pliki wyjściowe:**
- `train.csv`, `val.csv`, `test.csv`: Oczyszczone podziały tekstu (dla NN/Transformera)
- `X_train_tfidf.pkl`, `X_val_tfidf.pkl`, `X_test_tfidf.pkl`: Cechy TF-IDF (dla Regresji Logistycznej)
- `y_train.pkl`, `y_val.pkl`, `y_test.pkl`: Etykiety dla wszystkich podziałów
- `tfidf_vectorizer.pkl`: Dopasowany wektoryzator TF-IDF (dla wnioskowania/wdrożenia)

In [9]:
output_dir = 'data/processed'
os.makedirs(output_dir, exist_ok=True)

print("Saving processed data and artifacts...\n")

print("Saving TF-IDF matrices...")
with open(f'{output_dir}/X_train_tfidf.pkl', 'wb') as f:
    pickle.dump(X_train_tfidf, f)
    
with open(f'{output_dir}/X_val_tfidf.pkl', 'wb') as f:
    pickle.dump(X_val_tfidf, f)
    
with open(f'{output_dir}/X_test_tfidf.pkl', 'wb') as f:
    pickle.dump(X_test_tfidf, f)

print("Saving labels...")
with open(f'{output_dir}/y_train.pkl', 'wb') as f:
    pickle.dump(y_train, f)
    
with open(f'{output_dir}/y_val.pkl', 'wb') as f:
    pickle.dump(y_val, f)
    
with open(f'{output_dir}/y_test.pkl', 'wb') as f:
    pickle.dump(y_test, f)

print("Saving TF-IDF vectorizer...")
with open(f'{output_dir}/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf, f)

print("Saving EDA statistics (from Notebook 1)...")
with open(f'{output_dir}/eda_stats.pkl', 'wb') as f:
    pickle.dump(EDA_STATS, f)

import json
with open(f'{output_dir}/eda_stats.json', 'w') as f:
    json.dump(EDA_STATS, f, indent=2)

print("Saving domain stopwords list...")
with open(f'{output_dir}/domain_stopwords.pkl', 'wb') as f:
    pickle.dump(domain_stopwords, f)

print("Saving cleaned text CSVs...")
train_df = pd.DataFrame({
    'text': X_train.values,
    'label': y_train.values
})

val_df = pd.DataFrame({
    'text': X_val.values,
    'label': y_val.values
})

test_df = pd.DataFrame({
    'text': X_test.values,
    'label': y_test.values
})

train_df.to_csv(f'{output_dir}/train.csv', index=False)
val_df.to_csv(f'{output_dir}/val.csv', index=False)
test_df.to_csv(f'{output_dir}/test.csv', index=False)

print(f"\n{'='*80}")
print("All files saved successfully!")
print(f"{'='*80}")
print(f"Output directory: {output_dir}/\n")

print(f"CSV Files (for NN/Transformer):")
print(f"  - train.csv      ({len(train_df):,} rows)")
print(f"  - val.csv        ({len(val_df):,} rows)")
print(f"  - test.csv       ({len(test_df):,} rows)")

print(f"\nTF-IDF Matrices (for Logistic Regression):")
print(f"  - X_train_tfidf.pkl ({X_train_tfidf.shape})")
print(f"  - X_val_tfidf.pkl   ({X_val_tfidf.shape})")
print(f"  - X_test_tfidf.pkl  ({X_test_tfidf.shape})")

print(f"\nLabels:")
print(f"  - y_train.pkl ({len(y_train):,} labels)")
print(f"  - y_val.pkl   ({len(y_val):,} labels)")
print(f"  - y_test.pkl  ({len(y_test):,} labels)")

print(f"\nVectorizer:")
print(f"  - tfidf_vectorizer.pkl (vocabulary size: {len(tfidf.vocabulary_):,})")

print(f"\nEDA Statistics (from Notebook 1):")
print(f"  - eda_stats.pkl / eda_stats.json")
print(f"    • Recommended max_length: {EDA_STATS['recommended_max_length']} words")
print(f"    • Mean word count: {EDA_STATS['mean_word_count']:.1f} words")
print(f"    • 95th percentile: {EDA_STATS['percentile_95']} words")

print(f"\nDomain Stopwords:")
print(f"  - domain_stopwords.pkl ({len(domain_stopwords)} terms removed)")

print(f"\n{'='*80}")
print("Preprocessing complete! Ready for Notebook 3 (Model Training)")
print(f"{'='*80}")

print(f"\nFinal Data Integrity Check:")
print(f"  Total samples: {len(train_df) + len(val_df) + len(test_df):,}")
print(f"  Expected:      {len(df):,}")
print(f"  Match: {'YES' if len(train_df) + len(val_df) + len(test_df) == len(df) else 'NO'}")
print(f"\n  Train balance: {train_df['label'].value_counts().to_dict()}")
print(f"  Val balance:   {val_df['label'].value_counts().to_dict()}")
print(f"  Test balance:  {test_df['label'].value_counts().to_dict()}")

print(f"\n{'='*80}")
print("KEY INSIGHTS CARRIED FORWARD FROM EDA:")
print(f"{'='*80}")
print(f"Domain stopwords: {len(domain_stopwords)} neutral terms identified and removed")
print(f"Bigrams preserved: TF-IDF uses (1,2)-grams per EDA N-gram analysis")
print(f"Padding guidance: max_length={EDA_STATS['recommended_max_length']} for 95% coverage")

Saving processed data and artifacts...

Saving TF-IDF matrices...
Saving labels...
Saving TF-IDF vectorizer...
Saving EDA statistics (from Notebook 1)...
Saving domain stopwords list...
Saving cleaned text CSVs...

All files saved successfully!
Output directory: data/processed/

CSV Files (for NN/Transformer):
  - train.csv      (35,000 rows)
  - val.csv        (7,500 rows)
  - test.csv       (7,500 rows)

TF-IDF Matrices (for Logistic Regression):
  - X_train_tfidf.pkl ((35000, 5000))
  - X_val_tfidf.pkl   ((7500, 5000))
  - X_test_tfidf.pkl  ((7500, 5000))

Labels:
  - y_train.pkl (35,000 labels)
  - y_val.pkl   (7,500 labels)
  - y_test.pkl  (7,500 labels)

Vectorizer:
  - tfidf_vectorizer.pkl (vocabulary size: 5,000)

EDA Statistics (from Notebook 1):
  - eda_stats.pkl / eda_stats.json
    • Recommended max_length: 160 words
    • Mean word count: 77.6 words
    • 95th percentile: 160 words

Domain Stopwords:
  - domain_stopwords.pkl (78 terms removed)

Preprocessing complete! Read