# üî¨ Stage 1: Baseline Embedding Detector

## System detekcji deepfake z CLIP + Vector Database

**Podej≈õcie:**
1. U≈ºyj pretrained CLIP (bez fine-tuningu)
2. Zbuduj bazƒô wektorowƒÖ z known real/fake
3. Klasyfikuj przez k-NN similarity

**Oczekiwana accuracy:** ~65-75%

---

## üìä GPU Setup
**Runtime ‚Üí Change runtime type ‚Üí GPU (T4)**

In [None]:
# Sprawd≈∫ GPU
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## 1Ô∏è‚É£ Clone Repository

In [None]:
# Sklonuj repo
!git clone https://github.com/kordin33/DeepFake.git
%cd DeepFake
!ls -la

## 2Ô∏è‚É£ Install Dependencies

In [None]:
# Instaluj CLIP i zale≈ºno≈õci
!pip install -q git+https://github.com/openai/CLIP.git
!pip install -q scikit-learn tqdm matplotlib seaborn
!pip install -q umap-learn  # dla wizualizacji UMAP

print("\n‚úÖ Dependencies installed!")

## 3Ô∏è‚É£ Prepare Data

In [None]:
# Pobierz i przygotuj dane z HuggingFace
# U≈ºyj mniejszego zbioru do szybkiego testu

MAX_TRAIN_PER_CLASS = 2000  # Mo≈ºna zwiƒôkszyƒá
MAX_TEST_PER_CLASS = 500

!python efficientnet_b0_deepfake.py --prepare --data-root ./data \
    --max-per-class-a {MAX_TRAIN_PER_CLASS} \
    --max-per-class-b {MAX_TEST_PER_CLASS}

In [None]:
# Sprawd≈∫ strukturƒô danych
from pathlib import Path

data_root = Path("./data")
for folder in data_root.rglob("*"):
    if folder.is_dir() and (folder / "fake").exists():
        fake_count = len(list((folder / "fake").glob("*")))
        real_count = len(list((folder / "real").glob("*")))
        print(f"{folder}: fake={fake_count}, real={real_count}")

## 4Ô∏è‚É£ Initialize Stage 1 Detector

In [None]:
import sys
sys.path.insert(0, '.')

from deepfake_research.embeddings.stage1_baseline import (
    Stage1BaselineDetector,
    Stage1Config,
    run_stage1_experiment,
)
from deepfake_research.embeddings.visualization import EmbeddingVisualizer

# Konfiguracja
config = Stage1Config(
    encoder_name="clip",
    encoder_variant="ViT-B/32",  # Mo≈ºesz zmieniƒá na "ViT-L/14" dla lepszych wynik√≥w
    k_neighbors=10,
    db_backend="numpy",
    device="cuda",
)

# Inicjalizacja detektora
detector = Stage1BaselineDetector(config)

## 5Ô∏è‚É£ Build Vector Database (Training)

In [None]:
# Zbuduj bazƒô wektorowƒÖ z danych treningowych
TRAIN_REAL = "./data/A_standardized_224/train/real"
TRAIN_FAKE = "./data/A_standardized_224/train/fake"

train_stats = detector.fit_from_folder(
    real_folder=TRAIN_REAL,
    fake_folder=TRAIN_FAKE,
    max_images=2000,  # Mo≈ºesz zwiƒôkszyƒá
    batch_size=64,
)

print("\n" + "="*50)
print("DATABASE STATISTICS")
print("="*50)
for key, value in train_stats.items():
    print(f"  {key}: {value}")

## 6Ô∏è‚É£ Evaluate on Test Set A (In-Domain)

In [None]:
# Test na Dataset A (ten sam domain co train)
TEST_A_REAL = "./data/A_standardized_224/test_A/real"
TEST_A_FAKE = "./data/A_standardized_224/test_A/fake"

print("\n" + "="*50)
print("EVALUATION - DATASET A (In-Domain)")
print("="*50)

results_A = detector.evaluate_from_folder(
    real_folder=TEST_A_REAL,
    fake_folder=TEST_A_FAKE,
    max_images=500,
)

## 7Ô∏è‚É£ Evaluate on Test Set B (Cross-Domain)

In [None]:
# Test na Dataset B (inny domain - sprawdzamy generalizacjƒô!)
TEST_B_REAL = "./data/B_standardized_224/test_B/real"
TEST_B_FAKE = "./data/B_standardized_224/test_B/fake"

print("\n" + "="*50)
print("EVALUATION - DATASET B (Cross-Domain)")
print("="*50)

results_B = detector.evaluate_from_folder(
    real_folder=TEST_B_REAL,
    fake_folder=TEST_B_FAKE,
    max_images=500,
)

## 8Ô∏è‚É£ üìä Visualization - t-SNE & Cluster Analysis

In [None]:
# Pobierz wszystkie embeddingi z bazy
embeddings, labels = detector.db.get_all_embeddings()

print(f"Embeddings shape: {embeddings.shape}")
print(f"Labels: {len(labels)} ({sum(l == 'real' for l in labels)} real, {sum(l == 'fake' for l in labels)} fake)")

In [None]:
# t-SNE Visualization
visualizer = EmbeddingVisualizer(figsize=(12, 10))

visualizer.plot_tsne(
    embeddings=embeddings,
    labels=labels,
    title="t-SNE: CLIP Embeddings (Real vs Fake)",
    save_path="stage1_tsne.png",
)

In [None]:
# UMAP Visualization (often better than t-SNE)
visualizer.plot_umap(
    embeddings=embeddings,
    labels=labels,
    title="UMAP: CLIP Embeddings (Real vs Fake)",
    save_path="stage1_umap.png",
)

In [None]:
# Cluster Analysis
cluster_metrics = visualizer.plot_cluster_analysis(
    embeddings=embeddings,
    labels=labels,
    title="Cluster Analysis: Class Separation",
    save_path="stage1_clusters.png",
)

print("\nCluster Metrics:")
for key, value in cluster_metrics.items():
    print(f"  {key}: {value:.4f}")

## 9Ô∏è‚É£ Single Image Prediction Example

In [None]:
from PIL import Image
import matplotlib.pyplot as plt

# We≈∫ losowy obraz testowy
import random
from pathlib import Path

test_images = list(Path(TEST_A_FAKE).glob("*.jpg"))[:5] + list(Path(TEST_A_REAL).glob("*.jpg"))[:5]
random.shuffle(test_images)

fig, axes = plt.subplots(2, 5, figsize=(15, 6))

for ax, img_path in zip(axes.flatten(), test_images):
    img = Image.open(img_path).convert("RGB")
    result = detector.predict(img, method="knn")
    
    # Ground truth
    gt = "real" if "real" in str(img_path.parent) else "fake"
    correct = result.prediction == gt
    
    ax.imshow(img)
    ax.set_title(
        f"Pred: {result.prediction}\nGT: {gt}\nConf: {result.confidence:.2f}",
        color="green" if correct else "red",
        fontsize=9,
    )
    ax.axis("off")

plt.suptitle("Stage 1 Predictions (Green=Correct, Red=Wrong)", fontsize=14)
plt.tight_layout()
plt.savefig("stage1_predictions.png", dpi=150)
plt.show()

## üîü Summary & Results

In [None]:
print("\n" + "="*60)
print("üìä STAGE 1 RESULTS SUMMARY")
print("="*60)

print("\nüîß Configuration:")
print(f"   Encoder: CLIP {config.encoder_variant}")
print(f"   K neighbors: {config.k_neighbors}")
print(f"   Database size: {detector.db.count()} embeddings")

print("\nüìà Results - Dataset A (In-Domain):")
for method, metrics in results_A.items():
    print(f"   {method.upper()}: Accuracy = {metrics['accuracy']:.4f}, F1 = {metrics['f1']:.4f}")

print("\nüìà Results - Dataset B (Cross-Domain):")
for method, metrics in results_B.items():
    print(f"   {method.upper()}: Accuracy = {metrics['accuracy']:.4f}, F1 = {metrics['f1']:.4f}")

print("\nüìä Cluster Separation:")
print(f"   Silhouette Score: {cluster_metrics['silhouette_score']:.4f}")
print(f"   K-Means Accuracy: {cluster_metrics['kmeans_accuracy']:.4f}")
print(f"   Separation Ratio: {cluster_metrics['separation_ratio']:.4f}")

print("\n" + "="*60)
print("üí° ANALYSIS")
print("="*60)

best_acc_A = max(m['accuracy'] for m in results_A.values())
best_acc_B = max(m['accuracy'] for m in results_B.values())

if best_acc_A < 0.70:
    print("‚ùå Low accuracy! CLIP embeddings don't separate well.")
    print("   ‚Üí Need Stage 2: Better classification algorithms")
    print("   ‚Üí Or Stage 3: LoRA fine-tuning")
elif best_acc_A < 0.80:
    print("‚ö†Ô∏è Moderate accuracy. Baseline works, but can be improved.")
    print("   ‚Üí Proceed to Stage 2: Try SVM, MLP classifiers")
else:
    print("‚úÖ Good accuracy! CLIP embeddings are useful.")
    print("   ‚Üí Stage 2 may still improve results")

if cluster_metrics['silhouette_score'] < 0.1:
    print("\n‚ö†Ô∏è Classes overlap significantly in embedding space.")
    print("   ‚Üí Fine-tuning (Stage 3) will likely help a lot!")

print("\n" + "="*60)

## 1Ô∏è‚É£1Ô∏è‚É£ Save Results & Model

In [None]:
# Zapisz detektor
detector.save("./stage1_detector")

# Zapisz wyniki do JSON
import json

results_summary = {
    "config": {
        "encoder_variant": config.encoder_variant,
        "k_neighbors": config.k_neighbors,
    },
    "results_A": results_A,
    "results_B": results_B,
    "cluster_metrics": cluster_metrics,
}

with open("stage1_results.json", "w") as f:
    json.dump(results_summary, f, indent=2, default=str)

print("\n‚úÖ Results saved to stage1_results.json")

## 1Ô∏è‚É£2Ô∏è‚É£ Save to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import shutil
from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
dest = f"/content/drive/MyDrive/deepfake_stage1_{timestamp}"

# Kopiuj wyniki
import os
os.makedirs(dest, exist_ok=True)

for f in ["stage1_tsne.png", "stage1_umap.png", "stage1_clusters.png", 
          "stage1_predictions.png", "stage1_results.json"]:
    if os.path.exists(f):
        shutil.copy(f, dest)

shutil.copytree("./stage1_detector", f"{dest}/detector", dirs_exist_ok=True)

print(f"\n‚úÖ Saved to: {dest}")

---

# üéØ Next Steps

## If accuracy < 75%:
‚Üí **Stage 2**: Try better classifiers (SVM, MLP, XGBoost) on embeddings

## If accuracy 75-85%:
‚Üí **Stage 3**: LoRA fine-tuning on CLIP

## If you want better generalization:
‚Üí **Stage 4**: Add frequency domain features (FFT)

---

**Report your results and we'll proceed to the next stage!**