# Check if static seed is needed for reproducibility

## Baseline (Log Regression)

In [1]:
import os
import pandas as pd

# Directory containing the results
results_dir = "../results/classic"

# Steps to compare
sizes = [25, 50, 100, 150, 200, 250, 300, 350, 400]

# Metrics to compare
metrics = ["accuracy", "precision", "recall", "f1"]

header = (
    f"{'Size':>6} | {'Metric':>10} | {'Original':>10} | {'Diff_2':>10} | {'Diff_3':>10} | {'Diff_4':>10}"
)
print(header)
print("-" * len(header))

# Collect all differences for a small test at the end
all_differences = {metric: [] for metric in metrics}

for size in sizes:
    file1 = os.path.join(results_dir, f"classical_results_{size}.csv")
    file2 = os.path.join(results_dir, f"classical_results_{size}_2.csv")
    file3 = os.path.join(results_dir, f"classical_results_{size}_3.csv")
    file4 = os.path.join(results_dir, f"classical_results_{size}_4.csv")
    missing = []
    for f in [file1, file2, file3, file4]:
        if not os.path.exists(f):
            missing.append(f)
    if missing:
        print(f"Missing file(s) for size {size}: {' '.join(missing)}")
        continue

    df1 = pd.read_csv(file1)
    df2 = pd.read_csv(file2)
    df3 = pd.read_csv(file3)
    df4 = pd.read_csv(file4)

    # Assume single row per file
    row1 = df1.iloc[0]
    row2 = df2.iloc[0]
    row3 = df3.iloc[0]
    row4 = df4.iloc[0]

    for metric in metrics:
        val1 = row1[metric] if metric in row1 else float('nan')
        val2 = row2[metric] if metric in row2 else float('nan')
        val3 = row3[metric] if metric in row3 else float('nan')
        val4 = row4[metric] if metric in row4 else float('nan')
        diff2 = val2 - val1
        diff3 = val3 - val1
        diff4 = val4 - val1
        print(
            f"{size:6} | {metric:10} | {val1:10.4f} | {diff2:10.4f} | {diff3:10.4f} | {diff4:10.4f}"
        )
        # Collect differences for the test
        all_differences[metric].extend([diff2, diff3, diff4])

# === Difference Test Summary ===
tolerance = 1e-6
print("\n=== Difference Test Summary ===")
for metric in metrics:
    diffs = all_differences[metric]
    # Check if any difference is greater than tolerance
    significant_diffs = [d for d in diffs if abs(d) > tolerance]
    if significant_diffs:
        print(f"Metric '{metric}': Differences found in {len(significant_diffs)} run(s) (max diff: {max(abs(d) for d in significant_diffs):.6f})")
    else:
        print(f"Metric '{metric}': No significant differences across runs (all diffs <= {tolerance})")

  Size |     Metric |   Original |     Diff_2 |     Diff_3 |     Diff_4
-----------------------------------------------------------------------
    25 | accuracy   |     0.5012 |     0.0000 |     0.0000 |     0.0000
    25 | precision  |     1.0000 |     0.0000 |     0.0000 |     0.0000
    25 | recall     |     0.0025 |     0.0000 |     0.0000 |     0.0000
    25 | f1         |     0.0050 |     0.0000 |     0.0000 |     0.0000
    50 | accuracy   |     0.5800 |     0.0000 |     0.0000 |     0.0000
    50 | precision  |     0.5478 |     0.0000 |     0.0000 |     0.0000
    50 | recall     |     0.9175 |     0.0000 |     0.0000 |     0.0000
    50 | f1         |     0.6860 |     0.0000 |     0.0000 |     0.0000
   100 | accuracy   |     0.6737 |     0.0000 |     0.0000 |     0.0000
   100 | precision  |     0.6248 |     0.0000 |     0.0000 |     0.0000
   100 | recall     |     0.8700 |     0.0000 |     0.0000 |     0.0000
   100 | f1         |     0.7273 |     0.0000 |     0.0000 |    

---
### Baseline (LogReg)

In our classical baseline using TF-IDF vectorization and logistic regression, we ran the training pipeline **four times** and compared the resulting metrics across different training sizes.

The results showed **no variation across runs** in any of the evaluated metrics (`accuracy`, `precision`, `recall`, `f1`) for each training size. Differences in subsequent runs (`Diff_2`, `Diff_3`, `Diff_4`) were either **zero** or **extremely small (≤ 0.0012)**—values that can be attributed to numerical rounding rather than stochastic behavior.

This consistency confirms that:
- The TF-IDF vectorization with fixed vocabulary size (`max_features=5000`) and n-gram range is deterministic.
- Logistic Regression (with `max_iter=1000`) behaves deterministically on the given input.
---

## Fine Tuning

In [2]:
import os
import pandas as pd

# Directory containing the results
results_dir = "../results/fine-tuning"

sizes = [25, 50, 100, 150, 200, 250, 300, 350, 400]
epoch = 50  # always use 50 epochs

# Metrics to compare
metrics = ["eval_accuracy", "eval_precision", "eval_recall", "eval_f1"]  # adjust as needed

header = (
    f"{'Size':>6} | {'Epoch':>5} | {'Metric':>10} | {'Original':>10} | {'Diff_2':>10} | {'Diff_3':>10} | {'Diff_4':>10}"
)
print(header)
print("-" * len(header))

# To collect all differences for the test at the end
all_differences = {metric: [] for metric in metrics}

for size in sizes:
    # Build file names
    base = f"fine_tuning_metrics_{size}_{epoch}e"
    files = [
        os.path.join(results_dir, f"{base}.csv"),
        os.path.join(results_dir, f"{base}_2.csv"),
        os.path.join(results_dir, f"{base}_3.csv"),
        os.path.join(results_dir, f"{base}_4.csv"),
    ]
    missing = [f for f in files if not os.path.exists(f)]
    if missing:
        print(f"Missing file(s) for size {size}, epoch {epoch}: {' '.join(os.path.basename(m) for m in missing)}")
        continue

    # Read the files
    dfs = [pd.read_csv(f) for f in files]
    # Assume single row per file
    rows = [df.iloc[0] for df in dfs]

    for metric in metrics:
        vals = [row[metric] if metric in row else float('nan') for row in rows]
        diff2 = vals[1] - vals[0]
        diff3 = vals[2] - vals[0]
        diff4 = vals[3] - vals[0]
        print(
            f"{size:6} | {epoch:5} | {metric:10} | {vals[0]:10.4f} | {diff2:10.4f} | {diff3:10.4f} | {diff4:10.4f}"
        )
        # Collect differences for the test
        all_differences[metric].extend([diff2, diff3, diff4])

# Test at the end: check if there are any nonzero differences (beyond a small tolerance)
tolerance = 1e-6
print("\n=== Difference Test Summary ===")
for metric in metrics:
    diffs = all_differences[metric]
    # Check if any difference is greater than tolerance
    significant_diffs = [d for d in diffs if abs(d) > tolerance]
    if significant_diffs:
        print(f"Metric '{metric}': Differences found in {len(significant_diffs)} run(s) (max diff: {max(abs(d) for d in significant_diffs):.6f})")
    else:
        print(f"Metric '{metric}': No significant differences across runs (all diffs <= {tolerance})")


  Size | Epoch |     Metric |   Original |     Diff_2 |     Diff_3 |     Diff_4
-------------------------------------------------------------------------------
    25 |    50 | eval_accuracy |     0.7575 |    -0.0675 |    -0.0900 |    -0.0687
    25 |    50 | eval_precision |     0.8301 |     0.0062 |    -0.0170 |    -0.0452
    25 |    50 | eval_recall |     0.6475 |    -0.1750 |    -0.2125 |    -0.1275
    25 |    50 | eval_f1    |     0.7275 |    -0.1237 |    -0.1608 |    -0.1020
    50 |    50 | eval_accuracy |     0.7762 |     0.0038 |    -0.0100 |     0.0000
    50 |    50 | eval_precision |     0.7826 |    -0.0233 |    -0.0222 |     0.0000
    50 |    50 | eval_recall |     0.7650 |     0.0550 |     0.0125 |     0.0000
    50 |    50 | eval_f1    |     0.7737 |     0.0148 |    -0.0049 |     0.0000
   100 |    50 | eval_accuracy |     0.8662 |     0.0025 |     0.0175 |     0.0112
   100 |    50 | eval_precision |     0.8886 |    -0.0297 |     0.0229 |     0.0109
   100 |    50 | 

---
### Fine-tuning results

Based on the evaluation results across different training sizes and runs, there is clear evidence of performance variability despite identical training configurations. For example:

- `eval_accuracy` differences reach up to **0.09**
- `eval_recall` differences go as high as **0.21**
- Similar fluctuations are observed in `precision` and `f1` scores

These variations indicate that randomness in model initialization, data shuffling, or other stochastic elements significantly impacts the results.


---
## Transfer learning

In [3]:
import os
import pandas as pd

# Directory containing the results
results_dir = "../results/transfer"

sizes = [25, 50, 100, 150, 200, 250, 300, 350, 400]
epoch = 50  # always use 50 epochs

# Metrics to compare
metrics = ["eval_accuracy", "eval_precision", "eval_recall", "eval_f1"]  # adjust as needed

header = (
    f"{'Size':>6} | {'Epoch':>5} | {'Metric':>10} | {'Original':>10} | {'Diff_2':>10} | {'Diff_3':>10} | {'Diff_4':>10}"
)
print(header)
print("-" * len(header))

# To collect all differences for the test at the end
all_differences = {metric: [] for metric in metrics}

for size in sizes:
    # Build file names
    base = f"transfer_metrics_{size}_{epoch}e"
    files = [
        os.path.join(results_dir, f"{base}.csv"),
        os.path.join(results_dir, f"{base}_2.csv"),
        os.path.join(results_dir, f"{base}_3.csv"),
        os.path.join(results_dir, f"{base}_4.csv"),
    ]
    missing = [f for f in files if not os.path.exists(f)]
    if missing:
        print(f"Missing file(s) for size {size}, epoch {epoch}: {' '.join(os.path.basename(m) for m in missing)}")
        continue

    # Read the files
    dfs = [pd.read_csv(f) for f in files]
    # Assume single row per file
    rows = [df.iloc[0] for df in dfs]

    for metric in metrics:
        vals = [row[metric] if metric in row else float('nan') for row in rows]
        diff2 = vals[1] - vals[0]
        diff3 = vals[2] - vals[0]
        diff4 = vals[3] - vals[0]
        print(
            f"{size:6} | {epoch:5} | {metric:10} | {vals[0]:10.4f} | {diff2:10.4f} | {diff3:10.4f} | {diff4:10.4f}"
        )
        # Collect differences for the test
        all_differences[metric].extend([diff2, diff3, diff4])

# Test at the end: check if there are any nonzero differences (beyond a small tolerance)
tolerance = 1e-6
print("\n=== Difference Test Summary ===")
for metric in metrics:
    diffs = all_differences[metric]
    # Check if any difference is greater than tolerance
    significant_diffs = [d for d in diffs if abs(d) > tolerance]
    if significant_diffs:
        print(f"Metric '{metric}': Differences found in {len(significant_diffs)} run(s) (max diff: {max(abs(d) for d in significant_diffs):.6f})")
    else:
        print(f"Metric '{metric}': No significant differences across runs (all diffs <= {tolerance})")


  Size | Epoch |     Metric |   Original |     Diff_2 |     Diff_3 |     Diff_4
-------------------------------------------------------------------------------
    25 |    50 | eval_accuracy |     0.5038 |     0.0137 |     0.0037 |     0.0037
    25 |    50 | eval_precision |     0.5021 |     0.0216 |     0.0579 |     0.0579
    25 |    50 | eval_recall |     0.9000 |    -0.5125 |    -0.8300 |    -0.8300
    25 |    50 | eval_f1    |     0.6446 |    -0.1992 |    -0.5201 |    -0.5201
    50 |    50 | eval_accuracy |     0.5363 |     0.0000 |    -0.0175 |    -0.0175
    50 |    50 | eval_precision |     0.5502 |     0.0000 |     0.0016 |     0.0016
    50 |    50 | eval_recall |     0.3975 |     0.0000 |    -0.1975 |    -0.1975
    50 |    50 | eval_f1    |     0.4615 |     0.0000 |    -0.1680 |    -0.1680
   100 |    50 | eval_accuracy |     0.4863 |     0.0175 |     0.0275 |     0.0125
   100 |    50 | eval_precision |     0.4883 |     0.0136 |     0.0224 |     0.0103
   100 |    50 | 

---
### Transfer-learning

The evaluation results clearly show significant variability in model performance across different runs, despite using identical training settings. For instance:

- `eval_accuracy` differences reach up to **0.07**
- `eval_precision` differences go as high as **0.52**
- `eval_recall` fluctuates by as much as **0.83**
- `eval_f1` varies by up to **0.58**

These fluctuations suggest that stochastic elements such as weight initialization, data shuffling, and batching are introducing instability into the training process.

---
### Why Using a Static Seed Is Important for Training Deep Learning Models

In machine learning—especially in deep learning—training is often influenced by stochastic processes such as:

- Random weight initialization
- Data shuffling during training
- Dropout layers and other regularization techniques

These sources of randomness can lead to **non-deterministic outcomes**, meaning that running the same training pipeline multiple times may produce significantly different results. This has been observed in both **fine-tuning** and **transfer learning** experiments, where evaluation metrics (like accuracy, precision, recall, and F1 score) vary noticeably across runs.

By setting a **static random seed**, we can:

- **Ensure reproducibility**: Results can be reliably replicated for debugging or reporting.
- **Stabilize model comparison**: Differences in performance between models or configurations can be attributed to meaningful changes rather than random noise.
- **Simplify experimentation**: It becomes easier to isolate the effects of hyperparameter tuning or architectural changes.

While a static seed isn't necessary for deterministic algorithms like logistic regression (as confirmed by the baseline results), it's **strongly recommended** for any model or pipeline involving randomness.

### Setting a Static Seed in PyTorch

To enforce reproducibility in PyTorch-based experiments, the following function will be used and is found in `utils.py`:

```python
SEED = 2277

def set_seed():
    torch.use_deterministic_algorithms(True)

    random.seed(SEED)
    np.random.seed(SEED)

    torch.manual_seed(SEED)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(SEED)
        torch.cuda.manual_seed_all(SEED)

    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False