# ESA Project: Fake or Real: The Impostor Hunt in Texts

This notebook is dedicated to **baseline training**.  
It covers:

- Load the exploded training and validation data (`train_exploded.csv` and `val_exploded.csv`).
- Build and train a **Logistic Regression** model on **TF-IDF (Term Frequency-Inverse Document Frequency)** features extracted from the `text_chunk` column.
- Use the trained baseline model to predict the probability of the "real" class (`label=1`) for every chunk in the validation set.
- Implemente the **Realness score comparison** strategy, where the average "real" probability of all chunks is calculated for `file1` and `file2`.
- Determine the final **Accuracy** of the baseline model on the original text pair classification task (predicting 1 or 2) providing a crucial benchmark for the DistilBERT model.

# Import librairies

In [1]:
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
import os
import sys
import numpy as np
import pandas as pd

# Add the src folder to Python path
sys.path.append(os.path.abspath(os.path.join('..', 'src')))
import config

# Evaluation function (Realness score comparison for baseline model)

In [2]:
def evaluate_original_texts_baseline(exploded_df, chunk_probs):
    """
    Combines chunk-level probabilities to make a final prediction on the original 
    text pairs based on the Realness Score Comparison strategy, adapted for the baseline.
    """
    # Add the chunk probabilities back to the exploded DataFrame
    exploded_df['real_prob'] = chunk_probs

    # Identify Original Text Pairs
    # We need to ensure the original index is available. 
    # Assuming 'original_index' is a column that uniquely identifies the original text pair.
    # If not, we create a temporary one based on unique pairs.
    if 'original_index' not in exploded_df.columns:
        print("Warning: 'original_index' column not found. Creating a temporary one based on unique pairs")
        # Create a temporary ID based on the unique combination of the original texts
        # Note: This relies on 'file1_text' and 'file2_text' being present and identical for all chunks of the same original pair
        exploded_df['original_index'] = exploded_df.groupby(['file1_text', 'file2_text']).ngroup()
        
    # Calculate Realness Score for each original text (file1 and file2)
    # Group by the original pair ID and the source file ('file1' or 'file2')
    # Calculate the average 'real_prob' for each source
    realness_scores = exploded_df.groupby(['original_index', 'source'])['real_prob'].mean().reset_index()
    
    # Pivot the table to get file1_score and file2_score side-by-side
    realness_scores_pivot = realness_scores.pivot(
        index='original_index', 
        columns='source', 
        values='real_prob'
    ).reset_index()
    
    # Merge back the true label (real_text_id) for the original pair
    true_labels = exploded_df[['original_index', 'real_text_id']].drop_duplicates()
    
    final_evaluation_df = realness_scores_pivot.merge(true_labels, on='original_index', how='left')
    
    # Make Final Prediction
    # Prediction: 1 if file1_score > file2_score, else 2
    # Handle potential NaN values if a file has no chunks (unlikely but safe)
    final_evaluation_df['predicted_text_id'] = np.where(
        final_evaluation_df['file1'].fillna(0) > final_evaluation_df['file2'].fillna(0), 
        1, 
        2
    )
    
    # Calculate Final Accuracy
    true_labels = final_evaluation_df['real_text_id'].values
    predicted_labels = final_evaluation_df['predicted_text_id'].values
    
    final_accuracy = accuracy_score(true_labels, predicted_labels)
    final_f1 = f1_score(true_labels, predicted_labels, average='binary')
    cm = confusion_matrix(true_labels, predicted_labels)
    
    print('-' * 80)
    print("\nBaseline Evaluation Results (Original text pair level) ")
    print(f"Total original text pairs evaluated: {len(final_evaluation_df)}")
    print(f"Final Prediction Accuracy: {final_accuracy:.4f}")
    print(f"Final Prediction F1-Score: {final_f1:.4f}")
    print("\nConfusion Matrix (True Label vs. Predicted Label):")
    print(cm)
    
    return final_evaluation_df, final_accuracy, final_f1

# Load data

In [3]:
TRAIN_CLEANED_FILE = config.PROCESSED_DATA_DIR / "train_exploded.csv"
print(f"Loading exploded training data from: {TRAIN_CLEANED_FILE}")
train_df = pd.read_csv(TRAIN_CLEANED_FILE)

# Ensure 'original_index' exists
if 'original_index' not in train_df.columns:
    print("Creating 'original_index' to group chunks belonging to same text pair")
    train_df['original_index'] = train_df.groupby(['file1_text', 'file2_text']).ngroup()

train_df.head()

Loading exploded training data from: /Users/photoli93/Desktop/Projets perso Python/esa_fake_or_real/data/processed/train_exploded.csv
Creating 'original_index' to group chunks belonging to same text pair


Unnamed: 0,real_text_id,file1_text,file2_text,file1_char_len,file2_char_len,file1_word_len,file2_word_len,combined_text,file1_text_cleaned,file2_text_cleaned,...,file2_text_cleaned_tokens,file2_text_cleaned_num_tokens,file1_text_cleaned_chunks,file1_text_cleaned_num_chunks,file2_text_cleaned_chunks,file2_text_cleaned_num_chunks,text_chunk,source,label,original_index
0,2,We determine accurate values for the total lit...,We determine accurate values for the total lit...,7101,2525,751,406,We determine accurate values for the total lit...,determine accurate value total lithium abundan...,determine accurate value total lithium abundan...,...,"['determine', 'accurate', 'value', 'total', 'l...",280,['determine accurate value total lithium abund...,4,['determine accurate value total lithium abund...,1,determine accurate value total lithium abundan...,file1,0,70
1,2,We determine accurate values for the total lit...,We determine accurate values for the total lit...,7101,2525,751,406,We determine accurate values for the total lit...,determine accurate value total lithium abundan...,determine accurate value total lithium abundan...,...,"['determine', 'accurate', 'value', 'total', 'l...",280,['determine accurate value total lithium abund...,4,['determine accurate value total lithium abund...,1,resultantkan usp mind lux color hotel empirica...,file1,0,70
2,2,We determine accurate values for the total lit...,We determine accurate values for the total lit...,7101,2525,751,406,We determine accurate values for the total lit...,determine accurate value total lithium abundan...,determine accurate value total lithium abundan...,...,"['determine', 'accurate', 'value', 'total', 'l...",280,['determine accurate value total lithium abund...,4,['determine accurate value total lithium abund...,1,lantern einen histor cean gemusept replacequis...,file1,0,70
3,2,We determine accurate values for the total lit...,We determine accurate values for the total lit...,7101,2525,751,406,We determine accurate values for the total lit...,determine accurate value total lithium abundan...,determine accurate value total lithium abundan...,...,"['determine', 'accurate', 'value', 'total', 'l...",280,['determine accurate value total lithium abund...,4,['determine accurate value total lithium abund...,1,##igero centerall omgevingtocol lacao lamidora...,file1,0,70
4,2,The 160-megapixel **Edam** camera was designed...,The QUEST camera has 160 megapixels and was cr...,2076,1368,337,219,The 160-megapixel **Edam** camera was designed...,megapixel edam camera design fabricate yale un...,qu camera megapixel create yale university hel...,...,"['qu', 'camera', 'mega', '##pi', '##x', '##el'...",165,['megapixel edam camera design fabricate yale ...,1,['qu camera megapixel create yale university h...,1,megapixel edam camera design fabricate yale un...,file1,0,40


# Prepare Features and Labels

In [4]:
X = train_df['text_chunk']
y = train_df['label']
groups = train_df['original_index']

# Define Baseline model

In [5]:
baseline_model = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=3000, ngram_range=(1, 2), stop_words='english')),
    ('clf', LogisticRegression(solver='liblinear', C=0.1, penalty='l2', random_state=42))
])

# Setup GroupKFold CV

In [6]:
n_unique_pairs = len(train_df['original_index'].unique())
n_splits = min(5, n_unique_pairs)
gkf = GroupKFold(n_splits=n_splits)

pair_accuracies = []
pair_f1 = []
fold_results = []

print(f"\nStarting {n_splits}-fold GroupKFold cross-validation at the pair level\n")


Starting 5-fold GroupKFold cross-validation at the pair level



# Cross-Validation loop

In [9]:
for fold, (train_idx, val_idx) in enumerate(gkf.split(X, y, groups)):
    print('-' * 80)
    print(f"Fold {fold+1}/{n_splits}")

    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    val_df = train_df.iloc[val_idx].copy()

    # Train the model
    baseline_model.fit(X_train, y_train)

    # Predict probabilities for validation chunks
    chunk_probs = baseline_model.predict_proba(X_val)[:, 1]

    # Evaluate at original text pair level
    fold_final_df, fold_accuracy, fold_f1 = evaluate_original_texts_baseline(val_df, chunk_probs)
    pair_accuracies.append(fold_accuracy)
    pair_f1.append(fold_f1)

    # Store results
    fold_final_df['fold'] = fold + 1
    fold_final_df['fold_accuracy'] = fold_accuracy
    fold_final_df['fold_f1'] = fold_f1
    fold_results.append(fold_final_df)

    print(f"Fold {fold+1} Pair-Level Accuracy: {fold_accuracy:.4f}")
    print(f"Fold {fold+1} Pair-Level f1: {fold_f1:.4f}\n")

--------------------------------------------------------------------------------
Fold 1/5
--------------------------------------------------------------------------------

Baseline Evaluation Results (Original text pair level) 
Total original text pairs evaluated: 14
Final Prediction Accuracy: 0.7857
Final Prediction F1-Score: 0.8000

Confusion Matrix (True Label vs. Predicted Label):
[[6 2]
 [1 5]]
Fold 1 Pair-Level Accuracy: 0.7857
Fold 1 Pair-Level f1: 0.8000

--------------------------------------------------------------------------------
Fold 2/5
--------------------------------------------------------------------------------

Baseline Evaluation Results (Original text pair level) 
Total original text pairs evaluated: 15
Final Prediction Accuracy: 0.9333
Final Prediction F1-Score: 0.9333

Confusion Matrix (True Label vs. Predicted Label):
[[7 0]
 [1 7]]
Fold 2 Pair-Level Accuracy: 0.9333
Fold 2 Pair-Level f1: 0.9333

----------------------------------------------------------------

# Results

In [12]:
# Combine all folds’ results
cv_results_df = pd.concat(fold_results, ignore_index=True)

# Summary
mean_acc = np.mean(pair_accuracies)
std_acc = np.std(pair_accuracies)
mean_f1 = np.mean(pair_f1)
std_f1 = np.std(pair_f1)

print('-' * 80)
print("\nCross-Validation Summary")
print(f"Average Pair-Level Accuracy: {mean_acc:.4f}")
print(f"Std. Dev: {std_acc:.4f}")
print(f"Average Pair-Level f1: {mean_f1:.4f}")
print(f"Std. Dev: {std_f1:.4f}")

# Save detailed results
output_path = config.OUTPUT_DIR / "baseline_cv_results.csv"
cv_results_df.to_csv(output_path, index=False)
print(f"\nDetailed CV results saved to: {output_path}")

--------------------------------------------------------------------------------

Cross-Validation Summary
Average Pair-Level Accuracy: 0.7971
Std. Dev: 0.0945
Average Pair-Level f1: 0.7974
Std. Dev: 0.0837

Detailed CV results saved to: /Users/photoli93/Desktop/Projets perso Python/esa_fake_or_real/results/baseline_cv_results.csv


DistilBERT is clearly superior in all core metrics (accuracy & F1-score) but the baseline model is faster and lightweight, suitable for quick experiments or resource-constrained environments

# End of baseline evaluation notebook