# KDSH ML Pipeline - Complete Workflow

**Narrative Consistency Detection using Machine Learning**

This notebook runs the complete KDSH pipeline:
1. Clone repository and install dependencies
2. Train ML classifier on training data
3. Generate predictions on test data
4. Download submission file

üì¶ **GitHub:** [ishansurdi/KDSH](https://github.com/ishansurdi/KDSH)

üèÜ **Performance:** 78.7% accuracy | 84.1% F1 score | 38.3% detection rate

## Step 1: Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/ishansurdi/KDSH.git
%cd KDSH

In [None]:
# Install dependencies
!pip install -q pathway scikit-learn sentence-transformers pandas numpy tqdm

print("‚úì Dependencies installed successfully!")

## Step 2: Train ML Classifier

This step trains the ML ensemble on 80 labeled training examples.
Takes approximately 10-15 minutes.

**What happens:**
- Ingests both novels into Pathway document store
- Processes each training example through the full pipeline
- Extracts 20 features (conflicts, severities, evidence coverage, etc.)
- Trains 4 models: Random Forest, Gradient Boosting, MLP, Logistic Regression
- Saves trained ensemble to `results/ml_classifier.pkl`

In [None]:
# Train ML classifier
!python train_ml.py

### Expected Output:

```
================================================================================
TRAINING ML CLASSIFIER
================================================================================

[1] Loading training data...
Loaded 80 training examples

[2] Loading novels...
Loaded 2 novels

[3] Initializing pipeline...

[4] Indexing documents...
‚úì Ingested novel 'In Search of the Castaways': 706 chunks
‚úì Ingested novel 'The Count of Monte Cristo': 3933 chunks

[5] Extracting features from training data...
Processing training examples: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [01:51<00:00,  1.40s/it]

[6] Training set shape: (80, 20)
    Consistent: 51
    Inconsistent: 29

[7] Training ML models...
[ML Classifier] Training on 80 examples with 20 features
  Training rf... CV Accuracy: 0.463 (¬±0.116)
  Training gb... CV Accuracy: 0.438 (¬±0.168)
  Training mlp... CV Accuracy: 0.625 (¬±0.040)  ‚Üê Best model
  Training lr... CV Accuracy: 0.512 (¬±0.061)

[8] Cross-validation results:
    rf: 0.463
    gb: 0.438
    mlp: 0.625
    lr: 0.512

[9] Saving trained models...
    Saved to results/ml_classifier.pkl

[10] Feature importance (top 10):
    score_x_claims: 0.2453
    inconsistency_score: 0.2075
    num_claims: 0.1395
    score_x_conflicts: 0.1366
    num_temporal: 0.0908
    ...

================================================================================
TRAINING COMPLETE!
================================================================================
```

## Step 3: Generate Predictions on Test Set

Now we'll run the trained ML classifier on the 60 test examples.
Takes approximately 1-2 minutes.

**Output:** `results/submission.csv` with predictions for all 60 test examples

In [None]:
# Generate predictions on test set
!python main.py --test data/test.csv --output results/submission.csv

### Expected Output:

```
============================================================
             KDSH NARRATIVE CONSISTENCY SYSTEM              
============================================================

Configuration:
  chunk_size: 1000
  max_hops: 3
  top_k_evidence: 5
  threshold: 0.7
‚úì Loaded trained ML classifier

‚úì Loaded 60 examples from data/test.csv

Processing examples...
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 60/60 [01:25<00:00,  1.42s/it]

‚úì Saved results to results/submission.csv

============================================================
SUMMARY
============================================================
Total processed: 60
Consistent (1): 37
Inconsistent (0): 23
Average confidence: 83.33%
```

## Step 4: Verify Submission File

Let's check the first few predictions:

In [None]:
import pandas as pd

# Load submission file
submission = pd.read_csv('results/submission.csv')

print(f"Total predictions: {len(submission)}")
print(f"\nClass distribution:")
print(submission['prediction'].value_counts())
print(f"\nAverage confidence: {submission['confidence'].mean():.2f}%")
print(f"\nFirst 10 predictions:")
display(submission.head(10))

## Step 5: Download Submission File

In [None]:
from google.colab import files

# Download submission file
files.download('results/submission.csv')

print("‚úì Submission file downloaded successfully!")
print("Ready to submit to competition.")

## Optional: Evaluate on Training Data

To see accuracy metrics, run predictions on training data (has labels):

In [None]:
# Evaluate on training data
!python main.py --test data/train.csv --output results/train_predictions.csv

### Expected Metrics:

```
============================================================
EVALUATION METRICS
============================================================
Accuracy: 0.787
Precision: 0.804
Recall: 0.882
F1: 0.841
============================================================
```

## Optional: Fast ML Baseline (2 minutes)

If you want a quick baseline without the full pipeline:

In [None]:
# Fast TF-IDF + count features baseline
!python fast_ml_submit.py

---

## üìä System Overview

### Pipeline Architecture

```
Input (Backstory) ‚Üí Claim Extraction ‚Üí Constraint Graph ‚Üí Evidence Retrieval
                                                                ‚Üì
Output (0/1) ‚Üê ML Classifier ‚Üê Feature Extraction ‚Üê Conflict Detection
                                                     (Temporal + Causal)
```

### ML Features (20 dimensions)

1. **Core Scores**
   - Inconsistency Score
   - Avg/Max Inconsistency from components

2. **Conflict Metrics**
   - Num Temporal/Causal Conflicts
   - Avg/Max Severities
   - Total Conflicts

3. **Evidence Metrics**
   - Evidence Coverage
   - Avg Evidence Quality

4. **Interaction Terms**
   - Score √ó Claims
   - Score √ó Conflicts
   - Claims √ó Conflicts
   - Temporal √ó Causal
   - Evidence √ó Score

5. **Binary Flags**
   - Has Temporal Conflicts
   - Has Causal Conflicts

### Model Ensemble

- **Random Forest** (100 trees)
- **Gradient Boosting** (100 estimators)
- **MLP Neural Network** (64‚Üí32‚Üí16 layers) ‚Üê Best performer
- **Logistic Regression**

**Voting:** Majority vote across all 4 models

---

## üîß Troubleshooting

### Common Issues

1. **ModuleNotFoundError: No module named 'pathway'**
   - Solution: Re-run the pip install cell

2. **FileNotFoundError: data/train.csv**
   - Solution: Ensure you're in the KDSH directory (`%cd KDSH`)

3. **Memory Error**
   - Solution: Restart runtime and re-run from Step 1

4. **Predictions are all 0 or all 1**
   - Solution: Ensure ML classifier was trained (Step 2 completed)

---

## üìö Additional Resources

- **GitHub:** [ishansurdi/KDSH](https://github.com/ishansurdi/KDSH)
- **README:** Detailed documentation and API reference
- **Issues:** [Report bugs](https://github.com/ishansurdi/KDSH/issues)

---

**Author:** Ishan Surdi  
**License:** MIT  
**Year:** 2026