# Conflict Detection and Resolution

## Overview

In real-world Knowledge Graph construction, data often comes from multiple heterogeneous sources (databases, APIs, files, streams). These sources may provide conflicting information about the same entities or relationships. 

The **Semantica Conflict Detection and Resolution** module (`semantica.conflicts`) provides a comprehensive suite of tools to identifying, analyzing, and resolving these discrepancies to ensure high data quality and trust.

**Key Capabilities:**
- **Conflict Detection**: Identify value mismatches, type inconsistencies, and temporal contradictions.
- **Source Tracking**: Trace every piece of data back to its origin with credibility scoring.
- **Resolution Strategies**: Apply automated strategies like voting, credibility weighting, or recency.
- **Investigation Guides**: Generate human-readable guides for complex conflicts requiring manual review.

## Installation

Ensure Semantica is installed:

```bash
pip install semantica
```

In [None]:
from semantica.conflicts import (
    ConflictDetector,
    ConflictResolver,
    SourceTracker,
    ConflictAnalyzer,
    InvestigationGuideGenerator,
    ResolutionStrategy
)
import json
from datetime import datetime

## Step 1: Simulating Multi-Source Data

Let's simulate a scenario where we receive data about the same person from three different sources: an HR database, a LinkedIn scrape, and a public directory. Note the discrepancies in `birth_date` and `department`.

In [None]:
# Define simulated data sources
sources = {
    "hr_db": {"credibility": 0.95, "type": "internal_database"},
    "linkedin_scrape": {"credibility": 0.60, "type": "web_scrape"},
    "public_dir": {"credibility": 0.40, "type": "public_api"}
}

# Define entities from these sources
entity_records = [
    {
        "id": "emp_001",
        "name": "John Doe",
        "birth_date": "1980-05-15",
        "department": "Engineering",
        "source": "hr_db",
        "timestamp": "2023-01-01T10:00:00"
    },
    {
        "id": "emp_001",
        "name": "Jonathan Doe",
        "birth_date": "1980-05-15",
        "department": "Software Engineering",
        "source": "linkedin_scrape",
        "timestamp": "2023-06-15T14:30:00"
    },
    {
        "id": "emp_001",
        "name": "John Doe",
        "birth_date": "1982-05-15",  # Conflict!
        "department": "Engineering",
        "source": "public_dir",
        "timestamp": "2022-12-01T09:00:00"
    }
]

print(f"Loaded {len(entity_records)} records for Employee 001")

## Step 2: Tracking Sources

Before detecting conflicts, we register our sources with the `SourceTracker`. This allows the system to factor in source credibility during resolution.

In [None]:
source_tracker = SourceTracker()

# Register sources with their metadata and credibility scores
for source_id, metadata in sources.items():
    source_tracker.register_source(
        source_id=source_id,
        source_type=metadata["type"],
        credibility_score=metadata["credibility"]
    )

print("Sources registered successfully.")

## Step 3: Detecting Conflicts

Now we use `ConflictDetector` to identify discrepancies. We'll check for value conflicts in `birth_date` and `department`.

The detector compares values across all records for the same entity ID.

In [None]:
detector = ConflictDetector()

# Detect conflicts for specific properties
conflicts = []

# Check birth_date
dob_conflicts = detector.detect_value_conflicts(entity_records, "birth_date")
conflicts.extend(dob_conflicts)

# Check department
dept_conflicts = detector.detect_value_conflicts(entity_records, "department")
conflicts.extend(dept_conflicts)

print(f"Detected {len(conflicts)} conflicts:")
for conflict in conflicts:
    print(f"- {conflict.conflict_type.value}: {conflict.property_name} for {conflict.entity_id}")
    print(f"  Values: {conflict.conflicting_values}")
    print(f"  Severity: {conflict.severity}")
    print("--- ")

## Step 4: Analyzing Patterns

The `ConflictAnalyzer` can help identify systemic issues, such as a specific source consistently contradicting others.

In [None]:
analyzer = ConflictAnalyzer()
analysis = analyzer.analyze_conflicts(conflicts)

print("Conflict Analysis Summary:")
print(f"Total Conflicts: {analysis['total_conflicts']}")
print(f"By Type: {analysis['by_type']}")
print(f"By Severity: {analysis['by_severity']}")

## Step 5: Resolving Conflicts

We can resolve conflicts using different strategies. 

### Strategy A: Voting
Uses the most frequent value. Useful when you have many sources of equal standing.

### Strategy B: Credibility Weighted
Prefers values from trusted sources (like our HR DB) over lower-trust sources (public directory).

In [None]:
resolver = ConflictResolver()

# Need to link the source tracker to the resolver for credibility strategies
resolver.set_source_tracker(source_tracker)

print("--- Resolution: Voting ---")
voting_results = resolver.resolve_conflicts(conflicts, strategy=ResolutionStrategy.VOTING)
for res in voting_results:
    print(f"Property: {res.metadata.get('property_name')}")
    print(f"Resolved Value: {res.resolved_value}")
    print(f"Confidence: {res.confidence:.2f}")

print("\n--- Resolution: Credibility Weighted ---")
# This should favor the HR DB value for birth_date
credibility_results = resolver.resolve_conflicts(conflicts, strategy=ResolutionStrategy.CREDIBILITY_WEIGHTED)
for res in credibility_results:
    print(f"Property: {res.metadata.get('property_name')}")
    print(f"Resolved Value: {res.resolved_value}")
    print(f"Confidence: {res.confidence:.2f}")

## Step 6: Generating Investigation Guides

For critical conflicts or those with low resolution confidence, manual intervention is needed. The `InvestigationGuideGenerator` creates a structured guide for human analysts.

In [None]:
guide_generator = InvestigationGuideGenerator()

# Generate a guide for the first conflict (e.g., birth_date)
guide = guide_generator.generate_guide(conflicts[0])

print(f"Investigation Guide for {guide.conflict_id}:")
print(f"Title: {guide.title}")
print("Steps:")
for i, step in enumerate(guide.steps, 1):
    print(f"{i}. {step.description} (Action: {step.action_type})")

print("\nRecommended Checks:")
for check in guide.checklist:
    print(f"[ ] {check}")

## Conclusion

In this notebook, we explored how to:
1.  **Detect** conflicts in multi-source data.
2.  **Track** data provenance and source credibility.
3.  **Resolve** conflicts using automated strategies tailored to your data governance needs.
4.  **Investigate** complex issues with generated guides.

By integrating these steps into your pipeline, you ensure your Knowledge Graph remains accurate, consistent, and trustworthy.