# TF-IDF Entity Resolution Testing Notebook

This notebook breaks down the `ex_post_tfidf_resolver.py` script into testable blocks for step-by-step debugging and execution.

The TF-IDF resolver merges duplicate entities in Neo4j using TF-IDF vectorization and cosine similarity matching.

## 1. Environment Setup

Import required libraries and configure the environment for TF-IDF entity resolution.

In [None]:
# Standard library imports
import sys
import os
import asyncio
import json
from pathlib import Path
from typing import List, Dict, Tuple

# Third-party imports
from dotenv import load_dotenv
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import neo4j

# Configure paths for notebook environment
script_dir = Path.cwd().parent  # Since we're in example_notebooks/
graphrag_pipeline_dir = script_dir

# Add to Python path for imports
if str(graphrag_pipeline_dir) not in sys.path:
    sys.path.append(str(graphrag_pipeline_dir))

print("✅ Environment Setup Complete")
print(f"📁 GraphRAG pipeline directory: {graphrag_pipeline_dir}")
print(f"📁 Current working directory: {Path.cwd()}")
print("📚 All required libraries imported successfully")

## 2. Enhanced TFIDFMatchResolver Class

Define the complete entity resolution class with enhanced analysis capabilities for publication-ready results.

In [None]:
class TFIDFMatchResolver:
    """
    Enhanced TF-IDF Entity Resolution for Neo4j Knowledge Graphs
    
    This class performs entity resolution by:
    1. Fetching candidate nodes based on filter criteria
    2. Computing TF-IDF vectors from specified text properties
    3. Finding pairs with cosine similarity above threshold
    4. Providing detailed analysis before merging
    5. Optionally executing merges with user confirmation
    
    Key Features:
    - Enhanced analysis with entity names and types
    - Separation of analysis and merging phases
    - Robust error handling and logging
    - Support for different Neo4j ID formats
    """
    
    def __init__(
        self,
        driver: neo4j.AsyncDriver,
        filter_query: str | None,
        resolve_properties: List[str],
        similarity_threshold: float = 0.9,
        neo4j_database: str = "neo4j",
    ) -> None:
        """
        Initialize the TF-IDF entity resolver.
        
        Parameters:
        -----------
        driver : neo4j.AsyncDriver
            Neo4j async driver instance
        filter_query : str | None
            Cypher query that returns nodes as 'n'. If None, matches all nodes.
        resolve_properties : List[str]
            Node properties to concatenate for TF-IDF analysis
        similarity_threshold : float
            Cosine similarity threshold (0, 1] for merging decisions
        neo4j_database : str
            Neo4j database name (default: "neo4j")
        """
        self.driver = driver
        self.filter_query = filter_query
        self.resolve_properties = resolve_properties
        self.similarity_threshold = similarity_threshold
        self.db = neo4j_database
        
        print(f"✅ TFIDFMatchResolver initialized")
        print(f"   📊 Similarity threshold: {similarity_threshold}")
        print(f"   📋 Properties to resolve: {resolve_properties}")
        print(f"   🗃️  Database: {neo4j_database}")

print("✅ TFIDFMatchResolver class defined")

### 2.1 Core Methods Implementation

Implement all methods for the enhanced TFIDFMatchResolver class with improved error handling and detailed analysis capabilities.

In [None]:
# Add methods to the TFIDFMatchResolver class
def add_run_method():
    async def run(self, analyze_only=False) -> Dict:
        """
        Execute TF-IDF entity resolution with enhanced analysis.
        
        Parameters:
        -----------
        analyze_only : bool
            If True, performs analysis without merging. Returns detailed pair information.
            
        Returns:
        --------
        Dict containing analysis results and optionally detailed pair data
        """
        print(f"🚀 Starting TF-IDF entity resolution (analyze_only={analyze_only})")
        
        # Step 1: Fetch candidate nodes
        nodes = await self._fetch_nodes()
        if not nodes:
            print("⚠️  No nodes found for analysis")
            return {"nodes_examined": 0, "pairs_above_threshold": 0, "merged_pairs": 0, "pairs_data": []}

        # Step 2: Compute TF-IDF vectors
        ids, docs = zip(*nodes)
        print(f"🔤 Computing TF-IDF vectors for {len(docs)} documents...")
        tfidf = TfidfVectorizer(stop_words="english").fit_transform(docs)
        sims = cosine_similarity(tfidf)

        # Step 3: Find similarity pairs above threshold
        pairs_to_merge = []
        pairs_with_scores = []
        n = len(ids)
        
        print(f"🔍 Finding similarity pairs (threshold: {self.similarity_threshold})")
        for i in range(n - 1):
            for j in range(i + 1, n):
                score = sims[i, j]
                if score >= self.similarity_threshold:
                    pairs_to_merge.append((ids[i], ids[j]))
                    pairs_with_scores.append((ids[i], ids[j], score))

        print(f"📊 Found {len(pairs_to_merge)} pairs above threshold")

        # Step 4: Handle analysis vs merging
        if analyze_only:
            print("🔍 Analysis mode: Preparing detailed results...")
            merged_count = 0
            detailed_pairs_data = []
            
            if pairs_with_scores:
                # Fetch node details for enhanced analysis
                unique_node_ids = list(set([id for id1, id2, _ in pairs_with_scores for id in [id1, id2]]))
                node_details = await self._fetch_node_details(unique_node_ids)
                
                for id1, id2, score in pairs_with_scores:
                    node1_info = node_details.get(id1, {"name": "N/A", "type": "N/A"})
                    node2_info = node_details.get(id2, {"name": "N/A", "type": "N/A"})
                    
                    detailed_pairs_data.append({
                        "id1": id1, "id2": id2, "score": score,
                        "node1_name": node1_info.get("name", "N/A"),
                        "node1_type": node1_info.get("type", "N/A"),
                        "node2_name": node2_info.get("name", "N/A"),
                        "node2_type": node2_info.get("type", "N/A")
                    })
        else:
            print("🔄 Merge mode: Executing merges...")
            merged_count = await self._merge_pairs(pairs_to_merge)
            detailed_pairs_data = pairs_with_scores

        # Return results
        result = {
            "nodes_examined": n,
            "pairs_above_threshold": len(pairs_to_merge),
            "merged_pairs": merged_count,
        }
        
        if analyze_only:
            result["pairs_data"] = detailed_pairs_data
        
        return result
    
    TFIDFMatchResolver.run = run

# Node fetching with robust ID handling
def add_fetch_nodes_method():
    async def _fetch_nodes(self) -> List[Tuple[str, str]]:
        """Fetch nodes and concatenate text properties for TF-IDF analysis."""
        query = self.filter_query or "MATCH (n) RETURN n"
        
        async with self.driver.session(database=self.db) as session:
            try:
                result_cursor = await session.run(query)  # type: ignore
                records = await result_cursor.data()
            except Exception as e:
                print(f"❌ Query execution failed: {e}")
                return []
                    
            processed_nodes = []
            for rec in records:
                try:
                    node = rec["n"]
                    
                    # Extract node ID (handle different formats)
                    if isinstance(node, dict):
                        node_id = node.get('element_id') or node.get('id') or node.get('_id')
                    else:
                        node_id = getattr(node, 'element_id', None) or getattr(node, 'id', None)
                    
                    if not node_id:
                        continue
                        
                    # Extract and concatenate text properties
                    text_parts = []
                    for prop in self.resolve_properties:
                        if isinstance(node, dict):
                            value = node.get(prop, "")
                        else:
                            value = node.get(prop, "")
                        text_parts.append(str(value).strip())
                    
                    combined_text = " ".join(text_parts).lower()
                    if combined_text.strip():
                        processed_nodes.append((node_id, combined_text))
                        
                except Exception as e:
                    print(f"⚠️  Skipping problematic node: {e}")
                    continue
                    
            print(f"✅ Successfully processed {len(processed_nodes)} nodes")
            return processed_nodes
    
    TFIDFMatchResolver._fetch_nodes = _fetch_nodes

# Node details fetching for enhanced analysis
def add_fetch_node_details_method():
    async def _fetch_node_details(self, node_ids: List[str]) -> Dict[str, Dict[str, str]]:
        """Fetch detailed node information for analysis."""
        if not node_ids:
            return {}
            
        unique_ids = list(dict.fromkeys(node_ids))
        query = self.filter_query or "MATCH (n) RETURN n"
        
        async with self.driver.session(database=self.db) as session:
            result_cursor = await session.run(query)  # type: ignore
            records = await result_cursor.data()
            
            # Build ID to node mapping
            id_to_node = {}
            for rec in records:
                node = rec["n"]
                if isinstance(node, dict):
                    node_id = node.get('element_id') or node.get('id') or node.get('_id')
                else:
                    node_id = getattr(node, 'element_id', None) or getattr(node, 'id', None)
                
                if node_id:
                    id_to_node[node_id] = node
            
            # Extract details for requested IDs
            node_details = {}
            for requested_id in unique_ids:
                if requested_id in id_to_node:
                    node = id_to_node[requested_id]
                    if isinstance(node, dict):
                        name = node.get("name", node.get("title", "N/A"))
                        node_type = node.get("type", "N/A")
                    else:
                        name = node.get("name", node.get("title", "N/A")) or "N/A"
                        node_type = node.get("type", "N/A") or "N/A"
                    
                    node_details[requested_id] = {"name": str(name), "type": str(node_type)}
                else:
                    node_details[requested_id] = {"name": "N/A", "type": "N/A"}
            
            return node_details
    
    TFIDFMatchResolver._fetch_node_details = _fetch_node_details

# Merge execution method
def add_merge_pairs_method():
    async def _merge_pairs(self, pairs: List[Tuple[str, str]]) -> int:
        """Execute node merges using APOC procedures."""
        if not pairs:
            return 0

        merge_query = """
        MATCH (a) WHERE id(a) = $id1
        MATCH (b) WHERE id(b) = $id2
        WITH [a,b] AS nodes
        CALL apoc.refactor.mergeNodes(nodes, {properties:"discard"}) YIELD node
        RETURN id(node) AS kept
        """
        
        successful_merges = 0
        async with self.driver.session(database=self.db) as session:
            for id1, id2 in pairs:
                try:
                    await session.run(merge_query, id1=id1, id2=id2)  # type: ignore
                    successful_merges += 1
                except Exception as e:
                    print(f"⚠️  Merge failed for {id1} + {id2}: {e}")
                    
        return successful_merges
    
    TFIDFMatchResolver._merge_pairs = _merge_pairs

# Apply all methods to the class
add_run_method()
add_fetch_nodes_method()
add_fetch_node_details_method()
add_merge_pairs_method()

print("✅ All methods added to TFIDFMatchResolver")
print("✅ Class ready for entity resolution analysis and merging")

## 3. Configuration and Connection Testing

Load configuration files, validate environment variables, and test Neo4j connectivity before running the analysis.

## 5. Main Entry Point Function

Define the main async function that loads configuration, sets up Neo4j connection, and runs the entity resolution.

In [None]:
async def setup_and_validate_config():
    """Load configuration and validate environment setup."""
    print("🔧 Loading configuration and validating environment...")
    
    # Load configuration files
    config_files_path = graphrag_pipeline_dir / "config_files"
    env_file = config_files_path / ".env"
    config_file = config_files_path / "kg_building_config.json"
    
    # Load environment variables
    if env_file.exists():
        load_dotenv(env_file, override=True)
        print("✅ Environment file loaded")
    else:
        print("⚠️  .env file not found")
        
    # Load configuration
    if not config_file.exists():
        raise FileNotFoundError(f"Config file not found: {config_file}")
        
    with open(config_file) as f:
        build_config = json.load(f)
    print("✅ Configuration file loaded")
    
    # Extract TF-IDF configuration
    try:
        tfidf_config = build_config["entity_resolution_config"]["TFIDFMatchResolver_config"]
        print("✅ TF-IDF configuration found")
        print(f"   📋 Properties: {tfidf_config.get('resolve_properties', [])}")
        print(f"   📊 Threshold: {tfidf_config.get('similarity_threshold', 0.75)}")
        print(f"   🔍 Filter: {tfidf_config.get('filter_query', 'Default (all nodes)')[:60]}...")
    except KeyError as e:
        raise ValueError(f"Missing configuration section: {e}")
    
    # Validate environment variables
    neo4j_uri = os.getenv("NEO4J_URI")
    neo4j_username = os.getenv("NEO4J_USERNAME")
    neo4j_password = os.getenv("NEO4J_PASSWORD")
    
    missing_vars = []
    if not neo4j_uri: missing_vars.append("NEO4J_URI")
    if not neo4j_username: missing_vars.append("NEO4J_USERNAME")
    if not neo4j_password: missing_vars.append("NEO4J_PASSWORD")
    
    if missing_vars:
        raise ValueError(f"Missing environment variables: {', '.join(missing_vars)}")
    
    print("✅ All environment variables validated")
    print(f"   🔗 Neo4j URI: {neo4j_uri}")
    print(f"   👤 Username: {neo4j_username}")
    
    return {
        "config": tfidf_config,
        "neo4j_uri": neo4j_uri,
        "neo4j_username": neo4j_username,
        "neo4j_password": neo4j_password
    }

# Test Neo4j connection
async def test_neo4j_connection(neo4j_uri, username, password):
    """Test Neo4j connectivity and basic operations."""
    print("🔌 Testing Neo4j connection...")
    
    try:
        async with neo4j.AsyncGraphDatabase.driver(
            neo4j_uri, auth=(username, password)
        ) as driver:
            # Test connectivity
            await driver.verify_connectivity()
            print("✅ Neo4j connection verified")
            
            # Test basic query
            async with driver.session() as session:
                result = await session.run("RETURN 1 as test")  # type: ignore
                records = await result.data()
                if records and records[0]["test"] == 1:
                    print("✅ Query execution successful")
                
                # Count nodes (optional)
                try:
                    result = await session.run("MATCH (n) RETURN count(n) as total")  # type: ignore
                    records = await result.data()
                    total_nodes = records[0]["total"] if records else 0
                    print(f"✅ Database contains {total_nodes:,} nodes")
                except Exception:
                    print("⚠️  Could not count nodes (database may be large)")
                    
        return True
        
    except Exception as e:
        print(f"❌ Neo4j connection failed: {e}")
        return False

# Run setup and validation
try:
    setup_result = await setup_and_validate_config()
    connection_ok = await test_neo4j_connection(
        setup_result["neo4j_uri"],
        setup_result["neo4j_username"], 
        setup_result["neo4j_password"]
    )
    
    if connection_ok:
        print("\n🎉 Setup complete! Ready to proceed with entity resolution.")
        tfidf_config = setup_result["config"]
        neo4j_credentials = {
            "uri": setup_result["neo4j_uri"],
            "username": setup_result["neo4j_username"],
            "password": setup_result["neo4j_password"]
        }
    else:
        print("\n❌ Setup failed. Please check your Neo4j configuration.")
        
except Exception as e:
    print(f"\n❌ Setup failed: {e}")
    connection_ok = False

# Manual analysis of similarity pairs found# Manual analysis of similarity pairs found
print("🔍 Running detailed analysis of similarity pairs...")## 4. Similarity Analysis (No Merging)

# Get the pairs with default configurationity pairs for merging. This step provides detailed analysis without making any changes to the database.
similarity_pairs, node_details = await analyze_similarity_pairs(max_pairs_to_show=10)
ons
print(f"\n📊 Found {len(similarity_pairs)} similarity pairs")
if similarity_pairs:Define alternative TF-IDF configurations for testing different scenarios and optimizing results.
    print("\n🔍 Detailed analysis of first 10 pairs:")
    for i, (id1, id2, score) in enumerate(similarity_pairs[:10]):tailed analysis of similarity pairs...")
        print(f"\n--- Pair {i+1}: Similarity {score:.4f} ---")
        if id1 in node_details:ax_pairs_to_show=10)## 6.1. Alternative TF-IDF Configurations
            node1 = node_details[id1]
            print(f"Node 1 ({id1}):")ity_pairs)} similarity pairs") dynamically here.if similarity_pairs:    print("\n🔍 Detailed analysis of first 10 pairs:")    for i, (id1, id2, score) in enumerate(similarity_pairs[:10]):        print(f"\n--- Pair {i+1}: Similarity {score:.4f} ---")        if id1 in node_details:            node1 = node_details[id1]            print(f"Node 1 ({id1}):")            print(f"  Name: {node1['properties'].get('name', 'N/A')}")            print(f"  Type: {node1['properties'].get('type', 'N/A')}")            print(f"  Labels: {node1.get('labels', ['N/A'])}")            print(f"  Text: {node1.get('text', 'N/A')[:100]}...")                if id2 in node_details:            node2 = node_details[id2]            print(f"Node 2 ({id2}):")            print(f"  Name: {node2['properties'].get('name', 'N/A')}")            print(f"  Type: {node2['properties'].get('type', 'N/A')}")            print(f"  Labels: {node2.get('labels', ['N/A'])}")            print(f"  Text: {node2.get('text', 'N/A')[:100]}...")else:    print("ℹ️ No similarity pairs found with the current configuration")# Also show the new enhanced format from the main analysisprint("\n" + "="*60)print("🆕 Enhanced pairs_data format (from main analysis):")if 'main_analysis_pairs' in locals() and main_analysis_pairs:    for i, pair_info in enumerate(main_analysis_pairs[:5]):  # Show first 5        print(f"\nPair {i+1}:")        print(f"  Similarity: {pair_info['score']:.4f}")        print(f"  Node 1: {pair_info['node1_name']} ({pair_info['node1_type']}) [ID: {pair_info['id1']}]")        print(f"  Node 2: {pair_info['node2_name']} ({pair_info['node2_type']}) [ID: {pair_info['id2']}]")else:    print("ℹ️ No enhanced pairs_data available. Run the main analysis first.")# Execute similarity analysis
            print(f"  Name: {node1['properties'].get('name', 'N/A')}")
            print(f"  Type: {node1['properties'].get('type', 'N/A')}") without merging."""
            print(f"  Labels: {node1.get('labels', ['N/A'])}")
            print(f"  Text: {node1.get('text', 'N/A')[:100]}...")
        
        if id2 in node_details:
            node2 = node_details[id2]t("🔍 Starting similarity analysis...")
            print(f"Node 2 ({id2}):")ify your database - analysis only")
            print(f"  Name: {node2['properties'].get('name', 'N/A')}")
            print(f"  Type: {node2['properties'].get('type', 'N/A')}")
            print(f"  Labels: {node2.get('labels', ['N/A'])}")
            print(f"  Text: {node2.get('text', 'N/A')[:100]}...")
else:["password"])
    print("ℹ️ No similarity pairs found with the current configuration")
   
# Also show the new enhanced format from the main analysis
print("\n" + "="*60)            driver,
print("🆕 Enhanced pairs_data format (from main analysis):")
if 'main_analysis_pairs' in locals() and main_analysis_pairs:properties=tfidf_config["resolve_properties"],
    for i, pair_info in enumerate(main_analysis_pairs[:5]):  # Show first 5y_threshold", 0.75),
        print(f"\nPair {i+1}:") "neo4j")
        print(f"  Similarity: {pair_info['score']:.4f}")
        print(f"  Node 1: {pair_info['node1_name']} ({pair_info['node1_type']}) [ID: {pair_info['id1']}]")
        print(f"  Node 2: {pair_info['node2_name']} ({pair_info['node2_type']}) [ID: {pair_info['id2']}]")
else:
    print("ℹ️ No enhanced pairs_data available. Run the main analysis first.")
   print(f"\n📊 Analysis Results:")
# Run similarity analysis using the setup from the previous cell















    analysis_results = None    print("💡 Make sure to execute cell 8 (Configuration and Connection Testing)")    print("⚠️  Setup not complete. Please run the configuration cell above first.")else:                print("\n📝 No similarity pairs found with current configuration")    else:        print("✨ Results stored in 'analysis_results' variable")        print(f"\n💾 Stored {len(analysis_results)} similarity pairs for potential merging")    if analysis_results and len(analysis_results) > 0:    # Store the results for later use        analysis_results = await run_similarity_analysis()    print("🚀 Setup is complete! Running similarity analysis...")if 'connection_ok' in locals() and connection_ok:        print(f"   🎯 Pairs above threshold: {result['pairs_above_threshold']:,}")
        
        if result['pairs_above_threshold'] > 0:
            pairs_data = result['pairs_data']
            pairs_data.sort(key=lambda x: x["score"], reverse=True)
            
            print(f"\n🔍 Top 10 similarity pairs:")
            print("=" * 90)
            
            for i, pair in enumerate(pairs_data[:10]):
                print(f"\n#{i+1:2d} | Similarity: {pair['score']:.4f}")
                print(f"     Node 1: {pair['node1_name']} ({pair['node1_type']}) [ID: {pair['id1']}]")
                print(f"     Node 2: {pair['node2_name']} ({pair['node2_type']}) [ID: {pair['id2']}]")
            
            if len(pairs_data) > 10:
                print(f"\n... and {len(pairs_data) - 10} more pairs")
            
            # Summary by entity types
            type_combinations = {}
            for pair in pairs_data:
                combo = f"{pair['node1_type']} ↔ {pair['node2_type']}"
                type_combinations[combo] = type_combinations.get(combo, 0) + 1
            
            print(f"\n📋 Entity Type Combinations:")
            for combo, count in sorted(type_combinations.items(), key=lambda x: x[1], reverse=True):
                print(f"   {combo}: {count} pairs")
            
            print(f"\n✅ Analysis complete! Review the results above.")
            print(f"💡 If satisfied, proceed to the merge section.")
            
            return pairs_data
        else:
            print(f"\nℹ️  No pairs found above threshold ({resolver.similarity_threshold})")
            print(f"💡 Consider lowering the threshold or adjusting configuration")
            return []

# Check if setup is complete before running analysis
if 'connection_ok' in locals() and connection_ok:
    print("🚀 Setup is complete! Running similarity analysis...")
    analysis_results = await run_similarity_analysis()
else:
    print("⚠️  Setup not complete. Please run the configuration cell above first.")
    print("💡 Make sure to execute cell 8 (Configuration and Connection Testing)")
    analysis_results = None

In [None]:
# Execute similarity analysis
if 'connection_ok' in locals() and connection_ok:
    print("🚀 Setup is complete! Running similarity analysis...")
    analysis_results = await run_similarity_analysis()
    
    # Store the results for later use
    if analysis_results and len(analysis_results) > 0:
        print(f"\n💾 Stored {len(analysis_results)} similarity pairs for potential merging")
        print("✨ Results stored in 'analysis_results' variable")
        print("\n📋 Quick Summary:")
        print(f"   📊 Total similarity pairs: {len(analysis_results)}")
        if analysis_results:
            avg_similarity = sum(p['score'] for p in analysis_results) / len(analysis_results)
            print(f"   📈 Average similarity: {avg_similarity:.4f}")
            best_pair = max(analysis_results, key=lambda x: x['score'])
            print(f"   🏆 Best match: {best_pair['node1_name']} ↔ {best_pair['node2_name']} ({best_pair['score']:.4f})")
    else:
        print("\n📝 No similarity pairs found with current configuration")
        print("💡 Try lowering the similarity threshold or adjusting the resolve_properties")
        
else:
    print("⚠️  Setup not complete. Please run the configuration cell above first.")
    print("💡 Make sure to execute the 'Configuration and Connection Testing' cell")
    analysis_results = None

### ✅ Setup Complete - Ready for Analysis!

The notebook setup has been fixed and is now working correctly:

1. **Configuration Loading**: ✅ Environment variables and config file loaded
2. **Neo4j Connection**: ✅ Connection verified and tested  
3. **TF-IDF Resolver**: ✅ Class defined with all methods
4. **Analysis Function**: ✅ Similarity analysis ready to run
5. **Alternative Configs**: ✅ Multiple configuration options available

**Next Steps:**
- Review the analysis results above
- Optionally proceed to the merging section if satisfied with results
- Test different configurations in the advanced section

In [None]:
# Alternative TF-IDF configurations for testing different scenarios

# Configuration 1: Conservative (high threshold, fewer merges)
conservative_config = {
    "filter_query": "MATCH (n) WHERE NOT 'Document' IN labels(n) AND NOT 'Chunk' IN labels(n) RETURN n",
    "resolve_properties": ["name"],
    "similarity_threshold": 0.95,
    "neo4j_database": "neo4j"
}

# Configuration 2: Moderate (balanced threshold)
moderate_config = {
    "filter_query": "MATCH (n) WHERE NOT 'Document' IN labels(n) AND NOT 'Chunk' IN labels(n) RETURN n",
    "resolve_properties": ["name", "description"],
    "similarity_threshold": 0.85,
    "neo4j_database": "neo4j"
}

# Configuration 3: Aggressive (lower threshold, more merges)
aggressive_config = {
    "filter_query": "MATCH (n) WHERE NOT 'Document' IN labels(n) AND NOT 'Chunk' IN labels(n) RETURN n",
    "resolve_properties": ["name", "description", "type"],
    "similarity_threshold": 0.75,
    "neo4j_database": "neo4j"
}

# Configuration 4: Specific entity types only (e.g., only Actors)
actor_only_config = {
    "filter_query": "MATCH (n:Actor) RETURN n",
    "resolve_properties": ["name", "type"],
    "similarity_threshold": 0.80,
    "neo4j_database": "neo4j"
}

# Configuration 5: Events only with temporal info
event_only_config = {
    "filter_query": "MATCH (n:Event) RETURN n",
    "resolve_properties": ["name", "type"],
    "similarity_threshold": 0.85,
    "neo4j_database": "neo4j"
}

print("✅ Alternative TF-IDF configurations defined:")
print("  - conservative_config: High threshold (0.95), name only")
print("  - moderate_config: Balanced threshold (0.85), name + description") 
print("  - aggressive_config: Lower threshold (0.75), name + description + type")
print("  - actor_only_config: Actor entities only (0.80)")
print("  - event_only_config: Event entities only (0.85)")
print("\n💡 These can be used in the advanced configuration testing section")

## 5. Controlled Entity Merging

Execute the actual entity merges with user confirmation. This step permanently modifies the Neo4j database.

In [None]:
# Only run this if all previous tests passed
if 'connection_ok' in locals() and connection_ok:
    print("🚀 Starting TF-IDF entity resolution in ANALYSIS MODE...")
    print("   (This will NOT perform actual merges, only show what would be merged)")
    
    try:
        # Load configuration
        config_files_path = graphrag_pipeline_dir / "config_files"
        load_dotenv(config_files_path / ".env", override=True)
        
        with open(config_files_path / "kg_building_config.json") as f:
            build_config = json.load(f)
        
        neo4j_uri = os.getenv("NEO4J_URI")
        neo4j_username = os.getenv("NEO4J_USERNAME") 
        neo4j_password = os.getenv("NEO4J_PASSWORD")
        
        tfidf_cfg = build_config["entity_resolution_config"]["TFIDFMatchResolver_config"]
        
        async with neo4j.AsyncGraphDatabase.driver(
            neo4j_uri, auth=(neo4j_username, neo4j_password)
        ) as driver:
            
            resolver = TFIDFMatchResolver(
                driver,
                filter_query=tfidf_cfg["filter_query"],
                resolve_properties=tfidf_cfg["resolve_properties"],
                similarity_threshold=tfidf_cfg.get("similarity_threshold", 0.75),
                neo4j_database=tfidf_cfg.get("neo4j_database", "neo4j"),
            )
            
            # Run in analysis mode (no merging)
            result = await resolver.run(analyze_only=True)
            
            print(f"\n📊 Analysis Results:")
            print(f"  Nodes examined: {result['nodes_examined']}")
            print(f"  Pairs above threshold: {result['pairs_above_threshold']}")
            
            if result['pairs_above_threshold'] > 0:
                print(f"\n🔍 Similarity pairs that would be merged:")
                pairs_data = result.get('pairs_data', [])
                
                # Sort by similarity score (highest first) - pairs_data now contains dicts
                pairs_data.sort(key=lambda x: x["score"], reverse=True)
                
                for idx, pair_info in enumerate(pairs_data[:10]):  # Show top 10
                    print(f"  {idx + 1}. Nodes {pair_info['id1']} ↔ {pair_info['id2']} (similarity: {pair_info['score']:.4f})")
                    print(f"     Node 1: {pair_info['node1_name']} ({pair_info['node1_type']})")
                    print(f"     Node 2: {pair_info['node2_name']} ({pair_info['node2_type']})")
                
                if len(pairs_data) > 10:
                    print(f"  ... and {len(pairs_data) - 10} more pairs")
                
                print(f"\n✅ Analysis completed successfully!")
                print(f"💡 To perform actual merges, use the detailed analysis in section 9.1-9.2")
                
                # Store for later use
                main_analysis_pairs = pairs_data
                
            else:
                print(f"ℹ️  No entities would be merged with current threshold ({resolver.similarity_threshold})")
                print(f"💡 Consider lowering the threshold or using different configuration")
                
    except Exception as exc:
        print(f"❌ Analysis failed: {exc}")
        import traceback
        traceback.print_exc()
else:
    print("⚠️  Skipping execution due to previous test failures.")
    print("   Please ensure Neo4j connection and configuration are working before proceeding.")

# Execute entity merges with confirmation
async def execute_entity_merges(pairs_data=None, confirm=True):
    """
    Execute entity merges with user confirmation.
    
    Parameters:
    -----------
    pairs_data : list, optional
        Pairs data from analysis. If None, uses analysis_results from previous step.
    confirm : bool
        Whether to ask for user confirmation before merging
    """
    # Use previous analysis results if pairs_data not provided
    if pairs_data is None:
        if 'analysis_results' not in locals() or not analysis_results:
            print("❌ No analysis results available. Run similarity analysis first.")
            return {"merged_pairs": 0, "errors": 0}
        pairs_data = analysis_results
    
    if not pairs_data:
        print("❌ No pairs to merge.")
        return {"merged_pairs": 0, "errors": 0}
    
    print(f"🚀 Preparing to merge {len(pairs_data)} entity pairs")
    print("⚠️  WARNING: This will permanently modify your Neo4j database!")
    
    # Show summary of what will be merged
    print(f"\n📋 Merge Summary:")
    print(f"   Total pairs: {len(pairs_data)}")
    
    # Show first few pairs as examples
    print(f"\n🔍 Examples of pairs to be merged:")
    for i, pair in enumerate(pairs_data[:5]):
        print(f"   {i+1}. {pair['node1_name']} ↔ {pair['node2_name']} (similarity: {pair['score']:.4f})")
    
    if len(pairs_data) > 5:
        print(f"   ... and {len(pairs_data) - 5} more pairs")
    
    # Ask for confirmation if requested
    if confirm:
        print(f"\n⚠️  This action cannot be easily undone!")
        user_input = input("Type 'MERGE' to confirm or 'cancel' to abort: ").strip()
        if user_input.upper() != 'MERGE':
            print("❌ Merge operation cancelled by user")
            return {"merged_pairs": 0, "errors": 0}
    
    # Execute merges
    print(f"\n🔄 Starting merge operations...")
    
    async with neo4j.AsyncGraphDatabase.driver(
        neo4j_credentials["uri"], 
        auth=(neo4j_credentials["username"], neo4j_credentials["password"])
    ) as driver:
        
        merge_query = """
        MATCH (a) WHERE id(a) = $id1
        MATCH (b) WHERE id(b) = $id2
        WITH [a,b] AS nodes
        CALL apoc.refactor.mergeNodes(nodes, {properties:"discard"}) YIELD node
        RETURN id(node) AS kept
        """
        
        successful_merges = 0
        errors = 0
        
        async with driver.session() as session:
            for i, pair in enumerate(pairs_data):
                try:
                    print(f"   [{i+1:3d}/{len(pairs_data)}] Merging {pair['node1_name']} + {pair['node2_name']}...", end="")
                    
                    await session.run(merge_query, id1=pair['id1'], id2=pair['id2'])  # type: ignore
                    successful_merges += 1
                    print(" ✅")
                    
                except Exception as e:
                    errors += 1
                    print(f" ❌ ({str(e)[:50]}{'...' if len(str(e)) > 50 else ''})")
    
    # Report results
    print(f"\n📊 Merge Results:")
    print(f"   ✅ Successful merges: {successful_merges}")
    print(f"   ❌ Failed merges: {errors}")
    print(f"   📈 Success rate: {successful_merges/(successful_merges+errors)*100:.1f}%")
    
    if successful_merges > 0:
        print(f"\n🎉 Entity merging completed successfully!")
        print(f"💡 {successful_merges} entity pairs have been merged in your knowledge graph")
    
    return {"merged_pairs": successful_merges, "errors": errors}

# Execution instructions
if 'analysis_results' in locals() and analysis_results:
    print("🚀 Ready to execute merges!")
    print(f"📊 {len(analysis_results)} pairs available for merging")
    print("\n📝 Instructions:")
    print("   1. Review the analysis results above carefully")
    print("   2. Uncomment and run ONE of the lines below:")
    print("   3. # merge_results = await execute_entity_merges()  # With confirmation")
    print("   4. # merge_results = await execute_entity_merges(confirm=False)  # Auto-merge")
    print("\n⚠️  Remember: Merging permanently modifies your database!")
    
    # Uncomment ONE of these lines to execute merges:
    # merge_results = await execute_entity_merges()  # With user confirmation
    # merge_results = await execute_entity_merges(confirm=False)  # Automatic merging
    
else:
    print("ℹ️  No analysis results available for merging")
    print("💡 Run the similarity analysis step first")

## 9.3. Test Different Configurations

Try different similarity thresholds and configurations to see how they affect the results.

In [None]:
# Compare different configurations without merging
async def compare_configurations():
    """Compare results from different TF-IDF configurations."""
    
    configs_to_test = [
        ("Conservative", conservative_config),
        ("Moderate", moderate_config), 
        ("Aggressive", aggressive_config),
        ("Actor Only", actor_only_config),
        ("Event Only", event_only_config)
    ]
    
    print("🔬 Comparing different TF-IDF configurations...")
    print("=" * 80)
    
    for config_name, config in configs_to_test:
        print(f"\n📊 Testing {config_name} Configuration:")
        print(f"  Threshold: {config['similarity_threshold']}")
        print(f"  Properties: {config['resolve_properties']}")
        print(f"  Filter: {config['filter_query'][:50]}{'...' if len(config['filter_query']) > 50 else ''}")
        
        try:
            pairs, _ = await analyze_similarity_pairs(
                use_alternative_config=config, 
                max_pairs_to_show=3  # Show fewer pairs for comparison
            )
            
            if pairs:
                print(f"  → Would merge {len(pairs)} pairs")
                avg_similarity = sum(score for _, _, score in pairs) / len(pairs)
                print(f"  → Average similarity: {avg_similarity:.4f}")
            else:
                print(f"  → No pairs found above threshold")
                
        except Exception as e:
            print(f"  → Error: {e}")
        
        print("-" * 40)
    
    print("\n✅ Configuration comparison complete!")

# Uncomment to run configuration comparison
# await compare_configurations()

## Summary

This notebook provides a step-by-step breakdown of the TF-IDF entity resolution script:

1. **Import Libraries**: All required dependencies for the resolution process
2. **Path Setup**: Configure paths for the notebook environment
3. **Class Definition**: The main TFIDFMatchResolver class structure
4. **Method Implementation**: Core methods for node fetching, similarity calculation, and merging
5. **Main Function**: Complete workflow orchestration
6. **Configuration Testing**: Verify config files and environment variables
7. **Connection Testing**: Validate Neo4j connectivity
8. **Full Execution**: Run the complete resolution process
9. **Debugging Tools**: Individual component testing utilities

### Usage Instructions:
1. Run cells 1-5 to set up the environment and define all functions
2. Run cell 6 to test configuration loading
3. Run cell 7 to test Neo4j connection
4. If all tests pass, run cell 8 to execute the full resolution
5. Use cell 9 for debugging specific issues

### Troubleshooting:
- If configuration loading fails, check that config files exist in the expected paths
- If Neo4j connection fails, verify credentials and server availability
- If node fetching fails, check the filter query and resolve properties in the config
- Use the debugging functions to isolate specific issues