# Graph-based Knowledge Graph with Hybrid Search

This notebook demonstrates:
1. Building a **Graph-based Knowledge Graph** from documents
2. **Graph-based search** using relationships and entities
3. **Hybrid approach** combining Graph KG with embedding-based search
4. **Performance comparison** between different search methods

## Why Graph Knowledge Graphs?
- **Relationships**: Capture explicit relationships between entities
- **Reasoning**: Enable complex queries and inference
- **Structure**: Organized knowledge representation
- **Complementary**: Works well with embedding-based semantic search

In [1]:
# Install required packages
!pip install networkx matplotlib spacy transformers sentence-transformers
!pip install langchain chromadb faiss-cpu openai
!python -m spacy download en_core_web_sm

Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.10.6-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (11 kB)
Collecting spacy
  Downloading spacy-3.8.7-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting transformers
  Downloading transformers-4.56.1-py3-none-any.whl.metadata (42 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-5.1.0-py3-none-any.whl.metadata (16 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Downloading contourpy-1.3.3-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Downloading fonttools-4.60.0-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (111 kB)
Collecting kiwisol

In [2]:
import networkx as nx
import matplotlib.pyplot as plt
import spacy
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
import json
import re
from typing import List, Dict, Tuple, Set
import time

# Embedding and vector store imports
from sentence_transformers import SentenceTransformer
import faiss
from langchain.vectorstores import FAISS, Chroma
from langchain.embeddings import HuggingFaceEmbeddings

# Load spaCy model for NER
nlp = spacy.load("en_core_web_sm")

print("✅ All packages imported successfully!")

ImportError: libbz2.so.1.0: cannot open shared object file: No such file or directory

## Sample Document Data

Let's create some sample documents that contain entities and relationships we can extract.

In [None]:
# Sample documents with rich entity relationships
sample_documents = [
    {
        "id": "doc_1",
        "title": "Introduction to Machine Learning",
        "content": "Machine Learning is a subset of Artificial Intelligence that enables computers to learn without being explicitly programmed. Python is widely used for ML development. Popular libraries include TensorFlow, PyTorch, and Scikit-learn. Supervised learning uses labeled data to train models."
    },
    {
        "id": "doc_2",
        "title": "Deep Learning Fundamentals",
        "content": "Deep Learning is a subset of Machine Learning that uses neural networks with multiple layers. TensorFlow and PyTorch are the most popular frameworks. Deep Learning excels in computer vision and natural language processing. CUDA enables GPU acceleration for faster training."
    },
    {
        "id": "doc_3",
        "title": "Python for Data Science",
        "content": "Python is the preferred programming language for data science and machine learning. Key libraries include NumPy for numerical computing, Pandas for data manipulation, and Matplotlib for visualization. Jupyter notebooks provide an interactive development environment."
    },
    {
        "id": "doc_4",
        "title": "Natural Language Processing",
        "content": "Natural Language Processing (NLP) is a field of Artificial Intelligence focused on understanding human language. Transformers revolutionized NLP with models like BERT and GPT. PyTorch and TensorFlow support NLP model development. Applications include sentiment analysis and machine translation."
    },
    {
        "id": "doc_5",
        "title": "Computer Vision Applications",
        "content": "Computer Vision enables machines to interpret visual information. Deep Learning techniques, particularly Convolutional Neural Networks (CNNs), are essential. OpenCV provides computer vision tools, while TensorFlow and PyTorch offer deep learning frameworks. Applications include image recognition and object detection."
    },
    {
        "id": "doc_6",
        "title": "Data Visualization with Python",
        "content": "Data visualization is crucial for data analysis and presentation. Matplotlib is the foundational plotting library in Python. Seaborn provides statistical visualizations, while Plotly enables interactive charts. Jupyter notebooks integrate well with visualization libraries for exploratory data analysis."
    }
]

print(f"📚 Created {len(sample_documents)} sample documents")
for doc in sample_documents:
    print(f"  - {doc['title']} ({len(doc['content'])} chars)")

## 1. Entity Extraction and Relationship Building

We'll extract entities and relationships from documents to build our knowledge graph.

In [None]:
class KnowledgeGraphBuilder:
    def __init__(self):
        self.graph = nx.DiGraph()
        self.entities = set()
        self.relationships = []
        self.entity_documents = defaultdict(list)  # Track which documents contain which entities
        
    def extract_entities(self, text: str) -> List[str]:
        """Extract named entities using spaCy NER"""
        doc = nlp(text)
        entities = []
        
        # Extract named entities
        for ent in doc.ents:
            if ent.label_ in ['ORG', 'PRODUCT', 'LANGUAGE', 'PERSON', 'TECHNOLOGY']:
                entities.append(ent.text)
        
        # Extract key technical terms (simple keyword extraction)
        tech_terms = [
            'Machine Learning', 'Deep Learning', 'Artificial Intelligence', 'AI',
            'Python', 'TensorFlow', 'PyTorch', 'Scikit-learn', 'NumPy', 'Pandas',
            'Matplotlib', 'Seaborn', 'Plotly', 'Jupyter', 'OpenCV', 'CUDA',
            'Neural Networks', 'CNN', 'BERT', 'GPT', 'Transformers',
            'Computer Vision', 'Natural Language Processing', 'NLP',
            'Supervised Learning', 'Data Science', 'Data Visualization'
        ]
        
        for term in tech_terms:
            if term.lower() in text.lower():
                entities.append(term)
        
        return list(set(entities))  # Remove duplicates
    
    def extract_relationships(self, text: str, entities: List[str]) -> List[Tuple[str, str, str]]:
        """Extract relationships between entities"""
        relationships = []
        text_lower = text.lower()
        
        # Define relationship patterns
        patterns = [
            ('subset_of', ['is a subset of', 'subset of']),
            ('uses', ['uses', 'utilize', 'employs']),
            ('includes', ['includes', 'include', 'contains']),
            ('enables', ['enables', 'enable', 'allows']),
            ('supports', ['supports', 'support']),
            ('provides', ['provides', 'provide', 'offers']),
            ('excels_in', ['excels in', 'good for']),
            ('revolutionized', ['revolutionized', 'transformed']),
            ('integrates_with', ['integrate', 'integrates with'])
        ]
        
        # Find relationships between entities
        for i, entity1 in enumerate(entities):
            for j, entity2 in enumerate(entities):
                if i != j:
                    entity1_lower = entity1.lower()
                    entity2_lower = entity2.lower()
                    
                    # Check if both entities appear in the same sentence
                    sentences = text.split('.')
                    for sentence in sentences:
                        sentence_lower = sentence.lower()
                        if entity1_lower in sentence_lower and entity2_lower in sentence_lower:
                            # Check for relationship patterns
                            for rel_type, keywords in patterns:
                                for keyword in keywords:
                                    if keyword in sentence_lower:
                                        # Determine direction based on sentence structure
                                        entity1_pos = sentence_lower.find(entity1_lower)
                                        entity2_pos = sentence_lower.find(entity2_lower)
                                        keyword_pos = sentence_lower.find(keyword)
                                        
                                        # Simple heuristic: if keyword is between entities
                                        if entity1_pos < keyword_pos < entity2_pos:
                                            relationships.append((entity1, rel_type, entity2))
                                        elif entity2_pos < keyword_pos < entity1_pos:
                                            relationships.append((entity2, rel_type, entity1))
        
        return relationships
    
    def build_graph(self, documents: List[Dict]) -> None:
        """Build knowledge graph from documents"""
        for doc in documents:
            text = doc['title'] + ' ' + doc['content']
            
            # Extract entities
            entities = self.extract_entities(text)
            
            # Add entities to graph
            for entity in entities:
                self.graph.add_node(entity, type='entity')
                self.entities.add(entity)
                self.entity_documents[entity].append(doc['id'])
            
            # Extract and add relationships
            relationships = self.extract_relationships(text, entities)
            for subj, pred, obj in relationships:
                self.graph.add_edge(subj, obj, relation=pred, source_doc=doc['id'])
                self.relationships.append((subj, pred, obj, doc['id']))
    
    def get_entity_info(self, entity: str) -> Dict:
        """Get information about an entity"""
        if entity not in self.entities:
            return {}
        
        # Get connections
        outgoing = list(self.graph.successors(entity))
        incoming = list(self.graph.predecessors(entity))
        
        # Get relationship details
        relationships = []
        for neighbor in outgoing:
            edge_data = self.graph[entity][neighbor]
            relationships.append({
                'type': 'outgoing',
                'target': neighbor,
                'relation': edge_data.get('relation', 'related_to'),
                'source_doc': edge_data.get('source_doc', 'unknown')
            })
        
        for neighbor in incoming:
            edge_data = self.graph[neighbor][entity]
            relationships.append({
                'type': 'incoming',
                'source': neighbor,
                'relation': edge_data.get('relation', 'related_to'),
                'source_doc': edge_data.get('source_doc', 'unknown')
            })
        
        return {
            'entity': entity,
            'documents': self.entity_documents[entity],
            'connections': len(outgoing) + len(incoming),
            'relationships': relationships
        }

# Build the knowledge graph
kg_builder = KnowledgeGraphBuilder()
kg_builder.build_graph(sample_documents)

print(f"🕸️ Knowledge Graph built successfully!")
print(f"   Entities: {len(kg_builder.entities)}")
print(f"   Relationships: {len(kg_builder.relationships)}")
print(f"   Graph edges: {kg_builder.graph.number_of_edges()}")

# Show some entities
print(f"\n📋 Sample entities:")
for entity in list(kg_builder.entities)[:10]:
    print(f"   - {entity}")

## 2. Graph Visualization

Let's visualize our knowledge graph to understand the entity relationships.

In [None]:
def visualize_knowledge_graph(kg_builder, max_nodes=20, figsize=(15, 10)):
    """Visualize the knowledge graph"""
    # Get subgraph with most connected nodes
    node_degrees = dict(kg_builder.graph.degree())
    top_nodes = sorted(node_degrees.items(), key=lambda x: x[1], reverse=True)[:max_nodes]
    top_node_names = [node[0] for node in top_nodes]
    
    subgraph = kg_builder.graph.subgraph(top_node_names)
    
    plt.figure(figsize=figsize)
    
    # Create layout
    pos = nx.spring_layout(subgraph, k=3, iterations=50)
    
    # Draw nodes with different colors based on degree
    node_colors = []
    node_sizes = []
    for node in subgraph.nodes():
        degree = node_degrees[node]
        if degree >= 4:
            node_colors.append('red')  # Highly connected
            node_sizes.append(1000)
        elif degree >= 2:
            node_colors.append('orange')  # Moderately connected
            node_sizes.append(700)
        else:
            node_colors.append('lightblue')  # Less connected
            node_sizes.append(500)
    
    # Draw the graph
    nx.draw_networkx_nodes(subgraph, pos, node_color=node_colors, 
                          node_size=node_sizes, alpha=0.8)
    nx.draw_networkx_edges(subgraph, pos, edge_color='gray', 
                          arrows=True, arrowsize=20, alpha=0.6)
    nx.draw_networkx_labels(subgraph, pos, font_size=8, font_weight='bold')
    
    # Add edge labels for relationships
    edge_labels = {}
    for edge in subgraph.edges(data=True):
        relation = edge[2].get('relation', 'related')
        edge_labels[(edge[0], edge[1])] = relation
    
    nx.draw_networkx_edge_labels(subgraph, pos, edge_labels, font_size=6)
    
    plt.title(f"Knowledge Graph - Top {max_nodes} Most Connected Entities", 
              fontsize=16, fontweight='bold')
    plt.axis('off')
    plt.tight_layout()
    
    # Add legend
    legend_elements = [
        plt.scatter([], [], c='red', s=100, label='Highly connected (4+ edges)'),
        plt.scatter([], [], c='orange', s=100, label='Moderately connected (2-3 edges)'),
        plt.scatter([], [], c='lightblue', s=100, label='Less connected (1 edge)')
    ]
    plt.legend(handles=legend_elements, loc='upper right')
    
    plt.show()
    
    # Print statistics
    print(f"\n📊 Graph Statistics:")
    print(f"   Total nodes: {subgraph.number_of_nodes()}")
    print(f"   Total edges: {subgraph.number_of_edges()}")
    print(f"   Average degree: {np.mean([d for n, d in subgraph.degree()]):.2f}")
    
    return subgraph

# Visualize the knowledge graph
subgraph = visualize_knowledge_graph(kg_builder)

## 3. Graph-based Search Implementation

Now let's implement different graph-based search methods.

In [None]:
class GraphBasedSearch:
    def __init__(self, kg_builder: KnowledgeGraphBuilder):
        self.kg = kg_builder
        self.graph = kg_builder.graph
        self.entities = kg_builder.entities
        
    def entity_search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Search for entities matching the query"""
        query_lower = query.lower()
        matches = []
        
        for entity in self.entities:
            if query_lower in entity.lower():
                score = len(query_lower) / len(entity.lower())  # Simple relevance score
                entity_info = self.kg.get_entity_info(entity)
                matches.append({
                    'entity': entity,
                    'score': score,
                    'info': entity_info
                })
        
        # Sort by score and connections
        matches.sort(key=lambda x: (x['score'], x['info']['connections']), reverse=True)
        return matches[:top_k]
    
    def relationship_search(self, entity: str, max_hops: int = 2) -> Dict:
        """Find entities related to a given entity within max_hops"""
        if entity not in self.entities:
            return {'error': f'Entity "{entity}" not found in knowledge graph'}
        
        related_entities = {}
        
        # Use BFS to find entities within max_hops
        visited = {entity}
        queue = [(entity, 0)]  # (entity, hop_count)
        
        while queue:
            current_entity, hops = queue.pop(0)
            
            if hops < max_hops:
                # Get neighbors (both directions)
                neighbors = set(self.graph.successors(current_entity)) | set(self.graph.predecessors(current_entity))
                
                for neighbor in neighbors:
                    if neighbor not in visited:
                        visited.add(neighbor)
                        queue.append((neighbor, hops + 1))
                        
                        # Get relationship info
                        relationship = 'unknown'
                        if self.graph.has_edge(current_entity, neighbor):
                            relationship = self.graph[current_entity][neighbor].get('relation', 'related_to')
                        elif self.graph.has_edge(neighbor, current_entity):
                            relationship = self.graph[neighbor][current_entity].get('relation', 'related_to')
                        
                        if hops + 1 not in related_entities:
                            related_entities[hops + 1] = []
                        
                        related_entities[hops + 1].append({
                            'entity': neighbor,
                            'relationship': relationship,
                            'path_length': hops + 1,
                            'via': current_entity if hops > 0 else None
                        })
        
        return {
            'query_entity': entity,
            'related_entities': related_entities,
            'total_found': sum(len(entities) for entities in related_entities.values())
        }
    
    def path_search(self, start_entity: str, end_entity: str) -> List[List[str]]:
        """Find paths between two entities"""
        if start_entity not in self.entities or end_entity not in self.entities:
            return []
        
        try:
            # Find shortest paths (up to 3 paths)
            paths = []
            for path in nx.all_simple_paths(self.graph, start_entity, end_entity, cutoff=4):
                if len(paths) < 3:  # Limit to 3 paths
                    # Add relationship information
                    path_with_relations = []
                    for i in range(len(path) - 1):
                        relation = self.graph[path[i]][path[i+1]].get('relation', 'related_to')
                        path_with_relations.append((path[i], relation, path[i+1]))
                    paths.append(path_with_relations)
            return paths
        except nx.NetworkXNoPath:
            return []
    
    def concept_expansion(self, concepts: List[str]) -> Dict:
        """Expand a list of concepts using graph relationships"""
        expanded_concepts = set(concepts)
        expansion_info = {}
        
        for concept in concepts:
            if concept in self.entities:
                # Find related entities
                related = self.relationship_search(concept, max_hops=1)
                if 'related_entities' in related:
                    for hop_level, entities in related['related_entities'].items():
                        for entity_info in entities:
                            expanded_concepts.add(entity_info['entity'])
                expansion_info[concept] = related
        
        return {
            'original_concepts': concepts,
            'expanded_concepts': list(expanded_concepts),
            'expansion_details': expansion_info,
            'expansion_ratio': len(expanded_concepts) / len(concepts)
        }

# Initialize graph search
graph_search = GraphBasedSearch(kg_builder)

print("🔍 Graph-based search initialized!")

## 4. Demo: Graph-based Search Examples

In [None]:
# Example 1: Entity Search
print("🔍 Example 1: Entity Search for 'Python'")
print("=" * 50)
results = graph_search.entity_search("Python", top_k=3)
for result in results:
    print(f"\n📌 Entity: {result['entity']} (Score: {result['score']:.3f})")
    print(f"   Documents: {result['info']['documents']}")
    print(f"   Connections: {result['info']['connections']}")
    if result['info']['relationships']:
        print(f"   Relationships:")
        for rel in result['info']['relationships'][:3]:  # Show top 3
            if rel['type'] == 'outgoing':
                print(f"     → {rel['relation']} → {rel['target']}")
            else:
                print(f"     ← {rel['relation']} ← {rel['source']}")

print("\n" + "=" * 70)

In [None]:
# Example 2: Relationship Search
print("🔍 Example 2: Finding entities related to 'Machine Learning'")
print("=" * 60)
related_results = graph_search.relationship_search("Machine Learning", max_hops=2)
if 'error' not in related_results:
    print(f"Query Entity: {related_results['query_entity']}")
    print(f"Total Related Entities Found: {related_results['total_found']}")
    
    for hop_level, entities in related_results['related_entities'].items():
        print(f"\n🔗 Hop {hop_level} ({len(entities)} entities):")
        for entity_info in entities[:5]:  # Show top 5 per level
            via_text = f" (via {entity_info['via']})" if entity_info['via'] else ""
            print(f"   - {entity_info['entity']} [{entity_info['relationship']}]{via_text}")
else:
    print(related_results['error'])

print("\n" + "=" * 70)

In [None]:
# Example 3: Path Search
print("🔍 Example 3: Finding paths between 'Python' and 'Computer Vision'")
print("=" * 65)
paths = graph_search.path_search("Python", "Computer Vision")
if paths:
    print(f"Found {len(paths)} path(s):")
    for i, path in enumerate(paths, 1):
        print(f"\n🛤️ Path {i}:")
        path_str = path[0][0]  # Start with first entity
        for step in path:
            path_str += f" --[{step[1]}]--> {step[2]}"
        print(f"   {path_str}")
else:
    print("No paths found between these entities.")

print("\n" + "=" * 70)

In [None]:
# Example 4: Concept Expansion
print("🔍 Example 4: Concept Expansion for ['Deep Learning', 'NLP']")
print("=" * 60)
expansion_results = graph_search.concept_expansion(["Deep Learning", "NLP"])
print(f"Original concepts: {expansion_results['original_concepts']}")
print(f"Expanded to {len(expansion_results['expanded_concepts'])} concepts")
print(f"Expansion ratio: {expansion_results['expansion_ratio']:.2f}x")

print(f"\n📈 Expanded Concepts:")
new_concepts = set(expansion_results['expanded_concepts']) - set(expansion_results['original_concepts'])
for concept in list(new_concepts)[:8]:  # Show first 8 new concepts
    print(f"   + {concept}")

if len(new_concepts) > 8:
    print(f"   ... and {len(new_concepts) - 8} more")

print("\n" + "=" * 70)

## 5. Embedding-based Search Setup

Now let's set up embedding-based search to compare and combine with graph search.

In [None]:
# Initialize embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings for documents
documents_text = []
document_metadata = []

for doc in sample_documents:
    full_text = f"{doc['title']} {doc['content']}"
    documents_text.append(full_text)
    document_metadata.append({
        'doc_id': doc['id'],
        'title': doc['title']
    })

# Generate embeddings
print("🧮 Generating embeddings...")
start_time = time.time()
document_embeddings = embedding_model.encode(documents_text, show_progress_bar=True)
embedding_time = time.time() - start_time

print(f"✅ Generated {len(document_embeddings)} embeddings in {embedding_time:.2f} seconds")
print(f"   Embedding dimension: {document_embeddings.shape[1]}")

# Create FAISS index for fast similarity search
dimension = document_embeddings.shape[1]
faiss_index = faiss.IndexFlatIP(dimension)  # Inner product for similarity

# Normalize embeddings for cosine similarity
normalized_embeddings = document_embeddings / np.linalg.norm(document_embeddings, axis=1, keepdims=True)
faiss_index.add(normalized_embeddings.astype('float32'))

print(f"📊 FAISS index created with {faiss_index.ntotal} vectors")

## 6. Hybrid Search: Combining Graph and Embeddings

This is where the magic happens! We'll combine graph-based knowledge with embedding-based semantic search.

In [None]:
class HybridKnowledgeSearch:
    def __init__(self, graph_search: GraphBasedSearch, embedding_model, faiss_index, 
                 documents_text: List[str], document_metadata: List[Dict]):
        self.graph_search = graph_search
        self.embedding_model = embedding_model
        self.faiss_index = faiss_index
        self.documents_text = documents_text
        self.document_metadata = document_metadata
    
    def embedding_search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Perform embedding-based similarity search"""
        # Generate query embedding
        query_embedding = self.embedding_model.encode([query])
        query_embedding = query_embedding / np.linalg.norm(query_embedding)
        
        # Search in FAISS index
        scores, indices = self.faiss_index.search(query_embedding.astype('float32'), top_k)
        
        results = []
        for i, (score, idx) in enumerate(zip(scores[0], indices[0])):
            results.append({
                'rank': i + 1,
                'doc_id': self.document_metadata[idx]['doc_id'],
                'title': self.document_metadata[idx]['title'],
                'content': self.documents_text[idx],
                'similarity_score': float(score),
                'search_type': 'embedding'
            })
        
        return results
    
    def graph_enhanced_search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Enhance embedding search with graph knowledge"""
        # Step 1: Find relevant entities in the query
        query_entities = self.graph_search.kg.extract_entities(query)
        
        # Step 2: Expand concepts using graph relationships
        if query_entities:
            expansion_results = self.graph_search.concept_expansion(query_entities)
            expanded_concepts = expansion_results['expanded_concepts']
        else:
            expanded_concepts = []
        
        # Step 3: Create enriched query
        enriched_query = query
        if expanded_concepts:
            # Add top related concepts to query
            top_concepts = expanded_concepts[:5]  # Limit to avoid query bloat
            enriched_query += " " + " ".join(top_concepts)
        
        # Step 4: Perform embedding search with enriched query
        embedding_results = self.embedding_search(enriched_query, top_k)
        
        # Step 5: Add graph context to results
        for result in embedding_results:
            result['search_type'] = 'graph_enhanced'
            result['query_entities'] = query_entities
            result['expanded_concepts'] = expanded_concepts
            result['enriched_query'] = enriched_query
        
        return embedding_results
    
    def hybrid_search(self, query: str, top_k: int = 5, alpha: float = 0.7) -> List[Dict]:
        """Combine embedding and graph search with weighted scoring"""
        # Get results from both methods
        embedding_results = self.embedding_search(query, top_k * 2)  # Get more candidates
        graph_results = self.graph_enhanced_search(query, top_k * 2)
        
        # Create combined scoring
        doc_scores = {}
        
        # Score from embedding search
        for result in embedding_results:
            doc_id = result['doc_id']
            embedding_score = result['similarity_score']
            doc_scores[doc_id] = {
                'embedding_score': embedding_score,
                'graph_score': 0.0,
                'result': result
            }
        
        # Score from graph-enhanced search
        for result in graph_results:
            doc_id = result['doc_id']
            graph_score = result['similarity_score']
            if doc_id in doc_scores:
                doc_scores[doc_id]['graph_score'] = graph_score
            else:
                doc_scores[doc_id] = {
                    'embedding_score': 0.0,
                    'graph_score': graph_score,
                    'result': result
                }
        
        # Calculate hybrid scores
        hybrid_results = []
        for doc_id, scores in doc_scores.items():
            # Weighted combination
            hybrid_score = (alpha * scores['embedding_score'] + 
                          (1 - alpha) * scores['graph_score'])
            
            result = scores['result'].copy()
            result.update({
                'hybrid_score': hybrid_score,
                'embedding_component': scores['embedding_score'],
                'graph_component': scores['graph_score'],
                'search_type': 'hybrid',
                'alpha': alpha
            })
            hybrid_results.append(result)
        
        # Sort by hybrid score
        hybrid_results.sort(key=lambda x: x['hybrid_score'], reverse=True)
        
        # Re-rank
        for i, result in enumerate(hybrid_results[:top_k], 1):
            result['rank'] = i
        
        return hybrid_results[:top_k]
    
    def compare_search_methods(self, query: str, top_k: int = 3) -> Dict:
        """Compare all search methods side by side"""
        start_time = time.time()
        embedding_results = self.embedding_search(query, top_k)
        embedding_time = time.time() - start_time
        
        start_time = time.time()
        graph_results = self.graph_enhanced_search(query, top_k)
        graph_time = time.time() - start_time
        
        start_time = time.time()
        hybrid_results = self.hybrid_search(query, top_k)
        hybrid_time = time.time() - start_time
        
        return {
            'query': query,
            'embedding_search': {
                'results': embedding_results,
                'time': embedding_time
            },
            'graph_enhanced_search': {
                'results': graph_results,
                'time': graph_time
            },
            'hybrid_search': {
                'results': hybrid_results,
                'time': hybrid_time
            }
        }

# Initialize hybrid search
hybrid_search = HybridKnowledgeSearch(
    graph_search, embedding_model, faiss_index, 
    documents_text, document_metadata
)

print("🎯 Hybrid Knowledge Search initialized!")

## 7. Comprehensive Search Comparison

In [None]:
# Test different search methods with various queries
test_queries = [
    "Python libraries for machine learning",
    "Deep learning frameworks",
    "Data visualization tools",
    "Computer vision with neural networks"
]

print("🔬 COMPREHENSIVE SEARCH METHOD COMPARISON")
print("=" * 80)

for i, query in enumerate(test_queries, 1):
    print(f"\n🎯 Query {i}: '{query}'")
    print("-" * 60)
    
    comparison = hybrid_search.compare_search_methods(query, top_k=3)
    
    print(f"\n📊 Performance:")
    print(f"   Embedding Search: {comparison['embedding_search']['time']:.3f}s")
    print(f"   Graph Enhanced:   {comparison['graph_enhanced_search']['time']:.3f}s")
    print(f"   Hybrid Search:    {comparison['hybrid_search']['time']:.3f}s")
    
    # Show top result from each method
    methods = ['embedding_search', 'graph_enhanced_search', 'hybrid_search']
    method_names = ['🔍 Embedding', '🕸️ Graph Enhanced', '🎯 Hybrid']
    
    for method, method_name in zip(methods, method_names):
        results = comparison[method]['results']
        if results:
            top_result = results[0]
            score_info = ""
            if method == 'hybrid_search':
                score_info = f" (E:{top_result['embedding_component']:.3f}, G:{top_result['graph_component']:.3f})"
            elif 'similarity_score' in top_result:
                score_info = f" ({top_result['similarity_score']:.3f})"
            
            print(f"\n{method_name} Top Result{score_info}:")
            print(f"   📄 {top_result['title']}")
            
            # Show additional context for graph-enhanced search
            if method == 'graph_enhanced_search' and 'expanded_concepts' in top_result:
                concepts = top_result['expanded_concepts'][:3]
                if concepts:
                    print(f"   🧠 Expanded concepts: {', '.join(concepts)}")
    
    print("\n" + "=" * 80)

## 8. Interactive Search Demo

In [None]:
def interactive_search_demo(query: str):
    """Interactive demonstration of all search capabilities"""
    print(f"🔍 INTERACTIVE SEARCH DEMO")
    print(f"Query: '{query}'")
    print("=" * 70)
    
    # 1. Entity Recognition in Query
    query_entities = graph_search.kg.extract_entities(query)
    print(f"\n🏷️ Entities detected in query: {query_entities if query_entities else 'None'}")
    
    # 2. Concept Expansion
    if query_entities:
        expansion = graph_search.concept_expansion(query_entities[:2])  # Limit to 2 entities
        expanded_concepts = list(set(expansion['expanded_concepts']) - set(query_entities))
        print(f"\n🧠 Concept expansion found {len(expanded_concepts)} related concepts:")
        for concept in expanded_concepts[:6]:
            print(f"   + {concept}")
    
    # 3. Hybrid Search Results
    print(f"\n🎯 Hybrid Search Results:")
    hybrid_results = hybrid_search.hybrid_search(query, top_k=4)
    
    for i, result in enumerate(hybrid_results, 1):
        print(f"\n   {i}. {result['title']}")
        print(f"      📊 Hybrid Score: {result['hybrid_score']:.3f}")
        print(f"      🔍 Embedding: {result['embedding_component']:.3f} | "
              f"🕸️ Graph: {result['graph_component']:.3f}")
        
        # Show snippet
        content = result['content']
        if len(content) > 150:
            content = content[:150] + "..."
        print(f"      📝 {content}")
    
    # 4. Graph Relationships (if entities found)
    if query_entities:
        print(f"\n🕸️ Graph Relationships for '{query_entities[0]}':")
        entity_info = graph_search.kg.get_entity_info(query_entities[0])
        if entity_info and entity_info['relationships']:
            for rel in entity_info['relationships'][:4]:
                if rel['type'] == 'outgoing':
                    print(f"      {query_entities[0]} --[{rel['relation']}]--> {rel['target']}")
                else:
                    print(f"      {rel['source']} --[{rel['relation']}]--> {query_entities[0]}")
        else:
            print(f"      No relationships found")

# Test the interactive demo
demo_queries = [
    "How to use Python for deep learning?",
    "Computer vision with TensorFlow"
]

for query in demo_queries:
    interactive_search_demo(query)
    print("\n" + "=" * 100 + "\n")

## 9. Performance Analysis and Insights

In [None]:
def analyze_search_performance():
    """Analyze and compare search method performance"""
    test_queries = [
        "machine learning algorithms",
        "python data science",
        "neural networks",
        "visualization libraries",
        "deep learning frameworks"
    ]
    
    results = {
        'embedding': {'times': [], 'avg_scores': []},
        'graph_enhanced': {'times': [], 'avg_scores': []},
        'hybrid': {'times': [], 'avg_scores': []}
    }
    
    print("📊 PERFORMANCE ANALYSIS")
    print("=" * 50)
    
    for query in test_queries:
        comparison = hybrid_search.compare_search_methods(query, top_k=3)
        
        # Collect timing data
        results['embedding']['times'].append(comparison['embedding_search']['time'])
        results['graph_enhanced']['times'].append(comparison['graph_enhanced_search']['time'])
        results['hybrid']['times'].append(comparison['hybrid_search']['time'])
        
        # Collect average scores
        if comparison['embedding_search']['results']:
            avg_emb_score = np.mean([r['similarity_score'] for r in comparison['embedding_search']['results']])
            results['embedding']['avg_scores'].append(avg_emb_score)
        
        if comparison['graph_enhanced_search']['results']:
            avg_graph_score = np.mean([r['similarity_score'] for r in comparison['graph_enhanced_search']['results']])
            results['graph_enhanced']['avg_scores'].append(avg_graph_score)
        
        if comparison['hybrid_search']['results']:
            avg_hybrid_score = np.mean([r['hybrid_score'] for r in comparison['hybrid_search']['results']])
            results['hybrid']['avg_scores'].append(avg_hybrid_score)
    
    # Calculate statistics
    print(f"\n⏱️ Average Response Times:")
    for method, data in results.items():
        avg_time = np.mean(data['times'])
        std_time = np.std(data['times'])
        print(f"   {method.replace('_', ' ').title():15}: {avg_time:.4f}s (±{std_time:.4f}s)")
    
    print(f"\n📈 Average Relevance Scores:")
    for method, data in results.items():
        if data['avg_scores']:
            avg_score = np.mean(data['avg_scores'])
            std_score = np.std(data['avg_scores'])
            print(f"   {method.replace('_', ' ').title():15}: {avg_score:.4f} (±{std_score:.4f})")
    
    return results

# Run performance analysis
performance_results = analyze_search_performance()

# Create visualization
plt.figure(figsize=(12, 5))

# Subplot 1: Response Times
plt.subplot(1, 2, 1)
methods = ['Embedding', 'Graph Enhanced', 'Hybrid']
times = [np.mean(performance_results['embedding']['times']),
         np.mean(performance_results['graph_enhanced']['times']),
         np.mean(performance_results['hybrid']['times'])]
errors = [np.std(performance_results['embedding']['times']),
          np.std(performance_results['graph_enhanced']['times']),
          np.std(performance_results['hybrid']['times'])]

bars = plt.bar(methods, times, yerr=errors, capsize=5, 
               color=['skyblue', 'lightgreen', 'orange'], alpha=0.7)
plt.title('Average Response Time by Search Method')
plt.ylabel('Time (seconds)')
plt.xticks(rotation=45)

# Add value labels on bars
for bar, time in zip(bars, times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001, 
             f'{time:.4f}s', ha='center', va='bottom')

# Subplot 2: Score Distribution
plt.subplot(1, 2, 2)
scores = [performance_results['embedding']['avg_scores'],
          performance_results['graph_enhanced']['avg_scores'],
          performance_results['hybrid']['avg_scores']]
labels = ['Embedding', 'Graph Enhanced', 'Hybrid']
colors = ['skyblue', 'lightgreen', 'orange']

bp = plt.boxplot(scores, labels=labels, patch_artist=True)
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

plt.title('Relevance Score Distribution')
plt.ylabel('Relevance Score')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 10. Summary and Key Insights

### 🎯 What We've Built:

1. **Graph-based Knowledge Graph**: Extracted entities and relationships from documents
2. **Multiple Search Methods**: Entity search, relationship traversal, path finding, concept expansion
3. **Embedding-based Search**: Fast semantic similarity search using sentence transformers
4. **Hybrid Approach**: Combined graph knowledge with embedding similarity for enhanced results

### 🔍 Search Method Comparison:

| Method | Strengths | Weaknesses | Best Use Cases |
|--------|-----------|------------|----------------|
| **Embedding Search** | Fast, semantic understanding, good for general queries | No explicit relationships, may miss domain connections | General semantic search, content similarity |
| **Graph Search** | Explicit relationships, reasoning capabilities, domain knowledge | Requires good entity extraction, limited by graph coverage | Domain-specific queries, relationship exploration |
| **Hybrid Search** | Best of both worlds, enhanced relevance, context-aware | More complex, slower than pure embedding | Complex queries requiring both semantic and structural understanding |

### 💡 Key Benefits of Graph + Embeddings:

1. **Enhanced Query Understanding**: Graph concepts expand query context
2. **Better Relevance**: Combining structural and semantic signals
3. **Explainable Results**: Graph relationships provide reasoning trails
4. **Domain Adaptation**: Graph captures domain-specific relationships
5. **Query Expansion**: Automatic concept expansion using graph relationships

### 🚀 Next Steps:

- **Larger Knowledge Graphs**: Use more sophisticated entity extraction (spaCy, BERT-NER)
- **Dynamic Updates**: Real-time graph updates as new documents are added
- **Graph Embeddings**: Combine node2vec or GraphSAGE with document embeddings
- **Multi-modal**: Extend to images, videos with visual knowledge graphs
- **Federated Search**: Combine multiple knowledge sources and graphs


In [None]:
# Final demonstration: Show the complete pipeline
def complete_pipeline_demo(user_query: str):
    """Demonstrate the complete pipeline from query to results"""
    print(f"🔄 COMPLETE PIPELINE DEMONSTRATION")
    print(f"User Query: '{user_query}'")
    print("=" * 80)
    
    # Step 1: Entity extraction
    entities = graph_search.kg.extract_entities(user_query)
    print(f"\n1️⃣ Entity Extraction: {entities}")
    
    # Step 2: Graph expansion
    if entities:
        expansion = graph_search.concept_expansion(entities[:2])
        new_concepts = list(set(expansion['expanded_concepts']) - set(entities))
        print(f"\n2️⃣ Graph Expansion: +{len(new_concepts)} concepts")
        print(f"   Added: {', '.join(new_concepts[:5])}{'...' if len(new_concepts) > 5 else ''}")
    
    # Step 3: Embedding generation
    print(f"\n3️⃣ Query Embedding: Generated {embedding_model.encode([user_query]).shape[1]}D vector")
    
    # Step 4: Hybrid search
    results = hybrid_search.hybrid_search(user_query, top_k=3)
    print(f"\n4️⃣ Hybrid Search Results:")
    
    for i, result in enumerate(results, 1):
        print(f"\n   📄 {i}. {result['title']}")
        print(f"      🎯 Final Score: {result['hybrid_score']:.4f}")
        print(f"      📊 Components: Embedding({result['embedding_component']:.3f}) + "
              f"Graph({result['graph_component']:.3f})")
    
    print(f"\n✅ Pipeline completed successfully!")
    return results

# Demo with a complex query
demo_query = "What are the best Python frameworks for building computer vision applications?"
final_results = complete_pipeline_demo(demo_query)

print(f"\n" + "=" * 80)
print(f"📚 NOTEBOOK COMPLETE! You now have a comprehensive understanding of:")
print(f"   🕸️ Graph-based Knowledge Graphs")
print(f"   🔍 Multiple search methodologies")
print(f"   🎯 Hybrid embedding + graph approaches")
print(f"   📊 Performance comparison and analysis")
print(f"\n🚀 Ready to build your own hybrid knowledge search systems!")