# Module 3: Unstructured Data Processing with Neo4j

This notebook covers extracting, processing, and storing unstructured data in Neo4j for knowledge graph construction and AI applications.

## Learning Objectives
- Extract entities and relationships from text using NLP
- Process documents and create knowledge graphs
- Generate embeddings for semantic analysis
- Build document-based graph structures

## Prerequisites
- Completion of Module 2: Structured Data
- Basic understanding of NLP concepts

## Setup and Dependencies

In [None]:
# Install required packages
!pip install neo4j pandas numpy spacy openai sentence-transformers nltk

## Install Required NLP and ML Packages

For unstructured data processing, we need specialized packages:

- **spacy**: Advanced NLP library for entity extraction and language processing
- **openai**: API access for advanced language models (optional)
- **sentence-transformers**: Pre-trained models for creating semantic embeddings
- **nltk**: Natural Language Toolkit for text preprocessing
- **neo4j, pandas, numpy**: Core data processing and graph database connectivity

These packages enable sophisticated text analysis and knowledge graph construction.

## Import NLP Libraries and Download Language Resources

**Library Setup:**
- **spacy & nltk**: For natural language processing
- **sentence_transformers**: For generating semantic embeddings
- **numpy**: For numerical operations on embeddings
- **typing**: For type hints and better code documentation

**NLTK Downloads:**
- **punkt**: Sentence tokenization models
- **stopwords**: Common words to filter out during processing

**Important Note:** The `nltk.download()` commands download necessary language models and data files for text processing.

In [None]:
# Import libraries
import os
import pandas as pd
import numpy as np
from neo4j import GraphDatabase
import spacy
from sentence_transformers import SentenceTransformer
import openai
import json
import re
from typing import List, Dict, Tuple
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')

## Establish Neo4j Connection for Knowledge Graphs

**Connection Setup for Text Processing:**
- Load credentials from environment variables for security
- Create a reusable connection function for graph operations
- Test the connection to ensure we can store extracted knowledge

**Knowledge Graph Storage:**
Unlike structured data import, unstructured data processing creates more complex graph structures:
- Documents become nodes
- Extracted entities become nodes
- Relationships between entities are discovered through text analysis
- Embeddings enable semantic search and similarity analysis

## Neo4j Connection Setup

## Initialize NLP Models for Text Processing

**SpaCy Model Setup:**
- Download and load the English language model (`en_core_web_sm`)
- This model includes pre-trained capabilities for named entity recognition
- Provides part-of-speech tagging, dependency parsing, and more

**Sentence Transformer Setup:**
- Load `all-MiniLM-L6-v2` - a lightweight but effective embedding model
- This model converts text into 384-dimensional vectors
- Enables semantic similarity calculations and vector search

**Why These Models:**
- SpaCy excels at extracting structured information from text
- Sentence transformers capture semantic meaning for similarity search
- Together they enable both symbolic and semantic knowledge extraction

## Create Sample Document Collection

**Document Data Structure:**
Each document contains:
- **id**: Unique identifier for tracking
- **title**: Document title for human reference
- **content**: The actual text content to be processed

**Sample Content Themes:**
- **AI Research**: Academic and technical content with people, organizations, and technologies
- **Business Partnerships**: Corporate entities, partnerships, and strategic relationships  
- **Technology Innovation**: Products, companies, and technical developments

**Why This Data:**
These documents contain rich entity relationships perfect for demonstrating knowledge graph construction from unstructured text.

In [None]:
# Neo4j connection settings
NEO4J_URI = os.getenv('NEO4J_URI', 'bolt://localhost:7687')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME', 'neo4j')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD', 'password')

# Create Neo4j driver
driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

def run_query(query, parameters=None):
    """Execute a Cypher query and return results"""
    with driver.session() as session:
        result = session.run(query, parameters or {})
        return [record.data() for record in result]

# Test connection
print("Testing Neo4j connection...")
result = run_query("RETURN 'Connected to Neo4j!' as message")
print(result[0]['message'])

## Define Entity and Relationship Extraction Functions

**Entity Extraction with SpaCy:**
The `extract_entities()` function uses SpaCy's named entity recognition to find:
- **PERSON**: Names of people (Dr. Sarah Johnson, Prof. Michael Chen)
- **ORG**: Organizations (Stanford University, Google AI, Microsoft)
- **GPE**: Geographic/political entities (countries, cities, states)
- **PRODUCT**: Products and technologies (GPT models, neural networks)

**Relationship Extraction via Dependency Parsing:**
The `extract_relationships()` function identifies grammatical relationships:
- **nsubj**: Nominal subject (who is doing something)
- **dobj**: Direct object (what is being done to)
- **pobj**: Object of preposition (relationships through prepositions)

**Learning Point:** 
NLP enables automatic discovery of entities and relationships that would be impossible to extract manually from large document collections.

## Create Knowledge Graph from Extracted Entities

**Graph Construction Pattern:**
The `create_document_graph()` function demonstrates how to build knowledge graphs from NLP output:

1. **Document Nodes**: Each document becomes a node with metadata
2. **Entity Nodes**: Each extracted entity becomes a separate node
3. **CONTAINS_ENTITY Relationships**: Connect documents to their entities
4. **Position Tracking**: Store where entities appear in the source text

**Graph Design Benefits:**
- **Searchable entities**: Find all documents mentioning specific people/organizations
- **Entity relationships**: Discover connections between entities across documents
- **Provenance tracking**: Know exactly where information came from
- **Scalable structure**: Easy to add new documents and entities

## Lesson 1: Text Processing and NLP Setup

## Generate Vector Embeddings for Semantic Search

**Text Chunking Strategy:**
- Break documents into smaller, overlapping chunks (100 words with 20-word overlap)
- Overlapping ensures important concepts aren't split across chunks
- Smaller chunks provide more precise search results

**Embedding Generation:**
- Convert each text chunk into a 384-dimensional vector
- These vectors capture semantic meaning beyond keyword matching
- Similar concepts have similar vectors (high cosine similarity)

**Graph Storage:**
- **Chunk nodes**: Store text chunks as separate entities
- **HAS_CHUNK relationships**: Connect documents to their chunks
- **Embedding vectors**: Store as array properties for vector search

**Why This Approach:**
Chunking + embeddings enables finding relevant content even when exact keywords don't match the query.

## Create Vector Search Index for Fast Similarity Queries

**Vector Index Configuration:**
- **384 dimensions**: Matches our sentence transformer model output
- **Cosine similarity**: Best for text embeddings (measures angle between vectors)
- **Index name**: `chunk_embeddings` for easy reference

**Why Vector Indexes Matter:**
- **Performance**: Enables sub-second search across thousands of vectors
- **Similarity function**: Cosine similarity is optimal for text semantics
- **Scalability**: Index structure supports millions of embeddings

**Error Handling:**
The try-catch block handles cases where the index already exists, which is common when re-running notebook cells.

## Implement and Test Semantic Search

**Semantic Search Process:**
1. **Query embedding**: Convert search query to vector using same model
2. **Vector similarity**: Use Neo4j's vector search to find similar chunks
3. **Graph enrichment**: Include document context and metadata
4. **Ranking**: Results ordered by similarity score

**Search Query Breakdown:**
- `db.index.vector.queryNodes()`: Neo4j's vector search function
- Returns chunks ranked by cosine similarity to query
- Joins with document metadata for context

**Test Search:**
Query "AI research and neural networks" should find relevant content even if exact phrases don't appear in the text, demonstrating semantic understanding beyond keyword matching.

In [None]:
# Load spaCy model for NLP processing
!python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_sm")

# Initialize sentence transformer for embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

print("NLP models loaded successfully!")

## Discover Entity Co-occurrence Patterns

**Co-occurrence Analysis:**
This function finds entities that appear together in the same documents:
- **Pattern detection**: Entities mentioned in the same document likely have relationships
- **Frequency counting**: How often entities co-occur indicates relationship strength
- **Graph relationships**: Create CO_OCCURS_WITH edges between related entities

**Knowledge Discovery:**
- **Automatic relationship detection**: No manual relationship definition needed
- **Statistical significance**: Frequency indicates relationship importance
- **Network effects**: Entities become connected through shared contexts

**Applications:**
- **Recommendation systems**: Find related entities to suggest
- **Knowledge exploration**: Discover unexpected connections
- **Graph analysis**: Enable centrality and community detection algorithms

## Create Entity Similarity Using Context Embeddings

**Context-Based Similarity:**
Instead of just co-occurrence counting, this approach uses semantic similarity:
1. **Context extraction**: Gather text around each entity mention
2. **Context embeddings**: Generate vectors representing each entity's typical context
3. **Similarity calculation**: Use cosine similarity between context vectors
4. **Relationship creation**: Connect entities with high semantic similarity

**Advanced Concept:**
This creates a more nuanced understanding of entity relationships:
- Entities are similar if they appear in similar contexts
- Goes beyond simple co-occurrence to semantic relatedness
- Enables discovery of conceptually related entities even from different documents

**Threshold Tuning:**
The 0.7 similarity threshold can be adjusted based on desired precision vs. recall.

In [None]:
# Sample unstructured text data
sample_documents = [
    {
        "id": "doc1",
        "title": "AI Research Breakthrough",
        "content": "Dr. Sarah Johnson from Stanford University published groundbreaking research on neural networks. The study, conducted in collaboration with Google AI, demonstrates significant improvements in natural language processing. The research team, including Prof. Michael Chen from MIT, used advanced transformer architectures to achieve state-of-the-art results."
    },
    {
        "id": "doc2",
        "title": "Tech Company Partnership",
        "content": "Microsoft announced a strategic partnership with OpenAI to develop advanced AI systems. The collaboration focuses on integrating GPT models into Microsoft's suite of productivity tools. CEO Satya Nadella emphasized the importance of responsible AI development in the announcement."
    },
    {
        "id": "doc3",
        "title": "Database Innovation",
        "content": "Neo4j released new graph database capabilities for AI applications. The update includes enhanced vector search functionality and improved integration with machine learning workflows. CTO Jim Webber highlighted the benefits for knowledge graph construction and graph neural networks."
    }
]

print(f"Loaded {len(sample_documents)} sample documents")
for doc in sample_documents[:1]:
    print(f"\nDocument: {doc['title']}")
    print(f"Content: {doc['content'][:100]}...")

## Complete Knowledge Graph Construction

**Scale Up Processing:**
Now we process all remaining documents to build a comprehensive knowledge graph:
- Extract entities from each document
- Create document and entity nodes
- Build relationships between entities

**Knowledge Graph Growth:**
As we add more documents:
- New entities are discovered
- Existing entities gain more connections
- Co-occurrence patterns become more reliable
- The graph becomes richer and more connected

**Processing Feedback:**
The loop provides feedback on entities found per document, helping you understand the richness of your knowledge extraction process.

## Comprehensive Knowledge Graph Analysis

**Multi-Perspective Analysis:**
The `analyze_knowledge_graph()` function provides insights from multiple angles:

1. **Document Statistics**: How many entities and chunks per document
2. **Entity Connectivity**: Which entities are most connected (central in the graph)
3. **Graph Structure**: Overall composition of nodes by type

**Key Metrics:**
- **Entity count**: Richness of information extraction
- **Connection count**: How well entities are linked (graph density)
- **Node distribution**: Balance of different node types

**Business Value:**
These metrics help assess the quality and completeness of your knowledge extraction pipeline and identify the most important entities in your domain.

## Advanced Hybrid Search: Combining Semantic and Graph Intelligence

**Multi-Modal Search Approach:**
The `hybrid_search()` function demonstrates how to combine different search methods:

1. **Semantic search**: Find content similar to the query using embeddings
2. **Graph traversal**: Discover related entities through graph relationships
3. **Combined results**: Merge both approaches for comprehensive results

**Hybrid Search Benefits:**
- **Semantic matching**: Finds conceptually related content
- **Graph enrichment**: Discovers connected entities and concepts
- **Comprehensive coverage**: Captures both direct matches and related information

**Real-World Application:**
This pattern is used in advanced search engines, recommendation systems, and knowledge discovery platforms to provide more complete and contextual results than either approach alone.

## Lesson 2: Entity Extraction and Relationship Discovery

## Module Completion and Resource Cleanup

**Optional Cleanup:**
The commented `driver.close()` line shows how to properly close the Neo4j connection when you're completely done with the session.

**Why Keep Connection Open:**
In Jupyter notebooks, you might want to keep the connection open for additional experimentation and queries after completing the main module exercises.

**Production Consideration:**
In production applications, always ensure proper resource cleanup with try-finally blocks or context managers.

In [None]:
def extract_entities(text: str) -> List[Dict]:
    """Extract named entities from text using spaCy"""
    doc = nlp(text)
    entities = []
    
    for ent in doc.ents:
        entities.append({
            "text": ent.text,
            "label": ent.label_,
            "start": ent.start_char,
            "end": ent.end_char,
            "description": spacy.explain(ent.label_)
        })
    
    return entities

def extract_relationships(text: str) -> List[Dict]:
    """Extract basic relationships using dependency parsing"""
    doc = nlp(text)
    relationships = []
    
    for token in doc:
        if token.dep_ in ['nsubj', 'dobj', 'pobj'] and token.head.pos_ == 'VERB':
            relationships.append({
                "subject": token.text,
                "predicate": token.head.text,
                "object": token.head.text,
                "dependency": token.dep_
            })
    
    return relationships

# Process sample document
sample_text = sample_documents[0]['content']
entities = extract_entities(sample_text)
relationships = extract_relationships(sample_text)

print("Extracted Entities:")
for entity in entities:
    print(f"- {entity['text']} ({entity['label']}) - {entity['description']}")

print("\nExtracted Relationships:")
for rel in relationships[:5]:  # Show first 5
    print(f"- {rel['subject']} --{rel['dependency']}--> {rel['predicate']}")

In [None]:
# Create document nodes and entity relationships in Neo4j
def create_document_graph(doc_data: Dict, entities: List[Dict]):
    """Create document and entity nodes with relationships"""
    
    # Create document node
    doc_query = """
    MERGE (d:Document {id: $doc_id})
    SET d.title = $title,
        d.content = $content,
        d.created_at = datetime()
    """
    
    run_query(doc_query, {
        'doc_id': doc_data['id'],
        'title': doc_data['title'],
        'content': doc_data['content']
    })
    
    # Create entity nodes and relationships
    for entity in entities:
        entity_query = """
        MATCH (d:Document {id: $doc_id})
        MERGE (e:Entity {text: $entity_text})
        SET e.label = $entity_label,
            e.description = $entity_description
        MERGE (d)-[:CONTAINS_ENTITY {
            start_pos: $start_pos,
            end_pos: $end_pos
        }]->(e)
        """
        
        run_query(entity_query, {
            'doc_id': doc_data['id'],
            'entity_text': entity['text'],
            'entity_label': entity['label'],
            'entity_description': entity['description'],
            'start_pos': entity['start'],
            'end_pos': entity['end']
        })

# Process first document
create_document_graph(sample_documents[0], entities)
print(f"Created graph for document: {sample_documents[0]['title']}")

## Lesson 3: Vector Embeddings and Semantic Analysis

In [None]:
def generate_embeddings(texts: List[str]) -> np.ndarray:
    """Generate embeddings for a list of texts"""
    embeddings = embedding_model.encode(texts)
    return embeddings

def chunk_text(text: str, chunk_size: int = 200, overlap: int = 50) -> List[str]:
    """Split text into overlapping chunks"""
    words = text.split()
    chunks = []
    
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        if chunk.strip():
            chunks.append(chunk)
    
    return chunks

# Process all documents with embeddings
for doc in sample_documents:
    # Create text chunks
    chunks = chunk_text(doc['content'], chunk_size=100, overlap=20)
    
    # Generate embeddings for chunks
    chunk_embeddings = generate_embeddings(chunks)
    
    # Store chunks and embeddings in Neo4j
    for i, (chunk, embedding) in enumerate(zip(chunks, chunk_embeddings)):
        chunk_query = """
        MATCH (d:Document {id: $doc_id})
        MERGE (c:Chunk {id: $chunk_id})
        SET c.text = $chunk_text,
            c.position = $position,
            c.embedding = $embedding
        MERGE (d)-[:HAS_CHUNK]->(c)
        """
        
        run_query(chunk_query, {
            'doc_id': doc['id'],
            'chunk_id': f"{doc['id']}_chunk_{i}",
            'chunk_text': chunk,
            'position': i,
            'embedding': embedding.tolist()
        })
    
    print(f"Processed {len(chunks)} chunks for document: {doc['title']}")

In [None]:
# Create vector search index
index_query = """
CREATE VECTOR INDEX chunk_embeddings IF NOT EXISTS
FOR (c:Chunk) ON (c.embedding)
OPTIONS {
  indexConfig: {
    `vector.dimensions`: 384,
    `vector.similarity_function`: 'cosine'
  }
}
"""

try:
    run_query(index_query)
    print("Vector index created successfully")
except Exception as e:
    print(f"Vector index may already exist: {e}")

In [None]:
# Semantic search function
def semantic_search(query_text: str, top_k: int = 3) -> List[Dict]:
    """Perform semantic search using vector similarity"""
    # Generate query embedding
    query_embedding = generate_embeddings([query_text])[0]
    
    # Search similar chunks
    search_query = """
    CALL db.index.vector.queryNodes('chunk_embeddings', $top_k, $query_embedding)
    YIELD node AS chunk, score
    MATCH (d:Document)-[:HAS_CHUNK]->(chunk)
    RETURN d.title as document_title, 
           chunk.text as chunk_text,
           score,
           chunk.position as position
    ORDER BY score DESC
    """
    
    results = run_query(search_query, {
        'query_embedding': query_embedding.tolist(),
        'top_k': top_k
    })
    
    return results

# Test semantic search
query = "AI research and neural networks"
search_results = semantic_search(query, top_k=3)

print(f"Search results for: '{query}'\n")
for i, result in enumerate(search_results, 1):
    print(f"{i}. Document: {result['document_title']}")
    print(f"   Score: {result['score']:.4f}")
    print(f"   Text: {result['chunk_text'][:100]}...\n")

## Lesson 4: Advanced Text Processing and Knowledge Extraction

In [None]:
# Advanced entity linking and co-occurrence analysis
def find_entity_cooccurrences() -> List[Dict]:
    """Find entities that frequently co-occur in documents"""
    cooccurrence_query = """
    MATCH (d:Document)-[:CONTAINS_ENTITY]->(e1:Entity),
          (d)-[:CONTAINS_ENTITY]->(e2:Entity)
    WHERE e1 <> e2 AND e1.text < e2.text  // Avoid duplicates
    WITH e1, e2, count(d) as cooccurrence_count
    WHERE cooccurrence_count > 0
    MERGE (e1)-[r:CO_OCCURS_WITH]-(e2)
    SET r.frequency = cooccurrence_count
    RETURN e1.text as entity1, e2.text as entity2, cooccurrence_count
    ORDER BY cooccurrence_count DESC
    """
    
    return run_query(cooccurrence_query)

# Create entity co-occurrence relationships
cooccurrences = find_entity_cooccurrences()
print("Entity Co-occurrences:")
for co in cooccurrences[:5]:
    print(f"- {co['entity1']} <-> {co['entity2']} (frequency: {co['cooccurrence_count']})")

In [None]:
# Create entity similarity based on embeddings
def create_entity_similarities():
    """Create entity similarity relationships based on context embeddings"""
    
    # Get all entities and their contexts
    entity_query = """
    MATCH (d:Document)-[r:CONTAINS_ENTITY]->(e:Entity)
    WITH e, collect(substring(d.content, r.start_pos - 50, r.end_pos + 50)) as contexts
    RETURN e.text as entity_text, contexts
    """
    
    entities_data = run_query(entity_query)
    
    # Generate embeddings for entity contexts
    entity_embeddings = {}
    for entity_data in entities_data:
        entity_text = entity_data['entity_text']
        contexts = entity_data['contexts']
        
        # Combine contexts and generate embedding
        combined_context = ' '.join(contexts)
        embedding = generate_embeddings([combined_context])[0]
        entity_embeddings[entity_text] = embedding
    
    # Calculate similarities and create relationships
    entities = list(entity_embeddings.keys())
    for i, entity1 in enumerate(entities):
        for entity2 in entities[i+1:]:
            # Calculate cosine similarity
            similarity = np.dot(
                entity_embeddings[entity1], 
                entity_embeddings[entity2]
            ) / (
                np.linalg.norm(entity_embeddings[entity1]) * 
                np.linalg.norm(entity_embeddings[entity2])
            )
            
            # Create similarity relationship if above threshold
            if similarity > 0.7:
                similarity_query = """
                MATCH (e1:Entity {text: $entity1}),
                      (e2:Entity {text: $entity2})
                MERGE (e1)-[r:SIMILAR_TO]-(e2)
                SET r.similarity = $similarity
                """
                
                run_query(similarity_query, {
                    'entity1': entity1,
                    'entity2': entity2,
                    'similarity': float(similarity)
                })

create_entity_similarities()
print("Created entity similarity relationships")

## Hands-on Exercise: Complete Knowledge Graph Construction

In [None]:
# Process all remaining documents
for doc in sample_documents[1:]:
    # Extract entities
    entities = extract_entities(doc['content'])
    
    # Create document graph
    create_document_graph(doc, entities)
    
    print(f"Processed document: {doc['title']} ({len(entities)} entities)")

# Recreate co-occurrences with all documents
cooccurrences = find_entity_cooccurrences()
print(f"\nTotal entity co-occurrences: {len(cooccurrences)}")

In [None]:
# Knowledge graph analysis and insights
def analyze_knowledge_graph():
    """Analyze the constructed knowledge graph"""
    
    # Document statistics
    doc_stats = run_query("""
    MATCH (d:Document)
    OPTIONAL MATCH (d)-[:CONTAINS_ENTITY]->(e:Entity)
    OPTIONAL MATCH (d)-[:HAS_CHUNK]->(c:Chunk)
    RETURN d.title as document,
           count(DISTINCT e) as entity_count,
           count(DISTINCT c) as chunk_count
    ORDER BY entity_count DESC
    """)
    
    print("Document Analysis:")
    for stat in doc_stats:
        print(f"- {stat['document']}: {stat['entity_count']} entities, {stat['chunk_count']} chunks")
    
    # Most connected entities
    entity_stats = run_query("""
    MATCH (e:Entity)
    OPTIONAL MATCH (e)-[r:CO_OCCURS_WITH]-(other)
    WITH e, count(r) as connection_count
    RETURN e.text as entity, e.label as type, connection_count
    ORDER BY connection_count DESC
    LIMIT 10
    """)
    
    print("\nMost Connected Entities:")
    for stat in entity_stats:
        print(f"- {stat['entity']} ({stat['type']}): {stat['connection_count']} connections")
    
    # Graph structure overview
    graph_stats = run_query("""
    MATCH (n)
    RETURN labels(n)[0] as node_type, count(n) as count
    ORDER BY count DESC
    """)
    
    print("\nGraph Structure:")
    for stat in graph_stats:
        print(f"- {stat['node_type']}: {stat['count']} nodes")

analyze_knowledge_graph()

In [None]:
# Test comprehensive search combining text and graph
def hybrid_search(query: str, top_k: int = 5):
    """Combine semantic search with graph traversal"""
    
    # First, find semantically similar chunks
    semantic_results = semantic_search(query, top_k)
    
    # Then, find related entities through graph traversal
    graph_query = """
    CALL db.index.vector.queryNodes('chunk_embeddings', $top_k, $query_embedding)
    YIELD node AS chunk, score
    MATCH (d:Document)-[:HAS_CHUNK]->(chunk)
    MATCH (d)-[:CONTAINS_ENTITY]->(e:Entity)
    OPTIONAL MATCH (e)-[:CO_OCCURS_WITH]-(related:Entity)
    RETURN DISTINCT d.title as document,
           e.text as entity,
           e.label as entity_type,
           collect(DISTINCT related.text) as related_entities,
           max(score) as relevance_score
    ORDER BY relevance_score DESC
    LIMIT $top_k
    """
    
    query_embedding = generate_embeddings([query])[0]
    graph_results = run_query(graph_query, {
        'query_embedding': query_embedding.tolist(),
        'top_k': top_k
    })
    
    return {
        'semantic_results': semantic_results,
        'graph_results': graph_results
    }

# Test hybrid search
hybrid_results = hybrid_search("machine learning and AI partnerships")

print("Hybrid Search Results:")
print("\nSemantic Matches:")
for result in hybrid_results['semantic_results'][:3]:
    print(f"- {result['document_title']}: {result['chunk_text'][:80]}... (score: {result['score']:.3f})")

print("\nGraph-Enhanced Results:")
for result in hybrid_results['graph_results'][:3]:
    related = ', '.join(result['related_entities'][:3]) if result['related_entities'] else 'None'
    print(f"- Entity: {result['entity']} ({result['entity_type']})")
    print(f"  Document: {result['document']}")
    print(f"  Related: {related}")
    print(f"  Score: {result['relevance_score']:.3f}\n")

## Module Summary and Next Steps

In this module, you learned to:
- Extract entities and relationships from unstructured text
- Create knowledge graphs from document collections
- Generate and use vector embeddings for semantic analysis
- Implement hybrid search combining text similarity and graph traversal
- Analyze and visualize knowledge graph structures

### Key Takeaways
- NLP techniques enable automated knowledge extraction from text
- Vector embeddings provide semantic understanding beyond keyword matching
- Graph structures reveal relationships and patterns in unstructured data
- Hybrid approaches combine the strengths of multiple search methods

### Next Module
Module 4: Graph Analytics - Learn how to apply graph algorithms for advanced analytics and insights.

In [None]:
# Cleanup (optional)
# driver.close()
print("Module 3 completed successfully!")