<a href="https://colab.research.google.com/github/nelslindahlx/NLP/blob/master/Python_Search_Engine_Outline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Code Summary

This Google Colab notebook demonstrates how to create and use a fact-based knowledge graph search engine. The search engine indexes text files in a specified directory, extracts entities and relationships using spaCy, and builds a knowledge graph using NetworkX. The search engine then allows querying based on entities and returns relevant documents.

#### Steps:

1. **Install Necessary Libraries**:
   - Install `spacy` and `networkx` packages.
   - Download the spaCy model `en_core_web_sm`.

2. **Import Libraries**:
   - Import required libraries: `os`, `spacy`, `networkx`, and `defaultdict` from `collections`.

3. **Define the Search Engine Class**:
   - Create the `FactBasedKnowledgeGraphSearchEngine` class with methods to initialize the engine, crawl the directory for text files, extract entities and relationships, index documents, and search for queries.

4. **Prepare Text Files**:
   - Use Colab's file upload widget to upload text files.
   - Save the uploaded files into a directory named `text_files`.

5. **Initialize the Search Engine**:
   - Initialize an instance of the search engine with the `text_files` directory.

6. **Search for a Query**:
   - Define a search query.
   - Perform the search using the search engine instance.
   - Print the search results.

#### Detailed Breakdown:

1. **Install Necessary Libraries**:
   - Install the required Python packages (`spacy` and `networkx`) and download the spaCy model for English language processing.

2. **Import Libraries**:
   - Import the necessary modules for file operations, natural language processing, graph operations, and default dictionary handling.

3. **Define the Search Engine Class**:
   - **Initialization (`__init__`)**: Initializes the search engine with a directory path, a NetworkX graph, a documents dictionary, and a spaCy NLP model. Calls the `_crawl_directory` method to index documents.
   - **Crawling Directory (`_crawl_directory`)**: Reads all text files in the specified directory, extracts content, and calls `_index_document` to index them.
   - **Extracting Entities and Relationships (`_extract_entities_and_relationships`)**: Uses spaCy to extract named entities and relationships (subject-verb-object) from the document content.
   - **Indexing Document (`_index_document`)**: Adds entities and relationships to the knowledge graph, ensuring nodes and edges are added efficiently.
   - **Searching (`search`)**: Processes the search query to extract entities using spaCy, scores documents based on the presence of query entities in the knowledge graph, and returns a list of relevant document filenames sorted by relevance.

4. **Prepare Text Files**:
   - Use the `files.upload()` method to upload text files directly into the Colab environment.
   - Move the uploaded files to a directory named `text_files`.

5. **Initialize the Search Engine**:
   - Create an instance of the `FactBasedKnowledgeGraphSearchEngine` with the `text_files` directory as input.

6. **Search for a Query**:
   - Define a search query string.
   - Use the `search` method of the search engine instance to perform the search.
   - Print the filenames of the documents that match the query.

This notebook guides you through setting up and using a fact-based knowledge graph search engine step by step, ensuring a clear understanding of each part of the implementation.

Install Necessary Libraries

In [1]:
# Install necessary libraries
!pip install spacy networkx

# Download the spaCy model
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Import Libraries

In [1]:
import os
import spacy
import networkx as nx
from collections import defaultdict

Define the Search Engine Class

In [2]:
class FactBasedKnowledgeGraphSearchEngine:
    def __init__(self, directory: str):
        """
        Initialize the search engine with a directory of text files.

        :param directory: Directory containing text files to index
        """
        self.directory = directory
        self.graph = nx.Graph()
        self.documents = {}
        self.nlp = spacy.load("en_core_web_sm")
        self._crawl_directory()

    def _crawl_directory(self):
        """
        Crawl the directory and index all text files.
        """
        for filename in os.listdir(self.directory):
            if filename.endswith(".txt"):
                file_path = os.path.join(self.directory, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as file:
                        content = file.read()
                        doc_id = len(self.documents)
                        self.documents[doc_id] = filename
                        self._index_document(content, doc_id)
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")

    def _extract_entities_and_relationships(self, content: str):
        """
        Extract entities and relationships from the document content.

        :param content: Text content of the document
        :return: Tuple of entities and relationships
        """
        doc = self.nlp(content)
        entities = [(ent.text, ent.label_) for ent in doc.ents]
        relationships = []

        for sentence in doc.sents:
            for token in sentence:
                if token.dep_ in ("attr", "dobj"):
                    subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                    if subject:
                        subject = subject[0]
                        relationships.append((subject.text, token.text, token.head.text))

        return entities, relationships

    def _index_document(self, content: str, doc_id: int):
        """
        Index the document by adding entities and relationships to the knowledge graph.

        :param content: Text content of the document
        :param doc_id: Unique identifier for the document
        """
        entities, relationships = self._extract_entities_and_relationships(content)
        for entity, label in entities:
            if not self.graph.has_node(entity):
                self.graph.add_node(entity, type=label)
            self.graph.add_edge(entity, f'doc_{doc_id}', type='contains')

        for sub, obj, verb in relationships:
            if not self.graph.has_node(sub):
                self.graph.add_node(sub, type='entity')
            if not self.graph.has_node(obj):
                self.graph.add_node(obj, type='entity')
            self.graph.add_edge(sub, obj, type=verb)

    def search(self, query: str):
        """
        Search for documents that match the query.

        :param query: Search query
        :return: List of document filenames that match the query
        """
        query_doc = self.nlp(query)
        query_entities = [ent.text for ent in query_doc.ents]

        if not query_entities:
            return []

        doc_scores = defaultdict(int)
        for entity in query_entities:
            if self.graph.has_node(entity):
                for neighbor in self.graph.neighbors(entity):
                    if neighbor.startswith('doc_'):
                        doc_id = int(neighbor.split('_')[1])
                        doc_scores[doc_id] += 1

        sorted_docs = sorted(doc_scores.items(), key=lambda item: item[1], reverse=True)
        return [self.documents[doc_id] for doc_id, _ in sorted_docs]

Prepare Text Files

In [None]:
from google.colab import files

# Upload text files
uploaded = files.upload()

# Ensure the files are saved in a directory called 'text_files'
import os

os.makedirs('text_files', exist_ok=True)
for filename in uploaded.keys():
    os.rename(filename, f'text_files/{filename}')

Initialize the Search Engine

In [None]:
# Initialize the search engine with the directory of text files
directory = 'text_files'
search_engine = FactBasedKnowledgeGraphSearchEngine(directory)

Search for a Query

In [None]:
# Define your query
query = 'example search query'

# Perform the search
results = search_engine.search(query)

# Print the search results
print(f"Search results for '{query}':")
for result in results:
    print(result)

Full Notebook Example

In [None]:
# Step 1: Install Necessary Libraries
!pip install spacy networkx
!python -m spacy download en_core_web_sm

# Step 2: Import Libraries
import os
import spacy
import networkx as nx
from collections import defaultdict

# Step 3: Define the Search Engine Class
class FactBasedKnowledgeGraphSearchEngine:
    def __init__(self, directory: str):
        self.directory = directory
        self.graph = nx.Graph()
        self.documents = {}
        self.nlp = spacy.load("en_core_web_sm")
        self._crawl_directory()

    def _crawl_directory(self):
        for filename in os.listdir(self.directory):
            if filename.endswith(".txt"):
                file_path = os.path.join(self.directory, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as file:
                        content = file.read()
                        doc_id = len(self.documents)
                        self.documents[doc_id] = filename
                        self._index_document(content, doc_id)
                except Exception as e:
                    print(f"Error reading file {file_path}: {e}")

    def _extract_entities_and_relationships(self, content: str):
        doc = self.nlp(content)
        entities = [(ent.text, ent.label_) for ent in doc.ents]
        relationships = []

        for sentence in doc.sents:
            for token in sentence:
                if token.dep_ in ("attr", "dobj"):
                    subject = [w for w in token.head.lefts if w.dep_ == "nsubj"]
                    if subject:
                        subject = subject[0]
                        relationships.append((subject.text, token.text, token.head.text))

        return entities, relationships

    def _index_document(self, content: str, doc_id: int):
        entities, relationships = self._extract_entities_and_relationships(content)
        for entity, label in entities:
            if not self.graph.has_node(entity):
                self.graph.add_node(entity, type=label)
            self.graph.add_edge(entity, f'doc_{doc_id}', type='contains')

        for sub, obj, verb in relationships:
            if not self.graph.has_node(sub):
                self.graph.add_node(sub, type='entity')
            if not self.graph.has_node(obj):
                self.graph.add_node(obj, type='entity')
            self.graph.add_edge(sub, obj, type=verb)

    def search(self, query: str):
        query_doc = self.nlp(query)
        query_entities = [ent.text for ent in query_doc.ents]

        if not query_entities:
            return []

        doc_scores = defaultdict(int)
        for entity in query_entities:
            if self.graph.has_node(entity):
                for neighbor in self.graph.neighbors(entity):
                    if neighbor.startswith('doc_'):
                        doc_id = int(neighbor.split('_')[1])
                        doc_scores[doc_id] += 1

        sorted_docs = sorted(doc_scores.items(), key=lambda item: item[1], reverse=True)
        return [self.documents[doc_id] for doc_id, _ in sorted_docs]

# Step 4: Prepare Text Files
from google.colab import files

uploaded = files.upload()

import os

os.makedirs('text_files', exist_ok=True)
for filename in uploaded.keys():
    os.rename(filename, f'text_files/{filename}')

# Step 5: Initialize the Search Engine
directory = 'text_files'
search_engine = FactBasedKnowledgeGraphSearchEngine(directory)

# Step 6: Search for a Query
query = 'example search query'
results = search_engine.search(query)

print(f"Search results for '{query}':")
for result in results:
    print(result)