<a href="https://colab.research.google.com/github/nelslindahlx/Random-Notebooks/blob/master/basic_search_engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Search Engine in Google Colab
This notebook demonstrates how to build a basic search engine using a small dataset of documents. We will preprocess the documents, build an inverted index, and implement search functionality.

In [2]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
import string
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Sample Documents
We will use a small set of documents for this demonstration.

In [3]:
# Sample documents
documents = {
    1: 'The quick brown fox jumps over the lazy dog',
    2: 'Never jump over the lazy dog quickly',
    3: 'Bright sunny day with clear blue sky',
    4: 'The quick brown fox and the bright blue sky',
}

## Preprocess the Documents
We will tokenize the documents, remove stopwords, and punctuation.

In [4]:
# Preprocess documents
stop_words = set(stopwords.words('english'))
def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Preprocess all documents
processed_docs = {doc_id: preprocess(text) for doc_id, text in documents.items()}
print('Processed Documents:')
print(processed_docs)

Processed Documents:
{1: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog'], 2: ['never', 'jump', 'lazy', 'dog', 'quickly'], 3: ['bright', 'sunny', 'day', 'clear', 'blue', 'sky'], 4: ['quick', 'brown', 'fox', 'bright', 'blue', 'sky']}


## Build an Inverted Index
An inverted index maps words to the documents that contain them.

In [5]:
# Build an inverted index
inverted_index = defaultdict(list)

for doc_id, tokens in processed_docs.items():
    for token in tokens:
        if doc_id not in inverted_index[token]:
            inverted_index[token].append(doc_id)

print('Inverted Index:')
print(dict(inverted_index))

Inverted Index:
{'quick': [1, 4], 'brown': [1, 4], 'fox': [1, 4], 'jumps': [1], 'lazy': [1, 2], 'dog': [1, 2], 'never': [2], 'jump': [2], 'quickly': [2], 'bright': [3, 4], 'sunny': [3], 'day': [3], 'clear': [3], 'blue': [3, 4], 'sky': [3, 4]}


## Search Functionality
Implement a simple search function that returns documents containing the search query.

In [6]:
# Search function
def search(query):
    query_tokens = preprocess(query)
    results = set()
    for token in query_tokens:
        if token in inverted_index:
            results.update(inverted_index[token])
    return results

# Test the search function
query = 'quick fox'
result_docs = search(query)
print(f"Documents containing '{query}': {result_docs}")

Documents containing 'quick fox': {1, 4}
