<a href="https://colab.research.google.com/github/nelslindahlx/Random-Notebooks/blob/master/enhanced_search_engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enhanced Search Engine in Google Colab
This notebook demonstrates how to build an enhanced search engine using a small dataset of documents. We will preprocess the documents, build an inverted index and a TF-IDF matrix, and implement search functionality with ranking.

In [1]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import string
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Sample Documents
We will use a small set of documents for this demonstration.

In [2]:
# Sample documents
documents = {
    1: 'The quick brown fox jumps over the lazy dog',
    2: 'Never jump over the lazy dog quickly',
    3: 'Bright sunny day with clear blue sky',
    4: 'The quick brown fox and the bright blue sky',
}

## Preprocess the Documents
We will tokenize the documents, remove stopwords, and punctuation.

In [3]:
# Preprocess documents
stop_words = set(stopwords.words('english'))
def preprocess(text):
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

# Preprocess all documents
processed_docs = {doc_id: preprocess(text) for doc_id, text in documents.items()}
print('Processed Documents:')
print(processed_docs)

Processed Documents:
{1: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog'], 2: ['never', 'jump', 'lazy', 'dog', 'quickly'], 3: ['bright', 'sunny', 'day', 'clear', 'blue', 'sky'], 4: ['quick', 'brown', 'fox', 'bright', 'blue', 'sky']}


## Build an Inverted Index and TF-IDF Matrix
We will build an inverted index and also create a TF-IDF matrix.

In [4]:
# Build an inverted index
inverted_index = defaultdict(list)

for doc_id, tokens in processed_docs.items():
    for token in tokens:
        if doc_id not in inverted_index[token]:
            inverted_index[token].append(doc_id)

# Build the TF-IDF matrix
corpus = [' '.join(tokens) for tokens in processed_docs.values()]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)

print('Inverted Index:')
print(dict(inverted_index))
print('\nTF-IDF Matrix:')
print(tfidf_matrix.toarray())

Inverted Index:
{'quick': [1, 4], 'brown': [1, 4], 'fox': [1, 4], 'jumps': [1], 'lazy': [1, 2], 'dog': [1, 2], 'never': [2], 'jump': [2], 'quickly': [2], 'bright': [3, 4], 'sunny': [3], 'day': [3], 'clear': [3], 'blue': [3, 4], 'sky': [3, 4]}

TF-IDF Matrix:
[[0.         0.         0.3889911  0.         0.         0.3889911
  0.3889911  0.         0.49338588 0.3889911  0.         0.3889911
  0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.38274272
  0.         0.48546061 0.         0.38274272 0.48546061 0.
  0.48546061 0.         0.        ]
 [0.35745504 0.35745504 0.         0.4533864  0.4533864  0.
  0.         0.         0.         0.         0.         0.
  0.         0.35745504 0.4533864 ]
 [0.40824829 0.40824829 0.40824829 0.         0.         0.
  0.40824829 0.         0.         0.         0.         0.40824829
  0.         0.40824829 0.        ]]


## Search Functionality with Ranking
Implement a search function that uses TF-IDF and cosine similarity to rank results.

In [5]:
# Search function with ranking
def search(query):
    query_tokens = preprocess(query)
    query_vec = vectorizer.transform([' '.join(query_tokens)])
    cosine_similarities = cosine_similarity(query_vec, tfidf_matrix).flatten()
    ranked_docs = np.argsort(-cosine_similarities)

    results = []
    for doc_index in ranked_docs:
        if cosine_similarities[doc_index] > 0:
            doc_id = list(documents.keys())[doc_index]
            results.append((doc_id, cosine_similarities[doc_index]))

    return results

# Test the search function
query = 'quick fox'
result_docs = search(query)
print(f"Ranked documents for query '{query}': {result_docs}")

Ranked documents for query 'quick fox': [(4, 0.5773502691896258), (1, 0.5501164858565948)]
