<a href="https://colab.research.google.com/github/iamritikiit/Search-Engine-And-Retrieval/blob/main/Search_Engine_Implementation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Data set collection

In [1]:
import os
from sklearn.datasets import fetch_20newsgroups

# Step 1: Load the dataset (first 100 documents)
data = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
docs = data.data[:100]

# Step 2: Create a folder to save documents
folder_path = 'newsgroup_docs'
os.makedirs(folder_path, exist_ok=True)

# Step 3: Save each document as a separate .txt file
for i, doc in enumerate(docs):
    file_path = os.path.join(folder_path, f'doc_{i+1}.txt')
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(doc)

print(f"Saved {len(docs)} documents to the folder '{folder_path}'")


Saved 100 documents to the folder 'newsgroup_docs'


Loading all the documents in order to understand the dataset

In [2]:
import os

def load_documents(folder):
    docs = []
    filenames = []
    for file in os.listdir(folder):
        if file.endswith(".txt"):
            with open(os.path.join(folder, file), 'r', encoding='utf-8') as f:
                docs.append(f.read())
                filenames.append(file)
    return docs, filenames

documents, doc_names = load_documents("newsgroup_docs")


Tf-IDF use

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(documents)


Function for query embedding and searching

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

def search(query, top_k=5):
    query_vector = vectorizer.transform([query])
    similarity_scores = cosine_similarity(query_vector, tfidf_matrix).flatten()
    top_indices = similarity_scores.argsort()[-top_k:][::-1]
    return [(doc_names[i], similarity_scores[i]) for i in top_indices]


Results showcase

In [11]:
query = "virat kohli"
results = search(query)

print("Top matching documents:")
for name, score in results:
    print(f"{name} — Score: {score:.4f}")


Top matching documents:
doc_79.txt — Score: 0.0000
doc_29.txt — Score: 0.0000
doc_12.txt — Score: 0.0000
doc_5.txt — Score: 0.0000
doc_48.txt — Score: 0.0000
