# Document Search Engine: Complete Example

This notebook demonstrates a complete implementation of a semantic document search engine using Ollama and FAISS.

## 1. Setup and Imports
First, let's import all the necessary libraries and check our environment.

In [1]:
import Ollama_utils as ou
import os
import time
import numpy as np
import matplotlib.pyplot as plt
import shutil
import requests
import json


# Check for Ollama
# Add auto-download capability for llama3 if no models are present
def download_llama3_model():
    """Downloads the llama3 model automatically if no models are installed"""
    print("🔄 No models found. Automatically downloading llama3 model (approx. 4GB)...")
    
    url = "http://localhost:11434/api/pull"
    payload = {"name": "llama3"}
    
    try:
        # Start the download with stream=True to monitor progress
        with requests.post(url, json=payload, stream=True) as response:
            if response.status_code == 200:
                # Process the streaming response to show progress
                for line in response.iter_lines():
                    if line:
                        data = json.loads(line)
                        if "status" in data:
                            if "completed" in data["status"]:
                                print(f"✅ Download complete!")
                                return True
                            elif "downloading" in data["status"]:
                                if "total" in data and "completed" in data:
                                    percent = (data["completed"] / data["total"]) * 100
                                    print(f"Downloading: {percent:.1f}% complete", end="\r")
            else:
                print(f"❌ Error downloading model: {response.status_code} - {response.text}")
                return False
    except Exception as e:
        print(f"❌ Error connecting to Ollama: {str(e)}")
        return False

# Check for models and download automatically if none exist
try:
    response = requests.get("http://localhost:11434/api/tags")
    if response.status_code == 200:
        models = response.json().get("models", [])
        
        if not models:
            print("No models detected. Starting automatic download...")
            success = download_llama3_model()
            if success:
                # Check available models again
                response = requests.get("http://localhost:11434/api/tags")
                if response.status_code == 200:
                    models = response.json().get("models", [])
                    print(f"Available models: {[model['name'] for model in models]}")
            else:
                print("⚠️ Failed to download llama3 model. This notebook requires a model to function properly.")
except Exception as e:
    print(f"❌ Error checking models: {str(e)}")


  from tqdm.autonotebook import tqdm, trange


## 2. Create Sample Documents

Let's create a few sample documents to build our search engine.

In [2]:
# Create a directory for sample documents
sample_dir = "sample_docs"
os.makedirs(sample_dir, exist_ok=True)

# Sample document 1: Python Programming
python_doc = """
# Python Programming Guide

Python is a high-level, interpreted programming language known for its readability and versatility.

## Key Features

- Easy to learn and read
- Dynamically typed
- Garbage collection
- Comprehensive standard library
- Extensive third-party packages


## Popular Libraries

- NumPy: For numerical computing
- Pandas: For data analysis
- Matplotlib: For data visualization
- TensorFlow and PyTorch: For machine learning
- Django and Flask: For web development
"""

# Sample document 2: Machine Learning
ml_doc = """
# Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables systems to learn from data and improve from experience.

## Types of Machine Learning

### Supervised Learning
Training on labeled data to predict outcomes for unseen data.
Examples: Classification, Regression

### Unsupervised Learning
Finding patterns or structures in unlabeled data.
Examples: Clustering, Dimensionality Reduction

### Reinforcement Learning
Learning through interactions with an environment to maximize rewards.
Examples: Game playing, Robotics

## Common Algorithms

- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forests
- Support Vector Machines
- Neural Networks
- K-means Clustering

## Key Libraries in Python

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Example of a simple ML workflow
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
```
"""

# Sample document 3: Vector Search
vector_doc = """
# Guide to Vector Search

Vector search finds similar items by comparing their vector embeddings in high-dimensional space.

## Steps in Vector Search

1. Convert text to vector embeddings
2. Store embeddings in a vector index
3. Use similarity metrics to find nearest vectors

## Common Libraries

- FAISS (Facebook AI)
- Annoy (Spotify)
- Milvus and Weaviate (Cloud-native vector DBs)

## Similarity Metrics

- Cosine Similarity
- Euclidean Distance
- Dot Product
"""

## 3. Scan and Index Documents

Now let's scan our sample documents directory and build a searchable index.

In [3]:
# Save the documents
with open(os.path.join(sample_dir, "python_guide.md"), "w") as f:
    f.write(python_doc)
    
with open(os.path.join(sample_dir, "machine_learning.md"), "w") as f:
    f.write(ml_doc)
    
with open(os.path.join(sample_dir, "vector_search.md"), "w") as f:
    f.write(vector_doc)
    
print(f"Created 3 sample documents in '{sample_dir}' directory")

Created 3 sample documents in 'sample_docs' directory


In [4]:
# Scan directory for documents
files = ou.scan_directory(sample_dir)
print(f"Found {len(files)} files to index:")
for file in files:
    print(f"- {os.path.basename(file)}")

Found 3 files to index:
- machine_learning.md
- python_guide.md
- vector_search.md


In [5]:
# Create a temporary directory for our index
index_dir = "example_index"
os.makedirs(index_dir, exist_ok=True)

# Define a progress callback
progress_data = []

def progress_callback(progress, message):
    progress_data.append((progress, message))
    print(f"Progress: {progress*100:.1f}% - {message}")

# Build the document index
index_path = os.path.join(index_dir, "faiss_index.bin")
metadata_path = os.path.join(index_dir, "metadata.pkl")

start_time = time.time()
success = ou.build_document_index(
    files,
    index_path=index_path,
    metadata_path=metadata_path,
    progress_callback=progress_callback
)
end_time = time.time()

if success:
    print(f"✅ Index built successfully in {end_time - start_time:.2f} seconds")
else:
    print("❌ Failed to build index")

Building index for 3 files
Processing 3 new files
Using 8 parallel workers
Saved to cache: index/cache/_app_sample_docs_python_guide.md_1747532595.4979596_488.pkl
Progress: 30.0% - Processed 1/3 files
Saved to cache: index/cache/_app_sample_docs_vector_search.md_1747532595.505401_467.pkl
Progress: 60.0% - Processed 2/3 files
Saved to cache: index/cache/_app_sample_docs_machine_learning.md_1747532595.5021625_1214.pkl
Progress: 90.0% - Processed 3/3 files
Progress: 95.0% - Building FAISS index
Creating FAISS index with 6 chunks (dim=768)
Creating new FAISS index
Progress: 100.0% - Index built successfully
✅ Indexing complete
✅ Index built successfully in 1.78 seconds


In [6]:
# Define a function to display search results
def display_results(query, results):
    print(f"Query: '{query}'")
    print(f"Found {len(results)} results\n")
    
    for i, result in enumerate(results):
        print(f"Result {i+1}: {result['filename']} (Score: {result['score']:.3f})")
        print(f"Path: {result['file_path']}")
        print("Snippet:")
        print(result['snippet'][:300] + "..." if len(result['snippet']) > 300 else result['snippet'])
        print("\n" + "-"*80 + "\n")

## 4. Basic Document Search

Let's perform some basic searches on our document index.

In [7]:
# Search query 1: Python programming
query1 = "What are the key features of Python?"
results1 = ou.search_documents(query1, top_k=2, index_path=index_path, metadata_path=metadata_path)

if isinstance(results1, list):
    display_results(query1, results1)
else:
    print(f"Error: {results1.get('error', 'Unknown error')}")

Query: 'What are the key features of Python?'
Found 2 results

Result 1: python_guide.md (Score: 0.677)
Path: sample_docs/python_guide.md
Snippet:

# Python Programming Guide

Python is a high-level, interpreted programming language known for its readability and versatility. ## Key Features

- Easy to learn and read
- Dynamically typed
- Garbage collection
- Comprehensive standard library
- Extensive third-party packages


## Popular Libraries...

--------------------------------------------------------------------------------

Result 2: machine_learning.md (Score: 0.334)
Path: sample_docs/machine_learning.md
Snippet:

### Unsupervised Learning
Finding patterns or structures in unlabeled data.
Examples: Clustering, Dimensionality Reduction

### Reinforcement Learning
Learning through interactions with an environment to maximize rewards.
Examples: Game playing, Robotics

## Common Algorithms

- Linear Regression
-...

---------------------------------------------------------------------

In [8]:
# Search query 2: Machine learning
query2 = "What is the difference between supervised and unsupervised learning?"
results2 = ou.search_documents(query2, top_k=2, index_path=index_path, metadata_path=metadata_path)

if isinstance(results2, list):
    display_results(query2, results2)
else:
    print(f"Error: {results2.get('error', 'Unknown error')}")

Query: 'What is the difference between supervised and unsupervised learning?'
Found 2 results

Result 1: machine_learning.md (Score: 0.440)
Path: sample_docs/machine_learning.md
Snippet:

# Introduction to Machine Learning

Machine learning is a subset of artificial intelligence that enables systems to learn from data and improve from experience.

## Types of Machine Learning

### Supervised Learning
Training on labeled data to predict outcomes for unseen data.
Examples: Classificat...

--------------------------------------------------------------------------------

Result 2: machine_learning.md (Score: 0.390)
Path: sample_docs/machine_learning.md
Snippet:

### Unsupervised Learning
Finding patterns or structures in unlabeled data.
Examples: Clustering, Dimensionality Reduction

### Reinforcement Learning
Learning through interactions with an environment to maximize rewards.
Examples: Game playing, Robotics

## Common Algorithms

- Linear Regression
-...

-----------------------------

In [9]:
# Search query 3: Vector search
query3 = "How does vector search work?"
results3 = ou.search_documents(query3, top_k=2, index_path=index_path, metadata_path=metadata_path)

if isinstance(results3, list):
    display_results(query3, results3)
else:
    print(f"Error: {results3.get('error', 'Unknown error')}")

Query: 'How does vector search work?'
Found 2 results

Result 1: vector_search.md (Score: 0.622)
Path: sample_docs/vector_search.md
Snippet:

# Guide to Vector Search

Vector search finds similar items by comparing their vector embeddings in high-dimensional space. ## Steps in Vector Search

1. Convert text to vector embeddings
2. Store embeddings in a vector index
3. Use similarity metrics to find nearest vectors

## Common Libraries

-...

--------------------------------------------------------------------------------

Result 2: machine_learning.md (Score: 0.233)
Path: sample_docs/machine_learning.md
Snippet:

### Unsupervised Learning
Finding patterns or structures in unlabeled data.
Examples: Clustering, Dimensionality Reduction

### Reinforcement Learning
Learning through interactions with an environment to maximize rewards.
Examples: Game playing, Robotics

## Common Algorithms

- Linear Regression
-...

---------------------------------------------------------------------------

## 5. Enhanced Search with Ollama

Now let's try enhancing our search queries using Ollama.

In [10]:
def enhanced_search(query, model="llama3"):
    print(f"Original query: '{query}'")
    
    # Enhance the query using Ollama
    enhanced_query = ou.query_ollama(
        f"""
        You are a helpful assistant designed to improve search queries for document retrieval.
        
        Your task is to rewrite the following user query to make it more descriptive and specific.
        Return ONLY the rewritten query — no explanations, no suggestions, and no other text.
        
        Query: "{query}"
        
        Rewritten Query:
        """,
        model=model
    )
    
    # Clean up the enhanced query (remove quotes, extra spaces, etc.)
    enhanced_query = enhanced_query.strip().replace('"', '')
    if '\n' in enhanced_query:
        enhanced_query = enhanced_query.split('\n')[0]
    
    print(f"Enhanced query: '{enhanced_query}'\n")
    
    # Search with the enhanced query
    results = ou.search_documents(
        enhanced_query, 
        top_k=2, 
        index_path=index_path, 
        metadata_path=metadata_path
    )
    
    if isinstance(results, list):
        display_results(enhanced_query, results)
    else:
        print(f"Error: {results.get('error', 'Unknown error')}")
    
    return enhanced_query, results

In [11]:
# Try enhanced search with a vague query
vague_query = "Python libraries"
enhanced_query1, enhanced_results1 = enhanced_search(vague_query)

Original query: 'Python libraries'
Enhanced query: 'List of popular Python programming libraries or open-source packages for data analysis, machine learning, web development, etc.'

Query: 'List of popular Python programming libraries or open-source packages for data analysis, machine learning, web development, etc.'
Found 2 results

Result 1: python_guide.md (Score: 0.686)
Path: sample_docs/python_guide.md
Snippet:

# Python Programming Guide

Python is a high-level, interpreted programming language known for its readability and versatility. ## Key Features

- Easy to learn and read
- Dynamically typed
- Garbage collection
- Comprehensive standard library
- Extensive third-party packages


## Popular Libraries...

--------------------------------------------------------------------------------

Result 2: machine_learning.md (Score: 0.431)
Path: sample_docs/machine_learning.md
Snippet:

### Unsupervised Learning
Finding patterns or structures in unlabeled data.
Examples: Clustering, Di

## 6. Comparing Basic and Enhanced Search

Let's compare the results of basic search vs. enhanced search for the same query intent.

In [12]:
comparison_query = "ml in python"

print("BASIC SEARCH:\n" + "-"*80)
basic_results = ou.search_documents(comparison_query, top_k=2, index_path=index_path, metadata_path=metadata_path)
if isinstance(basic_results, list):
    display_results(comparison_query, basic_results)
else:
    print(f"Error: {basic_results.get('error', 'Unknown error')}")

print("\nENHANCED SEARCH:\n" + "-"*80)
enhanced_query4, enhanced_results4 = enhanced_search(comparison_query)

BASIC SEARCH:
--------------------------------------------------------------------------------
Query: 'ml in python'
Found 2 results

Result 1: python_guide.md (Score: 0.632)
Path: sample_docs/python_guide.md
Snippet:

# Python Programming Guide

Python is a high-level, interpreted programming language known for its readability and versatility. ## Key Features

- Easy to learn and read
- Dynamically typed
- Garbage collection
- Comprehensive standard library
- Extensive third-party packages


## Popular Libraries...

--------------------------------------------------------------------------------

Result 2: machine_learning.md (Score: 0.557)
Path: sample_docs/machine_learning.md
Snippet:

### Unsupervised Learning
Finding patterns or structures in unlabeled data.
Examples: Clustering, Dimensionality Reduction

### Reinforcement Learning
Learning through interactions with an environment to maximize rewards.
Examples: Game playing, Robotics

## Common Algorithms

- Linear Regression
-...

## 7. Clean Up

Let's clean up our example files and directories.

In [13]:
# Remove temporary directories and files
if os.path.exists(sample_dir):
    shutil.rmtree(sample_dir)
    print(f"Removed '{sample_dir}' directory")
    
if os.path.exists(index_dir):
    shutil.rmtree(index_dir)
    print(f"Removed '{index_dir}' directory")
    
print("\nCleanup complete!")

Removed 'sample_docs' directory
Removed 'example_index' directory

Cleanup complete!


## Conclusion

This notebook demonstrated a complete implementation of a semantic document search engine using Ollama and FAISS. Key components included:

1. Document scanning and processing
2. Text chunking and embedding generation
3. Building a searchable vector index
4. Basic semantic search
5. Enhanced search with query improvement using Ollama

The enhanced search showed how an LLM can improve search queries to provide more relevant results.

# Command To Run The Main Streamlit App

In [None]:
!streamlit run app.py


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.17.0.2:8501[0m
[34m  External URL: [0m[1mhttp://50.237.13.62:8501[0m
[0m
