# Lesson 40: Document search with embeddings activity

In this activity, you will build a simple document search system using word embeddings.

1. **Document embeddings** - Create document vectors by averaging word embeddings
2. **Similarity search** - Find documents similar to a query
3. **Comparison** - Compare embedding-based search with TF-IDF search

## Notebook set-up

### Imports

In [None]:
import numpy as np
import gensim.downloader as api
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained GloVe embeddings (this may take a moment)
print('Loading GloVe embeddings...')
glove = api.load('glove-wiki-gigaword-100')
print(f'Loaded {len(glove)} word vectors of dimension {glove.vector_size}')

## 1. Sample document corpus

In [None]:
# Sample documents about different topics
documents = [
    'The quick brown fox jumps over the lazy dog',
    'Machine learning algorithms can classify images and text',
    'Neural networks are inspired by the human brain',
    'Python is a popular programming language for data science',
    'Deep learning has revolutionized computer vision',
    'Natural language processing helps computers understand text',
    'The cat sat on the mat near the window',
    'Artificial intelligence is transforming many industries',
    'Dogs and cats are popular household pets',
    'Data scientists use statistics and machine learning',
]

print(f'Corpus contains {len(documents)} documents')

## 2. Create document embeddings

### Task 1: Implement document embedding function

Complete the function below to create a document embedding by averaging the word embeddings of all words in the document.

**Hints:**
- Tokenize the document using `.lower().split()`
- Check if each word is in the embedding vocabulary using `word in glove`
- Get the embedding with `glove[word]`
- Average all valid word vectors using `np.mean(vectors, axis=0)`
- Return a zero vector if no words are found

In [None]:
def get_document_embedding(document: str, embeddings) -> np.ndarray:
    '''Create document embedding by averaging word embeddings.
    
    TODO: Implement this function.
    '''
    
    # Step 1: Tokenize (lowercase and split)
    words = None  # YOUR CODE HERE
    
    # Step 2: Get embeddings for words that exist in vocabulary
    vectors = []

    for word in words:
        # YOUR CODE HERE - check if word is in embeddings and append vector
        pass
    
    # Step 3: Return average or zero vector
    if len(vectors) > 0:
        return None  # YOUR CODE HERE - return mean of vectors

    else:
        return np.zeros(embeddings.vector_size)

In [None]:
# Create embeddings for all documents
doc_embeddings = np.array([get_document_embedding(doc, glove) for doc in documents])
print(f'Document embeddings shape: {doc_embeddings.shape}')

## 3. Similarity search

### Task 2: Implement document search function

Complete the search function to find the most similar documents to a query.

**Hints:**
- Get the query embedding using `get_document_embedding`
- Compute cosine similarity with `cosine_similarity`
- Sort by similarity score and return top N results

In [None]:
def search_documents(query: str, documents: list, doc_embeddings: np.ndarray, top_n: int = 3):
    '''Search for documents similar to query using embeddings.
    
    TODO: Implement this function.
    '''
    
    # Step 1: Get query embedding
    query_embedding = None  # YOUR CODE HERE
    
    # Step 2: Compute cosine similarity with all documents
    # Hint: cosine_similarity expects 2D arrays, so reshape query_embedding
    similarities = None  # YOUR CODE HERE
    
    # Step 3: Get indices sorted by similarity (descending)
    sorted_indices = None  # YOUR CODE HERE
    
    # Step 4: Return top N results as list of (document, score) tuples
    results = []

    for idx in sorted_indices[:top_n]:
        results.append((documents[idx], similarities[0][idx]))
    
    return results

In [None]:
# Test the search function
queries = [
    'artificial intelligence and deep neural networks',
    'pets and animals',
    'programming code'
]

print('Embedding-based search results:\n')

for query in queries:

    print(f'Query: "{query}"')
    results = search_documents(query, documents, doc_embeddings)

    for doc, score in results:
        print(f'  [{score:.3f}] {doc}')

    print()

## 4. Comparison with TF-IDF search

### Task 3: Compare embedding search with TF-IDF search

Run the TF-IDF search below and compare results with embedding-based search.

In [None]:
# TF-IDF based search for comparison
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(documents)

def search_tfidf(query: str, top_n: int = 3):
    '''Search using TF-IDF similarity.'''

    query_vec = tfidf.transform([query])
    similarities = cosine_similarity(query_vec, tfidf_matrix)[0]
    sorted_indices = np.argsort(similarities)[::-1]
    
    results = []

    for idx in sorted_indices[:top_n]:
        results.append((documents[idx], similarities[idx]))

    return results

print('TF-IDF search results:\n')

for query in queries:

    print(f'Query: "{query}"')
    results = search_tfidf(query)

    for doc, score in results:
        print(f'  [{score:.3f}] {doc}')

    print()

## 5. Analysis questions

After completing the tasks above, answer these questions:

1. Which search method (embeddings vs TF-IDF) handles synonyms and related concepts better? Give an example.
2. What queries where embeddings outperform TF-IDF, and vice versa?
3. What are the trade-offs between these two approaches in terms of speed, memory, and accuracy?
4. How might you combine both approaches for better search results?