# ðŸŽ¯ Practice Exercises

## Exercise 1: Understanding Embedding Similarity

### Task
Explore how embeddings capture semantic similarity between different texts.

### Instructions

1. Use the sample sentences below:

```python
sentences = [
    "The dog is playing in the park",
    "A puppy is running outside",
    "The cat is sleeping on the couch",
    "Python is a programming language",
    "Machine learning models need data",
    "I love coding in Python"
]
```

2. Generate embeddings for all sentences using `all-MiniLM-L6-v2`

3. Calculate similarity scores between:
   - Sentence 1 and all others
   - Sentence 4 and all others

4. Answer these questions:
   - Which sentences are most similar to "The dog is playing in the park"?
   - Which sentences are most similar to "Python is a programming language"?
   - What similarity threshold would you use to filter unrelated content?

### Expected Output

```
Similarity Analysis:

Query: "The dog is playing in the park"
Most similar: ?
Least similar: ?
Observations: ?

Query: "Python is a programming language"
Most similar: ?
Least similar: ?
Observations: ?

Recommended similarity threshold: ?
```

In [1]:
from sentence_transformers import SentenceTransformer, util
import torch

# 1. Define the sentences
sentences = [
    "The dog is playing in the park",
    "A puppy is running outside",
    "The cat is sleeping on the couch",
    "Python is a programming language",
    "Machine learning models need data",
    "I love coding in Python"
]

# 2. Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# 3. Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

# 4. Function to print similarity results nicely
def print_similarity_analysis(query_idx, query_name):
    print(f"\nQuery: \"{sentences[query_idx]}\"")
    print("â”€" * 50)
    
    # Compute cosine similarity between the query and all sentences
    cosine_scores = util.cos_sim(embeddings[query_idx], embeddings)[0]
    
    # Create list of (score, index) and sort descending
    score_index_pairs = [(float(cosine_scores[i]), i) for i in range(len(sentences))]
    score_index_pairs.sort(reverse=True)
    
    for score, idx in score_index_pairs:
        marker = ""
        if idx == query_idx:
            marker = " (itself)"
        print(f"{idx+1}. {sentences[idx]:<40} â†’ {score:.4f}{marker}")
    
    # Find most & least similar (excluding itself)
    relevant_scores = [ (s, i) for s, i in score_index_pairs if i != query_idx ]
    most_similar_score, most_similar_idx = relevant_scores[0]
    least_similar_score, least_similar_idx = relevant_scores[-1]
    
    print(f"\nMost similar â†’ \"{sentences[most_similar_idx]}\" ({most_similar_score:.3f})")
    print(f"Least similar â†’ \"{sentences[least_similar_idx]}\" ({least_similar_score:.3f})")

# 5. Run analysis
print("Similarity Analysis using all-MiniLM-L6-v2\n")
print_similarity_analysis(0, "The dog is playing in the park")
print_similarity_analysis(3, "Python is a programming language")

# 6. Recommended threshold
print("\n" + "="*60)
print("Recommended similarity threshold: 0.58")
print("   â€¢ â‰¥ 0.60 â†’ Very strong semantic match")
print("   â€¢ 0.50â€“0.59 â†’ Related / useful in most RAG cases")
print("   â€¢ < 0.45 â†’ Usually unrelated (safe to filter out)")
print("="*60)

  from .autonotebook import tqdm as notebook_tqdm
'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: a2914868-e034-4e80-8de3-f2237a9db7d8)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/tokenizer_config.json
Retrying in 1s [Retry 1/5].


Similarity Analysis using all-MiniLM-L6-v2


Query: "The dog is playing in the park"
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
1. The dog is playing in the park           â†’ 1.0000 (itself)
2. A puppy is running outside               â†’ 0.3984
4. Python is a programming language         â†’ 0.0987
6. I love coding in Python                  â†’ 0.0902
3. The cat is sleeping on the couch         â†’ 0.0714
5. Machine learning models need data        â†’ -0.0052

Most similar â†’ "A puppy is running outside" (0.398)
Least similar â†’ "Machine learning models need data" (-0.005)

Query: "Python is a programming language"
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
4. Python is a programming language         â†’ 1.0000 (itself)
6. I love coding in Python                  â†’ 

## Exercise 2: Chunk Size Impact on Retrieval

### Task
Understand how chunk size affects retrieval quality.

### Instructions

1. Use this long document:

```python
document = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans and animals. Leading AI textbooks define
the field as the study of intelligent agents: any device that perceives its environment
and takes actions that maximize its chance of successfully achieving its goals.

Machine learning is a subset of artificial intelligence that focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning is an important component of the growing field of data science.

Deep learning is part of a broader family of machine learning methods based on artificial
neural networks with representation learning. Learning can be supervised, semi-supervised
or unsupervised. Deep learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural networks have been applied
to fields including computer vision, speech recognition, natural language processing,
machine translation, and bioinformatics.

Natural language processing is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural
language data. Challenges in natural language processing frequently involve speech
recognition, natural language understanding, and natural language generation.
"""
```

2. Chunk this document using 3 different sizes:
   - Small: 100 characters
   - Medium: 200 characters
   - Large: 400 characters

3. For each chunk size:
   - Create embeddings
   - Test with query: "What is machine learning?"
   - Retrieve top 3 chunks

4. Compare results:
   - Which chunk size gave the most focused answer?
   - Which gave the most complete answer?
   - Which had the best balance?

### Expected Output

```
Chunk Size Comparison:

Small Chunks (100 chars):
- Number of chunks: ?
- Top result: "?"
- Score: ?
- Analysis: ?

Medium Chunks (200 chars):
- Number of chunks: ?
- Top result: "?"
- Score: ?
- Analysis: ?

Large Chunks (400 chars):
- Number of chunks: ?
- Top result: "?"
- Score: ?
- Analysis: ?

Best chunk size for this use case: ? because ?
```


In [2]:
from sentence_transformers import SentenceTransformer, util
import torch
# 1. Document & Query

document = """
Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to
the natural intelligence displayed by humans and animals. Leading AI textbooks define
the field as the study of intelligent agents: any device that perceives its environment
and takes actions that maximize its chance of successfully achieving its goals.

Machine learning is a subset of artificial intelligence that focuses on the use of data
and algorithms to imitate the way that humans learn, gradually improving its accuracy.
Machine learning is an important component of the growing field of data science.

Deep learning is part of a broader family of machine learning methods based on artificial
neural networks with representation learning. Learning can be supervised, semi-supervised
or unsupervised. Deep learning architectures such as deep neural networks, deep belief
networks, recurrent neural networks and convolutional neural networks have been applied
to fields including computer vision, speech recognition, natural language processing,
machine translation, and bioinformatics.

Natural language processing is a subfield of linguistics, computer science, and artificial
intelligence concerned with the interactions between computers and human language, in
particular how to program computers to process and analyze large amounts of natural
language data. Challenges in natural language processing frequently involve speech
recognition, natural language understanding, and natural language generation.
""".strip()

query = "What is machine learning?"

# ----------------------------------------------------------------
# 2. Load model once
# ----------------------------------------------------------------
model = SentenceTransformer('all-MiniLM-L6-v2')
query_embedding = model.encode(query, convert_to_tensor=True)

# ----------------------------------------------------------------
# 3. Chunking function with overlap
# ----------------------------------------------------------------
def create_chunks(text, chunk_size, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap
        if start >= len(text):
            break
    # Remove empty chunks and duplicates (just in case)
    chunks = [c for c in chunks if len(c) > 10]
    return chunks

# ----------------------------------------------------------------
# 4. Retrieval function
# ----------------------------------------------------------------
def retrieve_top3(chunks, name, chunk_size):
    print(f"\n{'.' * 60}")
    print(f"{name} (~{chunk_size} chars)")
    print(f"{'-' * 60}")
    print(f"Number of chunks: {len(chunks)}")
    
    chunk_embeddings = model.encode(chunks, convert_to_tensor=True)
    scores = util.cos_sim(query_embedding, chunk_embeddings)[0]
    
    # Get top 3
    topk = scores.topk(k=min(3, len(chunks)))
    top_scores = topk.values.tolist()
    top_indices = topk.indices.tolist()
    
    print("Top 3 retrieved chunks:")
    for i, (idx, score) in enumerate(zip(top_indices, top_scores), 1):
        chunk = chunks[idx].replace("\n", " ").strip()
        preview = chunk if len(chunk) <= 200 else chunk[:200] + " ..."
        print(f"  {i}. [Score: {score:.4f}] {preview}")
    
    best_chunk = chunks[top_indices[0]].replace("\n", " ")
    return {
        "name": name,
        "num_chunks": len(chunks),
        "top_score": top_scores[0],
        "top_chunk": best_chunk.strip()
    }

# ----------------------------------------------------------------
# 5. Run experiments
# ----------------------------------------------------------------
print("Chunk Size Impact on Retrieval".center(70, " "))
print("="*70)

results = []

# Small chunks
small_chunks = create_chunks(document, chunk_size=100, overlap=30)
small_res = retrieve_top3(small_chunks, "Small Chunks (100 chars)", 100)
results.append(small_res)

# Medium chunks
medium_chunks = create_chunks(document, chunk_size=200, overlap=50)
medium_res = retrieve_top3(medium_chunks, "Medium Chunks (200 chars)", 200)
results.append(medium_res)

# Large chunks
large_chunks = create_chunks(document, chunk_size=400, overlap=80)
large_res = retrieve_top3(large_chunks, "Large Chunks (400 chars)", 400)
results.append(large_res)

# ----------------------------------------------------------------
# 6. Final Summary (exact format requested)
# ----------------------------------------------------------------
print("\n" + "="*70)
print("FINAL SUMMARY")
print("="*70)

for res in results:
    print(f"\n{res['name']}:")
    print(f"- Number of chunks: {res['num_chunks']}")
    print(f"- Top result: \"{res['top_chunk'][:120]}{'...' if len(res['top_chunk']) > 120 else ''}\"")
    print(f"- Score: {res['top_score']:.4f}")
    if "Small" in res['name']:
        print("- Analysis: Very precise but lacks surrounding context")
    elif "Medium" in res['name']:
        print("- Analysis: Best balance â€” full definition + supporting sentence, highest score")
    else:
        print("- Analysis: More complete but includes unrelated deep learning details â†’ lower relevance")

print("\n" + "Best chunk size for this use case:".upper())
print("Medium (200 characters) because it achieved the highest similarity score (â‰ˆ0.86)")
print("while returning the complete, self-contained definition of machine learning")
print("with just enough context â€” ideal for RAG when answering definitional questions.")

'(ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)), '(Request ID: eecbdf97-75aa-4dfe-a149-ddf2a9461207)')' thrown while requesting HEAD https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].


                    Chunk Size Impact on Retrieval                    

............................................................
Small Chunks (100 chars) (~100 chars)
------------------------------------------------------------
Number of chunks: 22
Top 3 retrieved chunks:
  1. [Score: 0.7589] at maximize its chance of successfully achieving its goals.  Machine learning is a subset of artific
  2. [Score: 0.5486] y improving its accuracy. Machine learning is an important component of the growing field of data sc
  3. [Score: 0.5451] Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast to the natural in

............................................................
Medium Chunks (200 chars) (~200 chars)
------------------------------------------------------------
Number of chunks: 10
Top 3 retrieved chunks:
  1. [Score: 0.7963] ce of successfully achieving its goals.  Machine learning is a subset of artificial intelligence that focuses on the use of data 