# üìò Hugging Face Embeddings + Semantic Search (CPU Only)

## üìå Purpose of this Notebook
This notebook demonstrates how to build **text embeddings, semantic similarity, and basic semantic search** using **FREE Hugging Face models**, without relying on paid APIs like OpenAI or Perplexity for embeddings.

It is designed for:
- Learning **RAG fundamentals**
- Semantic search experiments
- CPU-only systems
- Offline-friendly workflows

---

## üñ•Ô∏è System Assumptions
This notebook is optimized for a **CPU-based laptop** with:
- Intel i5 (12th Gen or similar)
- 16 GB RAM
- Windows / Linux / macOS
- No GPU required

---

## üß† Embedding Model Used
### ‚úÖ `all-mpnet-base-v2`

**Why this model?**
- High semantic accuracy
- Industry-standard for retrieval
- Works well on CPU
- Much better quality than MiniLM
- Free and open-source

**Embedding size:** 768 dimensions  
**One-time download:** ~400 MB (cached locally)

---

## üì¶ Libraries Used
- `sentence-transformers`
- `langchain-community`
- `numpy`

No API keys are required.

---

## üß© What This Notebook Covers
1. Initializing Hugging Face embeddings
2. Generating embeddings for:
   - Single text
   - Multiple texts
3. Computing cosine similarity
4. Comparing sentence similarity
5. Implementing a simple semantic search function
6. Preparing the foundation for RAG systems

---

## üö´ What This Notebook Does NOT Do
- ‚ùå No OpenAI embeddings
- ‚ùå No Perplexity embeddings (not supported)
- ‚ùå No video/image embeddings directly
- ‚ùå No vector database (FAISS/Chroma not included here)

---

## üé• Note on Video & Audio
For video/audio use cases:
- Use **Whisper** to convert audio/video ‚Üí text
- Then embed the transcript using this same pipeline

Pipeline:
Video / Audio
‚Üí Whisper (speech-to-text)
‚Üí Text chunks
‚Üí Hugging Face embeddings
‚Üí Semantic search / RAG


---

## üß™ Typical Use Cases
- Semantic search over documents
- RAG (Retrieval-Augmented Generation)
- Transcript search (lectures, meetings, videos)
- Text similarity analysis
- NLP experimentation without API costs

---

## üöÄ Next Possible Extensions
- Add FAISS or Chroma for vector storage
- Load PDFs and text files
- Combine with Perplexity or other LLMs for answers
- Add timestamps from Whisper transcripts

---

## üìù Reminder for Future Me
If something breaks:
1. Check Python environment
2. Ensure `sentence-transformers` is installed
3. Make sure the model is fully downloaded
4. Restart kernel and rerun all cells

---

## ‚úÖ Status
‚úî Stable  
‚úî Free  
‚úî Offline-friendly  
‚úî CPU-optimized  

### Hugging Face Embeddings

In [2]:
import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings

In [8]:
# Recommended Hugging Face embedding models (CPU-friendly):
# - all-mpnet-base-v2              ‚Üí 768 dims | Best accuracy for semantic search & RAG (default)
# - all-MiniLM-L6-v2               ‚Üí 384 dims | Faster & lighter, slightly lower accuracy
# - BAAI/bge-base-en-v1.5          ‚Üí 768 dims | Strong retrieval-focused model
# - intfloat/e5-base-v2            ‚Üí 768 dims | High-quality English embeddings
# - intfloat/multilingual-e5-base  ‚Üí 768 dims | Multilingual support (100+ languages)

embeddings = HuggingFaceEmbeddings(
    model_name="all-mpnet-base-v2"
)


In [9]:
## Example 1: Single text embeddings
single_text = "LangChain and RAG are amazing frameworks to work on"

single_embedding = embeddings.embed_query(single_text)

print("Vector length:", len(single_embedding))
print("Sample values:", single_embedding[:5])

Vector length: 768
Sample values: [0.002860752632841468, 0.05132891237735748, -0.04515519738197327, 0.015878301113843918, -0.024549704045057297]


In [7]:
## Example 2: Multiple text embeddings
texts = [
    "Python is a programming language",
    "LangChain is a framework for LLM applications",
    "Embeddings convert text to numbers",
    "Vectors can be compared for similarity"
]

text_embeddings = embeddings.embed_documents(texts)

print("Number of texts:", len(texts))
print("Embedding size:", len(text_embeddings[0]))

Number of texts: 4
Embedding size: 768


### Cosine Similarity with Huggingface Embeddings

In [12]:
def cosine_similarity(vec1, vec2):
    dot = np.dot(vec1, vec2)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)
    return dot / (norm1 * norm2)

In [13]:
sentences = [
    "The cat sat on the mat",
    "A feline rested on the rug",
    "The dog played in the yard",
    "I love programming in Python",
    "Python is my favorite programming language"
]

sentence_embeddings = embeddings.embed_documents(sentences)
sentence_embeddings

[[-0.004790397826582193,
  -0.04454632103443146,
  -0.01294713094830513,
  -0.015809927135705948,
  0.016214989125728607,
  -0.04238875210285187,
  -0.017240481451153755,
  -0.009168930351734161,
  0.060414642095565796,
  0.008282085880637169,
  0.05595804750919342,
  0.030417364090681076,
  -0.016756584867835045,
  -0.0072168344631791115,
  0.04726511240005493,
  0.08785635232925415,
  0.0011018391232937574,
  0.042304933071136475,
  0.025709744542837143,
  -0.009535906836390495,
  0.02513870783150196,
  0.004655035212635994,
  -0.0001165214489446953,
  0.02752346731722355,
  -0.07715395092964172,
  0.0082609998062253,
  -0.044578369706869125,
  -0.012338452972471714,
  -0.03978028520941734,
  -0.029954997822642326,
  0.03811982274055481,
  0.025839563459157944,
  0.010295694693922997,
  -0.10166221857070923,
  1.7521284689792083e-06,
  -0.006307980045676231,
  -0.025119546800851822,
  -0.00696765398606658,
  0.1041608527302742,
  -0.0633469671010971,
  -0.0325433574616909,
  -0.03675

In [15]:
## Calculate the similarity between all the pairs

for i in range(len(sentences)):
    for j in range(i + 1, len(sentences)):
        sim = cosine_similarity(sentence_embeddings[i], sentence_embeddings[j])
        print(f"{sentences[i]}  <->  {sentences[j]}")
        print(f"Similarity: {sim:.3f}\n")

The cat sat on the mat  <->  A feline rested on the rug
Similarity: 0.690

The cat sat on the mat  <->  The dog played in the yard
Similarity: 0.044

The cat sat on the mat  <->  I love programming in Python
Similarity: 0.002

The cat sat on the mat  <->  Python is my favorite programming language
Similarity: 0.024

A feline rested on the rug  <->  The dog played in the yard
Similarity: 0.101

A feline rested on the rug  <->  I love programming in Python
Similarity: 0.027

A feline rested on the rug  <->  Python is my favorite programming language
Similarity: 0.020

The dog played in the yard  <->  I love programming in Python
Similarity: 0.030

The dog played in the yard  <->  Python is my favorite programming language
Similarity: 0.037

I love programming in Python  <->  Python is my favorite programming language
Similarity: 0.886



In [16]:
## Example - Semantic Search - Retrieve the similar sentence

def semantic_search(query, documents, embeddings_model, top_k=3):

    # embed query and document

    query_embedding = embeddings_model.embed_query(query)
    doc_embeddings = embeddings_model.embed_documents(documents)

    # Calcualte the similarity score

    scores = []
    for i, doc_emb in enumerate(doc_embeddings):
        score = cosine_similarity(query_embedding, doc_emb)
        scores.append((score, documents[i]))

    # Sort by similarity
    
    scores.sort(reverse=True)
    return scores[:top_k]


In [20]:
documents = [
    "LangChain is a framework for developing applications powered by language models",
    "Python is a high-level programming language",
    "Machine learning is a subset of artificial intelligence",
    "Embeddings convert text into numerical vectors",
    "The weather today is sunny and warm"
]

# query = "What is LangChain?"
query = "How is weather today?"


results = semantic_search(query, documents, embeddings)
print(results)
print(f"\nResults for: {query}")
for score, doc in results:
    print(f"Score: {score:.3f} | {doc}")

[(np.float64(0.6998282765530333), 'The weather today is sunny and warm'), (np.float64(0.02372433280134619), 'Embeddings convert text into numerical vectors'), (np.float64(0.019462305184657445), 'Machine learning is a subset of artificial intelligence')]

Results for: How is weather today?
Score: 0.700 | The weather today is sunny and warm
Score: 0.024 | Embeddings convert text into numerical vectors
Score: 0.019 | Machine learning is a subset of artificial intelligence
