<a href="https://colab.research.google.com/github/jeff-ai-ml/genai/blob/main/recommendation_with_date_and_title_VDB_chromadb_16_07_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/semantic-search.ipynb)

# Semantic Search

In this walkthrough we will see how to use ChromaDB for semantic search. To begin we must install the required prerequisite libraries:

In [1]:
!pip install chromadb sentence-transformers

Collecting chromadb
  Downloading chromadb-1.0.15-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Collecting opentelemetry-api>=1.2.0 (from chromadb)
  Downloading opentelemetry_api-1.35.0-py3-none-any.whl.metadata (1.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.35.0-py3-none-any.whl.metadata (2.4 kB)
Collecting opentelemetry-sdk>=1.2.0 (from chromadb)
  Downloading opentelemetry_sdk-1.35.0-py3-none-any.whl.metadata (1.5 k

In [2]:
import chromadb
from chromadb.config import Settings
from sentence_transformers import SentenceTransformer
from datetime import datetime

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Data Download

# Initialize ChromaDB Client
# Step 1: Initialize ChromaDB Client - No need for Settings anymore

In [3]:
# Initialize ChromaDB Client
# Step 1: Initialize ChromaDB Client - No need for Settings anymore
client = chromadb.Client() # This is the updated way to initialize

# Set up the ChromaDB Collection
collection_name = "news_articles"
if collection_name not in client.list_collections():
    collection = client.create_collection(name=collection_name)
else:
    collection = client.get_collection(name=collection_name)

# Load Pre-trained Embedding Model
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [4]:
def add_articles():
    """Function to add news articles with metadata to the database."""
    print("Enter news articles (type 'done' to finish):")
    articles = []
    categories = []
    dates = []

    while True:
        article = input("Article: ")
        if article.lower() == "done":
            break

        category = input("Category (e.g., technology, science, or type 'done' to finish): ")
        if category.lower() == "done":
            break

        date = input("Date (YYYY-MM-DD, or type 'done' to finish): ")
        if date.lower() == "done":
            break


        # Validate date
        try:
            date = datetime.strptime(date, "%Y-%m-%d").strftime("%Y-%m-%d")
        except ValueError:
            print("Invalid date format. Try again.")
            continue

        articles.append(article)
        categories.append(category)
        dates.append(date)

    if articles:
        embeddings = model.encode(articles).tolist()
        ids = [f"article_{i}" for i in range(len(articles))]
        metadata = [{"category": cat, "date": date} for cat, date in zip(categories, dates)]
        collection.add(documents=articles, embeddings=embeddings, metadatas=metadata, ids=ids)
        print(f"{len(articles)} articles added to the collection.")
    else:
        print("No articles were added.")

In [5]:
def recommend_articles():
    """Function to recommend articles based on user preferences and filters."""
    preference = input("Describe the type of news you're interested in: ")
    category_filter = input("Filter by category (or leave blank): ").lower().strip()
    date_filter = input("Filter by date range (YYYY-MM-DD to YYYY-MM-DD, or leave blank): ").strip()

    # Parse date range if provided
    start_date, end_date = None, None
    if date_filter:
        try:
            start_date_str, end_date_str = date_filter.split(" to ")
            start_date = datetime.strptime(start_date_str, "%Y-%m-%d")
            end_date = datetime.strptime(end_date_str, "%Y-%m-%d")
        except ValueError:
            print("Invalid date range format. Ignoring date filter.")

    query_embedding = model.encode([preference]).tolist()[0]

    # Build the where filter dynamically
    where_filter = {}
    if category_filter:
        where_filter["category"] = category_filter
    if start_date and end_date:
        where_filter["date"] = {"$gte": start_date.strftime("%Y-%m-%d"), "$lte": end_date.strftime("%Y-%m-%d")}
    elif start_date:
        where_filter["date"] = {"$gte": start_date.strftime("%Y-%m-%d")}
    elif end_date:
        where_filter["date"] = {"$lte": end_date.strftime("%Y-%m-%d")}


    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=5,
        where=where_filter if where_filter else None # Pass None if the filter is empty
    )


    print("\nRecommended Articles:")
    if results and results["documents"] and results["documents"][0]:
        for i, (doc, score, meta) in enumerate(zip(results["documents"][0], results["distances"][0], results["metadatas"][0])):
            print(f"{i + 1}. {doc} | Relevance Score: {1 - score:.4f} | Category: {meta['category']} | Date: {meta['date']}")
    else:
        print("No articles found matching your criteria.")

    print()

In [6]:
def main():
    """Main function for the news recommendation system."""
    print("Welcome to the Enhanced News Recommendation System!")
    while True:
        print("\nOptions:")
        print("1. Add news articles")
        print("2. Get news recommendations")
        print("3. Exit")
        choice = input("Choose an option: ")

        if choice == "1":
            add_articles()
        elif choice == "2":
            recommend_articles()
        elif choice == "3":
            print("Goodbye!")
            break
        else:
            print("Invalid option. Please try again.")

# Run the application
if __name__ == "__main__":
    main()

Welcome to the Enhanced News Recommendation System!

Options:
1. Add news articles
2. Get news recommendations
3. Exit
Choose an option: 1
Enter news articles (type 'done' to finish):
Article: Jadeja's defiance in vain as England pull off dramatic win
Category (e.g., technology, science, or type 'done' to finish): sports
Date (YYYY-MM-DD, or type 'done' to finish): 2025-07-14
Article: Akash Deep ten-for seals statement win for India
Category (e.g., technology, science, or type 'done' to finish): sports
Date (YYYY-MM-DD, or type 'done' to finish): 2025-07-06
Article: Duckett 149 lays the foundation as England hunt down 371 at Headingley
Category (e.g., technology, science, or type 'done' to finish): sports
Date (YYYY-MM-DD, or type 'done' to finish): 2025-06-25
Article: Tensor Processing Unit (TPU) Market to Reach USD 31.60 Billion by 2032 
Category (e.g., technology, science, or type 'done' to finish): technology
Date (YYYY-MM-DD, or type 'done' to finish): 2025-04-07
Article: Netflix 