# Update Blog Data

This notebook demonstrates how to update the blog data and vector store when new blog posts are published. It uses the utility functions from `utils_data_loading.ipynb`.

In [1]:
import sys
import os
from pathlib import Path
from dotenv import load_dotenv


import sys
import os

# Add the project root to the Python path
package_root = os.path.abspath(os.path.join(os.getcwd(), "../"))
print(f"Adding package root to sys.path: {package_root}")
if package_root not in sys.path:
	sys.path.append(package_root)


Adding package root to sys.path: /home/mafzaal/source/lets-talk/py-src


In [2]:
notebook_dir = os.getcwd()
print(f"Current notebook directory: {notebook_dir}")
# change to the directory to the root of the project
project_root = os.path.abspath(os.path.join(os.getcwd(), "../../"))
print(f"Project root: {project_root}")
os.chdir(project_root)

Current notebook directory: /home/mafzaal/source/lets-talk/py-src/notebooks
Project root: /home/mafzaal/source/lets-talk


## Update Blog Data Process

This process will:
1. Load existing blog posts
2. Process and update metadata
3. Create or update vector embeddings

In [3]:
import lets_talk.utils.blog as  blog_utils
docs = blog_utils.load_blog_posts(data_dir="/home/mafzaal/source/mafzaal.github.io/posts",glob_pattern="index.md")




100%|██████████| 20/20 [00:00<00:00, 2876.16it/s]

Loaded 20 documents from /home/mafzaal/source/mafzaal.github.io/posts





In [4]:
for doc in docs:
    print(doc.metadata["source"])

/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md
/home/mafzaal/source/mafzaal.github.io/posts/generating-test-data-with-ragas/index.md
/home/mafzaal/source/mafzaal.github.io/posts/advanced-metrics-and-customization-with-ragas/index.md
/home/mafzaal/source/mafzaal.github.io/posts/building-research-agent/index.md
/home/mafzaal/source/mafzaal.github.io/posts/lets-talk-ai-chat-component/index.md
/home/mafzaal/source/mafzaal.github.io/posts/rss-feed-announcement/index.md
/home/mafzaal/source/mafzaal.github.io/posts/metric-driven-development/index.md
/home/mafzaal/source/mafzaal.github.io/posts/basic-evaluation-workflow-with-ragas/index.md
/home/mafzaal/source/mafzaal.github.io/posts/langchain-experience-csharp-perspective/index.md
/home/mafzaal/source/mafzaal.github.io/posts/evaluating-ai-agents-with-ragas/index.md
/home/mafzaal/source/mafzaal.github.io/posts/integrations-and-observability-with-ragas/index.md
/home/mafzaal/source/mafzaal.github.io/posts/building-f

In [5]:
docs_with_data = blog_utils.update_document_metadata(docs,data_dir_prefix="/home/mafzaal/source/mafzaal.github.io/posts/")

In [6]:
docs_with_data

[Document(metadata={'source': '/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/blog/introduction-to-ragas/', 'post_slug': 'introduction-to-ragas', 'post_title': 'Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications', 'cover_image': 'https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3', 'date': '2025-04-26T18:00:00-06:00', 'categories': ['AI', 'RAG', 'Evaluation', 'Ragas'], 'description': 'Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.', 'reading_time': '7', 'published': True, 'content_length': 6994}, page_content='---\ntitle: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: "Explore the essenti

In [7]:
split_docs = blog_utils.split_documents(docs)

Split 20 documents into 227 chunks


In [8]:
split_docs[0]

Document(metadata={'source': '/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/blog/introduction-to-ragas/', 'post_slug': 'introduction-to-ragas', 'post_title': 'Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications', 'cover_image': 'https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3', 'date': '2025-04-26T18:00:00-06:00', 'categories': ['AI', 'RAG', 'Evaluation', 'Ragas'], 'description': 'Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.', 'reading_time': '7', 'published': True, 'content_length': 6994}, page_content='---\ntitle: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: "Explore the essentia

In [9]:
vector_store = blog_utils = blog_utils.create_vector_store(split_docs,'./db/vector_store_tdg_2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/203 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/251k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/818 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

## Testing the Vector Store

Let's test the vector store with a few queries to make sure it's working correctly.

In [10]:
# Create a retriever from the vector store
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Test queries
test_queries = [
    "What is RAGAS?",
    "How to build research agents?",
    "What is metric driven development?",
    "Who is TheDataGuy?"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    docs = retriever.invoke(query)
    print(f"Retrieved {len(docs)} documents:")
    for i, doc in enumerate(docs):
        title = doc.metadata.get("post_title", "Unknown")
        url = doc.metadata.get("url", "No URL")
        print(f"{i+1}. {title} ({url})")


Query: What is RAGAS?
Retrieved 3 documents:
1. Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications (https://thedataguy.pro/blog/introduction-to-ragas/)
2. Part 8: Building Feedback Loops with Ragas (https://thedataguy.pro/blog/building-feedback-loops-with-ragas/)
3. Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications (https://thedataguy.pro/blog/introduction-to-ragas/)

Query: How to build research agents?
Retrieved 3 documents:
1. Building a Research Agent with RSS Feed Support (https://thedataguy.pro/blog/building-research-agent/)
2. Building a Research Agent with RSS Feed Support (https://thedataguy.pro/blog/building-research-agent/)
3. Building a Research Agent with RSS Feed Support (https://thedataguy.pro/blog/building-research-agent/)

Query: What is metric driven development?
Retrieved 3 documents:
1. Metric-Driven Development: Make Smarter Decisions, Faster (https://thedataguy.pro/blog/metric-driven-develop

In [11]:
vector_store.client.close()