# Update Blog Data

This notebook demonstrates how to update the blog data and vector store when new blog posts are published. It uses the utility functions from `utils_data_loading.ipynb`.

In [24]:
import sys
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()
import sys
import os

# Add the project root to the Python path
package_root = os.path.abspath(os.path.join(os.getcwd(), "../"))
print(f"Adding package root to sys.path: {package_root}")
if package_root not in sys.path:
	sys.path.append(package_root)


Adding package root to sys.path: /home/mafzaal/source


In [25]:
notebook_dir = os.getcwd()
print(f"Current notebook directory: {notebook_dir}")
# change to the directory to the root of the project
project_root = os.path.abspath(os.path.join(os.getcwd(), "../../"))
print(f"Project root: {project_root}")
os.chdir(project_root)

Current notebook directory: /home/mafzaal/source/lets-talk
Project root: /home/mafzaal


## Update Blog Data Process

This process will:
1. Load existing blog posts
2. Process and update metadata
3. Create or update vector embeddings

In [3]:
import lets_talk.utils.blog as  blog_utils
docs = blog_utils.load_blog_posts(data_dir="/home/mafzaal/source/mafzaal.github.io/posts",glob_pattern="index.md")




  description="Check that the field is empty, alternative syntax for `is_empty: \&quot;field_name\&quot;`",
  description="Check that the field is null, alternative syntax for `is_null: \&quot;field_name\&quot;`",
100%|██████████| 20/20 [00:00<00:00, 3227.13it/s]

Loaded 20 documents from /home/mafzaal/source/mafzaal.github.io/posts





In [4]:
#write code to get docs by source 
docs_by_source = {}
for doc in docs:
    source = doc.metadata.get("source", "unknown")
    if source not in docs_by_source:
        docs_by_source[source] = []
    docs_by_source[source].append(doc)

In [5]:
docs_by_source['/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md']

[Document(metadata={'source': '/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md'}, page_content='---\ntitle: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: "Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems."\ncategories: ["AI", "RAG", "Evaluation","Ragas"]\ncoverImage: "https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3"\nreadingTime: 7\npublished: true\n---\n\nAs Large Language Models (LLMs) become fundamental components of modern applications, effectively evaluating their performance becomes increasingly critical. Whether you\'re building a question-answering system, a document retrieval tool, or a conversational agent, you need reliable metrics to assess how

In [6]:
docs_with_data = blog_utils.update_document_metadata(docs,data_dir_prefix="/home/mafzaal/source/mafzaal.github.io/posts/")

Skipping unpublished document: Zero-Shot RAG: Building Systems That Work Out-of-the-Box


In [7]:
docs_with_data

[Document(metadata={'source': '/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/blog/introduction-to-ragas/', 'post_slug': 'introduction-to-ragas', 'post_title': 'Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications', 'cover_image': 'https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3', 'date': '2025-04-26T18:00:00-06:00', 'categories': ['AI', 'RAG', 'Evaluation', 'Ragas'], 'description': 'Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.', 'reading_time': '7', 'published': True, 'content_length': 6994}, page_content='---\ntitle: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: "Explore the essenti

In [8]:
split_docs = blog_utils.split_documents(docs)

Split 20 documents into 227 chunks


In [9]:
split_docs[0]

Document(metadata={'source': '/home/mafzaal/source/mafzaal.github.io/posts/introduction-to-ragas/index.md', 'url': 'https://thedataguy.pro/blog/introduction-to-ragas/', 'post_slug': 'introduction-to-ragas', 'post_title': 'Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications', 'cover_image': 'https://images.unsplash.com/photo-1593642634367-d91a135587b5?q=80&w=1770&auto=format&fit=crop&ixlib=rb-4.0.3', 'date': '2025-04-26T18:00:00-06:00', 'categories': ['AI', 'RAG', 'Evaluation', 'Ragas'], 'description': 'Explore the essential evaluation framework for LLM applications with Ragas. Learn how to assess performance, ensure accuracy, and improve reliability in Retrieval-Augmented Generation systems.', 'reading_time': '7', 'published': True, 'content_length': 6994}, page_content='---\ntitle: "Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications"\ndate: 2025-04-26T18:00:00-06:00\nlayout: blog\ndescription: "Explore the essentia

In [32]:
from langchain.embeddings import init_embeddings
embedding_model = init_embeddings("ollama:snowflake-arctic-embed2:latest",base_url="http://host.docker.internal:11434")
embedding_model.embed_query("Hello, how are you?")

[-0.014671022,
 0.0056940024,
 -0.054714832,
 -0.059647933,
 -0.006816796,
 0.01716877,
 0.033712815,
 0.026671892,
 -0.01865606,
 -0.01148566,
 0.014141501,
 0.003039137,
 0.021281049,
 -0.012173388,
 -0.0008546505,
 0.054538127,
 -0.04232712,
 -0.009622793,
 -0.050947458,
 -0.009234576,
 -0.08137996,
 0.069304764,
 0.009315735,
 -0.0029559545,
 0.06958663,
 0.011407494,
 -0.032647233,
 -0.028725518,
 -0.017362446,
 -0.032384507,
 -0.020219686,
 0.036269415,
 0.03129496,
 -0.0826072,
 0.0051507647,
 -0.016931148,
 0.03616488,
 -0.05358736,
 -0.14692639,
 0.030965284,
 -0.02527961,
 -0.001543461,
 -0.021383347,
 0.026346583,
 0.06690938,
 -0.03615613,
 -0.027886346,
 -0.05272993,
 0.021408971,
 -0.019687392,
 0.015839016,
 0.043884728,
 -0.019894397,
 -0.010064689,
 -0.044956546,
 0.038188823,
 0.0068873484,
 -0.019878529,
 -0.055200793,
 -0.0016756806,
 0.04296426,
 0.05140831,
 -0.008155994,
 0.038085878,
 -0.08189059,
 0.10218666,
 0.017926894,
 0.007290556,
 -0.035657167,
 -0.02976

In [None]:
#vector_store = blog_utils = blog_utils.create_vector_store(docs,'./db/vector_store_tdg_3')

from langchain.embeddings import init_embeddings
from langchain_qdrant import QdrantVectorStore

embedding_model = init_embeddings("ollama:snowflake-arctic-embed2:latest")



vector_store = QdrantVectorStore.from_documents(
        split_docs,
        embedding=embedding_model, #type: ignore
        collection_name="the_data_guy_dev",
        url="http://127.0.0.1:6334",
        prefer_grpc=True,
    )




In [18]:
vector_store = QdrantVectorStore.from_existing_collection(        
        embedding=embedding_model, #type: ignore
        collection_name="the_data_guy_dev",
        url="http://127.0.0.1:6334",
        prefer_grpc=True,
    )


In [21]:
vector_store.similarity_search("What is the difference between a Data Engineer and a Data Scientist?", k=3)

[Document(metadata={'cover_image': 'https://images.unsplash.com/photo-1705484229341-4f7f7519b718?q=80&w=1740&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D', 'description': "Discover why controlling unique, high-quality data is your organization's most valuable competitive advantage in the AI era, and how a strategic approach to data ownership is becoming essential to business success.", 'date': '2025-04-15T00:00:00-06:00', 'post_title': 'Data is King: Why Your Data Strategy IS Your Business Strategy', 'content_length': 6197, 'source': '/home/mafzaal/source/mafzaal.github.io/posts/data-is-king/index.md', 'url': 'https://thedataguy.pro/blog/data-is-king/', 'reading_time': '3', 'published': True, 'post_slug': 'data-is-king', 'categories': ['AI', 'Strategy', 'Data'], '_id': '72c5318b-884b-48bb-b9ba-ed86fc7bc147', '_collection_name': 'the_data_guy_dev'}, page_content='---\nlayout: blog\ntitle: "Data is King: Why Your Data Strategy IS Your Busi

## Testing the Vector Store

Let's test the vector store with a few queries to make sure it's working correctly.

In [22]:
# Create a retriever from the vector store
retriever = vector_store.as_retriever(search_kwargs={"k": 3})

# Test queries
test_queries = [
    "What is RAGAS?",
    "How to build research agents?",
    "What is metric driven development?",
    "Who is TheDataGuy?"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    docs = retriever.invoke(query)
    print(f"Retrieved {len(docs)} documents:")
    for i, doc in enumerate(docs):
        title = doc.metadata.get("post_title", "Unknown")
        url = doc.metadata.get("url", "No URL")
        print(f"{i+1}. {title} ({url})")


Query: What is RAGAS?
Retrieved 3 documents:
1. Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications (https://thedataguy.pro/blog/introduction-to-ragas/)
2. Part 1: Introduction to Ragas: The Essential Evaluation Framework for LLM Applications (https://thedataguy.pro/blog/introduction-to-ragas/)
3. Part 8: Building Feedback Loops with Ragas (https://thedataguy.pro/blog/building-feedback-loops-with-ragas/)

Query: How to build research agents?
Retrieved 3 documents:
1. Building a Research Agent with RSS Feed Support (https://thedataguy.pro/blog/building-research-agent/)
2. Building a Research Agent with RSS Feed Support (https://thedataguy.pro/blog/building-research-agent/)
3. Building a Research Agent with RSS Feed Support (https://thedataguy.pro/blog/building-research-agent/)

Query: What is metric driven development?
Retrieved 3 documents:
1. Metric-Driven Development: Make Smarter Decisions, Faster (https://thedataguy.pro/blog/metric-driven-develop

In [23]:
vector_store.client.close()