# Synthetic Data Generation

Please read the full article at [thedataguy.pro](https://thedataguy.pro/blog/evaluating-rag-systems-with-ragas/).



In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

In [3]:
import nest_asyncio
nest_asyncio.apply()

In [4]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Download Sample Data

In [None]:
!mkdir - data/

In [None]:
!curl https://raw.githubusercontent.com/MicrosoftDocs/dynamics-365-unified-operations-public/refs/heads/main/articles/fin-ops-core/dev-itpro/get-started/whats-new-platform-updates-10-0-24.md -o data/whats-new-platform-updates-10-0-23.md

In [6]:
!curl https://raw.githubusercontent.com/MicrosoftDocs/dynamics-365-unified-operations-public/refs/heads/main/articles/fin-ops-core/dev-itpro/get-started/whats-new-platform-updates-10-0-23.md -o data/whats-new-platform-updates-10-0-23.md

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3857  100  3857    0     0   2889      0  0:00:01  0:00:01 --:--:--  2891


In [7]:
! curl https://raw.githubusercontent.com/MicrosoftDocs/dynamics-365-unified-operations-public/refs/heads/main/articles/fin-ops-core/dev-itpro/get-started/whats-new-platform-updates-10-0-22.md -o data/whats-new-platform-updates-10-0-22.md

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5863  100  5863    0     0  25983      0 --:--:-- --:--:-- --:--:-- 26057


# Load the evaluation dataset

In [20]:
from langchain_community.document_loaders import DirectoryLoader
path = "data/"
loader = DirectoryLoader(path, glob="*.md")
docs = loader.load()

In [21]:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
# Initialize the generator with the LLM and embedding model
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/3 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/3 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/5 [00:00<?, ?it/s]

Property 'summary' already exists in node 'cfb8e5'. Skipping!
Property 'summary' already exists in node '2ba176'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/9 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'cfb8e5'. Skipping!
Property 'summary_embedding' already exists in node '2ba176'. Skipping!


Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [24]:
dataset_df = dataset.to_pandas()
dataset_df

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Power BI work now with new update?,[title: Platform updates for version 10.0.22 o...,The Power BI embedded and Power BI.com integra...,single_hop_specifc_query_synthesizer
1,What change regarding jQWidgets is included in...,[title: Platform updates for version 10.0.22 o...,The platform updates for version 10.0.22 of fi...,single_hop_specifc_query_synthesizer
2,Moment update what do?,[title: Platform updates for version 10.0.22 o...,Open-source software update upgrade Moment and...,single_hop_specifc_query_synthesizer
3,Wher can I find informashun about featuers tha...,[Removed and deprecated platform features The ...,The Removed or deprecated platform features ar...,single_hop_specifc_query_synthesizer
4,what happen if breaking changes only affect co...,[Removed and deprecated platform features The ...,For breaking changes that affect only compilat...,single_hop_specifc_query_synthesizer
5,What happen with production environments when ...,[Removed and deprecated platform features The ...,For breaking changes that affect only compilat...,single_hop_specifc_query_synthesizer
6,What open-source software update was included ...,[<1-hop>\n\ntitle: Platform updates for versio...,Platform version 10.0.22 of finance and operat...,multi_hop_abstract_query_synthesizer
7,Wht is the open-source software update in plat...,[<1-hop>\n\ntitle: Platform updates for versio...,The open-source software update in platform ve...,multi_hop_abstract_query_synthesizer
8,what platform updates for finance and operatio...,[<1-hop>\n\ntitle: Platform updates for versio...,platform updates for finance and operations ap...,multi_hop_abstract_query_synthesizer
9,What new scenarios are enabled with Microsoft ...,[<1-hop>\n\ntitle: Platform updates for versio...,In version 10.0.22 of finance and operations a...,multi_hop_abstract_query_synthesizer


In [26]:
dataset_df.columns

Index(['user_input', 'reference_contexts', 'reference', 'synthesizer_name'], dtype='object')