NLTK Imports.

In [1]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/pratikmurali/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/pratikmurali/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

Load the Environment Variables

In [2]:
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

True

Create a Project Name

In [3]:
from uuid import uuid4
import os

os.environ["LANGSMITH_PROJECT"] = f"FDA-CYBERSECURITY-SDG-{uuid4().hex[0:8]}"

****Data Preparation****

First We Will Load the Documents required to generate the dataset.

In [4]:
from langchain.document_loaders import PyMuPDFLoader
from langchain_community.document_loaders import DirectoryLoader
from langchain.schema import Document
import os
from typing import List

path = "doc/cybersecurity"

# Define a dictionary to map file extensions to their respective loaders
loaders = {
    '.pdf': PyMuPDFLoader,
}

# Define a function to create a DirectoryLoader for a specific file type
def create_directory_loader(file_type, directory_path):
    return DirectoryLoader(
        path=directory_path,
        glob=f"**/*{file_type}",
        loader_cls=loaders[file_type],
    )

# Create DirectoryLoader instances for each file type
pdf_loader = create_directory_loader('.pdf', 'docs/cybersecurity')

pdf_documents = pdf_loader.load();

Next we will setup our LLM who will read these docs and generate a synthetic dataset.

In [5]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
# configure RunConfig
from ragas.run_config import RunConfig
import tqdm as notebook_tqdm
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"), run_config=RunConfig(max_workers=10))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings(), run_config=RunConfig(max_workers=10))

  from .autonotebook import tqdm as notebook_tqdm


Next, we're going to instantiate our ***Knowledge Graph***. This graph will contain N number of nodes that have M number of relationships. These nodes and relationships (AKA "edges") will define our knowledge graph and be used later to construct relevant questions and responses.

In [6]:
from ragas.testset.graph import KnowledgeGraph

kg = KnowledgeGraph()
kg

KnowledgeGraph(nodes: 0, relationships: 0)

The first step we're going to take is to simply insert each of our full documents into the graph. This will provide a base that we can apply transformations to.

In [7]:
from ragas.testset.graph import Node, NodeType
for doc in pdf_documents:
    kg.nodes.append(
        Node(
            type=NodeType.DOCUMENT,
            properties={"page_content": doc.page_content, "document_metadata": doc.metadata}
        )
    )
kg

KnowledgeGraph(nodes: 109, relationships: 0)

### Default Transformations on the Knowledge Graph

Now, we'll apply the default transformations to our knowledge graph. This will take the nodes currently on the graph and transform them based on a set of default transformations. These default transformations are dependent on the corpus length, in our case:

1. Producing Summaries -> produces summaries of the documents
2. Extracting Headlines -> finding the overall headline for the document
3. Theme Extractor -> extracts broad themes about the documents

It then uses ***cosine-similarity*** and heuristics between the embeddings of the above transformations to construct relationships between the nodes.

In [8]:
from ragas.testset.transforms import default_transforms, apply_transforms

transformer_llm = generator_llm
embedding_model = generator_embeddings

default_transforms = default_transforms(documents=pdf_documents, llm=transformer_llm, embedding_model=embedding_model)
apply_transforms(kg, default_transforms)
kg

Applying HeadlineSplitter:   0%|          | 0/109 [00:00<?, ?it/s]          unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

KnowledgeGraph(nodes: 198, relationships: 15055)

#### Save and Load the Knowledge Graph

Now we save and load our knowledge graph as follows.

In [9]:
kg.save("cybersecurity_guidelines_for_software_as_a_medical_device.json")
cybersecurity_guidelines_for_software_as_a_medical_device_kg = KnowledgeGraph.load("cybersecurity_guidelines_for_software_as_a_medical_device.json")
cybersecurity_guidelines_for_software_as_a_medical_device_kg

KnowledgeGraph(nodes: 198, relationships: 15055)

#### Construct a Test Set Generator.
Next using our Knowledge Graph we construct a Test Set.

In [10]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(pdf_documents, testset_size=10, run_config=RunConfig(max_workers=16))

Applying HeadlineSplitter:   0%|          | 0/109 [00:00<?, ?it/s]          unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to ap

KeyboardInterrupt: 

#### View the Generated Dataset as a Pandas Dataframe