# IG to Test Kit FULL pipeline

## Setup

### Importing Notebooks as Modules (from the [Jupyter Notebook Documentation](https://jupyter-notebook.readthedocs.io/en/4.x/examples/Notebook/rstversions/Importing%20Notebooks.html))

In [7]:
import inspect
import json

In [1]:
import io, os, sys, types
from IPython import get_ipython
from nbformat import current
from IPython.core.interactiveshell import InteractiveShell


def find_notebook(fullname, path=None):
    """find a notebook, given its fully qualified name and an optional path

    This turns "foo.bar" into "foo/bar.ipynb"
    and tries turning "Foo_Bar" into "Foo Bar" if Foo_Bar
    does not exist.
    """
    name = fullname.rsplit('.', 1)[-1]
    if not path:
        path = ['']
    for d in path:
        nb_path = os.path.join(d, name + ".ipynb")
        if os.path.isfile(nb_path):
            return nb_path
        # let import Notebook_Name find "Notebook Name.ipynb"
        nb_path = nb_path.replace("_", " ")
        if os.path.isfile(nb_path):
            return nb_path
        

class NotebookLoader(object):
    """Module Loader for Jupyter Notebooks"""
    def __init__(self, path=None):
        self.shell = InteractiveShell.instance()
        self.path = path

    def load_module(self, fullname):
        """import a notebook as a module"""
        path = find_notebook(fullname, self.path)

        print ("importing Jupyter notebook from %s" % path)

        # load the notebook object
        with io.open(path, 'r', encoding='utf-8') as f:
            nb = current.read(f, 'json')


        # create the module and add it to sys.modules
        # if name in sys.modules:
        #    return sys.modules[name]
        mod = types.ModuleType(fullname)
        mod.__file__ = path
        mod.__loader__ = self
        mod.__dict__['get_ipython'] = get_ipython
        sys.modules[fullname] = mod

        # extra work to ensure that magics that would affect the user_ns
        # actually affect the notebook module's ns
        save_user_ns = self.shell.user_ns
        self.shell.user_ns = mod.__dict__

        try:
          for cell in nb.worksheets[0].cells:
            if cell.cell_type == 'code' and cell.language == 'python':
                # transform the input to executable Python
                code = self.shell.input_transformer_manager.transform_cell(cell.input)
                # run the code in themodule
                exec(code, mod.__dict__)
        finally:
            self.shell.user_ns = save_user_ns
        return mod
    

class NotebookFinder(object):
    """Module finder that locates Jupyter Notebooks"""
    def __init__(self):
        self.loaders = {}

    def find_module(self, fullname, path=None):
        nb_path = find_notebook(fullname, path)
        if not nb_path:
            return

        key = path
        if path:
            # lists aren't hashable
            key = os.path.sep.join(path)

        if key not in self.loaders:
            self.loaders[key] = NotebookLoader(path)
        return self.loaders[key]
    
sys.meta_path.append(NotebookFinder())


- use nbformat for read/write/validate public API
- use nbformat.vX directly to composing notebooks of a particular version

  from nbformat import current


## Text Extraction

### HTML to Markdown Conversion Using Markdownify (Langchain Tool)

In [None]:
import HTML_extractor

importing Jupyter notebook from HTML_extractor_pipeline.ipynb


USER_AGENT environment variable not set, consider setting it to identify your requests.


In [None]:
urls = [
    "https://hl7.org/fhir/uv/subscriptions-backport/STU1.1/components.html",
    "https://hl7.org/fhir/uv/subscriptions-backport/STU1.1/conformance.html",
    "https://hl7.org/fhir/uv/subscriptions-backport/STU1.1/OperationDefinition-backport-subscription-get-ws-binding-token.html", # negative
    "https://hl7.org/fhir/uv/subscriptions-backport/STU1.1/OperationDefinition-backport-subscription-events.html",
    "https://hl7.org/fhir/uv/subscriptions-backport/STU1.1/Bundle-r4-notification-empty.html", # negative
    "https://hl7.org/fhir/uv/subscriptions-backport/STU1.1/CapabilityStatement-backport-subscription-server-r4.html"
]

HTML_extractor.convert_urls_to_markdown(urls, output_dir="text_extraction/uv_subscriptions_backport/markdown")

Fetching pages: 100%|##########| 6/6 [00:00<00:00,  9.44it/s]

Created: components.md
Created: conformance.md
Created: OperationDefinition_backport_subscription_get_ws_binding_token.md
Created: OperationDefinition_backport_subscription_events.md
Created: Bundle_r4_notification_empty.md
Created: CapabilityStatement_backport_subscription_server_r4.md





In [None]:
urls = [
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/index.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/ChangeHistory.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/examples.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/implementation.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/profiles.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/artifacts.html",
    "https://hl7.org/fhir/us/davinci-pdex-plan-net/CapabilityStatement-plan-net.html"
]

HTML_extractor.convert_urls_to_markdown(urls, output_dir="text_extraction/PlanNet/site/markdown")

Fetching pages: 100%|##########| 7/7 [00:00<00:00, 10.78it/s]


Created: index.md
Created: ChangeHistory.md
Created: examples.md
Created: implementation.md
Created: profiles.md
Created: artifacts.md
Created: CapabilityStatement_plan_net.md


### Markdown Post-processing

In [None]:
import markdown_cleaner
markdown_cleaner.process_directory("text_extraction/uv_subscriptions_backport/markdown/", "checkpoints/post_processing/uv_subscriptions_backport/")

importing Jupyter notebook from markdown_cleaner_pipeline.ipynb
Found 6 markdown files in text_extraction/uv_subscriptions_backport/markdown/
Cleaned and saved: checkpoints/post_processing/uv_subscriptions_backport/conformance.md
Cleaned and saved: checkpoints/post_processing/uv_subscriptions_backport/OperationDefinition_backport_subscription_events.md
Cleaned and saved: checkpoints/post_processing/uv_subscriptions_backport/OperationDefinition_backport_subscription_get_ws_binding_token.md
Cleaned and saved: checkpoints/post_processing/uv_subscriptions_backport/components.md
Cleaned and saved: checkpoints/post_processing/uv_subscriptions_backport/Bundle_r4_notification_empty.md
Cleaned and saved: checkpoints/post_processing/uv_subscriptions_backport/CapabilityStatement_backport_subscription_server_r4.md

Processing complete: 6 files successfully cleaned, 0 failed


## Requirements Extraction

### Prompt-based Requirement Extraction

In [2]:
import reqs_extraction_pipeline

importing Jupyter notebook from reqs_extraction_pipeline.ipynb


In [3]:
reqs_extraction_pipeline.process_markdown_content_for_incose_srs(
    'claude', 
    'checkpoints/post_processing',
    'checkpoints/requirements_extraction'
)

INFO:root:Starting processing with claude on directory: checkpoints/post_processing
INFO:root:Found 7 markdown files:
INFO:root:  - implementation.md
INFO:root:  - examples.md
INFO:root:  - profiles.md
INFO:root:  - ChangeHistory.md
INFO:root:  - artifacts.md
INFO:root:  - index.md
INFO:root:  - CapabilityStatement_plan_net.md
INFO:root:Organized 7 files into 6 processing groups
INFO:root:Processing combined group of 2 files
INFO:root:Split combined content into 2 chunks
INFO:root:Processing chunk 1/2 of combined files
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:root:Processing chunk 2/2 of combined files
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK"
INFO:root:Processing single file: examples.md
INFO:root:Split examples.md into 3 chunks using dynamic sizing
INFO:root:Processing chunk 1/3 of examples.md
INFO:httpx:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 529 "
INFO:anthropic.

{'processed_files': ['profiles.md',
  'ChangeHistory.md',
  'examples.md',
  'implementation.md',
  'CapabilityStatement_plan_net.md',
  'index.md',
  'artifacts.md'],
 'srs_document': 'This content does not contain any explicit conformance language (SHALL, SHOULD, MAY, MUST, etc.) or testable requirements. The text describes profiles and change history but does not specify conformance requirements. Therefore, no INCOSE-style requirements can be extracted from this content.This content appears to be a change log and version history document that does not contain any explicit, testable requirements with conformance language (SHALL, SHOULD, MAY, MUST, etc.). While it references various changes and updates that were made across different versions, it does not specify new requirements - rather it documents the changes that were implemented.\n\nTherefore, I do not have any INCOSE-style requirements to extract from this particular content section.This content contains no explicit conformance

### RAG-based Requirement Extraction

This extraction requirement extraction method differs from the first in that, as a part of the creation of its prompt, it performs a semantic search on example sections of FHIR IG text and the human-generated requirements that were produced in reference to those sections of text to find the most similar section(s) of FHIR IG text in the database and their associated requirement(s). Those sets of IG text and requirement(s) are then supplied to the LLM as few-shot examples

In [None]:
from rag_reqs_extraction import full_pass

full_pass("checkpoints/post_processing", "checkpoints/requirements_extraction/RAG")

## Requirement Downselection

In [None]:
from requirement_downselect import full_pass as downselect_fp

downselect_fp(
    md_files=["checkpoints/requirements_extraction/claude_reqs_list_v1_20250429_081756.md"],
    rag_files=["checkpoints/requirements_extraction/RAG/plan_net_reqs.json"]
    )

## Test Plan Generation