# End-to-End AI-Assisted Software Development Workflow

This paper introduces an end-to-end AI-assisted software development workflow designed to minimize human oversight at each stage of the pipeline while enabling user intervention where necessary. The proposed workflow integrates high-level requirements, detailed design, code generation, and automated unit testing, ensuring that artifacts are well-documented at every stage. By defining a clear pipeline for human-machine interaction with specific goals, this approach addresses the limitations of current AI coding assistants, which often operate only at isolated stages of software development.

While the primary focus of this paper is on software development, the same workflow can be applied to a variety of other domains. The workflow leverages AI large language models (LLMs) to transform high-level task descriptions into detailed, pseudo-language low-level descriptions. This intermediate step allows developers to manually review pipeline artifacts and exercise control over the process at the pseudo-language level.

### Overview
Large Language Models (LLMs) are pattern inference machines trained on vast datasets. While they can generate plausible solutions, their effectiveness is influenced by how clearly requirements are presented, LLM model and complexity of problem. Common challenges include:

- **Non-Typical Library Usage**: Users might want to use a library in a unique way, which does not match the pattern the LLM was trained with. LLMs will require specific guidance to accommodate.
- **Version Mismatch**: LLMs may use the wrong version of a library or language, which can lead to compatibility issues or broken code.
- **Unique Applications**: LLMs are great at guessing, but if your solution is unique, it requires the developer to provide extensive details.
- **Version of LLM**: will greatly affect output. Models have improved signifactly in the past year. GPT-4o and GPT-3.5.turbo are used in this paper and the last chapter compairs thier relative performance. Good news is both satisified models satisifed the example requirements but the GPT-4o was far superior. 

These challenges emphasize the need for well-structured prompts, version control, and diligent review to maximize the benefits of AI-assisted coding.

In [27]:
%%writefile flowdiag.txt

┌─ #2 ────────┐       ┌─ #5 ────────┐       ┌─ #8 ────────┐       ┌─ #11 ──┐   
│ HiLevel Rqmt┌─►#4──►│ LoLevel Rqmt┌─►#7──►│     Code    ┌─►#10─►│UnitTest│   
└─────────────┘   ▲   └─────────────┘   ▲   └─────────────┘   ▲   └────────┘   
            ┌─ #1 └───────┐       ┌─ #6 └───────┐       ┌─ #9 └───────┐          
            │ PseudoC Tmpl│       │ Code Policy │       │ Test Policy │          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The software development workflow comprises eleven steps, as illustrated in the accompanying diagram. Each section of this paper provides a detailed explanation of a specific step:

1. **Pseudocode Template**  
2. **High-Level Requirements**  
3. **LLM API Interface**  
4. **High-Level to Low-Level Requirements Translation**  
5. **Low-Level Requirements**  
6. **Coding Policy**  
7. **Low-Level Requirements to Code Translation**  
8. **Code Listing**  
9. **Testing Policy**  
10. **Code to Unit Test Translation**  
11. **Unit Test Listing**  

For target applications involving the challenges described above, custom intervention may be required at step 5 to manually edit the *Low-Level Requirements*. While this workflow is demonstrated in a Jupyter Notebook with each step implemented manually, the process can be automated by defining the rules in a Makefile.

### 1. Pseudocode Template

In [28]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐       ┌─────────────┐       ┌────────┐
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►│11.uTest│
└─────────────┘   ▲   └─────────────┘   ▲   └─────────────┘   ▲   └────────┘
            ╔═ #1 ╚═══════╗       ┌─────┼───────┐       ┌─────┼───────┐          
            ║PSEUDO C TMPL║       │ 6.CodePolicy│       │ 9.TestPolicy│          
            ╚═════════════╝       └─────────────┘       └─────────────┘               

Overwriting flowdiag.txt


Current AI coding practices often rely on iterative prompting, where developers engage in repetitive back-and-forth interactions to craft the desired output. This approach can be inefficient, leading to frustration and an increased risk of overlooking critical details. An alternative solution involves the use of templates that encapsulate requirements, best practices, and goals within a single, monolithic prompt. The template goal is:

- **Higher Efficacy**: Reduces the friction in interactions, minimizing the need for iterative refinement and redundant instructions.
- **Leverage Familiar Paradigms**: By framing requirements in popular pseudocode formats, templates align closely with LLM training data, increasing the likelihood of accurate outputs.
- **Reusability**: Templates standardize interactions and can be reused across similar projects. This also promotes consistency across the pipeline.
- **Human Oversight**: Allows developers to create detailed requirements from high-level instructions and validate AI-generated results to ensure reliability and correctness.

LLMs can understand a large variety of pseudocode paradigms, but greater efficiency is gained by leveraging pseudocode paradigms easily understood by LLMs. Pseudocode features that work well with LLMs have:

- **Hierarchical Indentation**: Text format defines structure.
- **Popularity**: Paradigms that are sufficiently trained into the LLM.
- **Expressiveness**: Capable of handling complex logic.

The architecture, error handling, and implementation requirements allow fine-tuning of the code and provide workarounds for LLM hallucinations. The template minimizes *redundant or insufficient prompting*, reducing excessive iteration to achieve the desired output and leading to greater efficiency. The YAML template has placeholders for the following parameters used to compile the LLM prompt:

- **target_name**: Defines the file name for the code.
- **requirements**: Uses a *Step Action Table* and *Structured Function Documentation* that leverage paradigms easily understood by LLMs. The template is extended by adding additional functions and parameters as needed.
- **architecture**: Isolates the high-level structure out of the functional requirements section.
- **error_handling**: Allows fine-tuning of error handling—could include logging or debug print requirements.
- **impl_requirements**: Helpful for refining the implementation or workarounds for hallucinations.

In [29]:
%%writefile pseudocode_template.yaml
Format software requirements to conform to the following rules: 
 - Output requirements in a YAML format using the following YAML template.
 - The template leverages `Step Action Table` and structured documentation standards. 
 - Make sure exception handling is detailed in the `Step Action Table`.
 - Only list in the `error_handling:` section items not covered in the `Step Action Table`
 - Ignore any assumptions from previous prompts

target_name: <Name of Target>
requirements: |
  - NAME: <fucntion_name()>
    BRIEF: <outline description>
    PARAMETERS:
      - <param_name>
	    - Type:  <type>
	    - Validate: <define range>
	    - Default: <default value>
	    - Description: <Description> 
    RETURN: 
      - <return_name>: <return value description>
    CONSTRAINTS: 
      - <List any constraings>
    STEP_ACTION_TABLE: 
      - STEP_1: |
          TITLE: <step_title>
          ACTION: <step action>
          INPUT: <step input>
          OUTPUT: <step output>
          NEXT: <next step>

architecture: |
  - <list all archtecture requirements>

error_handling: |
  - <list detail for error handling here>

impl_requirements: |
  - <list implmentation requirements here>

Overwriting pseudocode_template.yaml


### 2. High Level Requirements Prompt

In [30]:
%%writefile flowdiag.txt
╔═════════════╗       ┌─────────────┐       ┌─────────────┐       ┌────────┐
║ HI LEVEL REQ║─►#4──►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►│11.uTest│
╚═════════════╝   ▲   └─────────────┘   ▲   └─────────────┘   ▲   └────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The following example shows how to leverage ChatGPT to fill out the pseudocode template. This first manual step converts high-level requirements into a verbose description of the code to be generated.

The following is a high-level description for a threadsafe counter. The description is used as a prompt to generate detailed requirements for auto code generation. This high-level description has sufficient features so the resulting detailed requirements need little to no hand editing. There is a diminishing return in iteratively modifying the high-level prompt versus directly editing the detailed requirements prompt.

In [31]:
%%writefile requirements_highlevel.txt
 
REQUIREMENTS: 
- Set `target_name:` to `counter.py`
- Write requirements for methods that increment, decrement, and retrieve a counter value. 
- Initialize Counter to zero and value is always non-negative

ARCHITECTURE:  Detail in the template `architecture:` section the rules:
- Encapsulate functionality into a thread-safe class.
- Minimize performance impact under high concurrency.
- Handle edge cases and invalid operations predictably.

Overwriting requirements_highlevel.txt



### 3. LLM API

We use the OpenAI API to directly translate natural language into low-level requirements, code, or unit tests. Using the API has the following advantages:

- **Ignores History**: By ignoring history from previous prompts, the process becomes more repeatable.
- **Automation**: Automation can be enabled simply by creating Makefile rules to search the project for requirements files, detect dependencies, generate targets, and run unit tests.
- **Temperature**: Controls the creativity of the solution. A value between 0.1 and 1.0 is used, typically set to 0.1 for the most repeatable results.
- **OPENAI_API_KEY**: An environment variable that contains your OpenAI API key.
- **System Role**: Must be set to define the flavor of the response. For this workflow, we specify a response from an expert in software development. 

The final chapter discusses the code generation performance differences between GPT models 'gpt-3.5-turbo' and 'gpt-4o' and the following code has an option to use either model so you can make a comparison for yourself. The following class and helper function use the API to translate the prompt into a response string containing low-level requirements, code, or unit tests.

In [32]:
import os
import openai
import threading
import time
import re
import yaml
from typing import List, Tuple, Optional

ENCODING='utf-8'
MODEL_KEY_ENV_VARIABLE = "OPENAI_API_KEY"

class LlmClient:
    DEFAULT_TEMPERATURE = 0.1
    DEFAULT_MAX_TOKENS = 4000
    DEFAULT_MODEL = 'gpt-4o'     # Speed: 17/12/7 seconds
    # DEFAULT_MODEL = 'gpt-3.5-turbo'  # Speed: 5/4/4 seconds

    def __init__(self):
        self.client = openai.OpenAI(api_key=os.getenv(MODEL_KEY_ENV_VARIABLE))

    def _show_progress(self) -> None:
        """Print a period every second to indicate progress, with a counter."""
        self.running = True
        seconds = 0
        print("\nProgress:")
        while self.running:
            seconds += 1
            print(f"\r{seconds:>3} {'.' * seconds}", end="", flush=True)
            time.sleep(1)

    def process_chat(self, messages: List[dict]) -> Tuple[Optional[str], Tuple[int, int, str]]:
        """Process a chat and return the generated content and token usage."""
        try:
            # Start progress indicator
            threading.Thread(target=self._show_progress, daemon=True).start()
            response = self.client.chat.completions.create(
                messages=messages,
                max_tokens=self.DEFAULT_MAX_TOKENS,
                temperature=self.DEFAULT_TEMPERATURE,
                model=self.DEFAULT_MODEL
            )
            # Handle and return token usage if available
            self.running = False
            usage = getattr(response, 'usage', None)
            if usage:
                return response.choices[0].message.content, (usage.prompt_tokens, usage.completion_tokens, self.DEFAULT_MODEL)
            return response.choices[0].message.content, (0, 0, self.DEFAULT_MODEL)
        except Exception as e:
            self.running = False
            print(f"\nprocess_chat() An error occurred: {e}")
            return None, (0, 0, self.DEFAULT_MODEL)

def process_prompt(prompt: str, system_role: str) -> Tuple[Optional[str], str]:
    """Example usage of LlmClient with a given prompt."""
    messages = [
        {"role": "system", "content": system_role},
        {"role": "user", "content": prompt}
    ]
    client = LlmClient()
    result, tokens = client.process_chat(messages)
    if result:
        token_debug = f"# TOKENS: {tokens[0] + tokens[1]} (of:{client.DEFAULT_MAX_TOKENS}) = {tokens[0]} + {tokens[1]} (prompt+return) -- MODEL: {tokens[2]}"
        return result, token_debug
    return None, "Failed to generate response."

### 4. High Level Requirements Prompt w/ template

In [33]:
%%writefile flowdiag.txt
┌─────────────┐ ╔═══╗ ┌─────────────┐       ┌─────────────┐       ┌────────┐
│ 2.HiLvlRqmt ┼──►#4─►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►│11.uTest│
└─────────────┘ ╚═▲═╝ └─────────────┘   ▲   └─────────────┘   ▲   └────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt



The target requirements are post fixed with the pseudocode_template.yaml to create a prompt for ChatGPT. The call to the OpenAI API translates the prompt into detailed requirements.

In [34]:
FILE1 = "requirements_highlevel.txt"
FILE2 = "pseudocode_template.yaml"
FILE_PROMPT = "requirements_prompt.txt"
TARGET = "requirements_lowlevel.yaml"
SYSTEM_ROLE = "You are and expert in translating high level software requirements into detailed low level requirement sufficent to then be used as a prompt to auto-generate code."

try:
    result = None
    e_msg = "\nFailed to generate response."
    prompt = open(FILE1, 'r', encoding='utf-8').read() + '\n' 
    prompt +=  open(FILE2, 'r', encoding='utf-8').read()
    with open(FILE_PROMPT, 'w', encoding='utf-8') as out:
        out.write(SYSTEM_ROLE + '\n' + prompt)
    response, debug = process_prompt(prompt, SYSTEM_ROLE)
    if response:
        response = re.sub(r'```yaml', '', response)
        response = re.sub(r'```.*$', '', response)
        result = debug + '\n' +  response
except Exception as e:
    e_msg = f"An error occurred while processing files:\n  Input files: {FILE1}, {FILE2}\n  Output file: {FILE_PROMPT}\n  Error details: {e}"
if result is None:
    result = e_msg
    print(e_msg)
with open(TARGET, 'w', encoding='utf-8') as out:
    out.write(result)
print(f"\n Result Written to: {TARGET}\n")



Progress:
 13 .............
 Result Written to: requirements_lowlevel.yaml



### 5. Detailed Target Requirements Prompt

In [35]:
%%writefile flowdiag.txt
┌─────────────┐       ╔═════════════╗       ┌─────────────┐       ┌────────┐
│ 2.HiLvlRqmt ┼─►#4──►║ LO LEVEL REQ║──►#7─►│   8.Code    ┼─►#10─►│11.uTest│
└─────────────┘   ▲   ╚═════════════╝   ▲   └─────────────┘   ▲   └────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The generated detailed requirements response will depend on the provider/model LLM used but will always utilize the desired pseudocode template format, as it ensures consistency by framing requirements in a standardized structure that aligns with LLM training data. The following response, 'requirements_prompt.yaml,' was created with ChatGPT 4.0, chosen for its advanced reasoning capabilities and improved consistency compared to earlier models. This method for generating a detailed description has the following advantages:

- Generates well-defined, stable code.
- Variations are less dependent on the LLM model/version.
- Provides verbose material for inline code documentation.
- Allows for black-box generation of unit tests.

A detailed manual review and editing of the generated YAML requirements are required to ensure fulfillment of application requirements. Common issues to look for include misaligned indentation, incomplete parameter definitions, and incorrect mappings in the Step Action Table. This manual process requires the most human interaction. 

If the design utilizes a common design pattern, such as Singleton or Factory, or heavily leverages a popular library like NumPy or Flask, then little editing will be needed. A novel algorithm will require detailed editing of the parameters and Step Action Table. The following code displays the auto-generated low-level prompt.



In [36]:
file_name = TARGET
try:
    with open(file_name, 'r', encoding='utf-8') as file:
        print(file.read())
except FileNotFoundError:
    print(f"The file {file_name} does not exist.")

# TOKENS: 1173 (of:4000) = 409 + 764 (prompt+return) -- MODEL: gpt-4o

target_name: counter.py
requirements: |
  - NAME: increment()
    BRIEF: Increment the counter value by one.
    PARAMETERS: []
    RETURN: 
      - new_value: The updated counter value after increment.
    CONSTRAINTS: 
      - Counter value must remain non-negative.
    STEP_ACTION_TABLE: 
      - STEP_1: |
          TITLE: Acquire Lock
          ACTION: Acquire a lock to ensure thread safety.
          INPUT: None
          OUTPUT: Lock acquired
          NEXT: STEP_2
      - STEP_2: |
          TITLE: Increment Counter
          ACTION: Increase the counter value by one.
          INPUT: Current counter value
          OUTPUT: Updated counter value
          NEXT: STEP_3
      - STEP_3: |
          TITLE: Release Lock
          ACTION: Release the lock after incrementing.
          INPUT: None
          OUTPUT: Lock released
          NEXT: END

  - NAME: decrement()
    BRIEF: Decrement the counter value by one

### 6. Coding Policies

In [37]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐       ┌─────────────┐       ┌────────┐
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►│11.uTest│
└─────────────┘   ▲   └─────────────┘   ▲   └─────────────┘   ▲   └────────┘
            ┌─────┼───────┐       ╔═════┼═══════╗       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       ║6.CODE POLICY║       │ 9.TestPolicy│          
            └─────────────┘       ╚═════════════╝       └─────────────┘          

Overwriting flowdiag.txt


The detailed requirements prompt could contain all the rules for generating the target code, but pulling common rules into a central reusable policy file provides advantages by ensuring these rules are seamlessly integrated into the overall workflow. This integration allows for automated application of consistent policies at each stage of the development cycle, from requirements generation to final code output, reducing redundancy and minimizing manual intervention.

- Reduced effort to write requirements.
- Consistency across the codebase derived from AI.
- Project policies are clearly defined in a central location.
- A suite of policies can scale to the size and type of project. A quick prototype would need different policies than a foundational framework.
- Policies extract implementation target language rules, allowing the same requirements to translate into various coding languages.

The file *rules_python3.8.yaml* defines the common high-level rules for code generation, targeting Python. In the final section of this document, we will demonstrate how these rules can be translated and tested with C++. It has two variables:

- *role_system*: Provides the high-level context and sets the flavor of the response.
- *role_user*: Contains project high-level rules for code generation. It centralizes coding standards, compiler version, debugging, and error handling, ensuring code consistency. These rules are maintained through periodic reviews and updates, aligned with project milestones or significant changes in project scope.

In [38]:
%%writefile rules_python3.8.yaml
role_system: |
  An experienced Python developer skilled in translating software requirements into efficient, maintainable, and Python 3.8-compatible code. 
  Adheres to best practices such as PEP8, effective use of type hints, and clear documentation. 
  Implements Pythonic error handling and debugging techniques, ensuring clarity, reliability, and maintainability in all code produced.

role_user: |
  - Compatibility:
    - Generate code compatible with Python version 3.8.

  - Coding Standards:
    - Follow PEP8 standards.
    - Include type hints for all function arguments and return values.
    - Import and use generics from the `typing` module (e.g., `List`, `Dict`, `Tuple`) for type hinting.

  - Documentation:
    - Do not place `Example usage:` at the end of the response. Instead, include inside comment in header.
    - Document the file header with best practices, including today's date.
    - Include explanations: inside a commont in header documentation.

  - Error Handling:
    - Use Pythonic error handling practices.
    - Include `try-except` blocks and raise appropriate built-in or custom exceptions.
    - Provide clear and informative error messages.

  - Debugging:
    - Use the `logging` module for error and debug information.
    - Include a method to print debug statements to stdout, enabled by a `debug_enable` boolean variable (default `False`).
    - Provide debug print statements that aid in troubleshooting and understanding code behavior.


Overwriting rules_python3.8.yaml


### 7. Python Translation Utility

In [39]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐ ╔═══╗ ┌─────────────┐       ┌────────┐
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼──►#7─►│   8.Code    ┼─►#10─►│11.uTest│
└─────────────┘   ▲   └─────────────┘ ╚═▲═╝ └─────────────┘   ▲   └────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The previously defined `LlmClient()` class, which acts as an interface to interact with the OpenAI API, is used by this code snippet to translate the YAML requirement files into target code or unit test software.

The following `process()` function is generic and requires a list of YAML keys to extract values from the YAML requirement file. The key/values are used to compose the LLM prompt. There are two key/prefix lists:

- **process_req_2_code()**: Requirements → Code translation 
  - Uses the key/provide list: *key_prefix_pairs_req_2_code*

- **process_code_2_test()**: Code → Unit test translation
  - Uses the key/provide list: *key_prefix_pairs_code_to_test*



In [40]:
def extract_code_from_response(response: str) -> str:
    """Extract and clean code from a response string."""
    response = re.sub(r'^.*?\`\`\`', '', response, flags=re.DOTALL)
    response = re.sub(r'```.*', '', response, flags=re.DOTALL)
    response = re.sub(r'^([\`]{3})', '#', response, flags=re.MULTILINE)
    response = re.sub(r'^python', '#', response, flags=re.MULTILINE)
    return response

def process(rules_fname: str, req_fname: str, prompt_fname: str, 
            key_prefix_pairs: list, is_utest: bool) -> None:
    """
    Process the input YAML files to generate a code prompt and save results to specified files.
    Parameters:
        rules_fname (str): Path to the rules YAML file.
        req_fname (str): Path to the requirements YAML file.
        prompt_fname (str): File to save the generated code prompt.
        is_utest (bool): true if processing unit test translation
        key_prefix_pairs (list): List of tuples containing key-prefix pairs for generating prompts.
    """
    try:
        # EXTRACT RULES - a llm prompt must start with high level system role and user role
        with open(rules_fname, "r", encoding=ENCODING) as file:
            data = yaml.safe_load(file)
        prompt = [
            {"role": "system", "content": data["role_system"]},
            {"role": "user", "content": data["role_user"]}
        ]

        # EXTRACT REQUIREMENTS - from req YAML using `key_prefix_pairs` list 
        dest_fname = "unknown.py"
        with open(req_fname, "r", encoding=ENCODING) as file:
            arch = yaml.safe_load(file)
            code_fname = arch.get("target_name", dest_fname)
            for key, prefix in key_prefix_pairs:
                if key in arch:
                    prompt.append({"role": "user", "content": prefix + arch[key]})

        # CREATE TARGET FILE NAME - based on the target_name in requirements
        if not code_fname.endswith('.py'):
            e_msg = f"The target_name:{code_fname} file name must end with '.py'"
            raise ValueError(e_msg)            
        if is_utest:
            with open(code_fname, "r", encoding=ENCODING) as file:
                code = file.read()
            prefix = "Write Unit Test for the following code:\n"    
            prompt.append({"role": "user", "content": prefix + code})
            base_name = code_fname[:-3]  # Remove the .py extension
            dest_fname = f"{base_name}_test.py"
        else:
            dest_fname = code_fname

        # PROCESS - the chat prompt & save prompt to disk (for inspection)
        client = LlmClient()
        response, tokens = client.process_chat(prompt)
        with open(prompt_fname, 'w', encoding=ENCODING) as out:
            out.write(", ".join(map(str, prompt)).replace("\\n", "\n"))

        # CONSTRUCT - the result string
        result = f"# TOKENS: {tokens[0] + tokens[1]} (of:{client.DEFAULT_MAX_TOKENS}) = {tokens[0]} + {tokens[1]}(prompt+return) -- MODEL: {tokens[2]}"
        if response is None:
            result = "Failed to generate response."
        else:
            result += extract_code_from_response(response) 

    except Exception as e:
        result = (f"An error occurred while processing files:\n  Input files: {rules_fname}, {req_fname}\n  "
                  f"Output file: {dest_fname}\n  Error details: {e}")
        print(f"ERROR THROWN {result}")

    # Write the result to the destination file
    with open(dest_fname, 'w', encoding=ENCODING) as out:
        out.write(result)
    print(f"\n Result Written to: {dest_fname}\n")

def process_req_2_code(rules_fname: str, req_fname: str, prompt_fname: str) -> None:
    is_utest = False
    key_prefix_pairs_req_2_code = [
        ("target_name", "Code shall be saved in a file named:"),
        ("requirements", "Use the following requirements to write code:\n"),
        ("architecture", "Use the following architecture to implement code:\n"),
        ("interface", "Use the following interface implementation requirements:\n"),
        ("error_handling", "Use the following error handling requirements:\n"),
        ("impl_requirements", "Use these additional implementation requirements:\n")
    ]
    return process( rules_fname, req_fname, prompt_fname,
                   key_prefix_pairs_req_2_code, is_utest )

def process_code_2_test(rules_fname: str, req_fname: str, prompt_fname: str) -> None:
    is_utest = True
    key_prefix_pairs_code_to_test = [
        ("target_name", "Code shall be saved in a file named:"),
        ("test_requirements", "see the additional test requirements:\n")
    ]
    return process( rules_fname, req_fname, prompt_fname, 
                    key_prefix_pairs_code_to_test, is_utest )

### 8. Code Translation Result

In [41]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐       ╔═════════════╗       ┌────────┐
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼─►#7──►║    8.CODE   ║─►#10─►│11.uTest│
└─────────────┘   ▲   └─────────────┘   ▲   ╚═════════════╝   ▲   └────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The following code calls on the LLM to generate the code


In [42]:
rules_fname = "rules_python3.8.yaml"
req_fname = "requirements_lowlevel.yaml"
prompt_fname = "counter_code_prompt.txt"
dest_fname = "counter.py"

process_req_2_code( rules_fname, req_fname, prompt_fname )
with open(dest_fname, "r") as file:
    content = file.read()
print(content)



Progress:
 17 .................
 Result Written to: counter.py

# TOKENS: 1879 (of:4000) = 1094 + 785(prompt+return) -- MODEL: gpt-4o#
"""
counter.py

This module provides a thread-safe counter class with methods to increment,
decrement, and retrieve the current counter value. The counter is initialized
to zero and ensures that it remains non-negative. Thread safety is achieved
using a threading lock to prevent race conditions during concurrent access.

Date: 2023-10-05

Features:
- Increment the counter value by one.
- Decrement the counter value by one, ensuring it does not go below zero.
- Retrieve the current counter value.

Error Handling:
- Handles lock acquisition failures by retrying or logging an error.
- Ensures that exceptions during operations do not leave the counter in an inconsistent state.

Debugging:
- Uses the logging module for error and debug information.
- Includes a method to print debug statements to stdout, enabled by a `debug_enable` boolean variable.
"""

imp

### 9. Unit Test Policy

In [43]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐       ┌─────────────┐       ┌────────┐
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►│11.uTest│
└─────────────┘   ▲   └─────────────┘   ▲   └─────────────┘   ▲   └────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ╔═════┼═══════╗          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       ║9.TEST POLICY║          
            └─────────────┘       └─────────────┘       ╚═════════════╝          

Overwriting flowdiag.txt


Unit test are created in the same process. The following set of rules are applied w/ the code to be tested as a prompt to the LLM to generate the set of Unit Tests


In [44]:
%%writefile rules_ptest.yaml
role_system: |
  - You are a Python expert specializing in writing high-quality unit tests with the pytest framework.
    Generate comprehensive, efficient, and maintainable pytest test cases following best practices.
  - Write Python code that adheres to PEP8 standards and is compatible with Python 3.8.
  - Always include the import `from typing import Tuple` if type hints are required.

role_user: |
  - Do not place Explanation: at the end of the response. Instead, include them as header comments in the test case file.
  - Ensure test functions cover edge cases and all possible scenarios, including extreme and unexpected inputs.
  - Test the function's behavior across a wide range of possible inputs.
  - Proactively address edge cases the author might not have considered.
  - Ensure tests are deterministic and produce the same result when repeated under the same conditions.

  - Use pytest to generate unit tests:
    - Employ pytest fixtures appropriately to manage setup and teardown.
    - Use pytest parameterization to create concise, readable, and maintainable test cases.
    - Do not use pytest's mocking features.

  - Organize tests logically to enhance clarity and maintainability.
  - Write tests that are easy to read, with clean code and descriptive function names.
  - Document the file header with explanations, expectations, and high-level details about the test cases.


Overwriting rules_ptest.yaml


### 10. Unit Test Translation

In [45]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐       ┌─────────────┐ ╔═══╗ ┌─────────┐
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►│11.uTests│
└─────────────┘   ▲   └─────────────┘   ▲   └─────────────┘ ╚═▲═╝ └─────────┘
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The following code performs the translation of code into unit test based on the testing policy.

In [46]:
rules_fname = "rules_ptest.yaml"
req_fname = "requirements_lowlevel.yaml"
prompt_fname = "counter_unit_test_prompt.txt"

process_code_2_test( rules_fname, req_fname, prompt_fname)


Progress:
  9 .........
 Result Written to: counter_test.py



### 11. Unit Test Code

In [47]:
%%writefile flowdiag.txt
┌─────────────┐       ┌─────────────┐       ┌─────────────┐       ╔════════╗
│ 2.HiLvlRqmt ┼─►#4──►│ 5.LoLvlRqmt ┼─►#7──►│   8.Code    ┼─►#10─►║11.UTEST║
└─────────────┘   ▲   └─────────────┘   ▲   └─────────────┘   ▲   ╚════════╝
            ┌─────┼───────┐       ┌─────┼───────┐       ┌─────┼───────┐          
            │ 1.pCodeTmpl │       │ 6.CodePolicy│       │ 9.TestPolicy│          
            └─────────────┘       └─────────────┘       └─────────────┘          

Overwriting flowdiag.txt


The following is the unit test code listing


In [48]:
dest_fname = "counter_test.py"
with open(dest_fname, "r") as file:
    content = file.read()
print(content)

# TOKENS: 1591 (of:4000) = 1037 + 554(prompt+return) -- MODEL: gpt-4o#
"""
test_counter.py

This module contains unit tests for the Counter class defined in counter.py.
The tests ensure that the Counter class behaves as expected, covering a range
of scenarios including normal operations, edge cases, and concurrent access.

Test Cases:
- Test initial counter value.
- Test incrementing the counter.
- Test decrementing the counter.
- Test decrementing the counter when the value is zero.
- Test concurrent increments and decrements to ensure thread safety.

Date: 2023-10-05
"""

import pytest
from counter import Counter
import threading

@pytest.fixture
def counter():
    """Fixture to create a new Counter instance for each test."""
    return Counter()

def test_initial_value(counter):
    """Test that the initial value of the counter is zero."""
    assert counter.get_value() == 0

def test_increment(counter):
    """Test incrementing the counter."""
    assert counter.increment() == 1
  

### 12. Unit Tests Results

In [49]:
%%writefile unittest.txt
(.venv) preact@oryx:~/sw/ai_sw_workflow$ pytest
======== test session starts ===================================
platform linux -- Python 3.8.10, pytest-8.3.3, pluggy-1.5.0
rootdir: /home/preact/sw/ai_sw_workflow
plugins: anyio-4.5.2
collected 8 items                                                                                                                         

counter_test.py ........                                    [100%]

============= 8 passed in 0.40s ==================================
(

Overwriting unittest.txt


### 13. Performance 'gpt-3.5-turbo' VS 'gpt-4o'

Both versions pass thier unit test w/o issue and meet intent of system requirements - but `gpt-4o` code is far suppior. The deficiencies in `3.5-turbo` are evident in the YAML `Low-Level Requirements` file and could be addressed through a combination of adding explicit details to the `High-Level Requirements` or directly editing the `Low-Level Requirements` YAML file. If edits are made to the `Low-Level Requirements`, the automatic translation from high to low-level must be disabled to prevent overwriting your changes.

The `3.5-Turbo` model took only 13 seconds to process the pipeline while `4o` took twice as long, 26 seconds. The following table breaks down the pipeline timing:

| MODEL           | Total<br> Sec | Req<br> Sec | Code<br> Sec | Test<br> Sec |
| --------------- | ------------- | ----------- | ------------ | ------------ |
| `GPT-3.5.Turbo` | 13            | 5           | 4            | 4            |
| `GPT-4o`        | 26            | 17          | 12           | 7            |

LLMs are pattern inference machines. A key takeaway is that 'GPT-4o' successfully infers our intent to use the 'Counting Semaphore' design pattern without requiring explicit instructions. Explicitly calling out the desired design pattern in `High-Level Requirements` significantly improves the `3.5-Turbo` generated code quality.

The following tables highlight the differences in artifacts produced by the two LLM models. The most notable differences are:

- **Documentation**: `GPT-4o` includes detailed and well-formatted docstrings, which are entirely missing from the Turbo version.
- **Thread Safety**: While both versions function correctly, `3.5-Turbo` unnecessarily exposes the counter, violating encapsulation principles. This reduces the robustness of the implementation.


**Low Level Requirements Differences**
| **Aspect**              | **GPT-3.5-Turbo**                                     | **GPT-4o**                                             |
| ----------------------- | ----------------------------------------------------- | ------------------------------------------------------ |
| **Thread<br>Safety**       | Described in `architecture`, not implemented in steps | Locking is beautify detailed in Action_Step_Table           |
| **Method<br>Parameters**   | Includes parameters for `increment`/`decrement`       | No parameters; operates directly on encapsulated value |
| **Edge Case<br>Handling**  | General mentions in `error_handling`                  | Integrated into the method workflows                   |
| **Naming**              | Uses `retrieve()`                                     | Uses `get_value()`                                     |
| **Concurrency<br>Details** | Minimizes performance impact (general)                | Adds specifics (efficient locking mechanisms)          |

**Python Code Generation  Differences**

| **Aspect**              | **GPT-3.5-Turbo**                                                                 | **GPT-4o**                                                                                |
|-------------------------|-----------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| **Thread<br>Safety**       | Uses `threading.Lock` but passes a `counter_value` parameter, creating potential redundancy and misuse. | Uses `threading.Lock` consistently, with methods operating directly on the encapsulated counter value. |
| **Increment/<br>Decrement** | Unnecessary complexity because `counter_value` is passed in as a parameter, adding complexity and exposing internal state. Exceptions are raised for invalid states. | Excellent code that fixes the poor implementation found in 3.5-Turbo. Handles edge cases (e.g., zero decrement) gracefully without exceptions. |
| **Edge Case<br>Handling**  | Raises exceptions for invalid operations like decrementing below zero.             | Treats edge cases predictably (e.g., decrementing at zero is a no-op), avoiding unnecessary exceptions. |
| **Code<br>Simplicity**     | Includes unnecessary parameters in methods, complicating the API and violating encapsulation.        | Clean, parameter-free methods ensure consistent and intuitive operation on encapsulated data. |
| **Retrieve/<br>Get Value**  | Uses `retrieve()` with thread safety but less standard naming.                     | Uses `get_value()` with thread safety, aligning with Python naming conventions.              |
| **Example<br>Usage**       | Demonstrates usage that passes the counter value to methods, contradicting encapsulation. | Demonstrates clean usage directly interacting with the counter object, reinforcing good encapsulation. |


---


