# Domain Specific Code Generation using the FormLang DSL

## Abstract

AI and LLM systems are being trained to perform a variety of complex tasks requiring expertise in many software tools, languages and technology stacks. Some of the challenges in this process involve quality data acquisition in large amounts, handling the increase of model parameter count which leads to increasing compute demands and costs, as well as emerging Data Privacy and Intellctual Property concerns related with using 3rd party cloud services for model training.

In this work we attempt to harness and combine the benefits of Abstraction and Determinism provided by Formal Domain Specific Languages (DSLs) with the innate ability of LLMs to learn new languages and their semantics. We propose a novel generation task called "Domain Specific Code Generation" which involve mapping user requests written in natural language to DSL code.

By utilizing a specially crafted DSL called `FormLang` as a case study, we attempt to lay the groundwork for methods of automated DSL dataset generation, training techniques and performance evaluation, with the end goal of creating an AI system capable of generating Web forms according to a user request. If successful and viable, these methods can be applied to other problem domains in which real-world processes can be formally modelled using DSLs, opening the door to the training of AI agents that understand these DSLs and can be engineered to perform tasks autonomously.

Our `FormLang` DSL allows expressing the semantics of Web-forms using a simplified syntax that does not require much, if any, Web-programming knowledge and expertise.

Given a user prompt in English describing the desired form and its fields the LLM produces syntactically valid `FormLang` output which is run through the accompanying FormLang parser and a hand-crafted React JSX compiler to output a final implementation of the form in JavaScript and React.

**(TBD)** The project includes a live demo which demonstrates the capabilities of the system.


## Referring to this work

If you use this work the following quote is preferred:

```bibtex
@misc{guyor2025dscodegenformlang,
      title={Domain Specific Code Generation using the FormLang DSL},
      author={Guy Or},
      year={2025}
}
```

The official repository of this work is hosted in GitHub at https://github.com/guyo13/Form-Lang.

## Project Goals

* Create an AI training pipeline for FormLang (implemented as a Juypter notebook) which includes:
    * Automatic FormLang Dataset generation using searching algorithms and heuristics.
    * Baseline model selection and loading from Hugging Face.
    * Dataset Preprocessing and loading.
    * Defining performance metrics for the model.
    * Model fine tuning using Transformers library
    * Model Adapter training using PEFT library.
    * Model upload to Hugging Face Hub and example usage from the Hub.
    *  **(TBD)** Export to ONNX using Optimum and run on-device using Transformers.js.
* Create the “FormLang” language:
    * Describe the problem domain.
    * Defining a viable minimal syntax and semantics which are research focused rather than completeness focused.
    *  **(TBD)** Implementing a “JavaScript React” compile target.
*  **(TBD)** Create a live demo website:
    * **(TBD)** Users input a prompt.
    * **(TBD)** A FormLang editor is populated with the AI’s code generation results.
    * **(TBD)** The form is rendered alongside the generated code.
*  Discuss the project results:
    * Perfornace and user acceptability.
    * **(TBD)** Viability of the implemented methods for the Domain Specific Code Generation task and generalization to other domains.  
    * **(TBD)** Potential enhancements to the system.
    * **(TBD)** Possible research directions on how to learn from user data.


**(TBD)** - Features marked as TBD are depending on the project's progress and timeline constraints as well as proving the viability of the methods and system.

# Notebook Setup

## About the FormLang DSL and its codebase

The FormLang DSL is implemented using [Langium](https://langium.org/). Langium is a toolkit written in TypeScript that allows language engineers to create DSLs and quickly iterate over their development lifecycle.

Some of the many features offered by Langium are:

* Parser generation - Langium leverages [Chevrotain](https://chevrotain.io/docs/) to create a fast parser that can parse the DSL code to AST.
* TypeScript AST nodes generation - Langium automatically generates TS types that represent the AST of the language.
* Simplified Validation rules - Langium has built in support for writing and executing validation rules with built in error reporting.
* Language Server Protocol support - Langium implements the LSP protocol which allows code completion and syntax highlighting in many IDEs such as VSCode.

Since we are using Langium to parse the FormLang DSL, some of this project tooling and algorithms are written in TypeScript.

To call our TypeScript code from Python we are using two distinct methods:

* [PythonMonkey](https://pythonmonkey.io/) - A Python package that implements FFI between Python and JavaScript.
* Local HTTP API - For some cases in which the PythonMonkey FFI is too slow or buggy, we've implemented a local REST-API server that implements some functions we require (e.g for parsing FormLang code to AST).

### Codebase

The entire codebase is hosted in Github at [FormLang](https://github.com/guyo13/Form-Lang).

### Running from Colab

#### Installing PNPM and Bun

In [1]:
!npm install -g pnpm
!npm install -g bun@latest
!npm install -g corepack@latest

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K
added 1 package in 2s
[1G[0K⠸[1G[0K
[1G[0K⠸[1G[0K1 package is looking for funding
[1G[0K⠸[1G[0K  run `npm fund` for details
[1G[0K⠸[1G[0K[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K
added 5 packages in 5s
[1G[0K⠼[1G[0K[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K
changed 1 package in 1s
[1G[0K⠋[1G[0K

In [2]:
!bun -v

1.2.2


#### Installing PythonMonkey

In [3]:
!pip install --quiet pythonmonkey

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.6/21.6 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m76.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.6/288.6 kB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pminit (pyproject.toml) ... [?25l[?25hdone


#### Installing Huggingface libraries and Unsloth

In [4]:
!pip install --quiet datasets evaluate bitsandbytes unsloth

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.1/162.1 kB[0m [31m16.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━

#### Installing dependencies for ROUGE score

In [5]:
!pip install --quiet nltk absl-py rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


#### Installing Flash attention 2

In [6]:
!pip install flash-attn --quiet --no-build-isolation

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/6.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.6/6.0 MB[0m [31m19.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m6.0/6.0 MB[0m [31m91.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.0/6.0 MB[0m [31m67.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash-attn (setup.py) ... [?25l[?25hdone


#### Installing Optimum-Quato

In [7]:
!pip install optimum-quanto --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/165.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.1/165.1 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h

#### Clone the repo

In [8]:
!rm -rf Form-Lang

In [9]:
!git clone -b react_compiler https://github.com/guyo13/Form-Lang.git

Cloning into 'Form-Lang'...
remote: Enumerating objects: 721, done.[K
remote: Counting objects: 100% (141/141), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 721 (delta 101), reused 34 (delta 24), pack-reused 580 (from 1)[K
Receiving objects: 100% (721/721), 805.59 KiB | 2.46 MiB/s, done.
Resolving deltas: 100% (464/464), done.


#### Install project dependencies and build

In [10]:
!cd Form-Lang && corepack install && corepack pnpm install && corepack pnpm run gb

Adding pnpm@10.0.0+sha512.b8fef5494bd3fe4cbd4edabd0745df2ee5be3e4b0b8b08fa643aa3e4c6702ccc0f00d68fa8a8c9858a735a0032485a44990ed2810526c875e416f001b17df12b to the cache...
Lockfile is up to date, resolution step is skipped
Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1APackages: [32m+502[39m[0K

Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1AProgress: resolved [96m502[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1A[33m[39m[0K
[33m   ╭──────────────────────────────────────────────────────────────────╮[39m
   [33m│[39m                                                                  [33m│[39m
   [33m│[39m                Update available! [31m10.0.0[39m → [32m10.4.0[39m.                [33m│[39m
   [33m│[39m   [35mChangelog:[39m https://github.com/pnpm/pnpm/releases/tag/v10.4.0   [33m│[39m
   [33m│[39m         Run "[35mcorepack install

In [11]:
!cd Form-Lang/ml && pnpm install

Lockfile is up to date, resolution step is skipped
Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1APackages: [32m+69[39m[0K

Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1AProgress: resolved [96m69[39m, reused [96m2[39m, downloaded [96m0[39m, added [96m0[39m
[1AProgress: resolved [96m69[39m, reused [96m23[39m, downloaded [96m31[39m, added [96m53[39m
[1AProgress: resolved [96m69[39m, reused [96m23[39m, downloaded [96m46[39m, added [96m69[39m, done

[96mdependencies:[39m
[32m+[39m express [90m4.21.2[39m

Done in 1.1s


#### Run the FormLang HTTP API server

In [12]:
from subprocess import Popen, PIPE
import sys

def run_js_with_bun(js_file, cwd=None):
  """
  Runs a JavaScript file using Bun as a subprocess.

  Args:
    js_file: The path to the JavaScript file.
    cwd: The working directory for the child process.
         If None, the current working directory is used.
  """
  try:
    # Execute the JavaScript file using Bun with specified working directory
    process = Popen(['bun', js_file], stdout=PIPE, cwd=cwd)
    return process

  except FileNotFoundError:
    print("Error: Bun not found. Please ensure Bun is installed and in your PATH.")
  except Exception as e:
    import traceback
    traceback.print_exc()
    print("An unexpected error occurred:", e)

http_server = run_js_with_bun("lib_server.cjs", "Form-Lang/ml")
http_server

<Popen: returncode: None args: ['bun', 'lib_server.cjs']>

In [18]:
!curl -d '{"sourceCode": "component hey{} form helloWorld {comp hey}"}' -H "content-type: application/json" "http://localhost:3000/compute/ast"

{"status":"ok","result":{"ast":"{\"$type\":\"Form\",\"name\":\"helloWorld\",\"component\":{\"$type\":\"FieldComponentDef\",\"componentId\":{\"$ref\":\"#/components@0\"},\"componentPropsKeys\":[],\"componentPropsValues\":[]},\"children\":[]}"}}

## Project imports

In [14]:
import asyncio
import time
import re
import functools
import json
from pprint import pprint
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM, GenerationConfig, QuantizedCacheConfig
from datasets import Dataset, load_dataset
from huggingface_hub import login
from tqdm.notebook import trange, tqdm
import evaluate
import pythonmonkey as pm
import pandas as pd
import requests
import torch
try:
    # If using locally from the `ml` folder
    formlang_lib = pm.require("../out/cjs/lib/index")
except:
    # If using in Colab:
    formlang_lib = pm.require("./Form-Lang/out/cjs/lib/index")

## Login to Huggingface Hub

In [15]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Helper functions

### Python Monkey helpers

These can be used to print JavaScript objects that live inside the PythonMonkey engine.

In [16]:
def js_dir(something):
    pm.globalThis.console.dir(something)

def js_log(something):
    pm.globalThis.console.log(something)

### Computing the AST

In order to check the model's output we will need to parse the output FormLang code and check it for errors. We will also use the AST to compute performance metrics. Our `get_ast` function will be used to send FormLang source code to our local HTTP API for parsing.  

In [17]:
def get_ast(code, shouldCheckErrors=False):
    result = requests.post("http://localhost:3000/compute/ast", json={"sourceCode": code, "shouldCheckErrors":shouldCheckErrors}).json()
    if not result["status"] == "ok":
        raise RuntimeError("failed")
    return result["result"]

# Dataset generation

## Generating the dataset

The `ProbabilisticSearchFormGenerator` is class implemeting a parameterized algorithm for randomly generating forms:

* Iterate Depth-First, start with an empty form definition.
* For each item off the frontier if the item is a Field, append it to its Parent Form.
* If the item is a Form, generate its children and add them onto the frontier. Generate children according to the following rules:
    * Assign to the child node $depth = parentDepth + 1$.
    * There is a probability of $\alpha^{d}$ of generating a random child Form where $0 \leq \alpha < 1$ and $d$ is the depth of the child.
    * There is an optional parameter $D$ for defining the maximum depth for nested forms, which sets to $0$ the probability to generate a nested Form if its depth will be $D$.
    * There is a probability of $1 - \alpha^{d}$ of generating a random child Field.
    * The number of generated children is a random number in the range of $[minChildren, maxChildren]$.
    * There is a probability of $\beta$ that a generated Field will contain a `state`.
      * There is a probability of $\gamma$ that a state definition will be an `array`.
        * The number of array elements will be chosen at random from the integer interval $[amin, amax]$.
      * There is an equal probability of the state `type` to be any of the supported built-in types.
      * There is a probability of $\delta$ that a state definition will contain a `default` value.
        * If the `type` is `string` there is an $\epsilon$ probability of the default value to be defined as `as expression`, otherwise the probability is $1$.
* For each item off the frontier chose a random component from a set of available components.
  * Choose at random the number of assigned component props from the range $[0, ComponentPropCount]$.
    * For each assigned prop generate a random value with $\epsilon$ probability of the value to be defined as `as expression`

The algorithm ranomly generates the tree structure of the form which is then serialized into `FormLang` source code.

Some of the main advatages of using probabilistic search is the simplicity of implementation as well as the ability to create datasets representing different distributions by adjusting the generation hyper parameters.  

We start by creating a [ProbabilisticSearchFormGenerator](https://github.com/guyo13/Form-Lang/blob/react_compiler/src/generation/formGen.ts#L102) object which is implemented in JavaScript, using the default search hyper parameters and some hard-coded Component definitions.

Later on we will implement a random generator for the component definitions as well.

In [None]:
form_components = """
  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}
"""
field_components = """
  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }
"""

form_gen = await formlang_lib.newFormGen(formlang_lib.DEFAULT_GENERATOR_HYPER_PARAMETERS, form_components, field_components)

### Create a data generation loop for the prompt data

#### Prompt-Data generation algorithm

Each training example consists of a randomly generated form, from-which we remove (or "mask") a random node, keep track of the removed node's parent and siblings.

The goal being to create a prompt that provides the masked form and describes in English the node that needs to be added along with its location in the form.

The algoithm is as follows:

* Generate a random Form - outputs $F$.
* Serialize the Form to FormLang code - outputs $s(F)$.
* Iterate Depth-First starting at the form's root.
  * For every node the probability of removing the node is $1 - \zeta^{depth(node)}, (0 < \zeta < 1)$ if no node was removed, else the probability is $ 0 $.
  * If a node is marked to be removed, recording its surrounding context and remove it from the tree. - $F \to (F', N, ctx(F, N))$.
* Serialize the modified Form to FormLang code - outputs $s(F')$.
* Generate Prompt instruction parts:
* Serialize the removed child node into plain English - outputs $eng(N)$.
* Serialize the Parent and Siblings nodes ids and node types (form/field) into plain English - outputs $eng(ctx(F, N))$.
* Output a dict consisting of $s(F), s(F'), eng(N), eng(ctx(F, N))$

The algorithm is implemented in TypeScript in the [generateRandomFormWithModification](https://github.com/guyo13/Form-Lang/blob/react_compiler/src/lib/generator.ts#L43) function.

We create a small dataset of 3000 random examples by using the JavaScript FormLang utilities library and the form generator object.

We copy from the form generator's output only the relevant fields which will make up the LLM prompt.

For each generated example we also compute the original form's AST and store it as part of the dataset.

In [None]:
# TODO - Refactor data generator pipeline as a Python class with easy interfaces

def concat_components_code_with_form_code(form_code, form_components, field_components):
    """Concatenates the form code with the given components code that are referenced inside of it."""
    return f"{form_components}\n{field_components}\n{form_code}"

def create_prompt_data(generation_result, form_components, field_components):
    data = {
        "originalFormCode": concat_components_code_with_form_code(generation_result["serializedForm"], form_components, field_components),
        "modifiedFormCode": concat_components_code_with_form_code(generation_result["serializedModifiedForm"], form_components, field_components),
        "removedNodeEnglish": str(generation_result["removedNodeEnglish"]),
        "removedNodeContextEnglish": str(generation_result["removedNodeContextEnglish"]),
    }
    return data

def create_random_data(form_gen, form_components, field_components, num_examples=3000):
    start_time = time.time()
    DATA = []
    for i in range(num_examples):
        generation_result = formlang_lib.generateRandomFormWithModification(form_gen)
        prompt_data = create_prompt_data(generation_result, form_components, field_components)
        DATA.append(prompt_data)
    # Generate the ast of each example in one batch
    ast_time = time.time()
    for example in DATA:
        resp = get_ast(example["originalFormCode"], True)
        errors = resp.get('errors', [])
        if len(errors) > 0:
            print(f"Error parsing generated code", errors)
            raise RuntimeError("Error parsing generated code")
        example["originalFormAst"] = resp["ast"]
    end_time = time.time()
    print(f"Generation took {end_time - start_time} seconds. Total Ast generation time {end_time - ast_time} seconds")
    return pd.DataFrame(DATA)

In [None]:
examples = create_random_data(form_gen, form_components, field_components, 3000)
examples.head()

Generation took 48.5854971408844 seconds. Total Ast generation time 47.7074978351593 seconds


Unnamed: 0,originalFormCode,modifiedFormCode,removedNodeEnglish,removedNodeContextEnglish,originalFormAst
0,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'Z_' with state of type:...,* is a child of the form whose id is 'Bw'.\n* ...,"{""$type"":""Form"",""name"":""Bw"",""component"":{""$typ..."
1,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'p_' with state of type:...,* is a child of the form whose id is 'Cw'.\n* ...,"{""$type"":""Form"",""name"":""Cw"",""component"":{""$typ..."
2,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'F' using the component ...,* is a child of the form whose id is 'Vw'.\n* ...,"{""$type"":""Form"",""name"":""Vw"",""component"":{""$typ..."
3,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\t\ta field whose id is 'Pww_ww___w' using the...,* is a child of the form whose id is 'd'.\n* i...,"{""$type"":""Form"",""name"":""Lw"",""component"":{""$typ..."
4,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'nw' with state of type:...,* is a child of the form whose id is 'a_'.\n* ...,"{""$type"":""Form"",""name"":""a_"",""component"":{""$typ..."


#### Example generated FormLang code and its AST

In [None]:
examples["originalFormCode"][1132]

'\n  component userDetailsContainer {}\n  component formContainer {}\n  component someOtherContainer {}\n  component OtherContainer2 {}\n\n\n  component myTextBox {\n    props {\n      textColor\n      textSize\n      textWeight\n      borderColor\n    }\n  }\n  component myCheckbox {\n    props {\n      size\n    }\n  }\n  component otherTextBox {}\n  component counter {\n    props {\n      style\n    }\n  }\n\nform T_ {\n\tcomp userDetailsContainer \n\t\n\tform H {\n\t\tcomp userDetailsContainer \n\t\t\n\t\tfield Ew {\n\t\t\tcomp otherTextBox \n\t\t\t\n\t\t}\n\t\tfield Rw_w {\n\t\t\tcomp otherTextBox \n\t\t\t\n\t\t}\n\t\tfield F_ {\n\t\t\tstate number\n\t\t\tcomp myTextBox borderColor="#301410" textSize="(() => 97.65823852274963)()" as expression\n\t\t\t\n\t\t}\n\t}\n\n\tfield p___ {\n\t\tcomp otherTextBox \n\t\t\n\t}\n\tfield vw {\n\t\tstate string default "Frankie Kessler"\n\t\tcomp counter style="#5b426f"\n\t\t\n\t}\n\tfield f {\n\t\tcomp counter \n\t\t\n\t}\n\tfield q {\n\t\tstat

In [None]:
examples["originalFormAst"][1132]

'{"$type":"Form","name":"T_","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@0"},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Form","name":"H","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@0"},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Field","name":"Ew","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@6"},"componentPropsKeys":[],"componentPropsValues":[]}},{"$type":"Field","name":"Rw_w","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@6"},"componentPropsKeys":[],"componentPropsValues":[]}},{"$type":"Field","name":"F_","state":{"$type":"FieldStateDef","type":"number","isArray":false},"component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@4"},"componentPropsKeys":[{"$type":"ComponentPropKey","key":"borderColor"},{"$type":"ComponentPropKey","key":"textSize"}],"componentPropsValues":[{"$type":"Va

### Creating the user prompts

For each example in our synthetic dataset, we create the user prompt which will be used to train and test the LLM.

The user prompt consists of the modified FormLang code taken from each of our dataset examples along with the english description of the form element to add and its location (context) in the form.

In [None]:
def create_prompt(row):
    return (
            "```FormLang\n"
            f"{row['modifiedFormCode']}\n"
            "```\n"
            "The description of the form element you need to add:\n"
            f"{row['removedNodeEnglish']}\n"
            "The description of the context in the form where you should add the element:\n"
            f"The element to be added:\n{row['removedNodeContextEnglish']}\n"
           )

In [None]:
examples["userPrompt"] = examples.apply(create_prompt, axis=1)
examples.head()

Unnamed: 0,originalFormCode,modifiedFormCode,removedNodeEnglish,removedNodeContextEnglish,originalFormAst,userPrompt
0,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'Z_' with state of type:...,* is a child of the form whose id is 'Bw'.\n* ...,"{""$type"":""Form"",""name"":""Bw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
1,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'p_' with state of type:...,* is a child of the form whose id is 'Cw'.\n* ...,"{""$type"":""Form"",""name"":""Cw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
2,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'F' using the component ...,* is a child of the form whose id is 'Vw'.\n* ...,"{""$type"":""Form"",""name"":""Vw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
3,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\t\ta field whose id is 'Pww_ww___w' using the...,* is a child of the form whose id is 'd'.\n* i...,"{""$type"":""Form"",""name"":""Lw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
4,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'nw' with state of type:...,* is a child of the form whose id is 'a_'.\n* ...,"{""$type"":""Form"",""name"":""a_"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...


##### Example User Prompt vs. original Code

In [None]:
print(examples['userPrompt'][0])

```FormLang

  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}


  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }

form Bw {
	comp OtherContainer2 
	
	field G {
		comp myCheckbox size="'Hello'.toLowerCase()" as expression
		
	}
	form a {
		comp formContainer 
		
		field qw {
			state boolean[]
			comp otherTextBox 
			
		}
		field F__ {
			comp myTextBox 
			
		}
		field Y__ {
			state string default "'John Streich'" as expression
			comp myTextBox 
			
		}
		field S_w {
			state string
			comp myCheckbox size="#121310"
			
		}
		field N_ {
			state number default "0.17049162264892637" as expression
			comp counter 
			
		}
	}

	field pw_wwww_w__ {
		state boolean
		comp counter 
		
	}
	field J {
		state s

In [None]:
print(examples['originalFormCode'][0])


  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}


  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }

form Bw {
	comp OtherContainer2 
	
	field G {
		comp myCheckbox size="'Hello'.toLowerCase()" as expression
		
	}
	form a {
		comp formContainer 
		
		field qw {
			state boolean[]
			comp otherTextBox 
			
		}
		field F__ {
			comp myTextBox 
			
		}
		field Y__ {
			state string default "'John Streich'" as expression
			comp myTextBox 
			
		}
		field S_w {
			state string
			comp myCheckbox size="#121310"
			
		}
		field N_ {
			state number default "0.17049162264892637" as expression
			comp counter 
			
		}
	}

	field pw_wwww_w__ {
		state boolean
		comp counter 
		
	}
	field Z_ {
		state string
		com

### Creating a Dataset on HuggingFace using `Datasets`

First we've manually created a Dataset repository on the hugging-face hub for our [form-lang-examples](https://huggingface.co/datasets/guy-or/form-lang-examples) .

We then convert our `examples` DataFrame into a Dataset object and upload it to the Hub.


In [None]:
ds = Dataset.from_pandas(examples)
ds.push_to_hub("guy-or/form-lang-examples", "3k_single_omission")
ds

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Dataset({
    features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt'],
    num_rows: 3000
})

# Zero-shot evaluation and Training using Llama 3.2-1B as a foundation model

We've selected Llama 3.2 1B Instruct as a foundantion model of our experimentation.

This is choice is driven by the popularity of the model and the abundace of resources and examples available online. Specifically the 1B version was chosen because it can run on consumer hardware and can be trained on free cloud resources.

Another important factor in this choice is Llama's large context size of 128K tokens, this allows us to experiment with large prompts and opens the door for building complex AI agents that can be fed with `FormLang`'s compiler messages and user feedback and adjust their responses.

Successfully demonstrating performance improvement after training the 1B version will justify further time and cost investment required for training larger models.

## Performance metrics

Before attempting to train the model, we want to define perfomance metrics for the code generation task and evaluate the base model performance on some of our data in order to establish a baseline for performance.

We will measure 3 metrics:

* Accuracy
* BLEU
* ROUGE

BLUE and ROUGE are standard metrics for measuring text generation performance when we want to compare the model's output to a desired output.

In the case of code generation, these metrics may not accurately reflect the quality of the model's output when applied to the source code and therefore we will try to apply them both on the serialized AST (in JSON format) of the generated code as well as the code itself and the raw output of the model vs the expected one.

**(TBD)** Try CodeBLEU and `apted` for tree edit distance algorithms that might better reflect the model's performance.

## Loading the Dataset from HuggingFace using `Datasets`

We load the Dataset from the Hub using `load_dataset` function. The dataset contains the raw user prompt and expected outputs.

In the next steps we will use these fields to create a full prompt which includes our system prompt and finally tokenize the prompt in preparation for running the model.

In [19]:
ds_3k_single_omission = load_dataset("guy-or/form-lang-examples", "3k_single_omission")
ds_3k_single_omission_all = ds_3k_single_omission['train']
ds_3k_single_omission_all

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/598 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.42M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset({
    features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt'],
    num_rows: 3000
})

### Dataset pre-processing

In order to create prompts that effectively describe the task to our Llama 3.2 model, we need to format our prompt in a way that the model was trained to respond to.

This is called a [Prompt Template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#prompt-template), which consists of a **System prompt** which provides the initial context and guidelines for the model, followed by a **User prompt** containing the request and relevant data.

#### System prompt

Our System prompt instructs the LLM to generate the original form (the `originalFormCode` column) given the data present in the user prompt which we already created in the Dataset generation part.

In [20]:
def get_system_prompt():
    return (
        "You are a code generation AI assistant.\n"
        "Your job is to generate valid FormLang code according to the instructions given below:\n"
        "Inspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ``` .\n"
        "After inspection, complete the Form's code according to the given a description of a new form element and a description of its location in the form.\n"
        "You may assume that the new form element to be added is always either a 'form' or a 'field'.\n"
        "Your output must be valid and compiler-friendly FormLang code only.\n"
        "If you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input."
        "Your answer will be evaluated using an AST comparison of your code to the expected code.\n"
        "Assume that the input code is valid and requires no modification other than the NEW code you must generate.\n"
        "You must output plain FormLang code without any additional text or delimiters.\n"
        "You must not change any part of the original input code other than adding the required element.\n"
    )



#### Creating the full chat prompts

We load Llama's tokenizer using Huggingface `AutoTokenizer` class and leverage its [Chat templates](https://huggingface.co/docs/transformers/v4.48.2/en/chat_templating#introduction) capabilities which exposes a simple API for formatting our prompts into Llama's required Prompt Template.

We create a `fullPrompt` column to store the input prompt which contains the templated system and user prompts.

Since our model's generated text gets appended (according to the prompt template format) to the input prompt, we also create a `labels` column containing the final expected model output.

The special `<|eot_id|>` token signifies the End of Turn. In the Lama 3.1 to 3.3 family of models this token signals to the executor that the model has finished generating a response.

In [21]:
# Setting the padding side to left for generation - later  on for training we will use right
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", padding_side="left")
tok.pad_token = "<|finetune_right_pad_id|>"

def get_chat_prompt(userPrompt):
    return [
        {"role": "system", "content": get_system_prompt()},
        {"role": "user", "content": userPrompt}]

def preprocess_func(example, add_generation_prompt=True):
    full_prompt = tok.apply_chat_template(get_chat_prompt(example["userPrompt"]), tokenize=False, add_generation_prompt=add_generation_prompt)
    expected_output = f"{full_prompt}{example['originalFormCode']}<|eot_id|>"
    return {"fullPrompt": full_prompt, "labels": expected_output}

# Currently unused
def tokenize_func(examples):
    return tok(examples["fullPrompt"], return_tensors="pt")

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [22]:
ds_3k_full_prompt = ds_3k_single_omission_all.map(preprocess_func, batched=False) # Our preprocess_func doesn't support batch operations
#ds_3k_full_prompt = ds_3k_full_prompt.map(tokenize_func, batched=False)
#ds_3k_full_prompt.set_format(type="torch", columns=["input_ids", "attention_mask"])
ds_3k_full_prompt

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset({
    features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt', 'fullPrompt', 'labels'],
    num_rows: 3000
})

In [23]:
ds_3k_full_prompt[0]['fullPrompt'],ds_3k_full_prompt[0]['labels']

('<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 15 Feb 2025\n\nYou are a code generation AI assistant.\nYour job is to generate valid FormLang code according to the instructions given below:\nInspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ``` .\nAfter inspection, complete the Form\'s code according to the given a description of a new form element and a description of its location in the form.\nYou may assume that the new form element to be added is always either a \'form\' or a \'field\'.\nYour output must be valid and compiler-friendly FormLang code only.\nIf you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input.Your answer will be evaluated using an AST comparison of your code to the expected code.\nAssume that the input code is valid and requires no modification other than the NEW code you mu

#### Creating the Train Validation Test split

We allocate 80% for training and 10% for test and validation each.

Since we are able to generate arbitrarily large amounts of data with different distributions, these percentages does not matter too much.

In [24]:
def create_train_validation_test_split(dataset, test_size):
    splits = dataset.train_test_split(test_size=test_size)
    test_splits = splits["test"].train_test_split(test_size=0.5) # Split so that test and validation sizes are equal
    return {"train": splits["train"], "test": test_splits["train"], "validation": test_splits["test"]}

ds_3k_splits = create_train_validation_test_split(ds_3k_full_prompt, 0.2)
ds_3k_train, ds_3k_test, ds_3k_val = ds_3k_splits["train"], ds_3k_splits["test"], ds_3k_splits["validation"]
ds_3k_train, ds_3k_test, ds_3k_val

(Dataset({
     features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt', 'fullPrompt', 'labels'],
     num_rows: 2400
 }),
 Dataset({
     features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt', 'fullPrompt', 'labels'],
     num_rows: 300
 }),
 Dataset({
     features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt', 'fullPrompt', 'labels'],
     num_rows: 300
 }))

## Utils

We create several utility functions for pre and post processing the prompt data

### Extracting the model's response from the output text

Our model's raw output is a concatenation of the input and the generated text following Llama's Prompt Template semantics.

In order to extract the generated code we first need to identify the model's latest response by extracting all the text between the special `<|start_header_id|>assistant<|end_header_id|>` and `<|eot_id|>` tokens.

In [25]:
# prompt: Create a function that accepts the output text and extracts all the text starting at the LATEST OCCURRENCE of "<|start_header_id|>assistant<|end_header_id|>" and ends with "<|eot_id|>"

def exatract_latest_assistant_response(text):
    """
    Extracts the model's response from the output text.

    Args:
        text: The output text from the model.

    Returns:
        The extracted model response, or None if no response is found.
    """
    match = re.findall(r"<\|start_header_id\|>assistant<\|end_header_id\|>(.*?)<\|eot_id\|>", text, re.DOTALL)
    if match:
        return match[-1].strip()  # Return the latest occurrence
    else:
        return None


### Extracting FormLang code from LLM output

The LLM might ignore the instruction to output plain code without the "```FormLang" delimiter, in that case we should try to extract the contents.

In [26]:
def extract_code_block(text, lang):
  """
  Extracts a code block from a string.

  Args:
    text: The string containing the code block.
    lang: The name of the language the code block is written in.

  Returns:
    The code block, or None if no code block is found.
  """
  if text is None:
    return None
  pattern = f"```{lang}\n(.*?)```"
  match = re.search(pattern, text, re.DOTALL)
  if match:
    return match.group(1).strip()
  return None

def extract_llm_formlang_output(text):
    # Strip leading and trailing whitespace and check for delimiters
    text = text.strip()
    assistant_response = exatract_latest_assistant_response(text)
    return extract_code_block(assistant_response, "FormLang") or assistant_response

### Running inference on a single example

In [27]:
def clear_mem():
    """Clear unused GPU memory used by the python process"""
    import gc
    gc.collect()
    torch.cuda.empty_cache()

In [28]:
def run_inference(model, tok, example, cuda=True):
    """Runs pre-processing, inference and post-processing on a single example."""
    example_inputs = tok(example, return_tensors="pt")
    if cuda:
        example_inputs = example_inputs.to("cuda")
    with torch.no_grad():
      example_outputs = model.generate(**example_inputs, max_new_tokens=10000)
    generated_text = tok.decode(example_outputs[0], skip_special_tokens=False)
    del example_inputs
    del example_outputs
    generated_code = extract_llm_formlang_output(generated_text)
    ast = get_ast(generated_code)
    return {"generated_code": generated_code, "generated_text": generated_text, "ast": ast}

## Zero-shot evaluation of Llama 3.2 1B using Transformers and Evaluate

### Loading the metrics and creating evaluators

We load the evaluators for BLEU, ROUGE and Accuracy (using the "exact match" metric since our data is text).

In [29]:
def get_all_metrics():
    bleu_metric = evaluate.load("bleu", module_type="metric")
    rouge_metric = evaluate.load("rouge", module_type="metric")
    exact_match_metric = evaluate.load("exact_match", module_type="metric")
    all_metrics = evaluate.combine([bleu_metric, rouge_metric, exact_match_metric])
    return all_metrics

In [None]:
all_metrics = get_all_metrics()

As an example to test the reliability of our metrics (specifically the BLEU and ROUGE scores), we check their values on a single example in the dataset,

First, applied to the expected raw output of the model comparted to our prompt and

Second, applied to the original form code and the modified form code.

Third, between the ASTs of the original form and the modified form.

In [None]:
all_metrics.compute(references=[ds_3k_train[0]["labels"]], predictions=[ds_3k_train[0]["fullPrompt"]])

{'bleu': 0.8394570207692074,
 'precisions': [1.0, 1.0, 1.0, 1.0],
 'brevity_penalty': 0.8394570207692074,
 'length_ratio': 0.851063829787234,
 'translation_length': 440,
 'reference_length': 517,
 'rouge1': np.float64(0.9401459854014599),
 'rouge2': np.float64(0.9399707174231332),
 'rougeL': np.float64(0.9401459854014599),
 'rougeLsum': np.float64(0.9401459854014599),
 'exact_match': np.float64(0.0)}

In [None]:
all_metrics.compute(references=[ds_3k_train[0]["originalFormCode"]], predictions=[ds_3k_train[0]["modifiedFormCode"]])

{'bleu': 0.35590702586152256,
 'precisions': [1.0,
  0.9811320754716981,
  0.9807692307692307,
  0.9803921568627451],
 'brevity_penalty': 0.3611295508093596,
 'length_ratio': 0.4954128440366973,
 'translation_length': 54,
 'reference_length': 109,
 'rouge1': 0.6666666666666666,
 'rouge2': 0.6588235294117647,
 'rougeL': 0.6666666666666666,
 'rougeLsum': 0.6666666666666666,
 'exact_match': 0.0}

In [None]:
all_metrics.compute(references=[json.dumps(ds_3k_train[0]["originalFormAst"])],
                    predictions=[json.dumps(get_ast(ds_3k_train[0]["modifiedFormCode"]))])

{'bleu': 0.6076249287142229,
 'precisions': [0.9950248756218906,
  0.98,
  0.964824120603015,
  0.9595959595959596],
 'brevity_penalty': 0.6233564232002683,
 'length_ratio': 0.6790540540540541,
 'translation_length': 201,
 'reference_length': 296,
 'rouge1': np.float64(0.7945205479452055),
 'rouge2': np.float64(0.7887323943661971),
 'rougeL': np.float64(0.7945205479452055),
 'rougeLsum': np.float64(0.7945205479452055),
 'exact_match': np.float64(0.0)}

As we can see, except for the `exact_match` score, our input is very "similar" to our expected output with the raw prompt comparison yielding scores above 50% except for the source code comparison in the BLEU metric.

This could mean on one hand that our trained model would have to exhibit scores that are very close to 100% in order to demonstrate a significant improvement,

and on the other hand scores that are far below the above scores are probably very bad even if they are above 50%.

It also gives us a good intution that comparing the source codes rather than their ASTs is a better option.

We might also need to adjust our data generation algorithm to remove more nodes from the form (instead of a single omission algorithm for our example dataset).

In order to get a better sense of the baseline similarity in our dataset we plot a historam of the BLEU and ROUGE scores applied to the AST and the prompts and compute the medians.

In [None]:
## TODO - Draw histogram plots and compute medians

In [30]:
def evaluate_model(model, tokenizer, raw_text_evaluations, source_code_evaluations, ast_evaluations, dataset, batch_size=16):
  model.eval()
  all_predictions = []

  def evaluate_batch(batch):
    inputs = None
    outputs = None
    input_ids = None
    attention_mask = None
    try:
      inputs = tokenizer(batch["fullPrompt"], return_tensors="pt", padding=True, truncation=False).to("cuda") # TODO pad_to_multiple_of=1024
      input_ids = inputs["input_ids"]
      attention_mask = inputs["attention_mask"]
      with torch.no_grad():
          t0 = time.time()
          print("generating")
          outputs = model.generate(input_ids=input_ids, attention_mask=attention_mask)
          print(f"batch generated. took {time.time() - t0} seconds")
      # For the raw text evaluation we do skip special tokens because we want to be agnostic of any padding
      predictions = tokenizer.batch_decode(outputs, skip_special_tokens=True)
      all_predictions.extend(predictions)
      print("decoded")
      # Create full expected text references by removinh
      text_references = tokenizer.batch_decode(tokenizer(batch["labels"], return_tensors="pt", padding=True, truncation=False), skip_special_tokens=True)
      raw_text_evaluations.add_batch(predictions=predictions, references=text_references)
      # Evaluate the source code output
      raw_predictions = tokenizer.batch_decode(outputs, skip_special_tokens=False) # We dont skip special tokens because we need them to know how to extract the LLMs latest response
      source_code_predictions = [extract_llm_formlang_output(output) or "" for output in raw_predictions]
      source_code_references = [original_code for original_code in batch["originalFormCode"]]
      source_code_evaluations.add_batch(predictions=source_code_predictions, references=source_code_references)
      # Evaluate the AST output - TODO Deal with errors
      #ast_predictions = [json.dumps(get_ast(code)) for code in form_lang_predictions]
      #ast_references = [ast for ast in batch["originalFormAst"]]
      #ast_evaluations.add_batch(predictions=ast_predictions, references=ast_references)
      print("added to eval")
    except:
      print("got exception")
      import traceback
      traceback.print_exc()
    finally:
      # Free GPU memory
      del inputs
      del outputs
      del input_ids
      del attention_mask

  for batch in tqdm(dataset.iter(batch_size)):
    evaluate_batch(batch)

  return raw_text_evaluations, source_code_evaluations, ast_evaluations, all_predictions


### Zero-shot evaluation with the original Meta version

We the original Llama 3.2 1B model available from Meta and run an evaluation over our validation dataset

In [31]:
llama3_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    attn_implementation="flash_attention_2"
)

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

##### Override default generation config

In [43]:
llama3_model.generation_config.pad_token_id = tok.pad_token_id
llama3_model.generation_config.temperature = 0.9
llama3_model.generation_config.max_new_tokens = 1024
print(llama3_model.generation_config)

GenerationConfig {
  "bos_token_id": 128000,
  "do_sample": true,
  "eos_token_id": [
    128001,
    128008,
    128009
  ],
  "max_new_tokens": 1024,
  "pad_token_id": 128004,
  "temperature": 0.9,
  "top_p": 0.9
}



##### [EXPERIMENT] Trying with static cache and torch compile

Only supported when `attn_implementation="sdpa"` - **slower than FA2**

In [73]:
#llama3_model.generation_config.cache_implementation = "static"

#llama3_model.forward = torch.compile(llama3_model.forward, mode="reduce-overhead", fullgraph=True)

##### [EXPERIMENT] Trying with Quantized Cache

**Increases latency relative to the default DynamicCache**

In [32]:
# Configure Quantized Cache
quantized_cache_config = QuantizedCacheConfig(
    nbits=4,  # Use 4-bit quantization
    q_group_size=64,  # Adjust if necessary
    residual_length=128,  # Adjust if necessary
)

#llama3_model.generation_config.cache_implementation = "quantized"
#llama3_model.generation_config.quantized_cache_config = quantized_cache_config

##### Trying to evaluate the first 50 examples in our validation dataset

In [33]:
first_examples = ds_3k_val.train_test_split(test_size=(50/300))["test"]

In [44]:
raw_text_evaluations, source_code_evaluations, ast_evaluations, all_predictions = evaluate_model(llama3_model, tok, get_all_metrics(), get_all_metrics(), get_all_metrics(), first_examples, batch_size=50)

0it [00:00, ?it/s]

generating
batch generated. took 146.80016207695007 seconds
decoded
added to eval


In [45]:
tok(all_predictions, return_tensors="pt", padding=True, truncation=False).input_ids.shape

torch.Size([50, 2247])

In [46]:
print(all_predictions[42])

<|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|><|finetune_right_pad_id|>

In [47]:
print(extract_llm_formlang_output(all_predictions[42]))

component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}

  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }
  component nTextBox {
    id
    type
    size
    textWeight
    borderColor
    props {
      value
    }
  }
  component formContainer {
    component nTextBox
  }

form cw {
  component userDetailsContainer
  component formContainer
}


In [48]:
print(extract_llm_formlang_output(first_examples[42]["labels"]))

component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}


  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }

form cw {
	comp userDetailsContainer 
	
	field n {
		state string
		comp myCheckbox size="(() => 15.712560583743162)()" as expression
		
	}
}


In [49]:
raw_text_evaluations.compute()

{'bleu': 0.04696241451534662,
 'precisions': [0.04815040417931251,
  0.047203941243130224,
  0.046548655615767084,
  0.04597442746466882],
 'brevity_penalty': 1.0,
 'length_ratio': 19.909283482497486,
 'translation_length': 751894,
 'reference_length': 37766,
 'rouge1': 0.17390459135466285,
 'rouge2': 0.16931847293661578,
 'rougeL': 0.17039640001023956,
 'rougeLsum': 0.17179978964118225,
 'exact_match': 0.0}

In [50]:
source_code_evaluations.compute()

{'bleu': 0.7617421558838426,
 'precisions': [0.8889019653995528,
  0.842092803030303,
  0.8055720919157042,
  0.7759281437125749],
 'brevity_penalty': 0.921028472363286,
 'length_ratio': 0.9239886907351023,
 'translation_length': 8497,
 'reference_length': 9196,
 'rouge1': 0.8250933090975863,
 'rouge2': 0.7676533409855832,
 'rougeL': 0.7654374401869593,
 'rougeLsum': 0.8246213946963713,
 'exact_match': 0.0}

In [None]:
# TODO
#ast_evaluations.compute()

In [None]:
#del llama3_model
#clear_mem()

### Zero-shot evaluation with Unsolth's 4-bit Quantized version of the model

In [None]:
unsloth4_llama3 = AutoModelForCausalLM.from_pretrained("unsloth/Llama-3.2-1B-bnb-4bit", device_map="cuda")

config.json:   0%|          | 0.00/1.51k [00:00<?, ?B/s]

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


model.safetensors:   0%|          | 0.00/1.03G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

In [None]:
%%time
uns_raw_text_evaluations, uns_source_code_evaluations, uns_ast_evaluations, uns_all_predictions = evaluate_model(unsloth4_llama3, tok, get_all_metrics(), get_all_metrics(), get_all_metrics(), first_examples, batch_size=5)

0it [00:00, ?it/s]

generating
batch generated. took 47.3324236869812 seconds
decoded
added to eval
generating
batch generated. took 47.25919985771179 seconds
decoded
added to eval
generating


KeyboardInterrupt: 

In [None]:
del unsloth4_llama3
clear_mem()