# Domain Specific Code Generation using the FormLang DSL

## Abstract

AI and LLM systems are being trained to perform a variety of complex tasks requiring expertise in many software tools, languages and technology stacks. Some of the challenges in this process involve quality data acquisition in large amounts, handling the increase of model parameter count which leads to increasing compute demands and costs, as well as emerging Data Privacy and Intellctual Property concerns related with using 3rd party cloud services for model training.

In this work we attempt to harness and combine the benefits of Abstraction and Determinism provided by Formal Domain Specific Languages (DSLs) with the innate ability of LLMs to learn new languages and their semantics. We propose a novel generation task called "Domain Specific Code Generation" which involve mapping user requests written in natural language to DSL code.

By utilizing a specially crafted DSL called `FormLang` as a case study, we attempt to lay the groundwork for methods of automated DSL dataset generation, training techniques and performance evaluation, with the end goal of creating an AI system capable of generating Web forms according to a user request.

Our `FormLang` DSL allows expressing the semantics of Web-forms using a simplified syntax that does not require much, if any, Web-programming knowledge and expertise.

Given a user prompt in English describing the desired form and its fields the LLM produces syntactically valid `FormLang` output which is run through the accompanying FormLang parser and a hand-crafted React JSX compiler to output a final implementation of the form in JavaScript and React.

The project includes a live demo which demonstrates the capabilities of the system.


## Referring to this work

If you use this work the following quote is preferred:

```bibtex
@misc{guyor2025dscodegenformlang,
      title={Domain Specific Code Generation using the FormLang DSL},
      author={Guy Or},
      year={2025}
}
```

The official repository of this work is hosted in GitHub at https://github.com/guyo13/Form-Lang **TBD** - Make the repo public

## Project Goals

* Define the Task of Domain Specific Code Generation.
* Create an AI training pipeline for FormLang (implemented as a Juypter notebook) which includes:
    * Automatic FormLang Dataset generation using searching algorithms and heuristics.
    * Baseline model selection and loading from Hugging Face.
    * Dataset Preprocessing and loading.
    * Defining performance KPIs for the system.
    * Model fine tuning using Transformers library
    * Model Adapter training using PEFT library.
    * Model upload to Hugging Face Hub and example usage from the Hub.
    *  **(TBD)** Export to ONNX using Optimum and run on-device using Transformers.js.
* Create the “FormLang” language:
    * Describe the problem domain.
    * Defining a viable minimal syntax and semantics which are research focused rather than completeness focused.
    *  **(TBD)** Implementing a “JavaScript React” compile target.
*  **(TBD)** Create a live demo website:
    * **(TBD)** Users input a prompt.
    * **(TBD)** A FormLang editor is populated with the AI’s code generation results.
    * **(TBD)** The form is rendered alongside the generated code.
*  Discuss the project results:
    * Perfornace and user acceptability.
    * **(TBD)** Viability of the implemented methods for the Domain Specific Code Generation task and generalization to other domains.  
    * **(TBD)** Potential enhancements to the system.
    * **(TBD)** Possible research directions on how to learn from user data.


**(TBD)** - Features marked as TBD are depending on the project's progress and timeline constraints as well as proving the viability of the methods and system.

# Notebook Setup

## About the FormLang DSL and its tooling and codebase

The FormLang DSL is implemented using the [Langium](https://langium.org/) project. Langium is a toolkit written in TypeScript that allows language engineers to create DSLs and quickly iterate over their development lifecycle.

Some of the many features offered by Langium are:

* Parser generation - Langium leverages [Chevrotain](https://chevrotain.io/docs/) to create a fast parser that can parse the DSL code to AST.
* TypeScript AST nodes generation - Langium automatically generates TS types that represent the AST of the language.
* Simplified Validation rules - Langium has built in support for writing and executing validation rules with built in error reporting.
* Language Server Protocol support - Langium implements the LSP protocol which allows code completion and syntax highlighting in many IDEs such as VSCode.

Since we are using Langium to parse the FormLang DSL, some of this project tooling and algorithms are written in TypeScript.

To call our TypeScript code from Python we are using two distinct methods:

* [PythonMonkey](https://pythonmonkey.io/) - A Python package that implements FFI between Python and JavaScript.
* Local HTTP API - For some cases in which the PythonMonkey FFI is too slow, we've implemented a local REST-API server that implements some functions we require (e.g for parsing FormLang code to AST).

### Codebase

The entire codebase is hosted in Github at [FormLang](https://github.com/guyo13/Form-Lang).

### Running from Colab

#### Installing PNPM and Bun

In [1]:
!npm install -g pnpm
!npm install -g bun@latest
!npm install -g corepack@latest

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K
added 1 package in 4s
[1G[0K⠹[1G[0K
[1G[0K⠹[1G[0K1 package is looking for funding
[1G[0K⠹[1G[0K  run `npm fund` for details
[1G[0K⠹[1G[0K[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙

In [2]:
!bun -v

1.2.2


#### Installing PythonMonkey

In [3]:
!pip install --quiet pythonmonkey

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.6/21.6 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.9/2.9 MB[0m [31m90.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m288.6/288.6 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pminit (pyproject.toml) ... [?25l[?25hdone


#### Installing Huggingface libraries and Unsloth

In [4]:
!pip install --quiet datasets evaluate bitsandbytes unsloth

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.4/57.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.2/181.2 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

#### Clone the repo

In [5]:
!rm -rf Form-Lang

In [6]:
!git clone -b react_compiler https://github.com/guyo13/Form-Lang.git

Cloning into 'Form-Lang'...
remote: Enumerating objects: 627, done.[K
remote: Counting objects: 100% (47/47), done.[K
remote: Compressing objects: 100% (40/40), done.[K
remote: Total 627 (delta 34), reused 7 (delta 7), pack-reused 580 (from 1)[K
Receiving objects: 100% (627/627), 532.63 KiB | 2.55 MiB/s, done.
Resolving deltas: 100% (397/397), done.


#### Install project dependencies and build

In [7]:
!cd Form-Lang && corepack install && corepack pnpm install && corepack pnpm run gb

Adding pnpm@10.0.0+sha512.b8fef5494bd3fe4cbd4edabd0745df2ee5be3e4b0b8b08fa643aa3e4c6702ccc0f00d68fa8a8c9858a735a0032485a44990ed2810526c875e416f001b17df12b to the cache...
Lockfile is up to date, resolution step is skipped
Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1APackages: [32m+502[39m[0K

Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1AProgress: resolved [96m502[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1A[33m[39m[0K
[33m   ╭──────────────────────────────────────────────────────────────────╮[39m
   [33m│[39m                                                                  [33m│[39m
   [33m│[39m                Update available! [31m10.0.0[39m → [32m10.2.1[39m.                [33m│[39m
   [33m│[39m   [35mChangelog:[39m https://github.com/pnpm/pnpm/releases/tag/v10.2.1   [33m│[39m
   [33m│[39m         Run "[35mcorepack install

In [8]:
!cd Form-Lang/ml && pnpm install

Lockfile is up to date, resolution step is skipped
Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1APackages: [32m+69[39m[0K

Progress: resolved [96m1[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1AProgress: resolved [96m69[39m, reused [96m0[39m, downloaded [96m0[39m, added [96m0[39m
[1AProgress: resolved [96m69[39m, reused [96m23[39m, downloaded [96m0[39m, added [96m3[39m
[1AProgress: resolved [96m69[39m, reused [96m23[39m, downloaded [96m24[39m, added [96m36[39m
[1AProgress: resolved [96m69[39m, reused [96m23[39m, downloaded [96m46[39m, added [96m69[39m, done

[96mdependencies:[39m
[32m+[39m express [90m4.21.2[39m

Done in 1.5s


#### Run the FormLang HTTP API server

In [9]:
from subprocess import Popen, PIPE
import sys

def run_js_with_bun(js_file, cwd=None):
  """
  Runs a JavaScript file using Bun as a subprocess.

  Args:
    js_file: The path to the JavaScript file.
    cwd: The working directory for the child process.
         If None, the current working directory is used.
  """
  try:
    # Execute the JavaScript file using Bun with specified working directory
    process = Popen(['bun', js_file], stdout=PIPE, cwd=cwd)
    return process

  except FileNotFoundError:
    print("Error: Bun not found. Please ensure Bun is installed and in your PATH.")
  except Exception as e:
    import traceback
    traceback.print_exc()
    print("An unexpected error occurred:", e)

http_server = run_js_with_bun("lib_server.cjs", "Form-Lang/ml")
http_server

<Popen: returncode: None args: ['bun', 'lib_server.cjs']>

In [29]:
!curl -d '{"sourceCode": "component hey{} form helloWorld {comp hey}"}' -H "content-type: application/json" "http://localhost:3000/compute/ast"

{"status":"ok","result":{"ast":"{\"$type\":\"Form\",\"name\":\"helloWorld\",\"component\":{\"$type\":\"FieldComponentDef\",\"componentId\":{\"$ref\":\"#/components@0\"},\"componentPropsKeys\":[],\"componentPropsValues\":[]},\"children\":[]}"}}

## Project imports

In [28]:
import asyncio
import time
import re
import functools
from pprint import pprint
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from datasets import Dataset, load_dataset
from huggingface_hub import login
import pythonmonkey as pm
import pandas as pd
import requests
import torch
# If using locally from the `ml` folder
# formlang_lib = pm.require("../out/cjs/lib/index")
# If using in Colab:
formlang_lib = pm.require("./Form-Lang/out/cjs/lib/index")

## Login to Huggingface Hub

In [20]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Helper functions

### Python Monkey helpers

These can be used to print JavaScript objects that live inside the PythonMonkey engine.

In [12]:
def js_dir(something):
    pm.globalThis.console.dir(something)

def js_log(something):
    pm.globalThis.console.log(something)

### Computing the AST

In order to check the model's output we will need to parse the output FormLang code and check it for errors. We will also use the AST to compute performance metrics. Our `get_ast` function will be used to send FormLang source code to our local HTTP API for parsing.  

In [13]:
def get_ast(code, shouldCheckErrors=False):
    result = requests.post("http://localhost:3000/compute/ast", json={"sourceCode": code, "shouldCheckErrors":shouldCheckErrors}).json()
    if not result["status"] == "ok":
        raise RuntimeError("failed")
    return result["result"]

# Dataset generation

## Generating the dataset

We start by creating a ProbabilisticSearchFormGenerator object which is implemented in JavaScript, providing the default search hyper parameters and some Component definitions,
later on we will implement a random generator for the component definitions as well.

In [42]:
form_components = """
  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}
"""
field_components = """
  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }
"""

form_gen = await formlang_lib.newFormGen(formlang_lib.DEFAULT_GENERATOR_HYPER_PARAMETERS, form_components, field_components)

### Create a data generation loop for the prompt data

We create a small dataset of 3000 random examples by using the JavaScript FormLang utilities library and the form generator object.
We copy from the form generator's output only the relevant fields which will make up the LLM prompt.

In [14]:
# TODO - Refactor data generator pipeline as a Python class with easy interfaces

def concat_components_code_with_form_code(form_code, form_components, field_components):
    """Concatenates the form code with the given components code that are referenced inside of it."""
    return f"{form_components}\n{field_components}\n{form_code}"

def create_prompt_data(generation_result, form_components, field_components):
    data = {
        "originalFormCode": concat_components_code_with_form_code(generation_result["serializedForm"], form_components, field_components),
        "modifiedFormCode": concat_components_code_with_form_code(generation_result["serializedModifiedForm"], form_components, field_components),
        "removedNodeEnglish": str(generation_result["removedNodeEnglish"]),
        "removedNodeContextEnglish": str(generation_result["removedNodeContextEnglish"]),
    }
    return data

def create_random_data(form_gen, form_components, field_components, num_examples=3000):
    start_time = time.time()
    DATA = []
    for i in range(num_examples):
        generation_result = formlang_lib.generateRandomFormWithModification(form_gen)
        prompt_data = create_prompt_data(generation_result, form_components, field_components)
        DATA.append(prompt_data)
    # Generate the ast of each example in one batch
    ast_time = time.time()
    for example in DATA:
        resp = get_ast(example["originalFormCode"], True)
        errors = resp.get('errors', [])
        if len(errors) > 0:
            print(f"Error parsing generated code", errors)
            raise RuntimeError("Error parsing generated code")
        example["originalFormAst"] = resp["ast"]
    end_time = time.time()
    print(f"Generation took {end_time - start_time} seconds. Total Ast generation time {end_time - ast_time} seconds")
    return pd.DataFrame(DATA)

In [45]:
examples = create_random_data(form_gen, form_components, field_components, 3000)
examples.head()

Generation took 48.5854971408844 seconds. Total Ast generation time 47.7074978351593 seconds


Unnamed: 0,originalFormCode,modifiedFormCode,removedNodeEnglish,removedNodeContextEnglish,originalFormAst
0,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'Z_' with state of type:...,* is a child of the form whose id is 'Bw'.\n* ...,"{""$type"":""Form"",""name"":""Bw"",""component"":{""$typ..."
1,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'p_' with state of type:...,* is a child of the form whose id is 'Cw'.\n* ...,"{""$type"":""Form"",""name"":""Cw"",""component"":{""$typ..."
2,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'F' using the component ...,* is a child of the form whose id is 'Vw'.\n* ...,"{""$type"":""Form"",""name"":""Vw"",""component"":{""$typ..."
3,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\t\ta field whose id is 'Pww_ww___w' using the...,* is a child of the form whose id is 'd'.\n* i...,"{""$type"":""Form"",""name"":""Lw"",""component"":{""$typ..."
4,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'nw' with state of type:...,* is a child of the form whose id is 'a_'.\n* ...,"{""$type"":""Form"",""name"":""a_"",""component"":{""$typ..."


#### Example generated FormLang code and its AST

In [46]:
examples["originalFormCode"][1132]

'\n  component userDetailsContainer {}\n  component formContainer {}\n  component someOtherContainer {}\n  component OtherContainer2 {}\n\n\n  component myTextBox {\n    props {\n      textColor\n      textSize\n      textWeight\n      borderColor\n    }\n  }\n  component myCheckbox {\n    props {\n      size\n    }\n  }\n  component otherTextBox {}\n  component counter {\n    props {\n      style\n    }\n  }\n\nform T_ {\n\tcomp userDetailsContainer \n\t\n\tform H {\n\t\tcomp userDetailsContainer \n\t\t\n\t\tfield Ew {\n\t\t\tcomp otherTextBox \n\t\t\t\n\t\t}\n\t\tfield Rw_w {\n\t\t\tcomp otherTextBox \n\t\t\t\n\t\t}\n\t\tfield F_ {\n\t\t\tstate number\n\t\t\tcomp myTextBox borderColor="#301410" textSize="(() => 97.65823852274963)()" as expression\n\t\t\t\n\t\t}\n\t}\n\n\tfield p___ {\n\t\tcomp otherTextBox \n\t\t\n\t}\n\tfield vw {\n\t\tstate string default "Frankie Kessler"\n\t\tcomp counter style="#5b426f"\n\t\t\n\t}\n\tfield f {\n\t\tcomp counter \n\t\t\n\t}\n\tfield q {\n\t\tstat

In [47]:
examples["originalFormAst"][1132]

'{"$type":"Form","name":"T_","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@0"},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Form","name":"H","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@0"},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Field","name":"Ew","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@6"},"componentPropsKeys":[],"componentPropsValues":[]}},{"$type":"Field","name":"Rw_w","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@6"},"componentPropsKeys":[],"componentPropsValues":[]}},{"$type":"Field","name":"F_","state":{"$type":"FieldStateDef","type":"number","isArray":false},"component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@4"},"componentPropsKeys":[{"$type":"ComponentPropKey","key":"borderColor"},{"$type":"ComponentPropKey","key":"textSize"}],"componentPropsValues":[{"$type":"Va

### Creating the user prompts

For each example in our synthetic dataset, we create the user prompt which will be used to train and test the LLM.

The user prompt consists of the modified FormLang code taken from each of our dataset examples along with the english description of the form element to add and its location (context) in the form.

In [18]:
def create_prompt(row):
    return (
            "```FormLang\n"
            f"{row['modifiedFormCode']}\n"
            "```\n"
            "The description of the form element you need to add:\n"
            f"{row['removedNodeEnglish']}\n"
            "The description of the context in the form where you should add the element:\n"
            f"The element to be added:\n{row['removedNodeContextEnglish']}\n"
           )

In [49]:
examples["userPrompt"] = examples.apply(create_prompt, axis=1)
examples.head()

Unnamed: 0,originalFormCode,modifiedFormCode,removedNodeEnglish,removedNodeContextEnglish,originalFormAst,userPrompt
0,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'Z_' with state of type:...,* is a child of the form whose id is 'Bw'.\n* ...,"{""$type"":""Form"",""name"":""Bw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
1,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'p_' with state of type:...,* is a child of the form whose id is 'Cw'.\n* ...,"{""$type"":""Form"",""name"":""Cw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
2,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'F' using the component ...,* is a child of the form whose id is 'Vw'.\n* ...,"{""$type"":""Form"",""name"":""Vw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
3,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\t\ta field whose id is 'Pww_ww___w' using the...,* is a child of the form whose id is 'd'.\n* i...,"{""$type"":""Form"",""name"":""Lw"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...
4,\n component userDetailsContainer {}\n compo...,\n component userDetailsContainer {}\n compo...,\ta field whose id is 'nw' with state of type:...,* is a child of the form whose id is 'a_'.\n* ...,"{""$type"":""Form"",""name"":""a_"",""component"":{""$typ...",```FormLang\n\n component userDetailsContaine...


##### Example User Prompt vs. original Code

In [50]:
print(examples['userPrompt'][0])

```FormLang

  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}


  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }

form Bw {
	comp OtherContainer2 
	
	field G {
		comp myCheckbox size="'Hello'.toLowerCase()" as expression
		
	}
	form a {
		comp formContainer 
		
		field qw {
			state boolean[]
			comp otherTextBox 
			
		}
		field F__ {
			comp myTextBox 
			
		}
		field Y__ {
			state string default "'John Streich'" as expression
			comp myTextBox 
			
		}
		field S_w {
			state string
			comp myCheckbox size="#121310"
			
		}
		field N_ {
			state number default "0.17049162264892637" as expression
			comp counter 
			
		}
	}

	field pw_wwww_w__ {
		state boolean
		comp counter 
		
	}
	field J {
		state s

In [51]:
print(examples['originalFormCode'][0])


  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}


  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }

form Bw {
	comp OtherContainer2 
	
	field G {
		comp myCheckbox size="'Hello'.toLowerCase()" as expression
		
	}
	form a {
		comp formContainer 
		
		field qw {
			state boolean[]
			comp otherTextBox 
			
		}
		field F__ {
			comp myTextBox 
			
		}
		field Y__ {
			state string default "'John Streich'" as expression
			comp myTextBox 
			
		}
		field S_w {
			state string
			comp myCheckbox size="#121310"
			
		}
		field N_ {
			state number default "0.17049162264892637" as expression
			comp counter 
			
		}
	}

	field pw_wwww_w__ {
		state boolean
		comp counter 
		
	}
	field Z_ {
		state string
		com

### Creating a Dataset on HuggingFace using `Datasets`

First we've manually created a Dataset repository on the hugging-face hub for our [form-lang-examples](https://huggingface.co/datasets/guy-or/form-lang-examples) .

We then convert our `examples` DataFrame into a Dataset object and upload it to the Hub.


In [60]:
ds = Dataset.from_pandas(examples)
ds.push_to_hub("guy-or/form-lang-examples", "3k_single_omission")
ds

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/30.0 [00:00<?, ?B/s]

Dataset({
    features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt'],
    num_rows: 3000
})

# Zero-shot evaluation using Llama 3.2-1B

Before attempting to train the model, we want to evaluate its performance on some of our data in order to establish a baseline for performance.

We will measure 3 metrics:

* Accuracy
* BLEU
* ROUGE

BLUE and ROUGE are standard metrics for measuring text generation performance when we want to compare the model's output to a desired output. In the case of code generation, these metrics may not accurately reflect the quality of the model's output when applied to the source code and therefore we will try to apply them both on the serialized AST (in JSON format) of the generated code as well as the code itself.

### Loading the Dataset from HuggingFace using `Datasets`

We load the Dataset from the Hub using `load_dataset` function. The dataset contains the raw user prompt and expected outputs.

In the next steps we will use these fields to create a full prompt which includes our system prompt and finally tokenize the prompt in preparation for running the model.

In [17]:
ds_3k_single_omission = load_dataset("guy-or/form-lang-examples", "3k_single_omission")
ds_3k_single_omission_all = ds_3k_single_omission['train']
ds_3k_single_omission_all

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/598 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.42M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3000 [00:00<?, ? examples/s]

Dataset({
    features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt'],
    num_rows: 3000
})

### Dataset pre-processing

In order to create prompts that effectively describe the task to our Llama 3.2 model, we need to format our prompt in a way that the model was trained to respond to.

This is called a [Prompt Template](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/#prompt-template), which consists of a **System prompt** which provides the initial context and guidelines for the model, followed by a **User prompt** containing the request and relevant data.

#### System prompt

Our System prompt instructs the LLM to generate the original form (the `originalFormCode` column) given the data present in the user prompt which we already created in the Dataset generation part.

In [15]:
def get_system_prompt():
    return (
        "You are a code generation AI assistant.\n"
        "Your job is to generate valid FormLang code according to the instructions given below:\n"
            "Inspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ``` .\n"
            "After inspection, complete the Form's code according to the given a description of a new form element and a description of its location in the form.\n"
            "You may assume that the new form element to be added is always either a 'form' or a 'field'.\n"
            "Your output must be valid and compiler-friendly FormLang code only.\n"
            "If you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input."
            "Your answer will be evaluated using an AST comparison of your code to the expected code.\n"
            "Assume that the input code is valid and requires no modification other than the NEW code you must generate.\n"
            "You must output plain FormLang code without any additional text or delimiters.\n"
            "You must not change any part of the original input code other than adding the required element.\n"
    )



We load Llama's tokenizer using Huggingface `AutoTokenizer` class and leverage its [Chat templates](https://huggingface.co/docs/transformers/v4.48.2/en/chat_templating#introduction) capabilities which exposes a simple API for formatting our prompts into Llama's required Prompt Template.

In [27]:
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

def get_chat_prompt(userPrompt):
  return [
    {"role": "system", "content": get_system_prompt()},
    {"role": "user", "content": userPrompt}]

def preprocess_func(example, add_generation_prompt=True):
  return {"fullPrompt": tok.apply_chat_template(get_chat_prompt(example["userPrompt"]), tokenize=False, add_generation_prompt=add_generation_prompt)}

In [24]:
ds_3k_full_prompt = ds_3k_single_omission_all.map(preprocess_func, batched=False)
ds_3k_full_prompt

Dataset({
    features: ['originalFormCode', 'modifiedFormCode', 'removedNodeEnglish', 'removedNodeContextEnglish', 'originalFormAst', 'userPrompt', 'fullPrompt'],
    num_rows: 3000
})

In [26]:
ds_3k_full_prompt['fullPrompt'][0]

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 08 Feb 2025\n\nYou are a code generation AI assistant.\nYour job is to generate valid FormLang code according to the instructions given below:\nInspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ``` .\nAfter inspection, complete the Form\'s code according to the given a description of a new form element and a description of its location in the form.\nYou may assume that the new form element to be added is always either a \'form\' or a \'field\'.\nYour output must be valid and compiler-friendly FormLang code only.\nIf you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input.Your answer will be evaluated using an AST comparison of your code to the expected code.\nAssume that the input code is valid and requires no modification other than the NEW code you mus

## Utils

### Extracting FormLang code from LLM output

The LLM might ignore the instruction to output plain code without the "```FormLang" delimiter, in that case we should try to extract the contents.

In [18]:
def extract_code_block(text, lang):
  """
  Extracts a code block from a string.

  Args:
    text: The string containing the code block.
    lang: The name of the language the code block is written in.

  Returns:
    The code block, or None if no code block is found.
  """
  pattern = f"```{lang}\n(.*?)```"
  match = re.search(pattern, text, re.DOTALL)
  if match:
    return match.group(1).strip()
  return None

def extract_llm_formlang_output(text):
    # Strip leading and trailing whitespace and check for delimiters
    text = text.strip()
    return extract_code_block(text, "FormLang") or text

## Load Llama 3.2 1B using Transformers Pipeline

In [32]:
tok(tok.apply_chat_template(get_chat_prompt(ds_3k_single_omission_all[0]["userPrompt"]), tokenize=False, add_generation_prompt=False, return_tensors="pt"))

{'input_ids': [128000, 128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 2318, 13806, 220, 2366, 20, 271, 2675, 527, 264, 2082, 9659, 15592, 18328, 627, 7927, 2683, 374, 311, 7068, 2764, 3459, 27317, 2082, 4184, 311, 279, 11470, 2728, 3770, 512, 59633, 279, 2768, 3459, 27317, 3459, 7419, 11, 279, 1212, 315, 279, 2082, 690, 387, 3453, 9437, 449, 55375, 1876, 27317, 323, 1202, 842, 449, 55375, 16853, 6153, 26767, 11, 4686, 279, 3459, 596, 2082, 4184, 311, 279, 2728, 264, 4096, 315, 264, 502, 1376, 2449, 323, 264, 4096, 315, 1202, 3813, 304, 279, 1376, 627, 2675, 1253, 9855, 430, 279, 502, 1376, 2449, 311, 387, 3779, 374, 2744, 3060, 264, 364, 630, 6, 477, 264, 364, 2630, 24482, 7927, 2612, 2011, 387, 2764, 323, 19979, 22658, 3459, 27317, 2082, 1193, 627, 2746, 499, 527, 44003, 315, 279, 3459, 27317, 20047, 11, 1456, 311, 24499, 433, 505, 279, 1376, 2082, 902, 374, 2728, 3770, 439, 459, 1988, 7659, 414, 4320, 690, 387, 26126, 

In [23]:
ds_3k_single_omission_all.map(preprocess_func, batched=False)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

TypeError: Provided `function` which is applied to all elements of table returns a variable of type <class 'torch.Tensor'>. Make sure provided `function` returns a variable of type `dict` (or a pyarrow table) to update the dataset or `None` if you are only interested in side effects.

In [38]:
unsloth4_llama3_pipeline = pipeline("text-generation", model="unsloth/Llama-3.2-1B-bnb-4bit", device_map="cuda", tokenizer=tok)

Unused kwargs: ['_load_in_4bit', '_load_in_8bit', 'quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Device set to use cuda


In [32]:
llama3_pipeline = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct", torch_dtype=torch.bfloat16, device_map="cuda")

config.json:   0%|          | 0.00/877 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Device set to use cuda


In [None]:
%%time
pred_example_0 = unsloth4_llama3_pipeline(get_chat_prompt(ds_3k_single_omission_train[0:10]), max_new_tokens=10**4)

In [None]:
%%time
llama3_pipeline(get_chat_prompt(ds_3k_single_omission_train[0]), max_new_tokens=10**4)

In [None]:
tok(pred_example_0[0]["generated_text"])

In [None]:
pred_example_0_generated = pred_example_0[0]['generated_text'][2]['content']
print(extract_llm_formlang_output(pred_example_0_generated))
pprint(get_ast(extract_llm_formlang_output(pred_example_0_generated), True))

In [81]:
llama3_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", torch_dtype=torch.bfloat16, device_map="cuda")

In [88]:
example_inputs = tok(f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{get_system_prompt()}\n<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n{ds_3k_single_omission_train[0]['userPrompt']}<|eot_id|>\n<|start_header_id|>assistant<|end_header_id|>", return_tensors="pt").to("cuda")
with torch.no_grad():
  example_outputs = llama3_model.generate(**example_inputs, max_new_tokens=1000)
generated_text = tok.decode(example_outputs[0], skip_special_tokens=False)
del example_inputs
del example_outputs
generated_text

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a code generation AI assistant.\nYour job is to generate valid FormLang code according to the instructions given below:\nInspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ```.\nAfter inspection, complete the Form\'s code according to the given a description of a new form element and a description of its location in the form.\nYou may assume that the new form element to be added is always either a \'form\' or a \'field\'.\nYour output must be valid and compiler-friendly FormLang code only.\nIf you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input.Your answer will be evaluated using an AST comparison of your code to the expected code.\nAssume that the input code is valid and requires no modification other than the NEW code you must generate.\nYou must output plain FormLang code wit