# Domain Specific Code Generation using the FormLang DSL

## Abstract

AI and LLM systems are being trained to perform a variety of complex tasks requiring expertise in many software tools, languages and technology stacks. Some of the challenges in this process involve quality data acquisition in large amounts, handling the increase of model parameter count which leads to increasing compute demands and costs, as well as emerging Data Privacy and Intellctual Property concerns related with using 3rd party cloud services for model training. 

In this work we attempt to harness and combine the benefits of Abstraction and Determinism provided by Formal Domain Specific Languages (DSLs) with the innate ability of LLMs to learn new languages and their semantics. We propose a novel generation task called "Domain Specific Code Generation" which involve mapping user requests written in natural language to DSL code.

By utilizing a specially crafted DSL called `FormLang` as a case study, we attempt to lay the groundwork for methods of automated DSL dataset generation, training techniques and performance evaluation, with the end goal of creating an AI system capable of generating Web forms according to a user request.

Our `FormLang` DSL allows expressing the semantics of Web-forms using a simplified syntax that does not require much, if any, Web-programming knowledge and expertise.

Given a user prompt in English describing the desired form and its fields the LLM produces syntactically valid `FormLang` output which is run through the accompanying FormLang parser and a hand-crafted React JSX compiler to output a final implementation of the form in JavaScript and React.

The project includes a live demo which demonstrates the capabilities of the system.


## Referring to this work

If you use this work the following quote is preferred:

```bibtex
@misc{guyor2025dscodegenformlang,
      title={Domain Specific Code Generation using the FormLang DSL}, 
      author={Guy Or},
      year={2025}
}
```

The official repository of this work is hosted in GitHub at https://github.com/guyo13/Form-Lang **TBD** - Make the repo public

## Project Goals

* Define the Task of Domain Specific Code Generation.
* Create an AI training pipeline for FormLang (implemented as a Juypter notebook) which includes:
    * Automatic FormLang Dataset generation using searching algorithms and heuristics.
    * Baseline model selection and loading from Hugging Face.
    * Dataset Preprocessing and loading.
    * Defining performance KPIs for the system.
    * Model fine tuning using Transformers library
    * Model Adapter training using PEFT library.
    * Model upload to Hugging Face Hub and example usage from the Hub.
    *  **(TBD)** Export to ONNX using Optimum and run on-device using Transformers.js.
* Create the “FormLang” language:
    * Describe the problem domain.
    * Defining a viable minimal syntax and semantics which are research focused rather than completeness focused.
    *  **(TBD)** Implementing a “JavaScript React” compile target.
*  **(TBD)** Create a live demo website:
    * **(TBD)** Users input a prompt.
    * **(TBD)** A FormLang editor is populated with the AI’s code generation results.
    * **(TBD)** The form is rendered alongside the generated code.
*  Discuss the project results:
    * Perfornace and user acceptability.
    * **(TBD)** Viability of the implemented methods for the Domain Specific Code Generation task and generalization to other domains.  
    * **(TBD)** Potential enhancements to the system.
    * **(TBD)** Possible research directions on how to learn from user data.


**(TBD)** - Features marked as TBD are depending on the project's progress and timeline constraints as well as proving the viability of the methods and system.

# Notebook Setup

## Project imports

In [1]:
import asyncio
import time
from transformers import pipeline
from huggingface_hub import login
import pythonmonkey as pm
import pandas as pd
import requests
formlang_lib = pm.require("../out/cjs/lib/index")

## Login to Huggingface Hub

In [8]:
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Python Monkey helpers

In [2]:
def js_dir(something):
    pm.globalThis.console.dir(something)

def js_log(something):
    pm.globalThis.console.log(something)

# Dataset generation

### Example usage of FormLang library

In [6]:
form_components = """
  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}
"""
field_components = """
  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }
"""

form_gen = await formlang_lib.newFormGen(formlang_lib.DEFAULT_GENERATOR_HYPER_PARAMETERS, form_components, field_components)

Validating model
Validating model


In [7]:
a = form_gen.generateForm()

In [8]:
js_dir(a)

{ [32m'$type'[39m: [32m'Form'[39m,
  name: [32m'xw'[39m,
  component: 
   { component: 
      { [32m'$type'[39m: [32m'ComponentDef'[39m,
        name: [32m'formContainer'[39m,
        [32m'$cstNode'[39m: [36m[Object][39m,
        props: [],
        [32m'$container'[39m: [36m[Object][39m,
        [32m'$containerProperty'[39m: [32m'components'[39m,
        [32m'$containerIndex'[39m: [33m1[39m },
     propAssignments: {} },
  children: 
   [ { [32m'$type'[39m: [32m'Field'[39m,
       name: [32m'c'[39m,
       component: [36m[Object][39m,
       state: [1mnull[22m,
       depth: [33m1[39m },
     { [32m'$type'[39m: [32m'Field'[39m,
       name: [32m'e'[39m,
       component: [36m[Object][39m,
       state: [36m[Object][39m,
       depth: [33m1[39m },
     { [32m'$type'[39m: [32m'Form'[39m,
       name: [32m'A'[39m,
       component: [36m[Object][39m,
       children: [36m[Array][39m,
       depth: [33m1[39m },
     { [32m'$ty

In [19]:
fl_obj = await formlang_lib.getFormLangStringParser()("""
component comp1 {}
component comp2 {}
form HelloWorld {
    comp comp1
    field MyField {
        comp comp2
    }
}
""")

Validating model


In [20]:
formlang_lib.hasErrors(fl_obj)

False

In [21]:
js_dir(fl_obj)

{ parseResult: 
   { value: 
      { [32m'$type'[39m: [32m'Model'[39m,
        components: [36m[Array][39m,
        forms: [36m[Array][39m,
        [32m'$cstNode'[39m: [36m[Object][39m,
        typeDefs: [],
        [32m'$document'[39m: [36m[Circular][39m },
     lexerErrors: [],
     lexerReport: { diagnostics: [] },
     parserErrors: [] },
  uri: 
   l2 {
     scheme: [32m'file'[39m,
     authority: [32m''[39m,
     path: [32m'/10.form'[39m,
     query: [32m''[39m,
     fragment: [32m''[39m,
     _formatted: [32m'file:///10.form'[39m,
     _fsPath: [1mnull[22m },
  state: [33m6[39m,
  references: 
   [ { [32m'$refNode'[39m: [36m[Object][39m,
       [32m'$refText'[39m: [32m'comp1'[39m,
       ref: [36m[Getter][39m,
       [32m'$nodeDescription'[39m: [36m[Getter][39m,
       error: [36m[Getter][39m,
       _ref: [36m[Object][39m,
       _nodeDescription: [36m[Object][39m },
     { [32m'$refNode'[39m: [36m[Object][39m,
       [32

In [22]:
formlang_lib.serializeAst(fl_obj.parseResult.value, formlang_lib.getServices().FormLang)

'{"$type":"Model","components":[{"$type":"ComponentDef","name":"comp1","props":[]},{"$type":"ComponentDef","name":"comp2","props":[]}],"forms":[{"$type":"Form","name":"HelloWorld","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@0"},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Field","name":"MyField","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@1"},"componentPropsKeys":[],"componentPropsValues":[]}}]}],"typeDefs":[]}'

### Generating the dataset

We start by creating a ProbabilisticSearchFormGenerator object which is implemented in JavaScript, providing the default search hyper parameters and some Component definitions,
later on we will implement a random generator for the component definitions as well.

In [3]:
form_components = """
  component userDetailsContainer {}
  component formContainer {}
  component someOtherContainer {}
  component OtherContainer2 {}
"""
field_components = """
  component myTextBox {
    props {
      textColor
      textSize
      textWeight
      borderColor
    }
  }
  component myCheckbox {
    props {
      size
    }
  }
  component otherTextBox {}
  component counter {
    props {
      style
    }
  }
"""

form_gen = await formlang_lib.newFormGen(formlang_lib.DEFAULT_GENERATOR_HYPER_PARAMETERS, form_components, field_components)

#### Computing the expected AST

In order to reliablity measure to model's performance, our KPI will be **AST-Accuracy** which is the accuracy measure with respect to the generated code's AST and the expected code's AST.

In [18]:
def get_ast(form_code, form_components, field_components):
    result = requests.post("http://localhost:3000/compute/ast", json={"sourceCode": form_code, "formComponentsCode": form_components, "fieldComponentsCode":field_components}).json()
    if not result["status"] == "ok":
        raise RuntimeError("failed")
    return result["result"]

We create a small dataset of 3000 random examples by using the JavaScript FormLang utilities library and the form generator object.
We copy from the form generator's output only the relevant fields which will make up the LLM prompt.

In [5]:
def copy_relevant_prompt_data(prompt_data):
    data = {
        # We copy the string to Python so that it's memory will be managed entirely in Python rather that by PythonMonkey
        "serializedForm": str(prompt_data["serializedForm"]),
        "serializedModifiedForm": str(prompt_data["serializedModifiedForm"]),
        "removedNodeEnglish": str(prompt_data["removedNodeEnglish"]),
        "removedNodeContextEnglish": str(prompt_data["removedNodeContextEnglish"]),
    }
    return data

def create_random_data(form_gen, form_components, field_components, num_examples=3000):
    start_time = time.time()
    DATA = []
    for i in range(num_examples):
        prompt_data = formlang_lib.generateRandomFormPromptData(form_gen)
        DATA.append(copy_relevant_prompt_data(prompt_data))
    # Generate the ast of each example in one batch
    ast_time = time.time()
    for example in DATA:
        example["ast"] = get_ast(example["serializedForm"], form_components, field_components)
    end_time = time.time()
    print(f"Generation took {end_time - start_time} seconds ast time {end_time - ast_time}")
    return pd.DataFrame(DATA)

In [19]:
examples = create_random_data(form_gen, form_components, field_components,3000)
examples.head()

Generation took 9.005642652511597 seconds ast time 8.696900129318237


Unnamed: 0,serializedForm,serializedModifiedForm,removedNodeEnglish,removedNodeContextEnglish,ast
0,form G {\n\tcomp userDetailsContainer \n\t\n\t...,form G {\n\tcomp userDetailsContainer \n\t\n\t...,\ta field whose id is 'Bww' using the componen...,* is a child of the form whose id is 'G'.\n* i...,"{""$type"":""Form"",""name"":""G"",""component"":{""$type..."
1,form nw {\n\tcomp formContainer \n\t\n\tfield ...,form nw {\n\tcomp formContainer \n\t\n\tfield ...,\ta field whose id is 'U' using the component ...,* is a child of the form whose id is 'nw'.\n* ...,"{""$type"":""Form"",""name"":""nw"",""component"":{""$typ..."
2,form Qw {\n\tcomp OtherContainer2 \n\t\n\tfiel...,form Qw {\n\tcomp OtherContainer2 \n\t\n\tfiel...,\ta field whose id is 'o___w' using the compon...,* is a child of the form whose id is 'Qw'.\n* ...,"{""$type"":""Form"",""name"":""Qw"",""component"":{""$typ..."
3,form rwwww {\n\tcomp formContainer \n\t\n\tfor...,form rwwww {\n\tcomp formContainer \n\t\n\tfor...,\t\t\ta field whose id is 'Q_w' with state of ...,* is a child of the form whose id is 't'.\n* i...,"{""$type"":""Form"",""name"":""rwwww"",""component"":{""$..."
4,form P {\n\tcomp someOtherContainer \n\t\n\tfi...,form P {\n\tcomp someOtherContainer \n\t\n\tfi...,\ta field whose id is 'c_ww' with state of typ...,* is a child of the form whose id is 'P'.\n* i...,"{""$type"":""Form"",""name"":""P"",""component"":{""$type..."


In [21]:
examples["serializedForm"][1093]

'form Yww {\n\tcomp OtherContainer2 \n\t\n\tfield g_wwwww {\n\t\tstate string[]\n\t\tcomp counter \n\t\t\n\t}\n}\n'

In [20]:
examples["ast"][1093]

'{"$type":"Form","name":"Yww","component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@3"},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Field","name":"g_wwwww","state":{"$type":"FieldStateDef","isArray":true,"type":"string"},"component":{"$type":"FieldComponentDef","componentId":{"$ref":"#/components@7"},"componentPropsKeys":[],"componentPropsValues":[]}}]}'

### Creating the prompt

We now create the user prompt which will be used to train the LLM.

We instruct the LLM to generate the original form (the `serializedForm` column) given the modified form, an English description of the removed node and its surrounding context (describing which element is the Form's parent and which elements are its siblings).


In [30]:
def create_prompt(row):
    return ("Your job is to generate valid FormLang code according to the instructions given below:\n"
            "Inspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ``` .\n"
            "After inspection, complete the Form's code according to the given a description of a new form element and a description of its location in the form.\n"
            "You may assume that the new form element to be added is always either a 'form' or a 'field'.\n"
            "Your output must be valid and compiler-friendly FormLang code only.\n"
            "If you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input."
            "Your answer will be evaluated using an AST comparison of your code to the expected code.\n"
            "```FormLang\n"
            f"{row['serializedModifiedForm']}\n"
            "```\n"
            "The description of the form element you need to add:\n"
            f"{row['removedNodeEnglish']}\n"
            "The description of the context in the form where you should add the element:\n"
            f"The element to be added:\n{row['removedNodeContextEnglish']}\n"
           )

examples["userPrompt"] = examples.apply(create_prompt, axis=1)
examples.head()

Unnamed: 0,serializedForm,serializedModifiedForm,removedNodeEnglish,removedNodeContextEnglish,userPrompt
0,form p_ {\n\tcomp someOtherContainer \n\t\n\tf...,form p_ {\n\tcomp someOtherContainer \n\t\n\tf...,\t\ta field whose id is 'y' with state of type...,* is a child of the form whose id is 'a_w'.\n*...,Your job is to generate valid FormLang code ac...
1,form z_ {\n\tcomp userDetailsContainer \n\t\n\...,form z_ {\n\tcomp userDetailsContainer \n\t\n\...,\ta field whose id is 'm' using the component ...,* is a child of the form whose id is 'z_'.\n* ...,Your job is to generate valid FormLang code ac...
2,form C_ {\n\tcomp someOtherContainer \n\t\n\tf...,form C_ {\n\tcomp someOtherContainer \n\t\n\tf...,\ta field whose id is 'yw_' with state of type...,* is a child of the form whose id is 'C_'.\n* ...,Your job is to generate valid FormLang code ac...
3,form Fw {\n\tcomp OtherContainer2 \n\t\n\tfiel...,form Fw {\n\tcomp OtherContainer2 \n\t\n\tfiel...,\t\ta field whose id is 'Sww' with state of ty...,* is a child of the form whose id is 'v'.\n* i...,Your job is to generate valid FormLang code ac...
4,form lw {\n\tcomp OtherContainer2 \n\t\n\tfiel...,form lw {\n\tcomp OtherContainer2 \n\t\n\tfiel...,\ta field whose id is 'b' with state of type: ...,* is a child of the form whose id is 'lw'.\n* ...,Your job is to generate valid FormLang code ac...


In [31]:
print(examples['userPrompt'][0])

Your job is to generate valid FormLang code according to the instructions given below:
Inspect the following FormLang Form definition, the start of the code will be denoted with ```FormLang and its end with ``` .
After inspection, complete the Form's code according to the given a description of a new form element and a description of its location in the form.
You may assume that the new form element to be added is always either a 'form' or a 'field'.
Your output must be valid and compiler-friendly FormLang code only.
If you are unsure of the FormLang syntax, try to infer it from the form code which is given below as an input.Your answer will be evaluated using an AST comparison of your code to the expected code.
```FormLang
form p_ {
	comp someOtherContainer 
	
	field M_ {
		state boolean
		comp otherTextBox 
		
	}
	form a_w {
		comp someOtherContainer 
		
		field rw_ {
			state boolean[]
			comp otherTextBox 
			
		}
		field pww {
			state string default "Ian Dickens"
			comp counter 

In [36]:
# [await get_ast(code) for code in examples["serializedForm"]]
ast1 = await get_ast(examples["serializedForm"][0])
ast1

'{"$type":"Model","forms":[{"$type":"Form","name":"p_","component":{"$type":"FieldComponentDef","componentId":{"$error":"Could not resolve reference to ComponentDef named \'someOtherContainer\'."},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Field","name":"M_","state":{"$type":"FieldStateDef","type":"boolean","isArray":false},"component":{"$type":"FieldComponentDef","componentId":{"$error":"Could not resolve reference to ComponentDef named \'otherTextBox\'."},"componentPropsKeys":[],"componentPropsValues":[]}},{"$type":"Form","name":"a_w","component":{"$type":"FieldComponentDef","componentId":{"$error":"Could not resolve reference to ComponentDef named \'someOtherContainer\'."},"componentPropsKeys":[],"componentPropsValues":[]},"children":[{"$type":"Field","name":"rw_","state":{"$type":"FieldStateDef","isArray":true,"type":"boolean"},"component":{"$type":"FieldComponentDef","componentId":{"$error":"Could not resolve reference to ComponentDef named \'otherTex