ad-llama 🦙

Use tagged template literals for structured Llama 2 inference locally in browser via mlc-llm. Runs on Chromium browsers with WebGPU support.

playground.mp4

Check out the playground, a simple dnd npc generator, or a multi-step color generator built on solid-js. There's also a documentation page -- say hi in discord!

Usage

npm install -S ad-llama

import { loadModel, ad, report } from 'ad-llama'

const loadedModel = await loadModel('Llama-2-7b-chat-hf-q4f32_1', report(console.info))
const { context, a } = ad(loadedModel)

const dm = context(
  'You are a dungeon master.',
  'Create an NPC based on the Dungeons and Dragons universe.'
)

const npc = dm`{
  "description": "${a('short description')}",
  "name": "${a('character name')}",
  "weapon": "${a('weapon')}",
  "class": "${a('primary class')}"
}`

const generatedNpc = await npc.collect(partial => {
  switch (partial.type) {
    case 'gen':
    case 'lit':
      console.info(partial.content)
      break
  }
})

console.log(generatedNpc)

For an example of more complicated usage including validation, retry logic and transforms check the hackernews who's hiring example.

Generation

Each expression in the template literal is a new prompt and options. The prompt given for each expression is added to the system and preprompt established in context, and prior completion text (literal parts and as well as inferences) are added to the end of the LLM prompt as a partially completed assistant response (i.e. after [/INST]).

Each template expression can be configured independently - you can set a different temperature, token count, max length and more. Check the TemplateExpressionOptions type in ./src/types.ts for all options. a adds the preword to the expression prompt (by default "Generate"), use prompt to provide a naked prompt or set the preword option. A plain string gets inserted as literal text, just like normal template literals.

template`{
  "name": "${characterName}",
  "description": "${a('clever description', {
    maxTokens: 1000,
    stops: ['\n']
  })}",
  "class": "${a('a primary class for the character')}"
}`

Biased Sampling

You can modify the sampler for each expression -- this allows you to adjust the logits before sampling. You can, for example, only accept number characters for one expression, while in another avoid specific strings. The main example here shows some of these options. You can build your own, but there are some helper functions exposed as sample to build samplers. The loaded model object has a bias field which configures the sampling - avoid, prefer allow you to adjust relative likelihood of certain tokens ids, while reject, accept change the logits to negative infinity for some ids (or all other ids).

import { sample } from 'ad-llama'

const model = await loadModel(...)

const { bias } = model
const { oneOf, consistsOf } = sample

const { context, a } = ad(model)
const character = context('Create a character')

character`{
  "weapon": "${a('special weapon', {
    sampler: bias.prefer(oneOf(['Nun-chucks', 'Beam Cannon']), 10),
  })}",
  "age": ${a('age', {
    sampler: bias.accept(consistsOf(['0','1','2','3','4','5','6', '7', '8', '9'])),
    maxTokens: 3
  })},
}`

As tokenizers are stateful, a simple encoding of just the provided strings won't necessarily produce tokens that would fit into the existing sequence. For example, with Llama 2's tokenizer foo and "foo" encode to completely different tokens:

encode('"foo"') === [376, 5431, 29908]
encode('foo') === [7953]
decode([5431]) === 'foo'
decode([7953]) === 'foo'

oneOf, consistsOf try to figure out relevant tokens for the provided strings within the context of the currently inferring expression.

consistsOf is for classes of characters - each sample will have the same token logits modified based on the given relevant tokens. So in consistsOf(['a','b']), every sample will have tokens for 'a' and 'b' modified.

oneOf is for strings - each sample has logits modified depending on the tokens which are still relevant given the already sampled tokens. For example, if we have oneOf(['ranger', 'wizard']) and we've already sampled 'w', the only next relevant tokens would be from 'izard'. If you want oneOf to stop at one of the choices, include the stop character (by default the next character after the expression), eg: oneOf(['ranger"', 'wizard"']).

Even though reject(oneOf(['ranger', 'wizard'])) will never make it past the first token for either of the strings, giving the entire string still lets you target the correct tokens for completing those specific strings.

Validation

You can provide an expression validation function with a retry count. If validation fails, that expression will be attempted again up to retry times, after which whatever was generated is taken. You can also transform the result of the expression generation (this happens whether validation passes or not).

validate?: {
  check?: (partial: string) => boolean
  transform?: (partial: string) => string
  retries: number
}

Vite HMR

Waiting for models to reload can be tedious, even when they're cached. ad-llama should work with vite HMR so the loaded models stay in memory. Put this in your source file to create an HMR boundary:

import { loadModel, ad, guessModelSpecFromPrebuiltId } from 'ad-llama'

+ if (import.meta.hot) { import.meta.hot.accept() }

const loadedModel = await loadModel(guessModelSpecFromPrebuiltId('Llama-2-7b-chat-hf-q4f32_1'))

Build

pre-reqs
- emcc: https://emscripten.org/docs/getting_started/downloads.html

build the tvmjs dependency first

git clone https://github.com/gsuuon/ad-llama.git --recursive
cd 3rdparty/relax/web

# source ~/emsdk/emsdk_env.sh
make
npm install
npm run build

then either npm run build or npm run dev (which watches src/ and serves public/)

Motivation

I was inspired by guidance but felt that tagged template literals were a better way to express structured inference. I also think grammar based sampling is neat, and wanted to add a way to plug something like that into MLC infrastructure.

Todos

runs on Deno
can target cpu
repeatable subtemplates
template expressions can reference previously generated expressions

Name	Name	Last commit message	Last commit date
Latest commit gsuuon fix: colorgen demo Oct 31, 2024 1c96bee · Oct 31, 2024 History 251 Commits
.config	.config	docs: improvements	Aug 10, 2023
.github	.github	build: publish pages	Aug 12, 2023
3rdparty	3rdparty	chore: update relax	Feb 19, 2024
example/vite-demo	example/vite-demo	fix: colorgen demo	Oct 31, 2024
public	public	chore: remove unused import	Sep 20, 2023
src	src	fix: llama 2 prompt format	Apr 8, 2024
.gitignore	.gitignore	build: ignore docs	Aug 12, 2023
.gitmodules	.gitmodules	build: add .gitmodules	Jul 31, 2023
LICENSE	LICENSE	docs: add license	Aug 3, 2023
dev.config.mjs	dev.config.mjs	build: dev script with watch and serve	Jul 29, 2023
package-lock.json	package-lock.json	feat: switch example models to 16f variants	Mar 12, 2024
package.json	package.json	chore: version 0.2.5	Apr 8, 2024
readme.md	readme.md	docs: update readme	Aug 31, 2023
rollup.config.mjs	rollup.config.mjs	build: simplify rollup and tsconfig	Jul 29, 2023
tsconfig.json	tsconfig.json	feat: switch example models to 16f variants	Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ad-llama 🦙

Usage

Generation

Biased Sampling

Validation

Vite HMR

Build

Motivation

Todos

About

Releases 5

Packages

Languages

License

gsuuon/ad-llama

Folders and files

Latest commit

History

Repository files navigation

ad-llama 🦙

Usage

Generation

Biased Sampling

Validation

Vite HMR

Build

Motivation

Todos

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Languages

Packages