# Format LLM Output to Generate JSON

## 1. Use Grammar Rules to Force LLM to Output JSON

### Prerequisites

Install [PyTorch](https://pytorch.org/get-started/locally/) and [huggingface-hub==0.23.0](https://pypi.org/project/huggingface-hub/) to your Python environment.

(I use M1 Macbook, so I used below command to install PyTorch. You can change it according to your environment.)

In [None]:
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
%pip install huggingface-hub==0.23.0

### 1st Option : llama.cpp

- Our first option is to use the [llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master) library with a grammar file to generate the JSON output.
- It may not be possible to run this code in an online notebook, as it requires cloning a git repository and running a C++ program.

1. Clone the repository

In [2]:
!git clone https://github.com/ggerganov/llama.cpp.git

Cloning into 'llama.cpp'...
remote: Enumerating objects: 27435, done.[K
remote: Counting objects: 100% (8572/8572), done.[K
remote: Compressing objects: 100% (551/551), done.[K
remote: Total 27435 (delta 8280), reused 8089 (delta 8016), pack-reused 18863[K
Receiving objects: 100% (27435/27435), 49.72 MiB | 22.46 MiB/s, done.
Resolving deltas: 100% (19618/19618), done.


2. Run `make -j` under `llama.cpp` directory to build the project.
    (Run `make LLAMA_CUDA=1` if you want to build with CUDA support.)

In [37]:
%cd ./llama.cpp

/Users/abdullahguser/Desktop/my-learning/machine_learning/llama.cpp


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


In [40]:
!make -j

I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Darwin
I UNAME_P:   arm
I UNAME_M:   arm64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DHAVE_BUGGY_APPLE_LINKER -DGGML_USE_ACCELERATE -DGGML_USE_BLAS -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_LLAMAFILE -DGGML_USE_METAL  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -Wunreachable-code-break -Wunreachable-code-return -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread   -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DHAVE_BUGGY_APPLE_LINKER -DGGML_USE_ACCELERATE -D

In [14]:
%cd ../

/Users/abdullahguser/Desktop/my-learning/machine_learning


3. Download `mistral-7b-instruct-v0.1.Q8_0.gguf` under `./models` directory.

    (For other models you can see [huggingface.co/TheBloke](https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF))

In [15]:
%cd ./llama.cpp/models/

/Users/abdullahguser/Desktop/my-learning/machine_learning/llama.cpp/models


In [16]:
!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q8_0.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'mistral-7b-instruct-v0.1.Q8_0.gguf' to '.huggingface/download/mistral-7b-instruct-v0.1.Q8_0.gguf.ab634d1d552dc60533e486cbcc73ad9b01358994dabf2453230e7fcd77308dc8.incomplete'
mistral-7b-instruct-v0.1.Q8_0.gguf: 100%|██| 7.70G/7.70G [04:12<00:00, 30.5MB/s]
Download complete. Moving file to mistral-7b-instruct-v0.1.Q8_0.gguf
mistral-7b-instruct-v0.1.Q8_0.gguf


In [17]:
%cd ../../

/Users/abdullahguser/Desktop/my-learning/machine_learning


  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


5. Llama.cpp uses [formal grammars](https://en.wikipedia.org/wiki/Formal_grammar) to constrain model outputs. Grammars are defined in GBNF (GGML BNF) format. Create a grammar file under `./grammars` directory. For example, `answer.gbnf`:

    ```
    interface answer {
        id: number;
        name: string;
    }
    ```

    ```
    root ::= answer
    answer ::= "{"   ws   "\"id\":"   ws   number   ","   ws   "\"name\":"   ws   string   "}"
    answerlist ::= "[]" | "["   ws   answer   (","   ws   answer)*   "]"
    string ::= "\""   ([^"]*)   "\""
    boolean ::= "true" | "false"
    ws ::= [ \t\n]*
    number ::= [0-9]+   "."?   [0-9]*
    stringlist ::= "["   ws   "]" | "["   ws   string   (","   ws   string)*   ws   "]"
    numberlist ::= "["   ws   "]" | "["   ws   string   (","   ws   number)*   ws   "]"
    ```

    - You can use [grammar.intrinsiclabs.ai](https://grammar.intrinsiclabs.ai/) to generate a grammar file for your custom schemas.

In [28]:
!echo "root ::= answer\n\
answer ::= \"{\"   ws   \"\\\"id\\\":\"   ws   number   \",\"   ws   \"\\\"name\\\":\"   ws   string   \"}\"\n\
answerlist ::= \"[]\" | \"[\"   ws   answer   (\",\"   ws   answer)*   \"]\"\n\
string ::= \"\\\"\"   ([^\"]*)   \"\\\"\"\n\
boolean ::= \"true\" | \"false\"\n\
ws ::= [ \\\t\\\n]*\n\
number ::= [0-9]+   \".\"?   [0-9]*\n\
stringlist ::= \"[\"   ws   \"]\" | \"[\"   ws   string   (\",\"   ws   string)*   ws   \"]\"\n\
numberlist ::= \"[\"   ws   \"]\" | \"[\"   ws   string   (\",\"   ws   number)*   ws   \"]\"" > ./llama.cpp/grammars/answer.gbnf

6. Now you can run below command under `llama.cpp` directory to generate JSON output.

In [None]:
%cd ./llama.cpp

In [43]:
!./llama-cli -m ./models/mistral-7b-instruct-v0.1.Q8_0.gguf -n 256 --grammar-file grammars/answer.gbnf -p "Q: Name the planets in the solar system? A:"

Log start
main: build = 3184 (9c77ec1d)
main: built with Apple clang version 15.0.0 (clang-1500.0.40.1) for arm64-apple-darwin23.5.0
main: seed  = 1718814861
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ./models/mistral-7b-instruct-v0.1.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32         

In [None]:
%cd ../

### 2nd Option (Preferred) : use llama-index

- [llama-index](https://github.com/run-llama/llama_index) is a very cool Python package that can be used to replicate llama.cpp functionality in Python.

1. Intall python library

In [None]:
%pip install llama-index==0.10.46
%pip install llama-index-embeddings-huggingface==0.2.2
%pip install llama-index-llms-llama-cpp==0.1.4

2. Download `mistral-7b-instruct-v0.1.Q8_0.gguf`.

In [None]:
!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q8_0.gguf --local-dir . --local-dir-use-symlinks False

3. Define json schema

In [44]:
from llama_cpp.llama_grammar import LlamaGrammar
from pydantic import BaseModel

class Answer(BaseModel):
    id: int
    name: str

class Answers(BaseModel):
    answers: list[Answer]

model_schema = str(Answers.model_json_schema()).replace("\'", "\"")

grammar = LlamaGrammar.from_json_schema(json_schema=model_schema)


from_string grammar:
Answer ::= [{] space Answer-id-kv [,] space Answer-name-kv [}] space 
space ::= space_49 
Answer-id-kv ::= ["] [i] [d] ["] space [:] space integer 
Answer-name-kv ::= ["] [n] [a] [m] [e] ["] space [:] space string 
integer ::= integer_15 space 
string ::= ["] string_50 ["] space 
answers ::= [[] space answers_11 []] space 
answers_7 ::= answers-item answers_10 
answers-item ::= Answer 
answers_9 ::= [,] space answers-item 
answers_10 ::= answers_9 answers_10 | 
answers_11 ::= answers_7 | 
answers-kv ::= ["] [a] [n] [s] [w] [e] [r] [s] ["] space [:] space answers 
char ::= [^"\] | [\] char_14 
char_14 ::= ["\/bfnrt] | [u] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] 
integer_15 ::= integer_16 integral-part 
integer_16 ::= [-] | 
integral-part ::= [0-9] | [1-9] integral-part_47 
integral-part_18 ::= [0-9] integral-part_46 
integral-part_19 ::= [0-9] integral-part_45 
integral-part_20 ::= [0-9] integral-part_44 
integral-part_21 ::= [0-9] integral-part_43 
integral

4. Define LLama model

In [48]:
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
    messages_to_prompt,
    completion_to_prompt,
)

model_url = "./mistral-7b-instruct-v0.1.Q8_0.gguf"

llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    # model_url=model_url,
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path=model_url,
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=3900,
    # kwargs to pass to __call__()
    generate_kwargs={
        "grammar": grammar,
    },
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.1.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.atte

5. Run

In [49]:
response = llm.complete("Q: Name all of the planets in the solar system? A:")
print(response.text)

llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?

llama_print_timings:        load time =   18560.69 ms
llama_print_timings:      sample time =     249.72 ms /    92 runs   (    2.71 ms per token,   368.42 tokens per second)
llama_print_timings: prompt eval time =   18559.85 ms /    78 tokens (  237.95 ms per token,     4.20 tokens per second)
llama_print_timings:        eval time =   13849.23 ms /    91 runs   (  152.19 ms per token,     6.57 tokens per second)
llama_print_timings:       total time =   32819.50 ms /   169 tokens


{"answers":[{"id":1,"name":"Mercury"},{"id":2,"name":"Venus"},{"id":3,"name":"Earth"},{"id":4,"name":"Mars"},{"id":5,"name":"Jupiter"},{"id":6,"name":"Saturn"},{"id":7,"name":"Uranus"},{"id":8,"name":"Neptune"}]}


### References

- [GBNF Guide](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md)

## 2. KOR : Structrued Data from Text

### Prerequisites

- [kor==1.0.1](https://github.com/eyurtsev/kor)
- [openai==1.30.3](https://pypi.org/project/openai/)
- [langchain==0.2.1](https://pypi.org/project/langchain/)
- [huggingface-hub==0.23.0](https://pypi.org/project/huggingface-hub/)