Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up README and exported functions #10

Merged
merged 3 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 34 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Julia interface to
Meta's [LLaMA](https://arxiv.org/abs/2302.13971) (a large language
model).


## Installation

Press `]` at the Julia REPL to enter pkg mode, then:
Expand All @@ -18,84 +19,70 @@ add https://github.com/marcom/Llama.jl

The `llama_cpp_jll.jl` package used behind the scenes currently works
on Linux, Mac, and FreeBSD on i686, x86_64, and aarch64 (note: only
tested on x86_64-linux so far).
tested on x86_64-linux and aarch64-macos so far).

## Downloading the model weights

You will need a file with quantized model weights, see
[llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions.
You will need a file with quantized model weights in the right format (GGUF).

You can either download the weights from the [HuggingFace Hub](https://huggingface.co) (search for "GGUF" to download the right format) or convert them from the original PyTorch weights (see [llama.cpp](https://github.com/ggerganov/llama.cpp) for instructions.)

Good weights to start with are the Dolphin-family fine-tuned weights, which are Apache 2.0 licensed and can be downloaded [here](https://huggingface.co/TheBloke/dolphin-2.6-mistral-7B-dpo-GGUF). Click on the tab "Files" and download one of the `*.gguf` files. We recommend the Q4_K_M version (~4.4GB).

The weights for OpenLLaMA, an open-source reproduction of Meta AI's
LLaMA, are freely available. They can be downloaded here in GGML
format (choose one of the .bin files):
https://huggingface.co/SlyEcho/open_llama_3b_v2_ggml
In the future, there might be new releases, so you might want to check for new versions.

Once you have a `url` link to a `.gguf` file, you can simply download it via:

```julia
using Llama

# Example for an Open-chat 7Bn parameter model (c. 4.4GB)
url = "https://huggingface.co/TheBloke/openchat-3.5-0106-GGUF/resolve/main/openchat-3.5-0106.Q4_K_M.gguf"
# Example for a 7Bn parameter model (c. 4.4GB)
url = "https://huggingface.co/TheBloke/dolphin-2.6-mistral-7B-dpo-GGUF/resolve/main/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf"
model = download_model(url)
# Output: "models/openchat-3.5-0106.Q4_K_M.gguf"
# Output: "models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf"
```

You can use the model variable directly in the `run_*` functions, like `run_server`.

## Simple HTTP Server
## Running example executables from llama.cpp

### Simple HTTP Server

Given a `model` file, you can run a simple HTTP server that provides both an in-browser chat interface and an OpenAI-compatible chat completion endpoint.
Server mode is the easiest way to get started with Llama.jl. It provides both an in-browser chat interface and an OpenAI-compatible chat completion endpoint (for packages like [PromptingTools.jl](https://github.com/svilupp/PromptingTools.jl)).

```julia
using Llama

# Use the `model` downloaded above
Llama.run_server(; model)
```
Explore other models on the [HuggingFace Hub](https://huggingface.co).

## REPL mode

The REPL mode is currently non-functional, but stay tuned!

## LibLlama
### Llama Text Generation

```julia
ctx = LlamaContext("./ggml-alpaca-7b-q4.bin")
```

### `generate`
using Llama

```julia
generate(ctx, "Write me a hello world in python") # => currently prints text to screen
```
s = run_llama(model="models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf", prompt="Hello")

### `logits`
# Provide additional arguments to llama.cpp (check the documentation for more details or the help text below)
s = run_llama(model="models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf", prompt="Hello", n_gpu_layers=0, args=`-n 16`)

```julia
logits(ctx) # => Vector{Float32}, length ctx.n_vocab
# print the help text with more options
run_llama(model="", prompt="", args=`-h`)
```

### `tokenize`
> [!TIP]
> If you're getting gibberish output, it's likely that the model requires a "prompt template" (ie, structure to how you provide your instructions). Review the model page on HF Hub to see how to use your model or use the server.

```julia
tokenize(ctx, "Hello world") # => Vector{Int32} (token_ids), variable length
```

## Running example executables from llama.cpp
### Interactive chat mode

```julia
using Llama

s = run_llama(model="./ggml-alpaca-7b-q4.bin", prompt="Hello", args=`-n 16`)
run_chat(model="models/dolphin-2.6-mistral-7b-dpo.Q4_K_M.gguf", prompt="Hello chat mode")
```

# use more threads
run_llama(model="./ggml-alpaca-7b-q4.bin", prompt="Hello", nthreads=4)
## REPL mode

# print the help text with more options
run_llama(model="", prompt="", args=`-h`)
```
The REPL mode is currently non-functional, but stay tuned!

### Interactive chat mode
## LibLlama

```julia
run_chat(model="./ggml-alpaca-7b-q4.bin", prompt="Hello chat mode", nthreads=4)
```
The `libllama` bindings are currently non-functional, but stay tuned!
15 changes: 11 additions & 4 deletions src/Llama.jl
Original file line number Diff line number Diff line change
@@ -1,8 +1,14 @@
module Llama

export run_llama, run_chat, run_server, download_model
export LlamaContext, embeddings, llama_eval, logits, tokenize,
token_to_str
# Use only these executables for now:
export run_llama, run_chat, run_server

export download_model

# Temporarily unexport as the low-level API is broken!
# export LlamaContext, embeddings, llama_eval, logits, tokenize,
# token_to_str


import llama_cpp_jll
import ReplMaker
Expand All @@ -11,7 +17,8 @@ import Downloads
include("../lib/LibLlama.jl")
import .LibLlama

__init__() = isdefined(Base, :active_repl) ? init_repl() : nothing
# Temporarily disable as the low-level API is broken!
# __init__() = isdefined(Base, :active_repl) ? init_repl() : nothing

include("utils.jl")
include("api.jl")
Expand Down
9 changes: 5 additions & 4 deletions src/repl.jl
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ function set_repl_llama(ctx::LlamaContext)
end

function repl_llama(s)
@warn "REPL Llama is not yet implemented. Please use `run_*` functions instead. See `?Llama.run_server` for more information."
# TODO
return s
end
Expand All @@ -26,9 +27,9 @@ function init_repl()
end
ReplMaker.initrepl(
repl_llama,
prompt_text = "LLaMA> ",
prompt_color = :blue,
start_key = '}',
mode_name = "LLaMA_mode",
prompt_text="LLaMA> ",
prompt_color=:blue,
start_key='}',
mode_name="LLaMA_mode",
)
end
Loading