Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 6 additions & 8 deletions js/chat/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# Local Chat using Phi3, ONNX Runtime Web and WebGPU
# Local Chatbot in the browser using Phi3, ONNX Runtime Web and WebGPU

This repository contains an example of running [Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) in your browser using [ONNX Runtime Web](https://github.com/microsoft/onnxruntime) with WebGPU.

You can try out the live demo [here](https://guschmue.github.io/ort-webgpu/chat/index.html).

We keep this example simple and use the onnxruntime-web api directly without a
higher level framework like [transformers.js](https://github.com/xenova/transformers.js).
We keep this example simple and use the onnxruntime-web api directly. ONNX Runtime Web has been powering
higher level frameworks like [transformers.js](https://github.com/xenova/transformers.js).

## Getting Started

Expand Down Expand Up @@ -42,13 +42,11 @@ Point your browser to http://localhost:8080/.

### The Phi3 ONNX Model

The model used in this example is hosted on [Hugging Face](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx-web). It is slightly different than the ONNX model for CUDA or CPU:
The model used in this example is hosted on [Hugging Face](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx-web). It is an optimized ONNX version specific to Web and slightly different than the ONNX model for CUDA or CPU:
1. The model output 'logits' is kept as float32 (even for float16 models) since Javascript does not support float16.
2. Our WebGPU implementation uses the custom Multiheaded Attention operator instread of Group Query Attention.
3. Phi3 is larger then 2GB and we need to use external data files. To keep them cacheable in the browser,
both model.onnx and model.onnx.data are kept under 2GB.

The model was created using the [ONNX genai model builder](https://github.com/microsoft/onnxruntime-genai/tree/main/src/python/py/models).

If you like to create the model yourself, you can use [Olive](https://github.com/microsoft/Olive/).
An example how to create the model for ONNX Runtime Web with Olive can be found [here](https://github.com/microsoft/Olive/tree/main/examples/phi3).
If you like to optimize your fine-tuned pytorch Phi-3-min model, you can use [Olive](https://github.com/microsoft/Olive/) which supports float data type conversion and [ONNX genai model builder toolkit](https://github.com/microsoft/onnxruntime-genai/tree/main/src/python/py/models).
An example how to optimize Phi-3-min model for ONNX Runtime Web with Olive can be found [here](https://github.com/microsoft/Olive/tree/main/examples/phi3).