diff --git a/README.md b/README.md
index 5cc01bcc0..75c367e76 100644
--- a/README.md
+++ b/README.md
@@ -4,15 +4,6 @@
-
-
-
-
-
-
-
Getting Started - Docs - Changelog - Bug reports - Discord @@ -20,46 +11,99 @@ > ⚠️ **Nitro is currently in Development**: Expect breaking changes and bugs! - ## Features ### Supported features -- Simple http webserver to do inference on triton (without triton client) -- Upload inference result to s3 (txt2img) - GGML inference support (llama.cpp, etc...) ### TODO: - [ ] Local file server - [ ] Cache -- [ ] Plugins support +- [ ] Plugin support -### Nitro Endpoints +## Documentation -```zsh -WIP +## About Nitro + +Nitro is a light-weight integration layer (and soon to be inference engine) for cutting edge inference engine, make deployment of AI models easier than ever before! + +The binary of nitro after zipped is only ~3mb in size with none to minimal dependencies (if you use a GPU need CUDA for example) make it desirable for any edge/server deployment 👍. + +### Repo Structure + +``` +. +├── controllers +├── docs +├── llama.cpp -> Upstream llama C++ +├── nitro_deps -> Dependencies of the Nitro project as a sub-project +└── utils ``` -## Documentation +## Quickstart -## Installation +**Step 1: Download Nitro** -WIP +To use Nitro, download the released binaries from the release page below: -## About Nitro +[](https://github.com/janhq/nitro/releases) -### Repo Structure +After downloading the release, double-click on the Nitro binary. -WIP +**Step 2: Download a Model** -### Architecture - +Download a llama model to try running the llama C++ integration. You can find a "GGUF" model on The Bloke's page below: + +[](https://huggingface.co/TheBloke) + +**Step 3: Run Nitro** -### Contributing +Double-click on Nitro to run it. After downloading your model, make sure it's saved to a specific path. Then, make an API call to load your model into Nitro. -WIP +```zsh +curl -X POST 'http://localhost:3928/inferences/llamacpp/loadmodel' \ + -H 'Content-Type: application/json' \ + -d '{ + "llama_model_path": "/path/to/your_model.gguf", + "ctx_len": 2048, + "ngl": 100, + "embedding": true + }' +``` + +`ctx_len` and `ngl` are typical llama C++ parameters, and `embedding` determines whether to enable the embedding endpoint or not. + +**Step 4: Perform Inference on Nitro for the First Time** + +```zsh +curl --location 'http://localhost:3928/inferences/llamacpp/chat_completion' \ + --header 'Content-Type: application/json' \ + --header 'Accept: text/event-stream' \ + --header 'Access-Control-Allow-Origin: *' \ + --data '{ + "messages": [ + {"content": "Hello there 👋", "role": "assistant"}, + {"content": "Can you write a long story", "role": "user"} + ], + "stream": true, + "model": "gpt-3.5-turbo", + "max_tokens": 2000 + }' +``` + +Nitro server is compatible with the OpenAI format, so you can expect the same output as the OpenAI ChatGPT API. + +## Compile from source +To compile nitro please visit [Compile from source](docs/manual_install.md) + +### Architecture +Nitro is an integration layer with the most cutting-edge inference engine. Its structure can be simplified as follows: + + ### Contact -- For support: please file a Github ticket -- For questions: join our Discord [here](https://discord.gg/FTk2MvZwJH) -- For long form inquiries: please email hello@jan.ai +- For support, please file a GitHub ticket. +- For questions, join our Discord [here](https://discord.gg/FTk2MvZwJH). +- For long-form inquiries, please email hello@jan.ai. + diff --git a/docs/architecture.png b/docs/architecture.png new file mode 100644 index 000000000..15ad4208c Binary files /dev/null and b/docs/architecture.png differ diff --git a/README_temp.md b/docs/manual_install.md similarity index 100% rename from README_temp.md rename to docs/manual_install.md