janhq · tikikun · Nov 28, 2023 · Nov 24, 2023 · Nov 24, 2023 · Nov 24, 2023
diff --git a/docs/docs/demos/chatbox-vid.mdx b/docs/docs/demos/chatbox-vid.mdx
diff --git a/docs/docs/examples/chatbox.md b/docs/docs/examples/chatbox.md
diff --git a/docs/docs/examples/jan.md b/docs/docs/examples/jan.md
@@ -1,5 +1,6 @@
 ---
 title: Nitro with Jan
+description: Nitro integrates with Jan to enable a ChatGPT-like functional app, optimized for local AI.
 ---
 
 You can effortlessly utilize Nitro through [Jan](https://jan.ai/), as it is fully integrated with all its functions. With Jan, using Nitro becomes straightforward without the need for any coding.

diff --git a/docs/docs/examples/openai-node.md b/docs/docs/examples/openai-node.md
@@ -1,9 +1,10 @@
 ---
 title: Nitro with openai-node
+description: Nitro intergration guide for Node.js.
 ---
 
 You can migrate from OAI API or Azure OpenAI to Nitro using your existing NodeJS code quickly
-> The ONLY thing you need to do is to override `baseURL` in `openai` init with `Nitro` URL
+> The **ONLY** thing you need to do is to override `baseURL` in `openai` init with `Nitro` URL
 - NodeJS OpenAI SDK: https://www.npmjs.com/package/openai
 
 ## Chat Completion
@@ -240,17 +241,23 @@ embedding();
 </table>
 
 ## Audio
-Coming soon
+
+:::info Coming soon
+:::
 
 ## How to reproduce
-1. Step 1: Dependencies installation
-```
+
+**Step 1:** Dependencies installation
+
+```bash
 npm install --save openai typescript
 # or
 yarn add openai
 ```
-2. Step 2: Fill `tsconfig.json`
-```json
+
+**Step 2:** Fill `tsconfig.json`
+
+```js
 {
   "compilerOptions": {
     "moduleResolution": "node",
@@ -263,7 +270,9 @@ yarn add openai
   "lib": ["es2015"]
 }
 ```
-3. Step 3: Fill `index.ts` file with code
-3. Step 4: Build with `npx tsc`
-4. Step 5: Run the code with `node dist/index.js`
-5. Step 6: Enjoy!
+
+**Step 3:** Fill `index.ts` file with code.
+
+**Step 4:** Build with `npx tsc`.
+
+**Step 5:** Run the code with `node dist/index.js`.
diff --git a/docs/docs/examples/openai-python.md b/docs/docs/examples/openai-python.md
@@ -1,10 +1,11 @@
 ---
 title: Nitro with openai-python
+description: Nitro intergration guide for Python.
 ---
 
 
 You can migrate from OAI API or Azure OpenAI to Nitro using your existing Python code quickly
-> The ONLY thing you need to do is to override `baseURL` in `openai` init with `Nitro` URL
+> The **ONLY** thing you need to do is to override `baseURL` in `openai` init with `Nitro` URL
 - Python OpenAI SDK: https://pypi.org/project/openai/
 
 ## Chat Completion
@@ -22,7 +23,10 @@ import asyncio
 from openai import AsyncOpenAI
 
 # gets API Key from environment variable OPENAI_API_KEY
-client = AsyncOpenAI(base_url="http://localhost:3928/v1/", api_key="sk-xxx")
+client = AsyncOpenAI(
+    base_url="http://localhost:3928/v1/",
+    api_key="sk-xxx"
+)
 
 
 async def main() -> None:
@@ -74,22 +78,16 @@ asyncio.run(main())
 ```python
 from openai import AzureOpenAI
 
-openai.api_key = '...' # Default is environment variable AZURE_OPENAI_API_KEY
+openai.api_key = '...' # Default is AZURE_OPENAI_API_KEY
 
 stream = AzureOpenAI(
     api_version=api_version,
-    # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
     azure_endpoint="https://example-endpoint.openai.azure.com",
 )
 
 completion = client.chat.completions.create(
     model="deployment-name",  # e.g. gpt-35-instant
-    messages=[
-        {
-            "role": "user",
-            "content": "How do I output all files in a directory using Python?",
-        },
-    ],
+    messages=[{"role": "user", "content": "Say this is a test"}],
     stream=True,
 )
 for part in stream:
@@ -115,11 +113,15 @@ import asyncio
 from openai import AsyncOpenAI
 
 # gets API Key from environment variable OPENAI_API_KEY
-client = AsyncOpenAI(base_url="http://localhost:3928/v1/", api_key="sk-xxx")
+client = AsyncOpenAI(base_url="http://localhost:3928/v1/",
+                     api_key="sk-xxx")
 
 
 async def main() -> None:
-    embedding = await client.embeddings.create(input='Hello How are you?', model='text-embedding-ada-002')
+    embedding = await client.embeddings.create(
+        input='Hello How are you?', 
+        model='text-embedding-ada-002'
+    )
     print(embedding)
 
 asyncio.run(main())
@@ -140,7 +142,10 @@ client = AsyncOpenAI(api_key="sk-xxx")
 
 
 async def main() -> None:
-    embedding = await client.embeddings.create(input='Hello How are you?', model='text-embedding-ada-002')
+    embedding = await client.embeddings.create(
+        input='Hello How are you?',          
+        model='text-embedding-ada-002'
+    )
     print(embedding)
 
 asyncio.run(main())
@@ -173,13 +178,17 @@ print(embeddings)
 </table>
 
 ## Audio
-Coming soon
+
+:::info Coming soon
+:::
 
 ## How to reproduce
-1. Step 1: Dependencies installation
-```
+**Step 1:** Dependencies installation.
+
+```bash title="Install OpenAI"
 pip install openai
 ```
-3. Step 2: Fill `index.py` file with code
-4. Step 3: Run the code with `python index.py`
-5. Step 5: Enjoy!
+
+**Step 2:** Fill `index.py` file with code.
+
+**Step 3:** Run the code with `python index.py`.
diff --git a/docs/docs/examples/palchat.md b/docs/docs/examples/palchat.md
@@ -1,5 +1,6 @@
 ---
 title: Nitro with Pal Chat
+description: Nitro intergration guide for mobile device usage.
 ---
 
 This guide demonstrates how to use Nitro with Pal Chat, enabling local AI chat capabilities on mobile devices.
@@ -15,15 +16,15 @@ Pal is a mobile app available on the App Store. It offers a customizable chat pl
 **1. Start Nitro server**
 
 Open your terminal:
-```
+```bash title="Run Nitro"
 nitro
 ```
 
 **2. Download Model**
 
 Use these commands to download and save the [Llama2 7B chat model](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/tree/main):
 
-```bash
+```bash title="Get a model"
 mkdir model && cd model
 wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_K_M.gguf?download=true
 ```
@@ -34,7 +35,7 @@ wget -O llama-2-7b-model.gguf https://huggingface.co/TheBloke/Llama-2-7B-Chat-GG
 
 To load the model, use the following command:
 
-```
+```bash title="Load model to the server"
 curl http://localhost:3928/inferences/llamacpp/loadmodel \
   -H 'Content-Type: application/json' \
   -d '{
@@ -44,11 +45,13 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \
   }'
 ```
 
-**4. Config Pal Chat**
+**4. Configure Pal Chat**
+
+In the `OpenAI API Key` field, just type any random text (e.g. key-xxxxxx).
 
-Adjust the `provide custom host` setting under `advanced settings` in Pal Chat to connect with Nitro. Enter your LAN IPv4 address (It should be something like 192.xxx.x.xxx).
+Adjust the `provide custom host` setting under `advanced settings` in Pal Chat with your LAN IPv4 address (a series of numbers like 192.xxx.x.xxx).
 
-> For instruction read: [How to find your IP](https://support.microsoft.com/en-us/windows/find-your-ip-address-in-windows-f21a9bbc-c582-55cd-35e0-73431160a1b9)
+> For instruction: [How to find your IP](https://support.microsoft.com/en-us/windows/find-your-ip-address-in-windows-f21a9bbc-c582-55cd-35e0-73431160a1b9)
 
 ![PalChat](img/pal.png)
 

diff --git a/docs/docs/features/chat.md b/docs/docs/features/chat.md
@@ -1,10 +1,11 @@
 ---
 title: Chat Completion
+description: Inference engine for chat completion, the same as OpenAI's
 ---
 
 The Chat Completion feature in Nitro provides a flexible way to interact with any local Large Language Model (LLM).
 
-## Single Request Example
+### Single Request Example
 
 To send a single query to your chosen LLM, follow these steps:
 

diff --git a/docs/docs/features/cont-batch.md b/docs/docs/features/cont-batch.md
@@ -1,20 +1,19 @@
 ---
 title: Continuous Batching
+description: Nitro's continuous batching combines multiple requests, enhancing throughput.
 ---
 
-## What is continous batching?
+Continuous batching boosts throughput and minimizes latency in large language model (LLM) inference. This technique groups multiple inference requests, significantly improving GPU utilization.
 
-Continuous batching is a powerful technique that significantly boosts throughput in large language model (LLM) inference while minimizing latency. This process dynamically groups multiple inference requests, allowing for more efficient GPU utilization.
+**Key Advantages:**
 
-## Why Continuous Batching?
+- Increased Throughput.
+- Reduced Latency.
+- Efficient GPU Use.
 
-Traditional static batching methods can lead to underutilization of GPU resources, as they wait for all sequences in a batch to complete before moving on. Continuous batching overcomes this by allowing new sequences to start processing as soon as others finish, ensuring more consistent and efficient GPU usage.
+**Implementation Insight:**
 
-## Benefits of Continuous Batching
-
-- **Increased Throughput:** Improvement over traditional batching methods.
-- **Reduced Latency:** Lower p50 latency, leading to faster response times.
-- **Efficient Resource Utilization:** Maximizes GPU memory and computational capabilities.
+To evaluate its effectiveness, compare continuous batching with traditional methods. For more details on benchmarking, refer to this [article](https://www.anyscale.com/blog/continuous-batching-llm-inference).
 
 ## How to use continous batching
 Nitro's `continuous batching` feature allows you to combine multiple requests for the same model execution, enhancing throughput and efficiency.
@@ -30,8 +29,4 @@ curl http://localhost:3928/inferences/llamacpp/loadmodel \
   }'
 ```
 
-For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.
-
-### Benchmark and Compare
-
-To understand the impact of continuous batching on your system, perform benchmarks comparing it with traditional batching methods. This [article](https://www.anyscale.com/blog/continuous-batching-llm-inference) will help you quantify improvements in throughput and latency.
+For optimal performance, ensure that the `n_parallel` value is set to match the `thread_num`, as detailed in the [Multithreading](features/multi-thread.md) documentation.