Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the tensorrt example yaml file, and update the install code #57

Merged
merged 4 commits into from
Jun 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/cortex-onnx.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ description: Onnx Architecture
slug: "cortex-onnx"
---

:::warning
🚧 Cortex is under construction.
:::

## Introduction
Cortex.onnx is a C++ inference library for Windows that relies on [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai), utilizing DirectML for hardware acceleration. [DirectML](https://github.com/microsoft/DirectML) is a high-performance DirectX 12 library for machine learning, providing GPU acceleration across various hardware and drivers, including AMD, Intel, NVIDIA, and Qualcomm GPUs. It integrates and sometimes upstreams [onnxruntime-genai](https://github.com/microsoft/onnxruntime-genai) for inference tasks.

Expand Down
4 changes: 4 additions & 0 deletions docs/cortex-tensorrt-llm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ description: NVIDIA TensorRT-LLM Architecture
slug: "cortex-tensorrt-llm"
---

:::warning
🚧 Cortex is under construction.
:::

## Introduction

[Cortex.tensorrt-llm](https://github.com/janhq/cortex.tensorrt-llm) is a C++ inference library for NVIDIA GPUs. It submodules NVIDIA’s [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) for GPU accelerated inference.
Expand Down
19 changes: 19 additions & 0 deletions docs/hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ description: Get started quickly with Jan, a ChatGPT-alternative that runs on yo
slug: "/hardware"
---

:::warning
🚧 Cortex is under construction.
:::

To run LLMs on device, Cortex has the following hardware requirements:

:::info
Expand All @@ -16,6 +20,21 @@ These are the general hardware requirements for running Cortex on your system. P
- Windows 10 or higher.
- Ubuntu 12.04 and later.

## CPU
- Mac:
- Support Intel and Mac Apple Silicon CPUs.
- Windows & Linux:
- Haswell processors (Q2 2013) and newer.
- Tiger Lake (Q3 2020) and newer for Celeron and Pentium processors.
:::info
Cortex supports multiple CPU instructions: AVX, AVX2, and AVX512.
:::
## GPU
- At least 6GB VRAM is recommended when using NVIDIA, AMD, or Intel Arc GPUs.

:::info
Cortex Supports NVIDIA GPU acceleration (Support for other GPUs will come soon!)
:::
## RAM (CPU Mode)

- 8GB for running up to 3B models.
Expand Down
4 changes: 2 additions & 2 deletions docs/installation/linux.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Install NPM on your machine before proceeding with this step.

```sh
# Install globally on your system
npm i -g @janhq/cortex
npm i -g cortexso
```
:::info
Cortex automatically detects your CPU and GPU, downloading the appropriate CPU instruction sets and required dependencies to optimize GPU performance.
Expand Down Expand Up @@ -160,5 +160,5 @@ cortex -h
Run the following command to uninstall Cortex globally on your machine:
```sh
# Uninstall globally on your system
npm uninstall -g @janhq/cortex
npm uninstall -g cortexso
```
4 changes: 2 additions & 2 deletions docs/installation/mac.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ Install NPM on your machine before proceeding with this step.

```sh
# Install globally on your system
npm i -g @janhq/cortex
npm i -g cortexso
```
:::info
Cortex automatically detects your CPU and GPU, downloading the appropriate CPU instruction sets and required dependencies to optimize GPU performance.
Expand Down Expand Up @@ -115,5 +115,5 @@ cortex -h
Run the following command to uninstall Cortex globally on your machine:
```sh
# Uninstall globally using NPM
npm uninstall -g @janhq/cortex
npm uninstall -g cortexso
```
6 changes: 3 additions & 3 deletions docs/installation/windows.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ Install NPM on your machine before proceeding with this step.

```sh
# Install globally on your system
npm i -g @janhq/cortex
npm i -g cortexso
```
:::info
Cortex automatically detects your CPU and GPU, downloading the appropriate CPU instruction sets and required dependencies to optimize GPU performance.
Expand Down Expand Up @@ -122,7 +122,7 @@ Install NPM on your machine before proceeding with this step.

```sh
# Install globally on your system
npm i -g @janhq/cortex
npm i -g cortexso
```
:::info
Cortex automatically detects your CPU and GPU, downloading the appropriate CPU instruction sets and required dependencies to optimize GPU performance.
Expand Down Expand Up @@ -171,5 +171,5 @@ cortex -h
Run the following command to uninstall Cortex globally on your machine:
```sh
# Uninstall globally on your system
npm uninstall -g @janhq/cortex
npm uninstall -g cortexso
```
63 changes: 63 additions & 0 deletions docs/model-operations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,69 @@ If a `model.yaml` is not available, Cortex autogenerates it from model metadata.

</TabItem>
<TabItem value="TensorRT-LLM" label="TensorRT-LLM">
**Example `model.yaml` for `llama3 8B-tensorrt-llm-windows-ada`**
```yaml
name: Llama 3
model: llama3:8B
version: 1

# files:

# Results Preferences
top_p: 0.95
temperature: 0.7
frequency_penalty: 0
presence_penalty: 0
max_tokens: 8192 # Infer from base config.json -> max_position_embeddings
stream: true # true | false

# Engine / Model Settings
engine: cortex.tensorrtllm
os: windows # from CI env var
gpu_arch: ada # from CI env var
quantization_method: awq # from CI env var
precision: int4 # from CI env var
tp: 1 # from CI env var
trtllm_version: 0.9.0 # From CI env var
ctx_len: 8192 # Infer from base config.json -> max_position_embeddings
text_model: false # Fixed value - https://github.com/janhq/jan/blob/dev/extensions/tensorrt-llm-extension/resources/models.json#L41C7-L41C26
prompt_template: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
# Prompt template: Can only be retrieved from instruct model
# - https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053
# - Requires jinja format parser
```
**Example `model.yaml` for `llama3 8B-tensorrt-llm-linux-ada`**

```yaml
name: Llama 3
model: llama3:8B
version: 1

# files:

# Results Preferences
top_p: 0.95
temperature: 0.7
frequency_penalty: 0
presence_penalty: 0
max_tokens: 8192 # Infer from base config.json -> max_position_embeddings
stream: true # true | false

# Engine / Model Settings
engine: cortex.tensorrtllm
os: linux # from CI env var
gpu_arch: ada # from CI env var
quantization_method: awq # from CI env var
precision: int4 # from CI env var
tp: 1 # from CI env var
trtllm_version: 0.9.0 # From CI env var
ctx_len: 8192 # Infer from base config.json -> max_position_embeddings
text_model: false # Fixed value - https://github.com/janhq/jan/blob/dev/extensions/tensorrt-llm-extension/resources/models.json#L41C7-L41C26
prompt_template: "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
# Prompt template: Can only be retrieved from instruct model
# - https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json#L2053
# - Requires jinja format parser
```
</TabItem>
</Tabs>

Expand Down
5 changes: 4 additions & 1 deletion docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,9 @@ cortex pull bartowski/Hermes-2-Theta-Llama-3-70B-GGUF
# Download a preconfigured model from https://huggingface.co/cortexhub
cortex pull llama3
```

:::info
Read more about [model operations](./model-operations).
:::

## Get Help

Expand All @@ -69,9 +70,11 @@ cortex
cortex CMD -h
```

:::info
Cortex is still in early development, so if you have any questions, please reach out to us:
- [GitHub](https://github.com/janhq/cortex)
- [Discord](https://discord.gg/YFKKeuVu)
:::

## Next Steps

Expand Down
3 changes: 3 additions & 0 deletions docs/telemetry.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ slug: "telemetry"
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

:::warning
🚧 Cortex is under construction.
:::

## Introduction
We collect anonymous usage data to enhance our product development.
Expand Down
Loading