### Multilingual LLM Orion-14B-Chat

https://huggingface.co/OrionStarAI/Orion-14B-Chat/tree/main

**Requires**
- *flash_attn-2.6.3*
- *>=30GB RAM*

If you have problem when I needed NVCC for flash attention, but it seems that torch uses a reduced version of CUDA libraries. Installing the toolkit from conda forge resolved issue for me: `conda install -c conda-forge cudatoolkit-dev -y`

https://stackoverflow.com/questions/52731782/get-cuda-home-environment-path-pytorch

In [1]:
!nvcc --version # Only inside current environment

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0


In [2]:
from huggingface_hub import snapshot_download

# Specify the model repository (e.g., "bert-base-uncased")
repo_id = "OrionStarAI/Orion-14B-Chat"

# Download the model and save it to a local directory
local_dir = snapshot_download(repo_id=repo_id)

print(f"Model downloaded to: {local_dir}")


Fetching 31 files:   0%|          | 0/31 [00:00<?, ?it/s]

Model downloaded to: /home/loc/.cache/huggingface/hub/models--OrionStarAI--Orion-14B-Chat/snapshots/7aa75f1e0939fc082e67a7f58af7876907a1875e


**Notes:** *to install `flash_attn`*

- Install by command: `pip install flash_attn` or `conda install -c conda-forge flash_attn`

- Install by repo:

```
git clone https://github.com/HazyResearch/flash-attention
cd flash-attention
pip install .

```

*If you get Error:*

The error you're seeing indicates that the `flash_attn` package's build process requires `g++`, the GNU C++ compiler, and it is not available on your system. This can be resolved by installing `g++` and ensuring it is accessible.


     
1. **Install `g++` (GNU C++ Compiler)**:
   Depending on your operating system, you'll need to install `g++` (>=6.0.0, <=11.5.0) to compile the required parts of the package. On **Ubuntu/Debian**:

    ```bash
     sudo apt update
     sudo apt install g++-11

2. **Verify `g++` Installation**:
   After installation, check that `g++` is available by running:
   ```bash
   g++ --version
   ```
   You should see the version information of `g++`.

3. **Retry the Installation**:
   Once `g++` is installed and available, try installing the package again:
   ```bash
   pip install .
   ```

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig

repo_id = "OrionStarAI/Orion-14B-Chat"

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="auto",
                                             torch_dtype=torch.bfloat16, trust_remote_code=True)

model.generation_config = GenerationConfig.from_pretrained(repo_id)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
%%time
messages = [{"role": "user", "content": "Hello, what is your name? "}]
response = model.chat(tokenizer, messages, streaming=False)
print(response)

Hello! I am an AI language model, so I don't have a name like a person does. How can I assist you today?
CPU times: user 3.44 s, sys: 231 ms, total: 3.67 s
Wall time: 3.68 s


In [5]:
%%time
messages = [{"role": "user", "content": "こんにちは、お名前は何ですか？"}]
response = model.chat(tokenizer, messages, streaming=False)
print(response)

私は人工知能アシスタントで、名前はありません。あなたに会えて嬉しいです！
CPU times: user 2.38 s, sys: 15.9 ms, total: 2.39 s
Wall time: 2.38 s


In [6]:
%%time
messages = [{"role": "user", "content": "你好! 你叫什么名字!"}]
response = model.chat(tokenizer, messages, streaming=False)
print(response)

你好！我是一个人工智能助手，我没有具体的名字。有什么我可以帮你的吗？
CPU times: user 2.18 s, sys: 3.52 ms, total: 2.19 s
Wall time: 2.18 s


In [7]:
%%time
messages = [{"role": "user", "content": "안녕! 이름이 뭐예요!"}]
response = model.chat(tokenizer, messages, streaming=False)
print(response)

안녕하세요! 저는 인공지능 보조 도구로, 이름이 있는 것이 아니라 '아이다'라고 불립니다. 무엇을 도와드릴까요?
CPU times: user 3.5 s, sys: 3.28 ms, total: 3.51 s
Wall time: 3.5 s
