### Inference code for Llama 2

- References
    1. Llama 2 [git](https://github.com/facebookresearch/llama)
    1. Llama 2 Huggingface: (https://huggingface.co/meta-llama)
    1. [Inference sample code](https://www.youtube.com/watch?v=OZbarkziC14)

#### Download llama 2 (Official)

- Must accept license terms
    1. Please visit the Meta website and accept our license terms and acceptable use policy before submitting this form. 
    1. Requests will be processed in 1-2 days.
    1. https://ai.meta.com/resources/models-and-libraries/llama-downloads

- Check your progress status after submit
    1. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/resolve/main/tokenizer_config.json

#### Download llama 2 (Unoffical)

- Just click the link for downloading.
    1. TheBloke/Llama-2-7b-Chat-GPTQ	/path-to/Llama-2-7b-Chat-GPTQ	[Link](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ): GPU+CUDA
    1. TheBloke/Llama-2-7B-Chat-GGML	/path-to/llama-2-7b-chat.ggmlv3.q4_0.bin [Link](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML): Working on CPU 

In [None]:
!nvidia-smi

Wed Jul 19 09:46:14 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P0    22W / 300W |      0MiB / 16384MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Check your VRAM size

In [4]:
!nvidia-smi --query-gpu=memory.total --format=csv

memory.total [MiB]
6144 MiB


In [2]:
!pip install -q transformers accelerate sentencepiece


[notice] A new release of pip available: 22.3.1 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### HuggingFace Token

You have to get a token from your huggingface account.

https://huggingface.co/settings/tokens

In [None]:
!huggingface-cli login

        _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
        _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
        _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
        _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
        _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
        To login, `huggingface_hub` now requires a token generated from https://huggingface.co/settings/tokens .

## Solution 1. Load a offical model

### Llama-2-7b-chat-hf
https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(
    model,
    use_auth_token=True,
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

When you’re offline, reload your files with `PreTrainedModel.from_pretrained()` from the specified directory

```python
tokenizer = AutoTokenizer.from_pretrained("./your/path/bigscience_t0")
model = AutoModel.from_pretrained("./your/path/bigscience_t0")
```

## Helper function

In [None]:
def gen(x, max_length=200):
    sequences = pipeline(
        x,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=max_length,
    )

    return sequences[0]["generated_text"].replace(x, "")

In [None]:
print(gen('I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n'))

I'm a big fan of crime dramas and thrillers, so here are a few more suggestions:
1. "The Sopranos" - This HBO series is widely considered one of the greatest TV shows of all time. It follows the life of Tony Soprano, a New Jersey mob boss, and explores themes of crime, loyalty, and family.
2. "The Wire" - This show explores the drug trade in Baltimore from multiple perspectives, including law enforcement, drug dealers, and politicians. It's known for its gritty realism and complex characters.
3. "Narcos" - This Netflix series tells the true story of Pablo Escobar, the infamous Colombian drug lord, and the DEA agents who


### Llama-2-13B

A100 Required

https://huggingface.co/meta-llama/Llama-2-13b-chat-hf

In [None]:
from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-13b-hf"

tokenizer = AutoTokenizer.from_pretrained(
    model,
)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

## Solution 2. Load a quantatized unoffical model

### TheBloke/Llama-2-7b-Chat-GPTQ
https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ

GPU and CUDA should be installed. 

In [None]:
!pip install auto-gptq

In [None]:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

"""
To download from a specific branch, use the revision parameter, as in this example:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        revision="gptq-4bit-32g-actorder_True",
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        quantize_config=None)
"""

prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))

# Inference can also be done using transformers' pipeline

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])
