<a href="https://colab.research.google.com/github/jonbaer/googlecolab/blob/master/Llama3_1_S_v0_2_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A Sample Inference Code for LLAMA3.1-S-v0.2: A Speech Multimodal Model That Natively Understanding Audio and Text Input
<div class="align-center">
  <img src="https://raw.githubusercontent.com/janhq/llama3-s/main/images/llama-listen.jpg" width="200"></a>
  <p><small>Image source: <a href="https://www.amazon.co.uk/When-Llama-Learns-Listen-Feelings/dp/1839237988">"When Llama Learns to Listen"</a></small></p>
</div>

## Install Dependencies

In [None]:
%%shell
pip install -q openai-whisper==20231117 IPython matplotlib vector_quantize_pytorch webdataset
pip install -q git+https://github.com/homebrewltd/WhisperSpeech.git
pip install -q -U transformers bitsandbytes

  Preparing metadata (setup.py) ... [?25l[?25hdone




In [None]:
import torch
import torchaudio
from whisperspeech.vq_stoks import RQBottleneckTransformer
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from huggingface_hub import hf_hub_download
import os

## Download a sound requesting our model to code a random python script

In [None]:
%%shell
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp' -O codeapythonscript.wav
wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1IShlXCiNrY0QBs7TeKxOH2zoh3IzXRrF' -O writeastory.wav

--2024-08-23 11:41:30--  https://docs.google.com/uc?export=download&id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp
Resolving docs.google.com (docs.google.com)... 74.125.197.100, 74.125.197.101, 74.125.197.138, ...
Connecting to docs.google.com (docs.google.com)|74.125.197.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp&export=download [following]
--2024-08-23 11:41:30--  https://drive.usercontent.google.com/download?id=1xwVCMtfDb_eRhuSSSP-_6SAiClQNZ9xp&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 172.253.117.132, 2607:f8b0:400e:c03::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|172.253.117.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60972 (60K) [audio/wav]
Saving to: ‘codeapythonscript.wav’


2024-08-23 11:41:33 (98.1 MB/s) - ‘codeapythonscript.wav’ saved [60972/60972]



## First, we need to convert the audio file to sound tokens

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
    vq_model.ensure_whisper(device)

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

def audio_to_sound_tokens_transcript(audio_path, target_bandwidth=1.5, device=device):
    vq_model.ensure_whisper(device)

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|reserved_special_token_69|><|sound_start|>{result}<|sound_end|>'

## Then, we can inference the model the same as any other LLM.

In [None]:
def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Unused kwargs: ['bnb_8bit_compute_dtype', 'bnb_8bit_use_double_quant']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

### Code generation

In [None]:
# Usage
sound_tokens = audio_to_sound_tokens("codeapythonscript.wav")

messages = [
    {"role": "user", "content": sound_tokens},
]
generated_text = generate_text(pipe, messages)

print("-"*50)
print("# Model Output: ", generated_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


--------------------------------------------------
# Model Output:  Sure, here is a simple Python script that uses the `pandas` library to read a CSV file and then writes it to a SQL database using `sqlalchemy`.

```python
import pandas as pd
from sqlalchemy import create_engine

# Read the CSV file
df = pd.read_csv('your_file.csv')




### Story creation

In [None]:
# Usage
sound_tokens = audio_to_sound_tokens("writeastory.wav")

messages = [
    {"role": "user", "content": sound_tokens},
]
generated_text = generate_text(pipe, messages)

print("-"*50)
print("# Model Output: ", generated_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


--------------------------------------------------
# Model Output:  Once upon a time, in a small village nestled between the mountains and the sea, lived a young girl named Lily. She was known for her radiant smile and her kind heart. Lily had a unique gift; she could communicate with animals. This gift was a secret she kept hidden from the villagers, fearing they would not


### We can also make model transcripe the audio by adding the "<|reserved_special_token_69|>" token


In [None]:
sound_tokens = audio_to_sound_tokens_transcript("writeastory.wav")

messages = [
    {"role": "user", "content": sound_tokens},
]
generated_text = generate_text(pipe, messages)

print("-"*50)
print("# Model Output: ", generated_text)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


--------------------------------------------------
# Model Output:  Write a story
