To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News


Unsloth now supports [gpt-oss RL](https://docs.unsloth.ai/new/gpt-oss-reinforcement-learning) with the fastest inference & lowest VRAM. Try our [new notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-GRPO.ipynb) which automatically creates kernels!

[Vision RL](https://docs.unsloth.ai/new/vision-reinforcement-learning-vlm-rl) is now supported! Train Qwen2.5-VL, Gemma 3 etc. with GSPO or GRPO.

Introducing Unsloth [Standby for RL](https://docs.unsloth.ai/basics/memory-efficient-rl): GRPO is now faster, uses 30% less memory with 2x longer context.

Unsloth now supports Text-to-Speech (TTS) models. Read our [guide here](https://docs.unsloth.ai/basics/text-to-speech-tts-fine-tuning).

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os, re
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    import torch; v = re.match(r"[0-9\.]{3,}", str(torch.__version__)).group(0)
    xformers = "xformers==" + ("0.0.32.post2" if v == "2.8.0" else "0.0.29.post3")
    !pip install --no-deps bitsandbytes accelerate {xformers} peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf "datasets>=3.4.1,<4.0.0" "huggingface_hub>=0.34.0" hf_transfer
    !pip install --no-deps unsloth
!pip install transformers==4.55.4
!pip install --no-deps trl==0.22.2
!pip install snac

### Unsloth

`FastModel` supports loading nearly any model now! This includes Vision and Text models!

Thank you to [Etherl](https://huggingface.co/Etherll) for creating this notebook!

In [2]:
from unsloth import FastLanguageModel
import torch

fourbit_models = [
    # 4bit dynamic quants for superior accuracy and low memory use
    "unsloth/gemma-3-4b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-12b-it-unsloth-bnb-4bit",
    "unsloth/gemma-3-27b-it-unsloth-bnb-4bit",
    # Qwen3 new models
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    "unsloth/Qwen3-8B-unsloth-bnb-4bit",
    # Other very popular models!
    "unsloth/Llama-3.1-8B",
    "unsloth/Llama-3.2-3B",
    "unsloth/Llama-3.3-70B",
    "unsloth/mistral-7b-instruct-v0.3",
    "unsloth/Phi-4",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/orpheus-3b-0.1-ft",
    max_seq_length= 2048, # Choose any for long context!
    dtype = None, # Select None for auto detection
    load_in_4bit = False, # Select True for 4bit which reduces memory usage
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.61G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/22.8M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 64, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.9.11 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


<a name="Data"></a>
### Data Prep  

We will use the `MrDragonFox/Elise`, which is designed for training TTS models. Ensure that your dataset follows the required format: **text, audio** for single-speaker models or **source, text, audio** for multi-speaker models. You can modify this section to accommodate your own dataset, but maintaining the correct structure is essential for optimal training.

In [4]:
from datasets import load_dataset
dataset = load_dataset("MrDragonFox/Elise", split = "train")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/328M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1195 [00:00<?, ? examples/s]

In [5]:
#@title Tokenization Function

import locale
import torchaudio.transforms as T
import os
import torch
from snac import SNAC
locale.getpreferredencoding = lambda: "UTF-8"
ds_sample_rate = dataset[0]["audio"]["sampling_rate"]

snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
snac_model = snac_model.to("cuda")
def tokenise_audio(waveform):
  waveform = torch.from_numpy(waveform).unsqueeze(0)
  waveform = waveform.to(dtype=torch.float32)
  resample_transform = T.Resample(orig_freq=ds_sample_rate, new_freq=24000)
  waveform = resample_transform(waveform)

  waveform = waveform.unsqueeze(0).to("cuda")

  #generate the codes from snac
  with torch.inference_mode():
    codes = snac_model.encode(waveform)

  all_codes = []
  for i in range(codes[0].shape[1]):
    all_codes.append(codes[0][0][i].item()+128266)
    all_codes.append(codes[1][0][2*i].item()+128266+4096)
    all_codes.append(codes[2][0][4*i].item()+128266+(2*4096))
    all_codes.append(codes[2][0][(4*i)+1].item()+128266+(3*4096))
    all_codes.append(codes[1][0][(2*i)+1].item()+128266+(4*4096))
    all_codes.append(codes[2][0][(4*i)+2].item()+128266+(5*4096))
    all_codes.append(codes[2][0][(4*i)+3].item()+128266+(6*4096))


  return all_codes

def add_codes(example):
    # Always initialize codes_list to None
    codes_list = None

    try:
        answer_audio = example.get("audio")
        # If there's a valid audio array, tokenise it
        if answer_audio and "array" in answer_audio:
            audio_array = answer_audio["array"]
            codes_list = tokenise_audio(audio_array)
    except Exception as e:
        print(f"Skipping row due to error: {e}")
        # Keep codes_list as None if we fail
    example["codes_list"] = codes_list

    return example

dataset = dataset.map(add_codes, remove_columns=["audio"])

tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009

start_of_speech = tokeniser_length + 1
end_of_speech = tokeniser_length + 2

start_of_human = tokeniser_length + 3
end_of_human = tokeniser_length + 4

start_of_ai = tokeniser_length + 5
end_of_ai =  tokeniser_length + 6
pad_token = tokeniser_length + 7

audio_tokens_start = tokeniser_length + 10

dataset = dataset.filter(lambda x: x["codes_list"] is not None)
dataset = dataset.filter(lambda x: len(x["codes_list"]) > 0)

def remove_duplicate_frames(example):
    vals = example["codes_list"]
    if len(vals) % 7 != 0:
        raise ValueError("Input list length must be divisible by 7")

    result = vals[:7]

    removed_frames = 0

    for i in range(7, len(vals), 7):
        current_first = vals[i]
        previous_first = result[-7]

        if current_first != previous_first:
            result.extend(vals[i:i+7])
        else:
            removed_frames += 1

    example["codes_list"] = result

    return example

dataset = dataset.map(remove_duplicate_frames)

tok_info = '''*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"
'''
print(tok_info)

def create_input_ids(example):
    # Determine whether to include the source field
    text_prompt = f"{example['source']}: {example['text']}" if "source" in example else example["text"]

    text_ids = tokenizer.encode(text_prompt, add_special_tokens=True)
    text_ids.append(end_of_text)

    example["text_tokens"] = text_ids
    input_ids = (
        [start_of_human]
        + example["text_tokens"]
        + [end_of_human]
        + [start_of_ai]
        + [start_of_speech]
        + example["codes_list"]
        + [end_of_speech]
        + [end_of_ai]
    )
    example["input_ids"] = input_ids
    example["labels"] = input_ids
    example["attention_mask"] = [1] * len(input_ids)

    return example


dataset = dataset.map(create_input_ids, remove_columns=["text", "codes_list"])
columns_to_keep = ["input_ids", "labels", "attention_mask"]
columns_to_remove = [col for col in dataset.column_names if col not in columns_to_keep]

dataset = dataset.remove_columns(columns_to_remove)

config.json:   0%|          | 0.00/300 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/79.5M [00:00<?, ?B/s]



Map:   0%|          | 0/1195 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1195 [00:00<?, ? examples/s]

Filter:   0%|          | 0/1195 [00:00<?, ? examples/s]

Map:   0%|          | 0/1195 [00:00<?, ? examples/s]

*** HERE you can modify the text prompt
If you are training a multi-speaker model (e.g., canopylabs/orpheus-3b-0.1-ft),
ensure that the dataset includes a "source" field and format the input accordingly:
- Single-speaker: f"{example['text']}"
- Multi-speaker: f"{example['source']}: {example['text']}"



Map:   0%|          | 0/1195 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface  `Trainer`! More docs here: [Transformers docs](https://huggingface.co/docs/transformers/main_classes/trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.

**Note:** Using a per_device_train_batch_size >1 may lead to errors if multi-GPU setup to avoid issues, ensure CUDA_VISIBLE_DEVICES is set to a single GPU (e.g., CUDA_VISIBLE_DEVICES=0).

In [6]:
from transformers import TrainingArguments,Trainer,DataCollatorForSeq2Seq
trainer = Trainer(
    model = model,
    train_dataset = dataset,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

In [7]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.818 GB of memory reserved.


In [8]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,195 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 4 x 1) = 4
 "-____-"     Trainable parameters = 97,255,424 of 3,398,122,496 (2.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,5.1101
2,4.8689
3,4.9886
4,4.8521
5,5.0037
6,4.841
7,4.8163
8,4.9529
9,5.0102
10,4.6787


In [9]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

201.1533 seconds used for training.
3.35 minutes used for training.
Peak reserved memory = 7.684 GB.
Peak reserved memory for training = 0.866 GB.
Peak reserved memory % of max memory = 52.127 %.
Peak reserved memory for training % of max memory = 5.875 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the prompts



In [10]:
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, e.g., "MrDragonFox/Elise"

In [11]:
#@title Run Inference


FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Moving snac_model cuda to cpu
snac_model.to("cpu")

prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token = torch.tensor([[ 128259]], dtype=torch.int64) # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  # Use the pad token defined in the data prep section
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

input_ids = all_padded_tensors.to("cuda")
attention_mask = all_attention_masks.to("cuda")
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=128258,
     use_cache = True
  )
token_to_find = 128257
token_to_remove = 128258

token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)

if len(token_indices[1]) > 0:
    last_occurrence_idx = token_indices[1][-1].item()
    cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
else:
    cropped_tensor = generated_ids

mask = cropped_tensor != token_to_remove

processed_rows = []

for row in cropped_tensor:
    masked_row = row[row != token_to_remove]
    processed_rows.append(masked_row)

code_lists = []

for row in processed_rows:
    row_length = row.size(0)
    new_length = (row_length // 7) * 7
    trimmed_row = row[:new_length]
    trimmed_row = [t - 128266 for t in trimmed_row]
    code_lists.append(trimmed_row)


def redistribute_codes(code_list):
  layer_1 = []
  layer_2 = []
  layer_3 = []
  for i in range((len(code_list)+1)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes]
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)
from IPython.display import display, Audio
if len(prompts) != len(my_samples):
  raise Exception("Number of prompts and samples do not match")
else:
  for i in range(len(my_samples)):
    print(prompts[i])
    samples = my_samples[i]
    display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
# Clean up to save RAM
del my_samples,samples

Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [12]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/chat_template.jinja',
 'lora_model/tokenizer.json')

### Saving to float16

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [14]:
# Merge to 16bit
if True: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False:
    model.save_pretrained("model")
    tokenizer.save_pretrained("model")
if False:
    model.push_to_hub("hf/model", token = "")
    tokenizer.push_to_hub("hf/model", token = "")


Found HuggingFace hub cache directory: /root/.cache/huggingface/hub


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Checking cache directory for required files...
Successfully copied all 2 files from cache to model.


Unsloth: Preparing safetensor model files: 100%|██████████| 2/2 [00:00<00:00, 17810.21it/s]
Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [02:01<00:00, 60.52s/it]


Unsloth: Merge process complete.


### Load and use the saved model
You can load the saved model using `FastLanguageModel.from_pretrained` and then use it for inference.

In [15]:
from unsloth import FastLanguageModel

# Load the saved model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "model", # Path to your saved model
    max_seq_length= 2048, # Same as during training
    dtype = None, # Same as during training
    load_in_4bit = False, # Same as during training
)

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Example inference with the loaded model
prompt = "Tell me a short story about a dragon."

# Determine whether to include the source field
chosen_voice = None # None for single-speaker
prompt_ = (f"{chosen_voice}: " + prompt) if chosen_voice else prompt

input_ids = tokenizer(prompt_, return_tensors="pt").input_ids.to("cuda")

start_token = torch.tensor([[ 128259]], dtype=torch.int64).to("cuda") # Start of human
end_tokens = torch.tensor([[128009, 128260]], dtype=torch.int64).to("cuda") # End of text, End of human

modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1) # SOH SOT Text EOT EOH

generated_ids = model.generate(
    input_ids=modified_input_ids,
    max_new_tokens=1200,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.1,
    num_return_sequences=1,
    eos_token_id=128258,
    use_cache = True
)

# Decode the generated tokens
decoded_output = tokenizer.decode(generated_ids[0], skip_special_tokens=False)

print(decoded_output)

==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

<custom_token_3><|begin_of_text|>Tell me a short story about a dragon.<|eot_id|><custom_token_4><custom_token_5><custom_token_1><custom_token_3929><custom_token_5371><custom_token_10025><custom_token_13341><custom_token_20475><custom_token_21625><custom_token_25721><custom_token_727><custom_token_7524><custom_token_8631><custom_token_14121><custom_token_19944><custom_token_23850><custom_token_28604><custom_token_4070><custom_token_6089><custom_token_11181><custom_token_13602><custom_token_17728><custom_token_22298><custom_token_26830><custom_token_3872><custom_token_4616><custom_token_10938><custom_token_14542><custom_token_18818><custom_token_20556><custom_token_27619><custom_token_2544><custom_token_5686><custom_token_12237><custom_token_14309><custom_token_20290><custom_token_20681><custom_token_27916><custom_token_721><custom_token_5803><custom_token_11532><custom_token_15073><custom_token_17303><custom_token_23265><custom_token_25151><custom_token_1456><custom_token_6388><custom_t

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>


# Task
Explain how to download the saved model and run it locally for voice cloning.

## Download the saved model

### Subtask:
Explain how to download the "model" directory created in the notebook to the local machine.


**Reasoning**:
The subtask is to explain how to download the "model" directory. This is a manual process within the Google Colab environment's file browser and does not require any code execution. Therefore, I will provide the instructions as a markdown cell and then finish the task.



In [16]:
To download the "model" directory created in the notebook to your local machine, follow these steps:

1.  **Locate the "model" directory:** In the file browser pane on the left side of your Google Colab environment, navigate to the root directory. You should see a directory named "model".
2.  **Right-click on the "model" directory:** A context menu will appear.
3.  **Select "Download":** Click on the "Download" option in the context menu. Google Colab will compress the directory into a zip file and download it to your local machine.

SyntaxError: invalid syntax (ipython-input-3911618318.py, line 1)

## Set up a local python environment

### Subtask:
Instruct the user to install Python and the required libraries (unsloth, transformers, torch, etc.) in a local environment.


**Reasoning**:
Provide instructions for setting up a local Python environment and installing the necessary libraries for running the model.



In [17]:
# Install Python
print("First, ensure you have Python installed on your local machine. Python 3.8 or later is recommended.")
print("You can download Python from the official website: https://www.python.org/downloads/")

# Create a virtual environment
print("\nNext, create a virtual environment for this project. This helps manage dependencies and avoids conflicts with other Python projects.")
print("Open your terminal or command prompt and navigate to the directory where you downloaded the model.")
print("Run the following command to create a virtual environment (replace 'venv' with your desired environment name):")
print("python -m venv venv")

# Activate the virtual environment
print("\nActivate the virtual environment:")
print("On Windows: .\\venv\\Scripts\\activate")
print("On macOS/Linux: source venv/bin/activate")
print("You should see the name of your virtual environment in your terminal prompt, indicating it's active.")

# Install required libraries
print("\nNow, install the necessary libraries within the activated virtual environment:")
print("pip install unsloth transformers torch bitsandbytes accelerate xformers")
print("This command installs the core libraries used in the notebook.")

First, ensure you have Python installed on your local machine. Python 3.8 or later is recommended.
You can download Python from the official website: https://www.python.org/downloads/

Next, create a virtual environment for this project. This helps manage dependencies and avoids conflicts with other Python projects.
Open your terminal or command prompt and navigate to the directory where you downloaded the model.
Run the following command to create a virtual environment (replace 'venv' with your desired environment name):
python -m venv venv

Activate the virtual environment:
On Windows: .\venv\Scripts\activate
On macOS/Linux: source venv/bin/activate
You should see the name of your virtual environment in your terminal prompt, indicating it's active.

Now, install the necessary libraries within the activated virtual environment:
pip install unsloth transformers torch bitsandbytes accelerate xformers
This command installs the core libraries used in the notebook.


## Load the model locally

### Subtask:
Provide a code snippet showing how to load the saved model from the local directory using `FastLanguageModel.from_pretrained`.


**Reasoning**:
The subtask is to provide a code snippet to load the saved model locally. I will create a code block that imports `FastLanguageModel` and uses `from_pretrained` to load the model from the specified local directory, including comments as requested.



In [18]:
# load_model.py

# Import the necessary class from the unsloth library
from unsloth import FastLanguageModel

# Define the path to the locally saved model directory
# Replace "./model" with the actual path if your directory is elsewhere
model_path = "./model"

# Load the model and tokenizer using FastLanguageModel.from_pretrained
# Use the same parameters (max_seq_length, dtype, load_in_4bit) as during training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,      # Path to your saved model
    max_seq_length = 2048,        # Same as during training
    dtype = None,                 # Same as during training (auto detection)
    load_in_4bit = False,         # Same as during training
)

# The model and tokenizer are now loaded and ready for inference.
# You can now proceed with the inference steps as shown in the original notebook.

print(f"Model and tokenizer loaded successfully from {model_path}")

==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Model and tokenizer loaded successfully from ./model


## Run inference locally

### Subtask:
Provide a code snippet demonstrating how to perform inference with the loaded model in the local environment, similar to the inference code in the notebook.


**Reasoning**:
The subtask is to provide a code snippet for local inference. This involves copying the inference code from the notebook, adapting it for a local environment by handling device placement (CPU/GPU), and ensuring all necessary components like `snac_model` are loaded. The code block will perform these steps.



In [None]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
from IPython.display import Audio, display

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 128260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move tensors to appropriate device (CPU for local)
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)
model.to(device) # Move model to device as well


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.
# Example of saving to a WAV file:
# import torchaudio
# torchaudio.save("output.wav", samples.detach().squeeze().to("cpu"), 24000)
print("\nGenerated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # In a standard Python script, you cannot directly display Audio like this.
        # You would typically save to a file and then play it using system tools.
        # For demonstration purposes, if running in an environment that supports display:
        # display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
        print("Audio generated. To play locally, save the tensor to a WAV file.")
        # Example save:
        # try:
        #     torchaudio.save(f"output_{i}.wav", samples.detach().squeeze().to("cpu"), 24000)
        #     print(f"Audio saved to output_{i}.wav")
        # except Exception as e:
        #     print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples

Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

**Reasoning**:
The previous code block successfully started loading the model and SNAC model. The next step is to finalize the inference process by generating the audio and providing instructions on how to save it locally since direct audio display is not supported in a standard Python script. The code block will complete the inference and provide saving instructions.



In [1]:
# The previous code block started the loading and inference process.
# This block continues from there, assuming model, tokenizer, and snac_model are loaded.

# The generation and post-processing code from the previous block is included here
# to ensure the full inference pipeline is present in a single conceptual unit,
# even though the execution was split.

# Note: In a real script, the loading and inference would be in one continuous block.

# Define the special tokens again for clarity in this part of the script
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 128260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice again for clarity
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids - This part was done in the previous block but is included here
# for completeness of the inference process.
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move tensors and model to appropriate device (CPU for local)
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)
model.to(device) # Ensure model is on the correct device


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.
import torchaudio # Import torchaudio for saving

print("\nGenerated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            torchaudio.save(output_filename, samples.detach().squeeze().to("cpu"), 24000)
            print(f"Audio saved to {output_filename}")
        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples, samples

NameError: name 'tokenizer' is not defined

**Reasoning**:
The previous execution failed because the `tokenizer` and `model` variables were not defined in the current code block. Although the intention was to continue from the previous block, the execution environment resets variable states between blocks. I need to include the model and tokenizer loading code in this block to ensure they are defined before being used for inference. I also need to include the SNAC model loading. I will regenerate the entire inference code block, including the necessary loading steps.



In [2]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
# In a standard Python script, you typically don't use IPython.display directly.
# from IPython.display import Audio, display # Commented out for standard script compatibility
import torchaudio # Import torchaudio for saving audio

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 128260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.", # Added another prompt for testing
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  # Ensure tokenizer is defined and used correctly
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move tensors and model to appropriate device (CPU for local)
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)
model.to(device) # Ensure model is on the correct device


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.


print("\nGenerated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            torchaudio.save(output_filename, samples.detach().squeeze().to("cpu"), 24000)
            print(f"Audio saved to {output_filename}")
        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model and tokenizer loaded.
Loading SNAC model...
SNAC model loaded.
Fast inference enabled.
Generating audio...

Generated Audio:
Prompt: Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.
Could not save audio to file: Failed to create input filter: "time_base=1/24000:sample_rate=24000:sample_fmt=flt:channel_layout=0x0" (Invalid argument)
Prompt: This is a test sentence for voice cloning.
Could not save audio to file: Failed to create input filter: "time_base=1/24000:sample_rate=24000:sample_fmt=flt:channel_layout=0x0" (Invalid argument)


**Reasoning**:
The previous code block successfully ran the inference and attempted to save the audio, but failed with a `Failed to create input filter` error from torchaudio. This suggests an issue with how the audio tensor is formatted or interpreted by torchaudio's saving function in the current environment. I need to investigate the format of the `samples` tensor and ensure it's compatible with `torchaudio.save`. It should be a 2D tensor with shape `(num_channels, num_frames)`. The current output indicates the tensor might be `squeeze()`d to a 1D tensor before saving, which might be causing the issue. I will modify the saving line to ensure the tensor has a channel dimension before saving.



In [3]:
# The previous code block completed the inference but failed to save the audio.
# This block will retry saving the audio with a corrected format.

import torchaudio # Ensure torchaudio is imported

# Assuming model, tokenizer, snac_model, prompts, my_samples are available from previous steps.
# Define the special tokens again for clarity
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 128260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266

# Define the prompts again for clarity
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.",
]


print("\nAttempting to save Generated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D: (num_channels, num_frames)
            # The original tensor might be (1, num_frames) or just (num_frames,)
            # Unsqueeze adds a channel dimension if it's a 1D tensor.
            audio_tensor_to_save = samples.detach().to("cpu")
            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)

            torchaudio.save(output_filename, audio_tensor_to_save, 24000)
            print(f"Audio saved to {output_filename}")
        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA


Attempting to save Generated Audio:


NameError: name 'my_samples' is not defined

**Reasoning**:
The previous execution failed because `my_samples` was not defined in the current code block. Similar to the `tokenizer` and `model` issue, the variables from the previous code block were not retained. I need to include the entire inference process, including the generation of `my_samples`, within this code block to ensure all variables are defined when needed. I will regenerate the complete inference code, incorporating the SNAC model loading and the corrected audio saving logic.



In [4]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
# In a standard Python script, you typically don't use IPython.display directly.
# from IPython.display import Audio, display # Commented out for standard script compatibility
import torchaudio # Import torchaudio for saving audio

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 1260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.", # Added another prompt for testing
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  # Ensure tokenizer is defined and used correctly
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move tensors and model to appropriate device (CPU for local)
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)
model.to(device) # Ensure model is on the correct device


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.


print("\nGenerated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D: (num_channels, num_frames)
            audio_tensor_to_save = samples.detach().to("cpu")
            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)

            torchaudio.save(output_filename, audio_tensor_to_save, 24000)
            print(f"Audio saved to {output_filename}")
        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA

Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model and tokenizer loaded.
Loading SNAC model...
SNAC model loaded.
Fast inference enabled.
Generating audio...

Generated Audio:
Prompt: Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.
Could not save audio to file: Input Tensor has to be 2D.
Prompt: This is a test sentence for voice cloning.
Could not save audio to file: Input Tensor has to be 2D.


**Reasoning**:
The previous execution still failed to save the audio, indicating that the tensor format is not 2D as required by `torchaudio.save`, despite the `unsqueeze(0)` call. This might mean `samples` is already 2D but with an incorrect shape, or the `unsqueeze(0)` is not having the intended effect in this context. I need to explicitly check the shape of `samples` before saving and ensure it is `(1, num_frames)` for mono audio. I will modify the saving logic to reshape the tensor if necessary.



In [5]:
# The previous code block ran the inference but failed to save the audio due to tensor shape.
# This block will retry saving the audio with explicit reshaping.

import torchaudio # Ensure torchaudio is imported

# Assuming model, tokenizer, snac_model, prompts, my_samples are available from previous steps.
# Define the special tokens again for clarity
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 1260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266

# Define the prompts again for clarity
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.",
]


print("\nAttempting to save Generated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D with shape (1, num_frames)
            audio_tensor_to_save = samples.detach().to("cpu")
            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)
            elif audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] != 1:
                 # If it's 2D but not (1, num_frames), assume it's (num_frames, 1) and transpose
                 audio_tensor_to_save = audio_tensor_to_save.T
            elif audio_tensor_to_save.ndim > 2:
                 print(f"Skipping saving for prompt {i} due to unexpected tensor shape: {audio_tensor_to_save.shape}")
                 continue # Skip saving this sample

            # Double check the shape before saving
            if audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] == 1:
                torchaudio.save(output_filename, audio_tensor_to_save, 24000)
                print(f"Audio saved to {output_filename}")
            else:
                 print(f"Skipping saving for prompt {i} as tensor could not be reshaped to (1, num_frames). Final shape: {audio_tensor_to_save.shape}")


        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA


Attempting to save Generated Audio:


NameError: name 'my_samples' is not defined

**Reasoning**:
The error persists because `my_samples` is still not defined in the current execution context. I must include the entire inference process within this single code block to ensure all variables are available when needed. This involves loading the model, tokenizer, and SNAC model, performing the token generation and post-processing, and finally saving the audio with the corrected logic.



In [6]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
# In a standard Python script, you typically don't use IPython.display directly.
# from IPython.display import Audio, display # Commented out for standard script compatibility
import torchaudio # Import torchaudio for saving audio

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 1260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.", # Added another prompt for testing
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  # Ensure tokenizer is defined and used correctly
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move tensors and model to appropriate device (CPU for local)
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)
model.to(device) # Ensure model is on the correct device


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.


print("\nAttempting to save Generated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D with shape (1, num_frames)
            audio_tensor_to_save = samples.detach().to("cpu")
            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)
            elif audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] != 1:
                 # If it's 2D but not (1, num_frames), assume it's (num_frames, 1) and transpose
                 audio_tensor_to_save = audio_tensor_to_save.T
            elif audio_tensor_to_save.ndim > 2:
                 print(f"Skipping saving for prompt {i} due to unexpected tensor shape: {audio_tensor_to_save.shape}")
                 continue # Skip saving this sample

            # Double check the shape before saving
            if audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] == 1:
                torchaudio.save(output_filename, audio_tensor_to_save, 24000)
                print(f"Audio saved to {output_filename}")
            else:
                 print(f"Skipping saving for prompt {i} as tensor could not be reshaped to (1, num_frames). Final shape: {audio_tensor_to_save.shape}")


        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA

Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Model and tokenizer loaded.
Loading SNAC model...




SNAC model loaded.
Fast inference enabled.


RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

**Reasoning**:
The code failed because I attempted to move the model to the device using `model.to(device)`. The error message indicates that the model has been offloaded to CPU or disk by `accelerate`, and cannot be moved this way. The `FastLanguageModel.from_pretrained` function from `unsloth` likely handles device placement during loading, especially when dealing with large models and limited VRAM. Removing the explicit `model.to(device)` call should resolve this conflict and allow the inference to proceed with the model placed on the correct device by `unsloth`/`accelerate`. I will regenerate the complete code block, removing the problematic line.



In [None]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
# In a standard Python script, you typically don't use IPython.display directly.
# from IPython.display import Audio, display # Commented out for standard script compatibility
import torchaudio # Import torchaudio for saving audio

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
# FastLanguageModel.from_pretrained handles device placement
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 1260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.", # Added another prompt for testing
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  # Ensure tokenizer is defined and used correctly
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move input tensors to appropriate device (CPU for local)
# Removed model.to(device) as accelerate handles it
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.


print("\nAttempting to save Generated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D with shape (1, num_frames)
            audio_tensor_to_save = samples.detach().to("cpu")
            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)
            elif audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] != 1:
                 # If it's 2D but not (1, num_frames), assume it's (num_frames, 1) and transpose
                 audio_tensor_to_save = audio_tensor_to_save.T
            elif audio_tensor_to_save.ndim > 2:
                 print(f"Skipping saving for prompt {i} due to unexpected tensor shape: {audio_tensor_to_save.shape}")
                 continue # Skip saving this sample

            # Double check the shape before saving
            if audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] == 1:
                torchaudio.save(output_filename, audio_tensor_to_save, 24000)
                print(f"Audio saved to {output_filename}")
            else:
                 print(f"Skipping saving for prompt {i} as tensor could not be reshaped to (1, num_frames). Final shape: {audio_tensor_to_save.shape}")


        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA

Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model and tokenizer loaded.
Loading SNAC model...
SNAC model loaded.
Fast inference enabled.
Generating audio...


## Run inference locally

### Subtask:
Provide a code snippet demonstrating how to perform inference with the loaded model in the local environment, similar to the inference code in the notebook.


**Reasoning**:
Provide the complete code snippet for `run_inference.py` including all necessary imports, model loading, SNAC model loading, token definitions, prompt processing, token generation, post-processing, and audio saving, following the instructions provided in the subtask description. This requires combining all previous code logic into a single, self-contained script.



In [1]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
# In a standard Python script, you typically don't use IPython.display directly.
# from IPython.display import Audio, display # Commented out for standard script compatibility
import torchaudio # Import torchaudio for saving audio

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
# FastLanguageModel.from_pretrained handles device placement
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 128260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.", # Added another prompt for testing
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  # Ensure tokenizer is defined and used correctly
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move input tensors to appropriate device (CPU for local)
# Removed model.to(device) as accelerate handles it
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.


print("\nAttempting to save Generated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D with shape (num_channels, num_frames)
            audio_tensor_to_save = samples.detach().to("cpu")
            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)
            elif audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] != 1:
                 # If it's 2D but not (1, num_frames), assume it's (num_frames, 1) and transpose
                 audio_tensor_to_save = audio_tensor_to_save.T
            elif audio_tensor_to_save.ndim > 2:
                 print(f"Skipping saving for prompt {i} due to unexpected tensor shape: {audio_tensor_to_save.shape}")
                 continue # Skip saving this sample

            # Double check the shape before saving
            if audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] == 1:
                torchaudio.save(output_filename, audio_tensor_to_save, 24000)
                print(f"Audio saved to {output_filename}")
            else:
                 print(f"Skipping saving for prompt {i} as tensor could not be reshaped to (1, num_frames). Final shape: {audio_tensor_to_save.shape}")


        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model and tokenizer loaded.
Loading SNAC model...
SNAC model loaded.
Fast inference enabled.
Generating audio...

Attempting to save Generated Audio:
Prompt: Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.
Skipping saving for prompt 0 due to unexpected tensor shape: torch.Size([1, 1, 241664])
Prompt: This is a test sentence for voice cloning.
Skipping saving for prompt 1 due to unexpected tensor shape: torch.Size([1, 1, 63488])


**Reasoning**:
The previous attempt to save the audio failed because the tensor shape was unexpected ([1, 1, num_frames]). The `torchaudio.save` function expects a 2D tensor with shape (num_channels, num_frames). The current shape suggests an extra dimension. The `redistribute_codes` function is likely returning a tensor with shape `(1, 1, num_frames)`. The saving logic needs to handle this specific 3D shape by removing the extra dimension before saving as a 2D tensor `(1, num_frames)`. Update the saving logic to explicitly handle the `(1, 1, num_frames)` case by squeezing the tensor.



In [2]:
# run_inference.py

import torch
import torchaudio.transforms as T
import os
from snac import SNAC
from unsloth import FastLanguageModel
# In a standard Python script, you typically don't use IPython.display directly.
# from IPython.display import Audio, display # Commented out for standard script compatibility
import torchaudio # Import torchaudio for saving audio

# Define the path to the locally saved model directory
model_path = "./model" # Replace with your actual model path

# Load the model and tokenizer
print(f"Loading model from {model_path}...")
# FastLanguageModel.from_pretrained handles device placement
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path,
    max_seq_length= 2048,
    dtype = None,
    load_in_4bit = False,
)
print("Model and tokenizer loaded.")

# Load the SNAC model
print("Loading SNAC model...")
snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
# Move SNAC model to CPU as done in the notebook inference
snac_model.to("cpu")
print("SNAC model loaded.")

# Enable native 2x faster inference for the loaded model
FastLanguageModel.for_inference(model)
print("Fast inference enabled.")


# Define the special tokens used in data prep
tokeniser_length = 128256
start_of_text = 128000
end_of_text = 128009
start_of_speech = tokeniser_length + 1 # 128257
end_of_speech = tokeniser_length + 2 # 128258
start_of_human = tokeniser_length + 3 # 128259
end_of_human = tokeniser_length + 4 # 128260
start_of_ai = tokeniser_length + 5 # 128261
end_of_ai =  tokeniser_length + 6 # 128262
pad_token = tokeniser_length + 7 # 128263
audio_tokens_start = tokeniser_length + 10 # 128266


# Define the prompts and chosen voice
prompts = [
    "Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.",
    "This is a test sentence for voice cloning.", # Added another prompt for testing
]

chosen_voice = "MrDragonFox/Elise" # Replace with the voice you want to clone, or None for single-speaker


# Prepare input_ids
prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]

all_input_ids = []

for prompt in prompts_:
  # Ensure tokenizer is defined and used correctly
  input_ids = tokenizer(prompt, return_tensors="pt").input_ids
  all_input_ids.append(input_ids)

start_token_tensor = torch.tensor([[ start_of_human]], dtype=torch.int64) # Start of human
end_tokens_tensor = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64) # End of text, End of human

all_modified_input_ids = []
for input_ids in all_input_ids:
  modified_input_ids = torch.cat([start_token_tensor, input_ids, end_tokens_tensor], dim=1) # SOH SOT Text EOT EOH
  all_modified_input_ids.append(modified_input_ids)

all_padded_tensors = []
all_attention_masks = []
max_length = max([modified_input_ids.shape[1] for modified_input_ids in all_modified_input_ids])
for modified_input_ids in all_modified_input_ids:
  padding = max_length - modified_input_ids.shape[1]
  padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), modified_input_ids], dim=1)
  attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, modified_input_ids.shape[1]), dtype=torch.int64)], dim=1)
  all_padded_tensors.append(padded_tensor)
  all_attention_masks.append(attention_mask)


all_padded_tensors = torch.cat(all_padded_tensors, dim=0)
all_attention_masks = torch.cat(all_attention_masks, dim=0)

# Move input tensors to appropriate device (CPU for local)
# Removed model.to(device) as accelerate handles it
device = "cuda" if torch.cuda.is_available() else "cpu"
input_ids = all_padded_tensors.to(device)
attention_mask = all_attention_masks.to(device)


print("Generating audio...")
# Generate tokens
generated_ids = model.generate(
      input_ids=input_ids,
      attention_mask=attention_mask,
      max_new_tokens=1200,
      do_sample=True,
      temperature=0.6,
      top_p=0.95,
      repetition_penalty=1.1,
      num_return_sequences=1,
      eos_token_id=end_of_speech, # Use the end of speech token
      use_cache = True
  )

# Post-process generated tokens to extract speech codes
token_to_find_start_speech = start_of_speech
token_to_remove_end_speech = end_of_speech

# Find the first occurrence of start_of_speech and the last of end_of_speech
start_indices = (generated_ids == token_to_find_start_speech).nonzero(as_tuple=True)
end_indices = (generated_ids == token_to_remove_end_speech).nonzero(as_tuple=True)

processed_rows = []

if len(start_indices[1]) > 0 and len(end_indices[1]) > 0:
    for i in range(generated_ids.shape[0]):
        # Find the first start token and last end token for the current row
        row_start_indices = (generated_ids[i] == token_to_find_start_speech).nonzero(as_tuple=True)[0]
        row_end_indices = (generated_ids[i] == token_to_remove_end_speech).nonzero(as_tuple=True)[0]

        if len(row_start_indices) > 0 and len(row_end_indices) > 0:
            first_start_idx = row_start_indices[0].item()
            last_end_idx = row_end_indices[-1].item()
            # Extract codes between start and end speech tokens
            cropped_tensor = generated_ids[i, first_start_idx + 1 : last_end_idx]

            # Trim to be divisible by 7
            trimmed_row = cropped_tensor[:(cropped_tensor.size(0) // 7) * 7]

            # Subtract audio_tokens_start offset
            trimmed_row = [t - audio_tokens_start for t in trimmed_row.tolist()]
            processed_rows.append(torch.tensor(trimmed_row)) # Convert back to tensor
        else:
            # If start or end token not found in a row, append empty list or handle as needed
            processed_rows.append(torch.tensor([])) # Append an empty tensor

else:
    print("Start or end speech token not found in generated IDs.")
    # Handle cases where tokens are not found, e.g., append empty lists or raise error


code_lists = processed_rows # Now code_lists contains tensors


def redistribute_codes(code_list):
  if len(code_list) == 0:
      return torch.tensor([]) # Return empty tensor if no codes

  layer_1 = []
  layer_2 = []
  layer_3 = []
  # Ensure code_list is a list or iterable for len()
  if isinstance(code_list, torch.Tensor):
      code_list = code_list.tolist()

  for i in range(len(code_list)//7):
    layer_1.append(code_list[7*i])
    layer_2.append(code_list[7*i+1]-4096)
    layer_3.append(code_list[7*i+2]-(2*4096))
    layer_3.append(code_list[7*i+3]-(3*4096))
    layer_2.append(code_list[7*i+4]-(4*4096))
    layer_3.append(code_list[7*i+5]-(5*4096))
    layer_3.append(code_list[7*i+6]-(6*4096))
  codes = [torch.tensor(layer_1).unsqueeze(0),
         torch.tensor(layer_2).unsqueeze(0),
         torch.tensor(layer_3).unsqueeze(0)]

  # codes = [c.to("cuda") for c in codes] # Keep on CPU for local inference
  audio_hat = snac_model.decode(codes)
  return audio_hat

my_samples = []
for code_list in code_lists:
  samples = redistribute_codes(code_list)
  my_samples.append(samples)

# Display audio (requires a suitable environment like Jupyter/Colab or saving to file)
# For a standard Python script, you might save the audio to a file instead.


print("\nAttempting to save Generated Audio:")
if len(prompts) != len(my_samples):
  print("Warning: Number of prompts and samples do not match.")
else:
  for i in range(len(my_samples)):
    print(f"Prompt: {prompts[i]}")
    samples = my_samples[i]
    if samples.numel() > 0:
        # Save to a WAV file
        try:
            output_filename = f"output_{i}.wav"
            # Ensure the tensor is 2D with shape (num_channels, num_frames)
            audio_tensor_to_save = samples.detach().to("cpu")

            # Handle the (1, 1, num_frames) case by squeezing
            if audio_tensor_to_save.ndim == 3 and audio_tensor_to_save.shape[0] == 1 and audio_tensor_to_save.shape[1] == 1:
                audio_tensor_to_save = audio_tensor_to_save.squeeze(0) # Remove the first dimension
                audio_tensor_to_save = audio_tensor_to_save.squeeze(0) # Remove the second dimension

            if audio_tensor_to_save.ndim == 1:
                 audio_tensor_to_save = audio_tensor_to_save.unsqueeze(0)
            elif audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] != 1:
                 # If it's 2D but not (1, num_frames), assume it's (num_frames, 1) and transpose
                 audio_tensor_to_save = audio_tensor_to_save.T
            elif audio_tensor_to_save.ndim > 2:
                 print(f"Skipping saving for prompt {i} due to unexpected tensor shape: {audio_tensor_to_save.shape}")
                 continue # Skip saving this sample

            # Double check the shape before saving
            if audio_tensor_to_save.ndim == 2 and audio_tensor_to_save.shape[0] == 1:
                torchaudio.save(output_filename, audio_tensor_to_save, 24000)
                print(f"Audio saved to {output_filename}")
            else:
                 print(f"Skipping saving for prompt {i} as tensor could not be reshaped to (1, num_frames). Final shape: {audio_tensor_to_save.shape}")


        except Exception as e:
            print(f"Could not save audio to file: {e}")
    else:
        print("No valid audio codes were generated for this prompt.")

# Clean up
del my_samples
# Optional: delete model and tokenizer if you need to free up memory
# del model, tokenizer
# torch.cuda.empty_cache() # If using CUDA

Loading model from ./model...
==((====))==  Unsloth 2025.9.11: Fast Llama patching. Transformers: 4.55.4.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Model and tokenizer loaded.
Loading SNAC model...
SNAC model loaded.
Fast inference enabled.
Generating audio...

Attempting to save Generated Audio:
Prompt: Hey there my name is Elise, <giggles> and I'm a speech generation model that can sound like a person.
Audio saved to output_0.wav
Prompt: This is a test sentence for voice cloning.
Audio saved to output_1.wav


## Summary:

### Data Analysis Key Findings

*   The saved model directory can be downloaded from Google Colab to a local machine using the file browser's download option.
*   Setting up a local Python environment for running the model involves installing Python (version 3.8 or later recommended), creating and activating a virtual environment, and installing necessary libraries such as `unsloth`, `transformers`, `torch`, `bitsandbytes`, `accelerate`, and `xformers` using pip.
*   The saved model can be loaded locally using `unsloth.FastLanguageModel.from_pretrained`, specifying the path to the model directory and the same parameters used during training (`max_seq_length`, `dtype`, `load_in_4bit`).
*   Performing local inference requires loading both the fine-tuned language model and the SNAC decoder model.
*   The generated audio output from the SNAC decoder is initially in a 3D tensor format `(1, 1, num_frames)`, which needs to be reshaped to a 2D format `(1, num_frames)` before saving as a WAV file using `torchaudio.save`.

### Insights or Next Steps

*   The provided Python script (`run_inference.py`) serves as a complete example for running voice cloning inference locally, including model loading, token generation, and audio saving.
*   Users should ensure their local environment has sufficient resources (CPU/GPU and RAM) to load and run the models, especially for larger models or longer audio generations.
