In [1]:
!pip install unsloth
!pip install --upgrade --no-deps "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"


Collecting unsloth@ git+https://github.com/unslothai/unsloth.git (from unsloth[colab-new]@ git+https://github.com/unslothai/unsloth.git)
  Cloning https://github.com/unslothai/unsloth.git to /tmp/pip-install-833fbqc7/unsloth_035b5e70128d44fbb4c473a41912870b
  Running command git clone --filter=blob:none --quiet https://github.com/unslothai/unsloth.git /tmp/pip-install-833fbqc7/unsloth_035b5e70128d44fbb4c473a41912870b
  Resolved https://github.com/unslothai/unsloth.git to commit a2f4c9793ecf829ede2cb64f2ca7a909ce3b0884
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [2]:
from unsloth import FastLanguageModel
import torch

# Set environment variables to enable detailed CUDA debugging
%env CUDA_LAUNCH_BLOCKING=1
%env TORCH_USE_CUDA_DSA=1

# Set the desired sequence length and model parameters
max_seq_length = 512  # Adjusted for testing
dtype = torch.float16  # Float16 for Tesla T4
load_in_4bit = False  # Temporarily disable 4-bit quantization for testing simplicity

# Load the TinyLlama model with extended context
model_name = "unsloth/tinyllama"
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,  # Disabling 4-bit for simplicity
)

# Enable fast inference mode for efficient generation
FastLanguageModel.for_inference(model)

# Prepare input for testing
input_text = "Translate to French: The cat sat on the mat."

# Tokenize input text
inputs = tokenizer(
    [input_text],
    return_tensors="pt",
    max_length=max_seq_length,
    truncation=True,
    padding="max_length",
    add_special_tokens=True
).to("cuda")

# Manually add attention mask to ensure proper handling of padding tokens
inputs['attention_mask'] = torch.ones_like(inputs['input_ids'], device="cuda")

# Generate text with adjusted parameters to improve output quality
outputs = model.generate(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    max_new_tokens=32,  # Control the maximum length of the output
    repetition_penalty=1.2,  # Apply a penalty to reduce repetition
    temperature=0.7,  # Lower temperature for more focused generation
    top_p=0.9,  # Top-p sampling for more diverse yet coherent results
    use_cache=True  # Use cache for more efficient inference
)

# Decode the generated tokens back into text
generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(f"Model Response: {generated_text[0]}")

# Optionally check GPU memory usage after the process
!nvidia-smi


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
env: CUDA_LAUNCH_BLOCKING=1
env: TORCH_USE_CUDA_DSA=1
==((====))==  Unsloth 2024.9.post4: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Model Response: Translate to French: The cat sat on the mat.
Memento MMRRomani.
The nd theater, and ateater theater.
Into theater.
theatre of
Mon Oct 14 06:38:35 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+---