# üé§ Test Model with Custom Audio

Now that we've successfully fine-tuned our Audio-Language model, it's time to test it with my own speech.

This notebook will guide you through:
1. Recording audio on Mac (Voice Memos)
2. Converting to WAV format
3. Resampling to 16kHz
4. Running inference on your custom audio

## üìù Step 1: Recording Your Audio

**On Mac:**
1. Open **Voice Memos** app (Launchpad ‚Üí Voice Memos)
2. Click the red record button
3. Speak clearly: "The quick brown fox jumps over the lazy dog"
4. Stop recording
5. Right-click ‚Üí Export ‚Üí Save as `my_recording.m4a`

**Alternative: Use FFmpeg to convert M4A to WAV**
```bash
# If you have FFmpeg installed:
ffmpeg -i my_recording.m4a -ar 16000 -ac 1 my_recording.wav
```

**Or use online converter**: https://audio.online-convert.com/convert-to-wav

Place your audio file in the same directory as this notebook.

## üì¶ Step 2: Install Dependencies

In [1]:
# Uncomment if needed:
# !pip install torchaudio librosa soundfile

import sys
import torch
import torchaudio
import librosa
import soundfile as sf
from io import BytesIO
import IPython.display as ipd

print("‚úÖ Dependencies loaded")

‚úÖ Dependencies loaded


In [15]:
import warnings
import logging
import transformers

# 1. Suppress Python Warnings
# "PySoundFile failed" (M4A fallback warning)
warnings.filterwarnings("ignore", category=UserWarning, module="librosa")
warnings.filterwarnings("ignore", category=FutureWarning, module="librosa")

# "do_sample" vs "temperature" conflicts in generation
warnings.filterwarnings("ignore", message=".*do_sample.*")

# "trust_remote_code" argument warnings
warnings.filterwarnings("ignore", message=".*trust_remote_code.*")

# "copying from a non-meta parameter" (The scary looking one that is actually harmless)
warnings.filterwarnings("ignore", message=".*non-meta parameter.*")

# 2. Suppress Transformers Logging (The "Loading checkpoint..." spam)
transformers.logging.set_verbosity_error()

print("‚úÖ Warnings suppressed. Ready for clean inference.")



## üéµ Step 3: Load and Process Your Audio

We'll:
1. Load your audio file (handles various formats)
2. Resample to 16kHz (model's training rate)
3. Convert to mono if needed
4. Save to in-memory BytesIO object

In [16]:
# Path to your audio file
AUDIO_FILE = "kulsoom_test.m4a"  # Change this to your file name

print(f"üìÇ Loading audio from: {AUDIO_FILE}")

# Load audio with librosa (handles many formats)
audio_array, original_sr = librosa.load(AUDIO_FILE, sr=None, mono=True)

print(f"   Original sample rate: {original_sr} Hz")
print(f"   Duration: {len(audio_array)/original_sr:.2f} seconds")
print(f"   Shape: {audio_array.shape}")

# Listen to original audio
print("\nüéß Your audio:")
ipd.display(ipd.Audio(audio_array, rate=original_sr))

üìÇ Loading audio from: kulsoom_test.m4a
   Original sample rate: 48000 Hz
   Duration: 14.36 seconds
   Shape: (689088,)

üéß Your audio:


  audio_array, original_sr = librosa.load(AUDIO_FILE, sr=None, mono=True)


## üîÑ Step 4: Resample to 16kHz

The model was trained on 16kHz audio, so we need to resample.

In [17]:
TARGET_SR = 16000

if original_sr != TARGET_SR:
    print(f"üîÑ Resampling from {original_sr}Hz to {TARGET_SR}Hz...")
    
    # Method 1: Using torchaudio (faster)
    audio_tensor = torch.from_numpy(audio_array).unsqueeze(0)  # Add channel dim
    resampler = torchaudio.transforms.Resample(
        orig_freq=original_sr,
        new_freq=TARGET_SR
    )
    audio_resampled = resampler(audio_tensor).squeeze(0).numpy()
    
    # Method 2: Using librosa (alternative)
    # audio_resampled = librosa.resample(audio_array, orig_sr=original_sr, target_sr=TARGET_SR)
    
    print(f"‚úÖ Resampled shape: {audio_resampled.shape}")
else:
    print(f"‚úÖ Already at {TARGET_SR}Hz, no resampling needed")
    audio_resampled = audio_array

# Listen to resampled audio
print("\nüéß Resampled audio (16kHz):")
ipd.display(ipd.Audio(audio_resampled, rate=TARGET_SR))

üîÑ Resampling from 48000Hz to 16000Hz...
‚úÖ Resampled shape: (229696,)

üéß Resampled audio (16kHz):


## üíæ Step 5: Convert to BytesIO (In-Memory)

This allows us to pass audio without saving temporary files.

In [18]:
# Create BytesIO object
audio_bytes_io = BytesIO()

# Write audio to BytesIO as WAV
sf.write(audio_bytes_io, audio_resampled, TARGET_SR, format='WAV')

# Reset pointer to beginning
audio_bytes_io.seek(0)

print(f"‚úÖ Audio saved to in-memory BytesIO object")
print(f"   Size: {len(audio_bytes_io.getvalue())} bytes")

# Alternative: Save to file if needed
# sf.write("processed_audio.wav", audio_resampled, TARGET_SR)
# print("‚úÖ Also saved to processed_audio.wav")

‚úÖ Audio saved to in-memory BytesIO object
   Size: 459436 bytes


## ü§ñ Step 6: Load Model and Run Inference

In [19]:
# Setup fork path
FORK_PATH = "./transformers_fork/src"
if FORK_PATH not in sys.path:
    sys.path.insert(0, FORK_PATH)

from transformers import (
    Qwen2VLForConditionalGeneration,
    AutoTokenizer,
    WhisperFeatureExtractor
)

print("‚úÖ Transformers loaded from fork")

‚úÖ Transformers loaded from fork


In [None]:
# Model configuration
MODEL_PATH = "./kulsoom-abdullah/Qwen2-Audio-7B-Transcription"
AUDIO_TOKEN_ID = 151657
NUM_AUDIO_TOKENS = 1500

print(f"üì• Loading model from: {MODEL_PATH}")

# Load model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
tokenizer.pad_token_id = 151643
tokenizer.eos_token_id = 151645

# Load feature extractor
feature_extractor = WhisperFeatureExtractor.from_pretrained(
    "openai/whisper-large-v3-turbo"
)

print("‚úÖ Model loaded and ready!")

üì• Loading model from: ./stage2_full_bulletproof/merged_model
üéß Grafting Audio Encoder: openai/whisper-large-v3-turbo...
‚úÖ Audio components initialized: 1280 -> 3584


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

‚úÖ Model loaded and ready!


## üöÄ Step 7: Generate Transcription

In [21]:
print("üéØ Running inference on your custom audio...\n")

# Prepare audio features
inputs = feature_extractor(
    audio_resampled,
    sampling_rate=TARGET_SR,
    return_tensors="pt"
)
input_features = inputs.input_features.to(model.device).to(torch.bfloat16)

# Build prompt with audio tokens
audio_tokens = [AUDIO_TOKEN_ID] * NUM_AUDIO_TOKENS
input_ids_audio = torch.tensor([audio_tokens], device=model.device)

# Tokenize prompt parts
p1 = tokenizer.encode(
    "<|im_start|>user\n<|audio_bos|>",
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

p2 = tokenizer.encode(
    "<|audio_eos|>\nTranscribe this audio.<|im_end|>\n<|im_start|>assistant\n",
    add_special_tokens=False,
    return_tensors="pt"
).to(model.device)

# Combine
input_ids = torch.cat([p1, input_ids_audio, p2], dim=1)
attention_mask = torch.ones_like(input_ids)

# Generate transcription
with torch.no_grad():
    generated_ids = model.generate(
        input_ids=input_ids,
        input_features=input_features,
        attention_mask=attention_mask,
        max_new_tokens=128,
        do_sample=False,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

# Decode output
transcription = tokenizer.decode(
    generated_ids[0][input_ids.shape[1]:],
    skip_special_tokens=True
).strip()

print("="*80)
print("üé§ YOUR AUDIO")
print("="*80)
ipd.display(ipd.Audio(audio_resampled, rate=TARGET_SR))

print("\n" + "="*80)
print("üìù MODEL TRANSCRIPTION")
print("="*80)
print(f"\n{transcription}\n")
print("="*80)

üéØ Running inference on your custom audio...

üé§ YOUR AUDIO



üìù MODEL TRANSCRIPTION

MY NAME IS KUOSSUM ABDULLA AND I'M RECORDING THIS AUDIO TO TEST THIS WHISPER TO QWIN GRAFTEDE AUDIO TO TEXT MODEL



## üé® Step 8: Try Different Instructions (Optional)

The model can handle various instructions since it was trained on instruction-following.

In [22]:
def transcribe_with_instruction(instruction_text):
    """Helper function to try different instructions."""
    
    # Build prompt with custom instruction
    audio_tokens = [AUDIO_TOKEN_ID] * NUM_AUDIO_TOKENS
    input_ids_audio = torch.tensor([audio_tokens], device=model.device)
    
    p1 = tokenizer.encode(
        "<|im_start|>user\n<|audio_bos|>",
        add_special_tokens=False,
        return_tensors="pt"
    ).to(model.device)
    
    p2 = tokenizer.encode(
        f"<|audio_eos|>\n{instruction_text}<|im_end|>\n<|im_start|>assistant\n",
        add_special_tokens=False,
        return_tensors="pt"
    ).to(model.device)
    
    input_ids = torch.cat([p1, input_ids_audio, p2], dim=1)
    attention_mask = torch.ones_like(input_ids)
    
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=input_ids,
            input_features=input_features,
            attention_mask=attention_mask,
            max_new_tokens=128,
            do_sample=False,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    return tokenizer.decode(
        generated_ids[0][input_ids.shape[1]:],
        skip_special_tokens=True
    ).strip()

# Try different instructions
instructions = [
    "Transcribe this audio.",
    "What does the speaker say?",
    "Please transcribe the following audio.",
    "Convert this speech to text."
]

print("üîç Testing different instructions:\n")
for instruction in instructions:
    result = transcribe_with_instruction(instruction)
    print(f"Instruction: {instruction}")
    print(f"Response: {result}")
    print("-" * 80)

üîç Testing different instructions:

Instruction: Transcribe this audio.
Response: MY NAME IS KUOSSUM ABDULLA AND I'M RECORDING THIS AUDIO TO TEST THIS WHISPER TO QWIN GRAFTEDE AUDIO TO TEXT MODEL
--------------------------------------------------------------------------------
Instruction: What does the speaker say?
Response: The speaker says, "My name is KUOSEM ABDULLA and I'm recording this audio to test this WHISPER TO QWIN GRAFTEDE AUDIO TO TEXT MODEL."
--------------------------------------------------------------------------------
Instruction: Please transcribe the following audio.
Response: MY NAME IS KUOSSUM ABDULLA AND I'M RECORDING THIS AUDIO TO TEST THIS WHISPER TO QWIN GRAFTEDE AUDIO TO TEXT MODEL
--------------------------------------------------------------------------------
Instruction: Convert this speech to text.
Response: MY NAME IS KUOSSUM ABDULLA AND I'M RECORDING THIS AUDIO TO TEST THIS WHISPER TO QWIN GRAFTEDE AUDIO TO TEXT MODEL
---------------------------------

## üéØ Summary

1. ‚úÖ Recorded custom audio
2. ‚úÖ Processed to correct format (16kHz WAV)
3. ‚úÖ Used BytesIO for in-memory handling
4. ‚úÖ Run inference with your trained model
5. ‚úÖ Got high-quality transcriptions!
