<a href="https://colab.research.google.com/github/hussainezzi/Arabic-NLP/blob/main/Towards_a_Multimodal_Large_Language_Model_for_Arabic_Poetry_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Multimodal Arabic Poetry Generation

This notebook implements a multimodal LLM for generating Arabic poetry from text, image, or audio inputs.

## 1. Setup and Installation

First, install required libraries:

In [None]:
!pip install transformers datasets torch torchvision torchaudio
!pip install speechbrain audiomentations
!pip install arabic-reshaper python-bidi pytesseract
!pip install gradio  # For eventual demo

## 2. Data Loading and Preparation

We'll use the following datasets from Hugging Face:
- Arabic Poetry Corpus
- Classical Arabic Poetry Dataset
- Common Voice Arabic (audio)

In [None]:
from datasets import load_dataset

# Load text datasets
poetry_dataset = load_dataset("arbml/ClassicalArabicPoetryDataset", split="train")
calligraphy_dataset = load_dataset("arbml/ArabicCalligraphy", split="train")

# Load audio dataset
audio_dataset = load_dataset("common_voice", "ar", split="train[:100]")

print(f"Loaded {len(poetry_dataset)} poetic verses")
print(f"Loaded {len(calligraphy_dataset)} calligraphy images")
print(f"Loaded {len(audio_dataset)} audio samples")

## 3. Model Architecture

We'll create a multimodal transformer integrating:
- AraGPT2 for text generation
- CLIP for image understanding
- SpeechT5 for audio processing

In [None]:
from transformers import (
    AutoTokenizer, AutoModelForCausalLM,
    CLIPProcessor, CLIPModel,
    SpeechT5Processor, SpeechT5ForTextToSpeech
)

# Text components
text_tokenizer = AutoTokenizer.from_pretrained("aubmindlab/aragpt2-base")
text_model = AutoModelForCausalLM.from_pretrained("aubmindlab/aragpt2-base")

# Image components
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")

# Audio components
speech_processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
speech_model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")

## 4. Multimodal Fusion Layer

This layer combines features from different modalities:

In [None]:
import torch.nn as nn

class MultimodalFusion(nn.Module):
    def __init__(self, text_dim, image_dim, audio_dim):
        super().__init__()
        self.text_proj = nn.Linear(text_dim, 512)
        self.image_proj = nn.Linear(image_dim, 512)
        self.audio_proj = nn.Linear(audio_dim, 512)
        self.attention = nn.MultiheadAttention(512, 8)

    def forward(self, text_feats, image_feats, audio_feats):
        # Project all features to same dimension
        text = self.text_proj(text_feats)
        image = self.image_proj(image_feats)
        audio = self.audio_proj(audio_feats)

        # Concatenate and apply attention
        combined = torch.cat([text, image, audio], dim=1)
        attn_output, _ = self.attention(combined, combined, combined)
        return attn_output.mean(dim=1)

## 5. Training Pipeline

Custom training loop for multimodal inputs:

In [None]:
def train_multimodal(batch):
    # Process different modalities
    text_inputs = text_tokenizer(batch["text"], return_tensors="pt", padding=True)
    image_inputs = clip_processor(images=batch["image"], return_tensors="pt")
    audio_inputs = speech_processor(audio=batch["audio"], return_tensors="pt")

    # Get features from each modality
    text_feats = text_model(**text_inputs).last_hidden_state
    image_feats = clip_model.get_image_features(**image_inputs)
    audio_feats = speech_model.get_audio_features(**audio_inputs)

    # Fuse features
    fused_feats = fusion_layer(text_feats, image_feats, audio_feats)

    # Generate poetry
    outputs = text_model(inputs_embeds=fused_feats)
    return outputs.loss

# Initialize fusion layer
fusion_layer = MultimodalFusion(
    text_dim=768,
    image_dim=512,
    audio_dim=512
)

# Training loop (simplified)
optimizer = torch.optim.AdamW([
    {'params': text_model.parameters()},
    {'params': fusion_layer.parameters()}
], lr=5e-5)

for epoch in range(3):
    for batch in train_loader:
        loss = train_multimodal(batch)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
    print(f"Epoch {epoch} Loss: {loss.item()}")

## 6. Poetry Generation

Function to generate poetry from any input modality:

In [None]:
def generate_poetry(input_text=None, input_image=None, input_audio=None):
    # Process input modality
    if input_text:
        inputs = text_tokenizer(input_text, return_tensors="pt")
    elif input_image:
        inputs = clip_processor(images=input_image, return_tensors="pt")
    elif input_audio:
        inputs = speech_processor(audio=input_audio, return_tensors="pt")

    # Generate features
    if input_text:
        feats = text_model(**inputs).last_hidden_state
    elif input_image:
        feats = clip_model.get_image_features(**inputs)
    elif input_audio:
        feats = speech_model.get_audio_features(**inputs)

    # Generate text
    poetry_ids = text_model.generate(
        inputs_embeds=feats,
        max_length=100,
        num_beams=5,
        early_stopping=True
    )
    return text_tokenizer.decode(poetry_ids[0], skip_special_tokens=True)

# Example usage
print(generate_poetry(input_text="الغروب الجميل"))

## 7. Output Rendering

Convert generated text to multimedia formats:

In [None]:
from gtts import gTTS
from IPython.display import Audio, Image

def text_to_speech(text, lang="ar"):
    tts = gTTS(text=text, lang=lang, slow=True)
    tts.save("poem.mp3")
    return Audio("poem.mp3")

def create_video(text, image_path):
    # Add text to image using Arabic reshaper
    from arabic_reshaper import reshape
    from bidi.algorithm import get_display

    reshaped_text = reshape(text)
    bidi_text = get_display(reshaped_text)

    # Add text to image and save as video
    # (Implementation using OpenCV would go here)
    return "output.mp4"

# Generate multimedia output
poem = generate_poetry(input_text="الحب")
audio = text_to_speech(poem)
video = create_video(poem, "background.jpg")

## 8. Evaluation Metrics

Implement automatic evaluation metrics:

In [None]:
from nltk.translate.bleu_score import sentence_bleu

def evaluate_poetry(reference, generated):
    # BLEU Score
    reference = [ref.split()]
    generated = generated.split()
    bleu = sentence_bleu(reference, generated)

    # Rhyme detection (Arabic-specific)
    last_words = [line.split()[-1] for line in generated.split('\n')]
    rhyme_score = len(set(last_words)) / len(last_words)

    return {"bleu": bleu, "rhyme": rhyme_score}

# Example evaluation
poem = "يا طائر الغرد في الأيك\nقد كنت أهواك من قديم"
print(evaluate_poetry(reference_poem, poem))

## Next Steps

1. Implement reinforcement learning for 'arūḍ meter alignment
2. Add diacritization support using Mishkal
3. Scale up training using distributed computing
4. Develop Gradio demo interface

Note: This implementation requires significant GPU resources. For full training, use:
- 4x NVIDIA A100 GPUs
- Mixed precision training
- Dataset sharding