Classify sentiment from text + image using a novel cross-modal attention architecture — RoBERTa meets CLIP ViT.
Social media content is inherently multimodal — a sarcastic caption paired with a joyful image completely changes the sentiment. This project addresses that challenge by fusing language understanding (RoBERTa) with visual understanding (CLIP ViT) through a cross-modal attention mechanism that lets text tokens attend directly to image patch features.
Key results on MVSA-Single:
| Model | Accuracy | F1 (Macro) |
|---|---|---|
| Text-only (RoBERTa) | 72.1% | 70.8% |
| Image-only (CLIP) | 65.3% | 63.7% |
| Late Fusion (concat) | 74.6% | 73.2% |
| Ours (Cross-Attn) | 77.4% | 76.1% |
Text ──► RoBERTa-base ──► Token Sequences [B, T, 768]
│
▼ (queries)
Cross-Modal Attention ──────────────────┐
▲ (keys/values) │
Image ──► CLIP ViT-B/32 ──► Patch Features [B, 197, 768] │
│ │
└──► Global Embedding [B, 512] │
│
Attended CLS [768] + Raw Text CLS [768] + CLIP [512] ◄─┘
│
Fusion MLP
│
Sentiment Logits [3]
(negative / neutral / positive)
Why cross-modal attention instead of simple concatenation?
Concatenation treats both modalities independently. Cross-attention lets the model explicitly learn which image regions are relevant to each word — e.g., the word "beautiful" should attend to the sky region in a landscape photo.
multimodal-sentiment/
├── src/
│ ├── model/
│ │ └── classifier.py # CrossModalAttention + MultimodalSentimentClassifier
│ ├── data/
│ │ ├── dataset.py # PyTorch Dataset + DataLoader builder
│ │ └── preprocessing.py # Tweet cleaning + MVSA label parser
│ ├── training/
│ │ ├── trainer.py # Training loop with MLflow + early stopping
│ │ └── evaluate.py # Accuracy, F1, confusion matrix
│ ├── api/
│ │ ├── main.py # FastAPI inference server
│ │ └── schemas.py # Pydantic request/response models
│ └── utils/
│ └── visualize.py # Attention heatmap + token importance plots
├── app/
│ └── demo.py # Gradio interactive demo
├── configs/
│ └── config.yaml # All hyperparameters
├── examples/
│ └── inference.py # CLI quickstart
├── tests/
│ ├── test_model.py
│ └── test_api.py
├── Dockerfile
└── docker-compose.yml
git clone https://github.com/maazhaider11/multimodal-sentiment.git
cd multimodal-sentiment
pip install -r requirements.txtRegister at mvsa.nlpir.org and download the dataset, then:
mkdir -p data/images
# Place mvsa_single.zip in data/
python -c "
from src.data.preprocessing import download_mvsa_single, build_labels_csv
download_mvsa_single('data')
build_labels_csv('data/mvsa_single', 'data/labels.csv')
"python -c "
import yaml, torch
from src.data.dataset import build_dataloaders
from src.model.classifier import MultimodalSentimentClassifier
from src.training.trainer import train
cfg = yaml.safe_load(open('configs/config.yaml'))
loaders = build_dataloaders(**cfg['data'])
model = MultimodalSentimentClassifier(**cfg['model'])
train(model, loaders['train'], loaders['val'], cfg['training'])
"MLflow dashboard → mlflow ui → http://localhost:5000
python examples/inference.py --text "Absolutely love this sunset!" --image path/to/image.jpgOutput:
=============================================
Text : Absolutely love this sunset!
Modality : multimodal
─────────────────────────────────────────────
😞 negative 0.031 █
😐 neutral 0.112 ███
😊 positive 0.857 █████████████████████████ ◀ PREDICTED
=============================================
uvicorn src.api.main:app --reload --port 8000curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Best day of my life!", "image_url": "https://example.com/photo.jpg"}'{
"label": "positive",
"confidence": 0.913,
"probabilities": {"negative": 0.031, "neutral": 0.056, "positive": 0.913},
"text": "Best day of my life!",
"modality": "multimodal"
}python app/demo.py
# → http://localhost:7860docker-compose up --build- API: http://localhost:8000
- Demo: http://localhost:7860
- MLflow: http://localhost:5000
Inspect what the model is looking at when making predictions:
from src.utils.visualize import plot_cross_attention, plot_token_importance
logits, attn_weights = model(..., return_attn=True)
# Heatmap: which image patches the [CLS] text token attends to
plot_cross_attention(image, attn_weights, tokens)
# Bar chart: which words are most important
plot_token_importance(tokens, attn_weights)pytest tests/ -v| Component | Library |
|---|---|
| Text Encoder | RoBERTa-base (HuggingFace Transformers) |
| Image Encoder | CLIP ViT-B/32 (OpenAI) |
| Framework | PyTorch 2.2 |
| Experiment Tracking | MLflow |
| REST API | FastAPI + Uvicorn |
| Demo UI | Gradio |
| Containerization | Docker + Docker Compose |
MIT © Maaz Haider