🧠 Multimodal Sentiment Analyzer

Classify sentiment from text + image using a novel cross-modal attention architecture — RoBERTa meets CLIP ViT.

Overview

Social media content is inherently multimodal — a sarcastic caption paired with a joyful image completely changes the sentiment. This project addresses that challenge by fusing language understanding (RoBERTa) with visual understanding (CLIP ViT) through a cross-modal attention mechanism that lets text tokens attend directly to image patch features.

Key results on MVSA-Single:

Model	Accuracy	F1 (Macro)
Text-only (RoBERTa)	72.1%	70.8%
Image-only (CLIP)	65.3%	63.7%
Late Fusion (concat)	74.6%	73.2%
Ours (Cross-Attn)	77.4%	76.1%

Architecture

Text ──► RoBERTa-base ──► Token Sequences [B, T, 768]
                                    │
                                    ▼ (queries)
                         Cross-Modal Attention ──────────────────┐
                                    ▲ (keys/values)               │
Image ──► CLIP ViT-B/32 ──► Patch Features [B, 197, 768]         │
                    │                                             │
                    └──► Global Embedding [B, 512]               │
                                                                  │
         Attended CLS [768] + Raw Text CLS [768] + CLIP [512]  ◄─┘
                                    │
                               Fusion MLP
                                    │
                          Sentiment Logits [3]
                    (negative / neutral / positive)

Why cross-modal attention instead of simple concatenation?
Concatenation treats both modalities independently. Cross-attention lets the model explicitly learn which image regions are relevant to each word — e.g., the word "beautiful" should attend to the sky region in a landscape photo.

Project Structure

multimodal-sentiment/
├── src/
│   ├── model/
│   │   └── classifier.py       # CrossModalAttention + MultimodalSentimentClassifier
│   ├── data/
│   │   ├── dataset.py          # PyTorch Dataset + DataLoader builder
│   │   └── preprocessing.py   # Tweet cleaning + MVSA label parser
│   ├── training/
│   │   ├── trainer.py          # Training loop with MLflow + early stopping
│   │   └── evaluate.py         # Accuracy, F1, confusion matrix
│   ├── api/
│   │   ├── main.py             # FastAPI inference server
│   │   └── schemas.py          # Pydantic request/response models
│   └── utils/
│       └── visualize.py        # Attention heatmap + token importance plots
├── app/
│   └── demo.py                 # Gradio interactive demo
├── configs/
│   └── config.yaml             # All hyperparameters
├── examples/
│   └── inference.py            # CLI quickstart
├── tests/
│   ├── test_model.py
│   └── test_api.py
├── Dockerfile
└── docker-compose.yml

Quickstart

1. Install

git clone https://github.com/maazhaider11/multimodal-sentiment.git
cd multimodal-sentiment
pip install -r requirements.txt

2. Prepare Data (MVSA-Single)

Register at mvsa.nlpir.org and download the dataset, then:

mkdir -p data/images
# Place mvsa_single.zip in data/
python -c "
from src.data.preprocessing import download_mvsa_single, build_labels_csv
download_mvsa_single('data')
build_labels_csv('data/mvsa_single', 'data/labels.csv')
"

3. Train

python -c "
import yaml, torch
from src.data.dataset import build_dataloaders
from src.model.classifier import MultimodalSentimentClassifier
from src.training.trainer import train

cfg = yaml.safe_load(open('configs/config.yaml'))
loaders = build_dataloaders(**cfg['data'])
model = MultimodalSentimentClassifier(**cfg['model'])
train(model, loaders['train'], loaders['val'], cfg['training'])
"

MLflow dashboard → mlflow ui → http://localhost:5000

4. Run Inference (CLI)

python examples/inference.py --text "Absolutely love this sunset!" --image path/to/image.jpg

Output:

=============================================
  Text     : Absolutely love this sunset!
  Modality : multimodal
─────────────────────────────────────────────
  😞 negative  0.031  █
  😐 neutral   0.112  ███
  😊 positive  0.857  █████████████████████████  ◀ PREDICTED
=============================================

5. Start API Server

uvicorn src.api.main:app --reload --port 8000

curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Best day of my life!", "image_url": "https://example.com/photo.jpg"}'

{
  "label": "positive",
  "confidence": 0.913,
  "probabilities": {"negative": 0.031, "neutral": 0.056, "positive": 0.913},
  "text": "Best day of my life!",
  "modality": "multimodal"
}

6. Gradio Demo

python app/demo.py
# → http://localhost:7860

7. Docker (full stack)

docker-compose up --build

Attention Visualization

Inspect what the model is looking at when making predictions:

from src.utils.visualize import plot_cross_attention, plot_token_importance

logits, attn_weights = model(..., return_attn=True)

# Heatmap: which image patches the [CLS] text token attends to
plot_cross_attention(image, attn_weights, tokens)

# Bar chart: which words are most important
plot_token_importance(tokens, attn_weights)

Run Tests

pytest tests/ -v

Tech Stack

Component	Library
Text Encoder	RoBERTa-base (HuggingFace Transformers)
Image Encoder	CLIP ViT-B/32 (OpenAI)
Framework	PyTorch 2.2
Experiment Tracking	MLflow
REST API	FastAPI + Uvicorn
Demo UI	Gradio
Containerization	Docker + Docker Compose

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Multimodal Sentiment Analyzer

Overview

Architecture

Project Structure

Quickstart

1. Install

2. Prepare Data (MVSA-Single)

3. Train

4. Run Inference (CLI)

5. Start API Server

6. Gradio Demo

7. Docker (full stack)

Attention Visualization

Run Tests

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
configs		configs
examples		examples
src		src
tests		tests
.env.example		.env.example
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 Multimodal Sentiment Analyzer

Overview

Architecture

Project Structure

Quickstart

1. Install

2. Prepare Data (MVSA-Single)

3. Train

4. Run Inference (CLI)

5. Start API Server

6. Gradio Demo

7. Docker (full stack)

Attention Visualization

Run Tests

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages