Skip to content

maazhaider11/SentiVision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 Multimodal Sentiment Analyzer

Classify sentiment from text + image using a novel cross-modal attention architecture — RoBERTa meets CLIP ViT.

Python PyTorch FastAPI MLflow License


Overview

Social media content is inherently multimodal — a sarcastic caption paired with a joyful image completely changes the sentiment. This project addresses that challenge by fusing language understanding (RoBERTa) with visual understanding (CLIP ViT) through a cross-modal attention mechanism that lets text tokens attend directly to image patch features.

Key results on MVSA-Single:

Model Accuracy F1 (Macro)
Text-only (RoBERTa) 72.1% 70.8%
Image-only (CLIP) 65.3% 63.7%
Late Fusion (concat) 74.6% 73.2%
Ours (Cross-Attn) 77.4% 76.1%

Architecture

Text ──► RoBERTa-base ──► Token Sequences [B, T, 768]
                                    │
                                    ▼ (queries)
                         Cross-Modal Attention ──────────────────┐
                                    ▲ (keys/values)               │
Image ──► CLIP ViT-B/32 ──► Patch Features [B, 197, 768]         │
                    │                                             │
                    └──► Global Embedding [B, 512]               │
                                                                  │
         Attended CLS [768] + Raw Text CLS [768] + CLIP [512]  ◄─┘
                                    │
                               Fusion MLP
                                    │
                          Sentiment Logits [3]
                    (negative / neutral / positive)

Why cross-modal attention instead of simple concatenation?
Concatenation treats both modalities independently. Cross-attention lets the model explicitly learn which image regions are relevant to each word — e.g., the word "beautiful" should attend to the sky region in a landscape photo.


Project Structure

multimodal-sentiment/
├── src/
│   ├── model/
│   │   └── classifier.py       # CrossModalAttention + MultimodalSentimentClassifier
│   ├── data/
│   │   ├── dataset.py          # PyTorch Dataset + DataLoader builder
│   │   └── preprocessing.py   # Tweet cleaning + MVSA label parser
│   ├── training/
│   │   ├── trainer.py          # Training loop with MLflow + early stopping
│   │   └── evaluate.py         # Accuracy, F1, confusion matrix
│   ├── api/
│   │   ├── main.py             # FastAPI inference server
│   │   └── schemas.py          # Pydantic request/response models
│   └── utils/
│       └── visualize.py        # Attention heatmap + token importance plots
├── app/
│   └── demo.py                 # Gradio interactive demo
├── configs/
│   └── config.yaml             # All hyperparameters
├── examples/
│   └── inference.py            # CLI quickstart
├── tests/
│   ├── test_model.py
│   └── test_api.py
├── Dockerfile
└── docker-compose.yml

Quickstart

1. Install

git clone https://github.com/maazhaider11/multimodal-sentiment.git
cd multimodal-sentiment
pip install -r requirements.txt

2. Prepare Data (MVSA-Single)

Register at mvsa.nlpir.org and download the dataset, then:

mkdir -p data/images
# Place mvsa_single.zip in data/
python -c "
from src.data.preprocessing import download_mvsa_single, build_labels_csv
download_mvsa_single('data')
build_labels_csv('data/mvsa_single', 'data/labels.csv')
"

3. Train

python -c "
import yaml, torch
from src.data.dataset import build_dataloaders
from src.model.classifier import MultimodalSentimentClassifier
from src.training.trainer import train

cfg = yaml.safe_load(open('configs/config.yaml'))
loaders = build_dataloaders(**cfg['data'])
model = MultimodalSentimentClassifier(**cfg['model'])
train(model, loaders['train'], loaders['val'], cfg['training'])
"

MLflow dashboard → mlflow uihttp://localhost:5000

4. Run Inference (CLI)

python examples/inference.py --text "Absolutely love this sunset!" --image path/to/image.jpg

Output:

=============================================
  Text     : Absolutely love this sunset!
  Modality : multimodal
─────────────────────────────────────────────
  😞 negative  0.031  █
  😐 neutral   0.112  ███
  😊 positive  0.857  █████████████████████████  ◀ PREDICTED
=============================================

5. Start API Server

uvicorn src.api.main:app --reload --port 8000
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "Best day of my life!", "image_url": "https://example.com/photo.jpg"}'
{
  "label": "positive",
  "confidence": 0.913,
  "probabilities": {"negative": 0.031, "neutral": 0.056, "positive": 0.913},
  "text": "Best day of my life!",
  "modality": "multimodal"
}

6. Gradio Demo

python app/demo.py
# → http://localhost:7860

7. Docker (full stack)

docker-compose up --build

Attention Visualization

Inspect what the model is looking at when making predictions:

from src.utils.visualize import plot_cross_attention, plot_token_importance

logits, attn_weights = model(..., return_attn=True)

# Heatmap: which image patches the [CLS] text token attends to
plot_cross_attention(image, attn_weights, tokens)

# Bar chart: which words are most important
plot_token_importance(tokens, attn_weights)

Run Tests

pytest tests/ -v

Tech Stack

Component Library
Text Encoder RoBERTa-base (HuggingFace Transformers)
Image Encoder CLIP ViT-B/32 (OpenAI)
Framework PyTorch 2.2
Experiment Tracking MLflow
REST API FastAPI + Uvicorn
Demo UI Gradio
Containerization Docker + Docker Compose

License

MIT © Maaz Haider

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors