Implementation of Perez et al. (2017) with an interactive Streamlit application covering three use cases: visual question answering on Sort-of-CLEVR and CLEVR, and artistic style transfer via Conditional Instance Normalisation.
A common challenge in deep learning is conditioning — adapting a network's behavior based on external information (a question, a style, a class label).
FiLM addresses this with a simple and general idea: instead of concatenating the context to the inputs, it transforms it into scale and shift parameters γ and β that directly modulate the CNN feature maps:
- γ amplifies, reduces or suppresses a feature map
- β shifts activations up or down
- Both are produced by a lightweight network (the FiLM generator) from the conditioning input (e.g. the question)
In practice we use
pip install -r requirements.txt
python -m streamlit run app.pyEverything runs from the interface — no further terminal interaction needed.
Data and model weights are not included in the repo (too large). They are hosted on Google Drive and can be downloaded directly from the app.
| Dataset | Size | How to get it |
|---|---|---|
| Sort-of-CLEVR | ~200 MB | Button in the app |
| Style Transfer | ~400 MB | Button in the app |
| CLEVR VQA | ~18 GB | Manual (see below) |
2D Kaggle dataset of colored shapes with 11 answer classes. The question is encoded in 10 dimensions and passed to the FiLM generator, which modulates the CNN feature maps. You can train from scratch, load a pretrained model, and test visually on generated scenes.
Full implementation of the paper's architecture. The dataset is ~18 GB so interactive training is not available — the app displays the learning curves from our own run (~40k iterations).
To reproduce:
# Preprocess questions
python -m clevr.scripts.preprocess_questions \
--input_questions_json CLEVR_v1.0/questions/CLEVR_train_questions.json \
--output_h5_file clevr/data/train_questions.h5 \
--output_vocab_json clevr/data/vocab.json
# Extract ResNet101 features
python -m clevr.scripts.extract_features \
--data-dir clevr/data/clevr --split train
# Train
python -m clevr.scripts.train_model --model_type FiLM \
--checkpoint_path clevr/data/film_checkpoint.pth \
--batch_size 64 --num_iterations 100000 --loader_num_workers 0Implementation of Ghiasi et al. (2017) — same FiLM conditioning idea applied to artistic style via Conditional Instance Normalisation. 6 styles available with interactive inference from the app.
| Dataset | Validation Accuracy |
|---|---|
| Sort-of-CLEVR | ~94% (10 epochs) |
| CLEVR VQA (our run, 40k iters) | ~51% |
The gap with the paper on CLEVR VQA is mainly due to a reduced hidden_dim (256 vs 4096) and a limited number of iterations.
FiLMProjet/
├── app.py
├── pages/
│ ├── 0_Présentation.py
│ ├── 1_Sort_of_CLEVR.py
│ ├── 2_CLEVR_VQA.py
│ └── 3_Style_Transfer.py
├── sortofclevr/ # dataset, model, training
├── style_transfer/ # dataset, model, training
├── clevr/
│ ├── core/ # data, embedding, preprocess, utils
│ ├── models/ # film_net, film_gen, baselines, layers
│ ├── scripts/ # train, preprocess, extract features
│ └── data/ # vocab, h5 questions, result logs
├── assets/
└── requirements.txt
- Perez et al. (2017) — FiLM: Visual Reasoning with a General Conditioning Layer
- Johnson et al. (2017) — CLEVR: A Diagnostic Dataset for Visual Reasoning
- Ghiasi et al. (2017) — Exploring the structure of a real-time, arbitrary neural artistic stylization network
- Original CLEVR codebase: github.com/ethanjperez/film
- Iliès Chenene
- Valentin Porlier