Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
28 changed files
with
4,812 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -99,3 +99,5 @@ ENV/ | |
|
||
# mypy | ||
.mypy_cache/ | ||
|
||
**.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,117 @@ | ||
# multimodal-vae-public | ||
A PyTorch implementation of "Multimodal Generative Models for Scalable Weakly-Supervised Learning" (https://arxiv.org/abs/1802.05335) | ||
# Multimodal Variational Autoencoder | ||
A PyTorch implementation of *Multimodal Generative Models for Scalable Weakly-Supervised Learning* (https://arxiv.org/abs/1802.05335). | ||
|
||
## Setup/Installation | ||
|
||
Open a new conda environment and install the necessary dependencies. See [here](https://www.pyimagesearch.com/2017/03/27/how-to-install-dlib/) for more details on installing `dlib`. | ||
``` | ||
conda create -n multimodal python=2.7 anaconda | ||
# activate the environment | ||
source activate multimodal | ||
# install the pytorch | ||
conda install pytorch torchvision -c pytorch | ||
pip install tqdm | ||
pip install scikit-image | ||
pip install python-opencv | ||
pip install imutils | ||
# install dlib | ||
brew install cmake | ||
brew install boost | ||
pip install dlib | ||
``` | ||
|
||
Some additional setup is needed for CelebA-related datasets. Download the aligned-and-cropped version [here](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html). Also download any annotation information. For the computer vision experiment, we need to precompute a few computations on the CelebA dataset. The dlib model we use to extract landmarks is from a PyImageSearch tutorial. You can download it [here](https://www.pyimagesearch.com/2017/04/03/facial-landmarks-dlib-opencv-python/). After downloading CelebA, try the following: | ||
``` | ||
cd vision | ||
# assuming CelebA images are stored in ./data/images | ||
python setup.py grayscale ./data/images ./data/grayscale | ||
python setup.py edge ./data/images ./data/edge | ||
python setup.py mask ./data/images ./data/mask | ||
``` | ||
|
||
## Example Experiments | ||
This repository contains a subset of the experiments mentioned in the [paper](https://arxiv.org/abs/1802.05335). In each folder, there are 3 scripts that one can run: `train.py` to fit the MVAE; `sample.py` to (conditionally) reconstruct from samples in the latent space; and `loglike.py` to compute the marginal log likelihood `log p(x)` using `q(z|x,y)` as the inference network. | ||
|
||
By default, we anneal KL from 0 to 1. The user can customize the learning rate (`--lr`), number of latent dimensions (`--n-latents`), te annealing rate (`--annealing-epochs`), etc. from the command line. Notably, the user can set `lambda_image` and `lambda_text`, which balance the reconstruction terms. This tends to be important in practice. Training the model will save weights to filesystem. Run `python train.py -h` for details. | ||
|
||
![experiment-reconstructions](./static/reconstructions1.png) | ||
|
||
### MNIST | ||
Treat images as one modality and the label (integer 0 to 9) as a second. | ||
|
||
``` | ||
cd mnist | ||
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 50. --cuda | ||
# model is stored in ./trained_models | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda | ||
# you can also condition on the label | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 5 --cuda | ||
``` | ||
|
||
### FashionMNIST | ||
Very similar to MNIST, except the labels correspond to categories of fashion items. | ||
|
||
``` | ||
cd fashionmnist | ||
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 50. --cuda | ||
# model is stored in ./trained_models | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda | ||
# you can also condition on the label | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 1 --cuda | ||
``` | ||
|
||
### MultiMNIST | ||
Again, a MNIST-derivative except each image contains up to 4 digits in fixed locations. The second modality is a string of digits representing the character(s) in the image. We employ an RNN in the label inference network `q(z|y)`. | ||
|
||
``` | ||
cd multimnist | ||
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-text 10. --cuda | ||
# model is stored in ./trained_models | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda | ||
# you can also condition on the digits | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-text 1773 --cuda | ||
``` | ||
|
||
### CelebA | ||
Treat images of celebrity faces as one modality and 18 attributes pertaining to the celebrity (i.e. gender, hair color, etc) as a second modality. | ||
|
||
``` | ||
cd celeba | ||
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-attrs 10. --cuda | ||
# model is stored in ./trained_models | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda | ||
# you can also condition on the attribute | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-on-attrs Male --cuda | ||
``` | ||
|
||
### CelebA-19 | ||
Similar to CelebA except we treat each attribute as its own expert in the product-of-experts. Here we begin to explore more than 2 modalities. See code for an example of the MVAE training paradigm (mentioned in the paper) by sampling multimodal ELBO terms. | ||
|
||
``` | ||
cd celeba | ||
CUDA_VISIBLE_DEVICES=0 python train.py --lambda-attrs 10. approx-m 1 --cuda | ||
``` | ||
|
||
Here `approx-m` sets the number of ELBO terms to sample beyond the complete and individual terms. | ||
|
||
### Computer Vision Transformations | ||
We learn a series of image processing transformations (i.e. colorization, image completion, edge detciont, watermark removal, and facial landmark segmentation) as modalities. We curate a dataset by applying off-the-shelf tools to CelebA. For simplicitly, in this implementation, we only include the complete ELBO term (using all 6 modalities), and the 6 individual ELBO terms as the objective (in order words `k = 0`). One can also subsample more ELBO terms to better approximate the true MVAE objective (as in `/celeba19/train.py`). | ||
|
||
``` | ||
cd vision | ||
CUDA_VISIBLE_DEVICES=0 python train.py --cuda | ||
# model is stored in ./trained_models | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --cuda | ||
# this will reconstruct all the modalities from the image | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-file <path_to_file> --condition-type image --cuda | ||
# we can also go in the other directions | ||
CUDA_VISIBLE_DEVICES=0 python sample.py ./trained_models/model_best.pth.tar --condition-file <path_to_file> --condition-type watermark --cuda | ||
``` | ||
|
||
![vision-reconstructions](./static/reconstructions2.png) | ||
|
||
## Questions? | ||
Please report any bugs and I will get to them ASAP. For any additional questions, feel free to email me@mikehwu.com. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,152 @@ | ||
from __future__ import division | ||
from __future__ import print_function | ||
from __future__ import absolute_import | ||
|
||
import os | ||
import sys | ||
import copy | ||
import random | ||
import numpy as np | ||
import numpy.random as npr | ||
from PIL import Image | ||
from random import shuffle | ||
from scipy.misc import imresize | ||
|
||
import torch | ||
from torch.utils.data.dataset import Dataset | ||
|
||
VALID_PARTITIONS = {'train': 0, 'val': 1, 'test': 2} | ||
# go from label index to interpretable index | ||
ATTR_TO_IX_DICT = {'Sideburns': 30, 'Black_Hair': 8, 'Wavy_Hair': 33, 'Young': 39, 'Heavy_Makeup': 18, | ||
'Blond_Hair': 9, 'Attractive': 2, '5_o_Clock_Shadow': 0, 'Wearing_Necktie': 38, | ||
'Blurry': 10, 'Double_Chin': 14, 'Brown_Hair': 11, 'Mouth_Slightly_Open': 21, | ||
'Goatee': 16, 'Bald': 4, 'Pointy_Nose': 27, 'Gray_Hair': 17, 'Pale_Skin': 26, | ||
'Arched_Eyebrows': 1, 'Wearing_Hat': 35, 'Receding_Hairline': 28, 'Straight_Hair': 32, | ||
'Big_Nose': 7, 'Rosy_Cheeks': 29, 'Oval_Face': 25, 'Bangs': 5, 'Male': 20, 'Mustache': 22, | ||
'High_Cheekbones': 19, 'No_Beard': 24, 'Eyeglasses': 15, 'Bags_Under_Eyes': 3, | ||
'Wearing_Necklace': 37, 'Wearing_Lipstick': 36, 'Big_Lips': 6, 'Narrow_Eyes': 23, | ||
'Chubby': 13, 'Smiling': 31, 'Bushy_Eyebrows': 12, 'Wearing_Earrings': 34} | ||
# we only keep 18 of the more visually distinctive features | ||
# See [1] Perarnau, Guim, et al. "Invertible conditional gans for | ||
# image editing." arXiv preprint arXiv:1611.06355 (2016). | ||
ATTR_IX_TO_KEEP = [4, 5, 8, 9, 11, 12, 15, 17, 18, 20, 21, 22, 26, 28, 31, 32, 33, 35] | ||
IX_TO_ATTR_DICT = {v:k for k, v in ATTR_TO_IX_DICT.iteritems()} | ||
N_ATTRS = len(ATTR_IX_TO_KEEP) | ||
ATTR_TO_PLOT = ['Heavy_Makeup', 'Male', 'Mouth_Slightly_Open', 'Smiling', 'Wavy_Hair'] | ||
|
||
|
||
class CelebAttributes(Dataset): | ||
"""Define dataset of images of celebrities and attributes. | ||
The user needs to have pre-defined the Anno and Eval folder from | ||
http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html | ||
@param partition: string | ||
train|val|test [default: train] | ||
See VALID_PARTITIONS global variable. | ||
@param data_dir: string | ||
path to root of dataset images [default: ./data] | ||
@param image_transform: ?torchvision.Transforms | ||
optional function to apply to training inputs | ||
@param attr_transform: ?torchvision.Transforms | ||
optional function to apply to training outputs | ||
""" | ||
def __init__(self, partition='train', data_dir='./data', | ||
image_transform=None, attr_transform=None): | ||
self.partition = partition | ||
self.image_transform = image_transform | ||
self.attr_transform = attr_transform | ||
self.data_dir = data_dir | ||
assert partition in VALID_PARTITIONS.keys() | ||
self.image_paths = load_eval_partition(partition, data_dir=data_dir) | ||
self.attr_data = load_attributes(self.image_paths, partition, | ||
data_dir=data_dir) | ||
self.size = int(len(self.image_paths)) | ||
|
||
def __getitem__(self, index): | ||
""" | ||
Args: | ||
index (int): Index | ||
Returns: | ||
tuple: (image, target) where target is index of the target class. | ||
""" | ||
image_path = os.path.join(self.data_dir, 'img_align_celeba', | ||
self.image_paths[index]) | ||
attr = self.attr_data[index] | ||
image = Image.open(image_path).convert('RGB') | ||
|
||
if self.image_transform is not None: | ||
image = self.image_transform(image) | ||
|
||
if self.attr_transform is not None: | ||
attr = self.attr_transform(attr) | ||
|
||
return image, attr | ||
|
||
def __len__(self): | ||
return self.size | ||
|
||
|
||
def load_eval_partition(partition, data_dir='./data'): | ||
"""After downloading the dataset, we can load a subset for | ||
training or testing. | ||
@param partition: string | ||
which subset to use (train|val|test) | ||
@param data_dir: string [default: ./data] | ||
where the images are saved | ||
""" | ||
eval_data = [] | ||
with open(os.path.join(data_dir, 'Eval/list_eval_partition.txt')) as fp: | ||
rows = fp.readlines() | ||
for row in rows: | ||
path, label = row.strip().split(' ') | ||
label = int(label) | ||
if label == VALID_PARTITIONS[partition]: | ||
eval_data.append(path) | ||
return eval_data | ||
|
||
|
||
def load_attributes(paths, partition, data_dir='./data'): | ||
"""Load the attributes into a torch tensor. | ||
@param paths: string | ||
a numpy array of attributes (1 or 0) | ||
@param partition: string | ||
which subset to use (train|val|test) | ||
@param data_dir: string [default: ./data] | ||
where the images are saved | ||
""" | ||
if os.path.isfile(os.path.join(data_dir, 'Anno/attr_%s.npy' % partition)): | ||
attr_data = np.load(os.path.join(data_dir, 'Anno/attr_%s.npy' % partition)) | ||
else: | ||
attr_data = [] | ||
with open(os.path.join(data_dir, 'Anno/list_attr_celeba.txt')) as fp: | ||
rows = fp.readlines() | ||
for ix, row in enumerate(rows[2:]): | ||
row = row.strip().split() | ||
path, attrs = row[0], row[1:] | ||
if path in paths: | ||
attrs = np.array(attrs).astype(int) | ||
attrs[attrs < 0] = 0 | ||
attr_data.append(attrs) | ||
attr_data = np.vstack(attr_data).astype(np.int64) | ||
attr_data = torch.from_numpy(attr_data).float() | ||
return attr_data[:, ATTR_IX_TO_KEEP] | ||
|
||
|
||
def tensor_to_attributes(tensor): | ||
"""Use this for the <image_transform>. | ||
@param tensor: PyTorch Tensor | ||
D dimensional tensor | ||
@return attributes: list of strings | ||
""" | ||
attrs = [] | ||
n = tensor.size(0) | ||
tensor = torch.round(tensor) | ||
for i in xrange(n): | ||
if tensor[i] > 0.5: | ||
attr = IX_TO_ATTR_DICT[ATTR_IX_TO_KEEP[i]] | ||
attrs.append(attr) | ||
return attrs |
Oops, something went wrong.