Skip to content

A hierarchical generative model for aligning video segments with corresponding text descriptions without manual supervision.

Notifications You must be signed in to change notification settings

iftekharnaim/GenerativeVideoToTextAlignment

Repository files navigation

This directory contains the source codes and datasets for unsupervised alignment of natural language instructions with corresponding video segments. The mathematical models are explained in our AAAI-2014 paper[1].


Data:
=====

The directory "dataset_aaai14" contains the datasets used for our AAAI-14 paper. There are 3 directories: protocols, video_blobs, and vision_3d_coord.

Directory “protocols” - contains the raw text and parses for 3 wetlab protocols: CELL, LLGM, and YPAD.

Directory “video_blobs” - contains the annotations of the set of blobs touched by hands in the videos. There are total 6 videos (2 per protocol). The files that contains “vision” in their name are generated by automated tracking. The files that do not have “vision” in their name (and upper case) are the manual tracking data via Anvil.

Directory “vision_3d_coord” - X, Y, Z coordinates for the tracked objects.


Code:
=====

This implementation includes the source codes for 4 different alignment models as described in [1]. The instructions for running each of the methods are explained below:


Run HMM1 model (monotonic)
==========================

On Anvil data:
> python hmm_multivideos.py False True

On Vision data:
> python hmm_multivideos.py False False


Run HMM2 (non-monotonic)
========================

On Anvil data:
> python hmm_multivideos.py True True

On Vision data:
> python hmm_multivideos.py True False


Run LHMM1 (monotonic, unobserved)
================================

On Anvil data:
> python latent_hmm_multivideos.py False True

On Vision data:
> python latent_hmm_multivideos.py False False

Run LHMM2 (non-monotonic, unobserved)
================================

On Anvil data:
> python latent_hmm_multivideos.py True True

On Vision data:
> python latent_hmm_multivideos.py True False


NOTE:
======

We extended the generative alignment model using discriminative Latent CRF model in [2]. Please note that the code in this repository does NOT include the source codes for the discriminative models. We’ll soon release that source codes for the discriminative models separately.


References:
==========
[1] Unsupervised Alignment of Natural Language Instructions with Video Segments , Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry Kautz, Jiebo Luo, Daniel Gildea, in Proceedings of AAAI 2014.

[2] Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , Iftekhar Naim, Young C. Song, Qiguang Liu, Liang Huang, Henry Kautz, Jiebo Luo and Daniel Gildea, in Proceedings of NAACL, 2015.

About

A hierarchical generative model for aligning video segments with corresponding text descriptions without manual supervision.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages