GitHub - iftekharnaim/GenerativeVideoToTextAlignment: A hierarchical generative model for aligning video segments with corresponding text descriptions without manual supervision.

iftekharnaim / GenerativeVideoToTextAlignment Public

Notifications You must be signed in to change notification settings
Fork 5
Star 7

A hierarchical generative model for aligning video segments with corresponding text descriptions without manual supervision.

7 stars 5 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
dataset_aaai14		dataset_aaai14
frames		frames
protocols		protocols
results		results
video_blobs		video_blobs
vision_3d_coord		vision_3d_coord
.gitignore		.gitignore
.project		.project
.pydevproject		.pydevproject
README		README
extract_np.py		extract_np.py
global_variables.py		global_variables.py
hmm_library.py		hmm_library.py
hmm_multivideos.py		hmm_multivideos.py
input_file_processor.py		input_file_processor.py
latent_hmm_multivideos.py		latent_hmm_multivideos.py
log_rescaling.py		log_rescaling.py
preprocess_protocol.py		preprocess_protocol.py
process_vision_coordinates.py		process_vision_coordinates.py
run_experiments.py		run_experiments.py
timing_info.py		timing_info.py
video_protocol_data.py		video_protocol_data.py

Repository files navigation

This directory contains the source codes and datasets for unsupervised alignment of natural language instructions with corresponding video segments. The mathematical models are explained in our AAAI-2014 paper[1].


Data:
=====

The directory "dataset_aaai14" contains the datasets used for our AAAI-14 paper. There are 3 directories: protocols, video_blobs, and vision_3d_coord.

Directory “protocols” - contains the raw text and parses for 3 wetlab protocols: CELL, LLGM, and YPAD.

Directory “video_blobs” - contains the annotations of the set of blobs touched by hands in the videos. There are total 6 videos (2 per protocol). The files that contains “vision” in their name are generated by automated tracking. The files that do not have “vision” in their name (and upper case) are the manual tracking data via Anvil.

Directory “vision_3d_coord” - X, Y, Z coordinates for the tracked objects.


Code:
=====

This implementation includes the source codes for 4 different alignment models as described in [1]. The instructions for running each of the methods are explained below:


Run HMM1 model (monotonic)
==========================

On Anvil data:
> python hmm_multivideos.py False True

On Vision data:
> python hmm_multivideos.py False False


Run HMM2 (non-monotonic)
========================

On Anvil data:
> python hmm_multivideos.py True True

On Vision data:
> python hmm_multivideos.py True False


Run LHMM1 (monotonic, unobserved)
================================

On Anvil data:
> python latent_hmm_multivideos.py False True

On Vision data:
> python latent_hmm_multivideos.py False False

Run LHMM2 (non-monotonic, unobserved)
================================

On Anvil data:
> python latent_hmm_multivideos.py True True

On Vision data:
> python latent_hmm_multivideos.py True False


NOTE:
======

We extended the generative alignment model using discriminative Latent CRF model in [2]. Please note that the code in this repository does NOT include the source codes for the discriminative models. We’ll soon release that source codes for the discriminative models separately.


References:
==========
[1] Unsupervised Alignment of Natural Language Instructions with Video Segments , Iftekhar Naim, Young Chol Song, Qiguang Liu, Henry Kautz, Jiebo Luo, Daniel Gildea, in Proceedings of AAAI 2014.

[2] Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments , Iftekhar Naim, Young C. Song, Qiguang Liu, Liang Huang, Henry Kautz, Jiebo Luo and Daniel Gildea, in Proceedings of NAACL, 2015.