## 2020 Readme

TRECVID-VTT

Video-to-text dataset

This dataset contains short videos (ranging from 3 seconds to 10 seconds) from TRECVID VTT task from 2016 to 2019. There are 7485 videos with captions. Each video has between 2 and 5 captions, which have been written by dedicated annotators. The videos have IDs from 1 to 7485. The videos are given in the following format:

1. There are 6475 URLs from Twitter Vine. These URLs are in the vtt_video_urls.txt file. Each line in the file gives the Video ID followed by its URL.
2. There are 1010 video files available to download from our Flickr dataset. These videos are in the webm format and have the Creative Commons License.
3. The file vtt_ground_truth.txt contains the captions for all videos. Each line in the file gives the Video ID followed by its caption.
4. Videos with IDs from 1 to 3528 have between 2 to 5 captions. Videos from 3529 to 7485 have 5 captions each.

If you use this dataset, please cite the following paper:

@inproceedings{awad2019trecvid,
  title={Trecvid 2019: An evaluation campaign to benchmark video activity detection, video captioning and matching, and video search \& retrieval},
  author={Awad, George and Butt, Asad and Curtis, Keith and Lee, Yooyoung and Fiscus, Jonathan and Godil, Afzal and Delgado, Andrew and others},
  year={2019}
}

## Generate corpus and gruond-truth references of released videos

### Corpus file contents
0. train_data: captions and idxs of training videos in format [corpus_widxs, vidxs, corpus_pidxs], where:
    - corpus_widxs is a list of lists with the index of words in the vocabulary
    - vidxs is a list of indexes of video features in the features file
    - corpus_pidxs is a list of lists with the index of POS tags in the POS tagging vocabulary
1. val_data: same format of train_data. We created this set randomly sampling 20% of videos in the tv2020 released data for training.
2. test_data: **None**. The videos for testing have not been released yet.
3. vocabulary: in format {'word': count}.
4. idx2word: is the vocabulary in format {idx: 'word'}.
5. word_embeddings: are the vectors of each word. The i-th row is the word vector of the i-th word in the vocabulary.
6. idx2pos: is the vocabulary of POS tagging in format {idx: 'POSTAG'}

### Generate split for training and validation

In [1]:
train_corpus = []
with open('../../../data/TRECVID/tv2020/train/vtt_ground_truth.txt') as f:
    vidxs, corpus = zip(*[l.strip().split(' ', 1) for l in f.readlines()])

print('count of videos: ', len(set(vidxs)))    
print('original count of pairs: ', len(vidxs))
    
## remove vids that has a video with problem
import os
videos_dir_path = '../../../data/TRECVID/tv2020/train/videos'
vidxs_to_delete = [int(filename[6:filename.index('.')]) for filename in os.listdir(videos_dir_path) if not os.path.getsize(os.path.join(videos_dir_path, filename))]

import random
unique_vidxs = set([int(vidx) for vidx in vidxs]) - set(vidxs_to_delete)
random.seed(42)
sampled_vidxs = random.sample(vidxs, len(unique_vidxs)//5) # 20% of videos for validation

train_vidxs, train_corpus = zip(*[(int(vidx)-1, corpus[i]) for (i, vidx) in enumerate(vidxs) if not vidx in sampled_vidxs])
valid_vidxs, valid_corpus = zip(*[(int(vidx)-1, corpus[i]) for (i, vidx) in enumerate(vidxs) if vidx in sampled_vidxs])

print('\nMy split:')
print('count of training pairs: ', len(train_vidxs))
print('count of validation pairs: ', len(valid_vidxs))

count of videos:  7485
original count of pairs:  28183

My split:
count of training pairs:  22491
count of validation pairs:  5692


### Discard videos without features (video files that have problem)
**IMPORTANT**: I don't need to remove the features of the .h5 file. I only need to remove the indices and captions from corpus.

In [2]:
import h5py
with h5py.File('../../../data/TRECVID/tv2020/features/features_linspace16_20-cnn_globals-cnn_sem_globals-cnn_features-c3d_features-eco_globals.h5', 'r+') as feats_file:
    dataset = feats_file['TRECVID-2020']
    print(dataset.keys())
    cnn_fts = dataset['cnn_features'][...]
    print(cnn_fts.shape)
    
import numpy as np
vidxs_to_discard = []
for i, v in enumerate(cnn_fts):
    if np.all(v == np.zeros((20,2048))):
        vidxs_to_discard.append(i)
print(len(vidxs_to_discard), ' videos to dicard')

train_vidxs, train_corpus = zip(*[(vidx, train_corpus[i]) for (i, vidx) in enumerate(train_vidxs) if not vidx in vidxs_to_discard])
valid_vidxs, valid_corpus = zip(*[(vidx, valid_corpus[i]) for (i, vidx) in enumerate(valid_vidxs) if not vidx in vidxs_to_discard])

print('count of training pairs: ', len(train_vidxs))
print('count of validation pairs: ', len(valid_vidxs))

<KeysViewHDF5 ['c3d_features', 'cnn_features', 'cnn_globals', 'cnn_sem_globals', 'count_features', 'eco_globals', 'frames_tstamp']>
(7485, 20, 2048)
254  videos to dicard
count of training pairs:  21956
count of validation pairs:  5634


### Get pretrained embeddings

In [3]:
import os
import numpy as np

wordvectors = {}
# with open('./glove.42B.300d.txt') as f:
with open('./glove.6B.300d.txt') as f:
    for line in f:
        s = line.strip().split(' ')
        if len(s) == 301:
            wordvectors[s[0]] = np.array(s[1:], dtype=float)
    print(len(wordvectors))

400000


### Determine the vocabulary from train split

In [4]:
import nltk
nltk.download('punkt')

vocab, total_len = {}, 0
for cap in train_corpus:
    tokens = nltk.word_tokenize(cap.lower())
    total_len += len(tokens)
    for w in tokens:
        try:
            vocab[w] += 1
        except:
            vocab[w] = 1

print('Avg. count of words per caption:', total_len/len(train_corpus))
print('Count of unique words: ', len(vocab))
            
to_del = []
for w in vocab.keys():
    if not w in wordvectors:
        to_del.append(w)
        print('missing word: {}'.format(w))

print('count of missing words: ', len(to_del))
        
for w in to_del:
    del vocab[w]
        
idx2word = {idx: word for idx, word in enumerate(['<eos>', '<unk>'] + list(vocab.keys()))}
word2idx = {word: idx for idx, word in enumerate(['<eos>', '<unk>'] + list(vocab.keys()))}
EOS, UNK = 0, 1

len(vocab), len(idx2word), len(word2idx)

word_embeddings = np.zeros((len(idx2word), 300))
for idx, word in idx2word.items():
    if idx == EOS:
        word_embeddings[idx] = wordvectors['eos']
    elif idx == UNK:
        word_embeddings[idx] = wordvectors['unk']
    else:
        word_embeddings[idx] = wordvectors[word]

[nltk_data] Downloading package punkt to /home/jeperez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Avg. count of words per caption: 18.899161960284204
Count of unique words:  11634
missing word: toddles
missing word: jellabiya
missing word: gesticulates
missing word: socket-outlet
missing word: laundromatic
missing word: outstretching
missing word: peruke
missing word: mtv-studio
missing word: nba-playoff
missing word: france/eu
missing word: mewls
missing word: tv-studio
missing word: cacadoo
missing word: air-kiss
missing word: grining
missing word: puppie
missing word: twerking
missing word: flic-flacs
missing word: hoody
missing word: rollick
missing word: punchbag
missing word: york-player
missing word: bar-keeper
missing word: overlarge
missing word: muscly
missing word: sidewards
missing word: selfie
missing word: one-move
missing word: afflick
missing word: fur-trimmed
missing word: mid-court
missing word: carooms
missing word: face.
missing word: martin-lewis
missing word: break-dance
missing word: head-butt
missing word: venue..
missing word: flip-phone
missing word: sprin

### Determine POS-tagging vocabulary from train split

In [5]:
import nltk

pos_vocab = {}
pos_unique_words = {}
for cap in train_corpus:
    for tag in nltk.pos_tag(nltk.word_tokenize(cap.lower())):
        try:
            pos_vocab[tag[1]] += 1
            try: 
                pos_unique_words[tag[1]][tag[0]] += 1
            except:
                pos_unique_words[tag[1]][tag[0]] = 1
        except:
            pos_vocab[tag[1]] = 1
            pos_unique_words[tag[1]] = {tag[0]: 1}

print('Unique words per tag:')
print('\n'.join([f' {k}:\t{len(words)}' for k, words in pos_unique_words.items()]))
            
idx2pos = {idx: tag for idx, tag in enumerate(['eos', 'unk'] + list(pos_vocab.keys()))}
pos2idx = {tag: idx for idx, tag in enumerate(['eos', 'unk'] + list(pos_vocab.keys()))}
EOS, UNK = 0, 1
print(len(idx2pos))

Unique words per tag:
 DT:	23
 NN:	5352
 VBZ:	939
 IN:	126
 PRP$:	7
 ,:	1
 CD:	104
 NNS:	2165
 VBP:	740
 JJ:	2816
 VBG:	1036
 TO:	2
 VB:	941
 RP:	28
 .:	3
 CC:	19
 RB:	522
 POS:	4
 PRP:	24
 VBN:	767
 VBD:	502
 WRB:	7
 WDT:	6
 (:	1
 ):	1
 PDT:	14
 JJR:	52
 WP:	7
 MD:	14
 ``:	1
 '':	5
 WP$:	1
 JJS:	16
 ::	5
 UH:	3
 EX:	2
 #:	1
 FW:	16
 RBR:	16
 NNP:	12
 $:	2
 RBS:	1
44


### Determine Universal POS-tagging from train split

In [6]:
import nltk
nltk.download('universal_tagset')

upos_vocab = {}
upos_unique_words = {}
for cap in train_corpus:
    for tag in nltk.pos_tag(nltk.word_tokenize(cap.lower()), tagset='universal'):
        try:
            upos_vocab[tag[1]] += 1
            try: 
                upos_unique_words[tag[1]][tag[0]] += 1
            except:
                upos_unique_words[tag[1]][tag[0]] = 1
        except:
            upos_vocab[tag[1]] = 1
            upos_unique_words[tag[1]] = {tag[0]: 1}

print('Unique words per universal tag:')
print('\n'.join([f' {k}:\t{len(words)}' for k, words in upos_unique_words.items()]))
            
idx2upos = {idx: word for idx, word in enumerate(['eos', 'unk'] + list(upos_vocab.keys()))}
upos2idx = {word: idx for idx, word in enumerate(['eos', 'unk'] + list(upos_vocab.keys()))}
print(len(idx2upos))

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /home/jeperez/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Unique words per universal tag:
 DET:	40
 NOUN:	7316
 VERB:	4038
 ADP:	126
 PRON:	38
 .:	20
 NUM:	104
 ADJ:	2878
 PRT:	34
 CONJ:	19
 ADV:	542
 X:	19
14


### Generate ground-truth references files

In [7]:
with open('../results/TRECVID-2020_val_references.txt', 'w') as f:
    for vidx, cap in zip(valid_vidxs, valid_corpus):
        f.write('{}\t{}\n'.format(vidx, cap))

#### test reference files, after the ground-truth was released

In [3]:
import json
with open('../../../data/TRECVID/tv2020/test/Description_Generation_Groundtruth/vtt.cider.gt.json') as f:
    gt = json.load(f)

with open('../results/TRECVID-2020_test_references.txt', 'w') as f:
    for row in gt:
        f.write('{}\t{}\n'.format(row['image_id'], row['caption']))

### Generate corpus.pkl file

In [8]:
import pickle

train_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in train_corpus]
valid_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in valid_corpus]

train_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in train_corpus]
valid_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in valid_corpus]

assert len(train_corpus_widxs) == len(train_vidxs) and len(train_vidxs) == len(train_corpus_pidxs) and len(train_vidxs) == len(train_corpus), f'{len(train_vidxs)}, {len(train_corpus_widxs)}, {len(train_corpus_pidxs)}, {len(train_corpus)}'
assert len(valid_corpus_widxs) == len(valid_vidxs) and len(valid_vidxs) == len(valid_corpus_pidxs) and len(valid_vidxs) == len(valid_corpus), f'{len(valid_vidxs)}, {len(valid_corpus_widxs)}, {len(valid_corpus_pidxs)}, {len(valid_corpus)}'

train_data = [train_corpus_widxs, train_vidxs, train_corpus_pidxs, train_corpus]
valid_data = [valid_corpus_widxs, valid_vidxs, valid_corpus_pidxs, valid_corpus]

with open('../../../data/TRECVID/tv2020/trecvid_tv2020_corpus_pos.pkl', 'wb') as outfile:
    pickle.dump([train_data, valid_data, None, vocab, idx2word, word_embeddings, idx2pos], outfile)

In [9]:
print(train_vidxs[0])
print(train_corpus[0])
nltk.pos_tag(nltk.word_tokenize(train_corpus[0]))

0
a man sings in a car


[('a', 'DT'),
 ('man', 'NN'),
 ('sings', 'VBZ'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('car', 'NN')]

In [10]:
print(valid_vidxs[0])
print(valid_corpus[0])
valid_corpus_pidxs[0]

6
a man is scared by a picture of a boy on a mirror in a bathroom


[2, 3, 4, 21, 5, 2, 3, 5, 2, 3, 5, 2, 3, 5, 2, 3, 0]