## Generate corpus and gruond-truth references of released videos

### Corpus file contents
0. train_data: captions and idxs of training videos in format [corpus_widxs, vidxs, corpus_pidxs], where:
    - corpus_widxs is a list of lists with the index of words in the vocabulary
    - vidxs is a list of indexes of video features in the features file
    - corpus_pidxs is a list of lists with the index of POS tags in the POS tagging vocabulary
1. val_data: same format of train_data.
2. test_data: same format of train_data.
3. vocabulary: in format {'word': count}.
4. idx2word: is the vocabulary in format {idx: 'word'}.
5. word_embeddings: are the vectors of each word. The i-th row is the word vector of the i-th word in the vocabulary.
6. idx2pos: is the vocabulary of POS tagging in format {idx: 'POSTAG'}

### Generate split for training and validation

In [3]:
import json
with open('../../../data/MSR-VTT/v1/all_videodatainfo.json') as f:
    vidxs, corpus = zip(*[(s['video_id'], s['caption']) for s in json.load(f)['sentences']]) 
    vidxs = [int(s[5:]) for s in vidxs]
    
train_vidxs, train_corpus = zip(*[(int(vidx), corpus[i]) for (i, vidx) in enumerate(vidxs) if vidx <= 6512])
valid_vidxs, valid_corpus = zip(*[(int(vidx), corpus[i]) for (i, vidx) in enumerate(vidxs) if vidx >= 6513 and vidx <= 7009])
test_vidxs, test_corpus = zip(*[(int(vidx), corpus[i]) for (i, vidx) in enumerate(vidxs) if vidx >= 7010])

print('count of training pairs: ', len(train_vidxs))
print('count of validation pairs: ', len(valid_vidxs))
print('count of testing pairs: ', len(test_vidxs))

count of training pairs:  130260
count of validation pairs:  9940
count of testing pairs:  59800


### Get pretrained embeddings

In [4]:
import os
import numpy as np

wordvectors = {}
# with open('./glove.42B.300d.txt') as f:
with open('./glove.6B.300d.txt') as f:
    for line in f:
        s = line.strip().split(' ')
        if len(s) == 301:
            wordvectors[s[0]] = np.array(s[1:], dtype=float)
    print(len(wordvectors))

400000


### Determine the vocabulary from train split

In [5]:
import nltk
nltk.download('punkt')

vocab, total_len = {}, 0
for cap in train_corpus:
    tokens = nltk.word_tokenize(cap.lower())
    total_len += len(tokens)
    for w in tokens:
        try:
            vocab[w] += 1
        except:
            vocab[w] = 1

print('Avg. count of words per caption:', total_len/len(train_corpus))
print('Count of unique words: ', len(vocab))

to_del = []
for w in vocab.keys():
    if not w in wordvectors:
        to_del.append(w)
        print('missing word: {}'.format(w))

print('count of missing words: ', len(to_del))
        
for w in to_del:
    del vocab[w]
        
idx2word = {idx: word for idx, word in enumerate(['<eos>', '<unk>'] + list(vocab.keys()))}
word2idx = {word: idx for idx, word in enumerate(['<eos>', '<unk>'] + list(vocab.keys()))}
EOS, UNK = 0, 1

print(len(vocab), len(idx2word), len(word2idx))

word_embeddings = np.zeros((len(idx2word), 300))
for idx, word in idx2word.items():
    if idx == EOS:
        word_embeddings[idx] = wordvectors['eos']
    elif idx == UNK:
        word_embeddings[idx] = wordvectors['unk']
    else:
        word_embeddings[idx] = wordvectors[word]

[nltk_data] Downloading package punkt to /home/jeperez/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Avg. count of words per caption: 9.27392906494703
Count of unique words:  23527
missing word: scating
missing word: dbz
missing word: kamehame
missing word: krillin
missing word: charachters
missing word: ball-z
missing word: mini-motorbike
missing word: minibikes
missing word: mini-motorcycles
missing word: reult
missing word: potaote
missing word: godaddycom
missing word: origomi
missing word: intermittendly
missing word: palying
missing word: skarlet
missing word: skarlett
missing word: comdey
missing word: circket
missing word: mupphets
missing word: hiliary
missing word: superreaders
missing word: conversating
missing word: dutchsinse
missing word: soilder
missing word: vieo
missing word: stegasaurus
missing word: livecast
missing word: somethign
missing word: expresion
missing word: machinama
missing word: flyroll
missing word: demostrating
missing word: evercise
missing word: feattured
missing word: whilea
missing word: animtated
missing word: lapras
missing word: wiscash
missin

missing word: carpentery
missing word: wodden
missing word: peticular
missing word: astronout
missing word: explaying
missing word: parotand
missing word: squarreled
missing word: uppehand
missing word: superbehind
missing word: super-comfortable
missing word: accesssory
missing word: scenefirston
missing word: catche
missing word: eache
missing word: dummie
missing word: snow-field
missing word: walktor
missing word: surrounder
missing word: whits
missing word: chappathi
missing word: indain
missing word: receipies
missing word: taliking
missing word: friens
missing word: mayonaise
missing word: drived
missing word: darker-blue
missing word: merbrane
missing word: membrance
missing word: vegitable
missing word: kitchenthere
missing word: coriandrum
missing word: corriander
missing word: theling
missing word: inhome
missing word: kitches
missing word: completly
missing word: instuction
missing word: shrredded
missing word: meadowswhere
missing word: gold-framed
missing word: cuite
miss

missing word: comiing
missing word: agter
missing word: sied
missing word: driving/
missing word: burgandy
missing word: apeak
missing word: thin-trunked
missing word: wearring
missing word: chapatti
missing word: puppiesa
missing word: veternarian
missing word: miriah
missing word: unboard
missing word: flighta
missing word: happeneded
missing word: walhberg
missing word: whalburg
missing word: famiily
missing word: wowedres
missing word: arow
missing word: adveertisement
missing word: -china
missing word: factroy
missing word: reparepair
missing word: rhiana
missing word: feamle
missing word: rihanni
missing word: crawing
missing word: gold-painted
missing word: unflavoured
missing word: surgar
missing word: benefical
missing word: paedialyte
missing word: pedalyte
missing word: verterinarian
missing word: vaselene
missing word: pediayte
missing word: inverviewed
missing word: tecnique
missing word: vehiclesviolence
missing word: fiaghting
missing word: fliping
missing word: writting

### Determine POS-tagging vocabulary from train split

In [6]:
import nltk

pos_vocab = {}
for cap in train_corpus:
    for tag in nltk.pos_tag(nltk.word_tokenize(cap.lower())):
        try:
            pos_vocab[tag[1]] += 1
        except:
            pos_vocab[tag[1]] = 0
            
idx2pos = {idx: word for idx, word in enumerate(['eos', 'unk'] + list(pos_vocab.keys()))}
pos2idx = {word: idx for idx, word in enumerate(['eos', 'unk'] + list(pos_vocab.keys()))}
EOS, UNK = 0, 1

### Generate ground-truth references files

In [7]:
with open('../results/MSR-VTT_val_references.txt', 'w') as f:
    for vidx, cap in zip(valid_vidxs, valid_corpus):
        f.write('{}\t{}\n'.format(vidx, cap.lower()))
        
with open('../results/MSR-VTT_test_references.txt', 'w') as f:
    for vidx, cap in zip(test_vidxs, test_corpus):
        f.write('{}\t{}\n'.format(vidx, cap.lower()))

### Generate corpus.pkl file

In [8]:
import pickle

train_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in train_corpus]
valid_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in valid_corpus]
test_corpus_widxs = [[word2idx[w] if w in vocab else UNK for w in nltk.word_tokenize(cap.lower())] + [EOS] for cap in test_corpus]

train_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in train_corpus]
valid_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in valid_corpus]
test_corpus_pidxs = [[pos2idx[w[1]] if w[1] in pos_vocab else UNK for w in nltk.pos_tag(nltk.word_tokenize(cap.lower()))] + [EOS] for cap in test_corpus]

assert len(train_corpus_widxs) == len(train_vidxs) and len(train_vidxs) == len(train_corpus_pidxs) and len(train_vidxs) == len(train_corpus), f'{len(train_vidxs)}, {len(train_corpus_widxs)}, {len(train_corpus_pidxs)}, {len(train_corpus)}'
assert len(valid_corpus_widxs) == len(valid_vidxs) and len(valid_vidxs) == len(valid_corpus_pidxs) and len(valid_vidxs) == len(valid_corpus), f'{len(valid_vidxs)}, {len(valid_corpus_widxs)}, {len(valid_corpus_pidxs)}, {len(valid_corpus)}'
assert len(test_corpus_widxs) == len(test_vidxs) and len(test_vidxs) == len(test_corpus_pidxs) and len(test_vidxs) == len(test_corpus), f'{len(test_vidxs)}, {len(test_corpus_widxs)}, {len(test_corpus_pidxs)}, {len(test_corpus)}'

train_data = [train_corpus_widxs, train_vidxs, train_corpus_pidxs, train_corpus]
valid_data = [valid_corpus_widxs, valid_vidxs, valid_corpus_pidxs, valid_corpus]
test_data = [test_corpus_widxs, test_vidxs, test_corpus_pidxs, test_corpus]

with open('../../../data/MSR-VTT/v1/my_msrvtt_corpus.pkl', 'wb') as outfile:
    pickle.dump([train_data, valid_data, test_data, vocab, idx2word, word_embeddings, idx2pos], outfile)

# END
de aqui hacia delante son solo pruebas para sacar ejemplos y esas cosas

In [17]:
patterns_to_search = ["DT NN VBZ VBG RB IN DT NN"]

pos_templates = [" ".join([idx2pos[pidx] for pidx in c]) for c in train_corpus_pidxs]
matchs = [(train_vidxs[i], s, train_corpus[i]) for (i, s) in enumerate(pos_templates) if all([(p in s) for p in patterns_to_search])]
matchs

[(1968,
  'DT NN VBZ VBG RB IN DT NN IN DT NN IN NN eos',
  'a guy is flipping around on a trampoline into a pile of foam'),
 (1968,
  'DT NN VBZ VBG RB IN DT NN IN DT NN IN NN eos',
  'a guy is flipping around on a trampoline into a pile of foam'),
 (2402,
  'DT NN VBZ VBG RB IN DT NN NN eos',
  'a man is dancing awkwardly on the dance floor'),
 (2747,
  'DT NN VBZ VBG RB IN DT NN eos',
  'a man is dancing along with a music'),
 (2854,
  'DT NN VBZ VBG RB IN DT NN NN IN DT NN VBZ NNS eos',
  'a child is driving away in a toy car while a man drinks starbucks'),
 (5284,
  'DT NN VBZ VBG RB IN DT NN VBG PRP$ NNS eos',
  'a man is standing outside in the snow brushing his teeth'),
 (5592, 'DT NN VBZ VBG RB IN DT NN eos', 'a woman is shying away from a kiss'),
 (3928,
  'DT NN VBZ VBG RB IN DT NN NN CC VBG IN DT eos',
  'a man is diving back in a training room and talking about that'),
 (5324,
  'DT NN VBZ VBG RB IN DT NN eos',
  'a man is flipping back on a trampoline'),
 (615,
  'DT NN N

In [10]:
d = {}
for i, t in enumerate(pos_templates):
    if not t in d:
        d[t] = [i]
    else:
        d[t].append(i)
for tags, idxs in d.items():
    if len(idxs)>1:
        print(f'{tags}\n{idxs}\n')

DT NN NNS VBZ IN DT NN NN IN DT NN NN eos
[0, 18]

DT NN NN VBZ IN IN IN DT NN NN eos
[1, 19]

DT NN VBZ VBG IN DT NN eos
[2, 130, 382, 491, 1025, 1348, 1364, 1445, 1449, 1724, 1746, 1805, 1901, 1911, 1918, 2402, 2403, 2790, 2943, 3041, 3224, 3487, 3499, 3608, 3748, 3785, 3889, 3902, 3927, 3949, 4102, 4104, 4107, 4108, 4113, 4171, 4393, 4502, 4505, 4521, 4522, 4527, 4538, 4561, 4602, 4625, 4662, 4763, 4768, 4770, 4852, 5047, 5100, 5226, 5268, 5452, 5458, 5584, 5669, 6168, 6506, 6524, 6591, 7043, 7044, 7070, 7264, 7523, 7529, 7661, 7662, 7679, 7687, 7726, 7766, 7806, 7828, 7842, 7902, 7946, 7962, 7976, 7977, 8409, 8418, 8501, 8705, 8804, 8819, 8862, 9265, 9273, 9345, 9367, 9565, 9583, 9686, 9698, 9699, 10005, 10015, 10200, 10465, 10486, 10711, 10727, 10733, 10738, 10802, 10865, 11063, 11069, 11071, 11085, 11161, 11521, 11581, 11665, 11666, 11701, 11805, 11846, 11873, 11985, 11999, 12226, 12342, 12447, 12467, 12523, 12673, 12811, 12843, 12882, 12967, 13087, 13144, 13212, 13269, 13365, 13


DT NN NN VBG NN IN DT NN NN eos
[8168, 44722]

NN IN NNS IN NN NN eos
[8169, 23215, 75225, 80887]

NNS VBP VBG TO DT JJ NN NN eos
[8172, 37313]

DT NN VBZ VBG DT JJ eos
[8183, 10008, 13762, 15043, 25702, 28523, 28628, 41000, 49327, 53550, 56444, 61661, 63462, 76620, 76640, 76660, 81882, 87952, 88201]

DT NN IN JJ NNS VBG eos
[8184, 18906, 19420, 24703, 30424, 58662, 71746, 74203, 74501, 83000]

DT JJ NN IN DT JJ NN NN NN eos
[8187, 71365, 85166]

JJ NNS IN JJ NNS eos
[8189, 13351, 47385, 47398, 62987, 62996, 68497]

NNS IN DT JJ NN NN VBP VBN eos
[8191, 8199]

DT NN VBP VBG IN eos
[8193, 8198, 76860]

DT NN VBZ TO VB WRB TO VB DT NN IN JJ NNS eos
[8200, 8218]

DT NN VBZ VBG WRB TO VB NNS eos
[8202, 31125, 38928, 44371, 53021, 57842, 57846, 76116, 121370]

DT NN VBZ NN VBG eos
[8206, 8217, 47281]

NN VBG IN WRB TO VB NNS eos
[8211, 47799]

NN VBG WRB TO VB eos
[8214, 28911, 44916, 44932, 76175]

NN NN IN DT NN IN DT NN eos
[8215, 8219, 58792, 77946]

DT NN VBZ VBG WRB TO VB DT NN IN NN


DT NN VBZ VBG PRP$ JJ eos
[23240, 82547]

DT NN NNS IN NNS eos
[23241, 36516, 40983, 64827, 64839, 66290, 82228, 84682]

DT NN VBG IN DT NN NN IN NN eos
[23250, 121428]

NN NN VBZ VBG TO DT NN eos
[23255, 23259, 57510, 57516, 57518, 85191]

NN VBG NNS IN NN NN eos
[23256, 44713, 44718, 78055]

DT NN VBZ IN NNS IN NN eos
[23257, 37774, 66565, 66566, 88666]

DT NN VBZ DT NN IN IN DT NN IN DT NN eos
[23260, 89184]

DT NN NN VBZ VBG DT JJ NN IN DT NN eos
[23264, 102173]

DT NN NN VBZ RP IN DT NN eos
[23267, 50395]

NN IN DT NN VBG DT NN NN eos
[23275, 39718, 59091, 59097, 60133, 82337]

IN DT NN NN VB DT NN VBZ VBG DT NN eos
[23277, 72470]

DT NN VBZ IN PRP$ NNS eos
[23283, 27491, 35063, 43525, 43887, 43893, 53131, 65006]

DT NN VBZ VBG IN DT NN JJ NN eos
[23284, 100787]

DT NN VBZ IN DT NN JJ NN eos
[23285, 116489]

DT NN VBZ IN DT RB JJ NN eos
[23290, 72252, 72258]

DT NN VBG IN CD NN NNS eos
[23309, 37605]

DT JJ NN IN DT NNS eos
[23323, 23339, 34788, 84703]

NN NN NNS VBP VBG DT NN eo


DT NN VBG RP RB IN DT NN eos
[44301, 65540, 65558]

DT NN IN NNS IN DT JJ JJ NN IN RB PRP IN DT NN eos
[44321, 44336, 44337]

NN VBG NN IN NN IN DT NN eos
[44330, 67310]

NN VBZ VBG DT NN CC VBG IN DT eos
[44332, 48537, 50675]

DT NN VBZ VBG NNS TO VBG NNS eos
[44346, 44356]

DT NN VBZ VBG NNS TO VB eos
[44369, 65002, 85825]

NN IN DT NN WDT VBZ TO VB VBN eos
[44390, 44397]

NN VBZ VBG NN CC VBG IN DT eos
[44392, 51117, 75852]

DT NN NN NN NN CC NN NN JJ NNS VBZ RP eos
[44404, 44418]

DT NN IN DT NN NNS JJ NN eos
[44406, 59721]

NN NN NN NN IN NNS eos
[44415, 44419]

NN VBZ DT NN JJ eos
[44416, 76853]

NNS VBN IN VBG NNS eos
[44423, 85383, 86625]

CD NNS IN DT NN NN IN DT NN NNS eos
[44427, 111372]

CD JJ NNS VBP IN NN IN DT NN eos
[44432, 72877]

NN IN DT NN VBG DT NN IN DT NN eos
[44457, 44459, 51458]

DT NNS VBP VBG VBG eos
[44464, 87368]

EX VBP CD NNS IN DT NN NN eos
[44467, 46111, 47326, 86650]

NN VBZ VBG VBN IN DT NN NN IN DT NN IN JJ NN eos
[44490, 44497]

DT NN VBZ VBG WRB T


DT NNS VBG NN VBG eos
[79473, 79478]

DT NN JJ NN VBZ VBG IN DT NN eos
[79536, 99140, 120713]

NN IN JJ NNS VBG NN eos
[79547, 79558]

NNS VBP JJ IN NN IN DT NN NN NN eos
[79549, 79559]

DT JJ NNS VBG JJ NNS CC VBG eos
[79550, 79556]

DT NN VBZ PRP$ NNS VBN eos
[79566, 79576, 79577]

DT NN IN WP VBZ VBN IN NN eos
[79584, 79598]

DT NN VBZ IN PRP VBZ DT JJ NN NN eos
[79623, 113892]

NN VBZ VBG DT VBG NN NN eos
[79632, 79639, 106973]

DT NN NN VBZ DT NN TO VB eos
[79646, 79658]

DT VBG WRB TO VB DT NN eos
[79720, 79739]

DT NN VBZ DT JJ NN TO VB DT NN eos
[79734, 98851]

DT NNS VBD PRP$ NN eos
[79755, 79759]

NNS VBG RB IN CD NN eos
[79771, 79777]

DT NN CC VBD JJ NNS VB DT JJ NN eos
[79774, 79778, 79779]

DT JJ NN NN VBZ VBG IN DT NNS eos
[79809, 120960]

NNS VBG TO DT NN NN NN eos
[79814, 79819]

DT NN CC NN NN CC NN IN NN NN eos
[79820, 79839]

CD NNS NN IN DT NN NN eos
[79837, 86355]

NNS IN NNS VBG JJ NNS eos
[79848, 79858]

CD NN CC CD NN NN DT JJ eos
[79877, 79879]

DT NN VBG TO 

In [15]:
[k for k in d.keys() if len(d[k])==1]

['DT NN VBZ DT NN NN VBD IN NN NN DT NN eos',
 'DT NN VBZ NN CC NNS PRP eos',
 'DT NN IN DT NN NN NN NN VBZ VBN eos',
 'DT JJ NN IN DT JJ JJ VBG IN DT NN NN eos',
 'NN CC NN IN DT NN NN NN NN eos',
 'NN NN NN NN JJ NN NNS IN eos',
 'DT NN VBZ PRP$ NN VBD IN DT NN NN IN DT NN NN eos',
 'DT NN NN DT NN IN DT NN IN DT NN NN CC RB VBZ VBG PRP eos',
 'DT NN VBZ DT JJ NN IN CD NNS IN DT NN NN eos',
 'CD NNS VBG CC DT NN VBG TO VB eos',
 'NN VBZ JJ NN IN NN CC VBZ PRP IN DT NN eos',
 'CD NNS VBN IN DT JJ eos',
 'JJ NNS IN NN VBP VBG IN DT JJ NN NN NN eos',
 'DT NN VBZ WRB TO VB DT JJ IN DT JJ NN eos',
 'DT NN VBZ VBG DT NN NN IN DT JJ eos',
 'DT NN VBZ VBG DT IN NN IN NNS VBN TO NN CC DT JJ eos',
 'CD NNS RB IN DT JJ eos',
 'CD NNS VBG CD DT eos',
 'CD NNS NN PRP RP eos',
 'DT NN VBG IN DT eos',
 'DT NN VBZ JJ IN DT IN DT NN VBZ PRP$ eos',
 'JJ NNS VBP VBG IN DT JJ NN IN DT NN IN DT NN NN eos',
 'NNS VBP NN IN VB RB eos',
 'DT NN IN NNS VBP IN NNS eos',
 'CD NNS CC NN NN IN DT JJ NN NN eos',


In [14]:
mor_than_two_templates = [k for k in d.keys() if len(d[k])>1]
most_used = sorted(mor_than_two_templates, key=lambda k: len(d[k]))
print(f"count of templates: {len(d)}\n\
        count of templats that appears at least 2 times: {len(mor_than_two_templates)}\n\
        the most used pattern: {most_used[1000]}")

count of templates: 69861
        count of templats that appears at least 2 times: 10973
        the most used pattern: NN MD VBG PRP RB VBP VBN NN NN VBZ IN eos
