Every word should have an individual time span. #9

gordon-lim · 2021-08-14T09:11:39Z

I quote from the paper that "Note that µ assigns each ai to exactly one mj , but mj can match to zero or multiple words in a". and refer you to the formal definition for how the start_time and end_time is derived for a word mj. This stands to reason that every individual word should have its own individual time span. However, in the sample data

{ dataset_id: 'mscoco_val2017', image_id: '137576', annotator_id: 93, caption: 'In this image there are group of cows standing and eating th...', timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...], traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...], voice_recording: 'coco_val/coco_val_137576_93.ogg' }

it is shown that two words 'In this' shares the same time window which is contrary to how the paper describes the start_time and end_time would be assigned... that is to individual words.

Yet, it is correct in other parts of the dataset such as the following

{ dataset_id: ADE20k, image_id: ADE_val_00000175, annotator_id: 125, caption: In this image on the left side I can see a bed and a window...., timed_caption: [{'utterance': 'In', 'start_time': 0.0, 'end_time': 0.0}, {'utterance': 'this', 'start_time': 0.0, 'end_time': 0.8}, ...], traces: [[{'x': 0.6408, 'y': 0.1371, 't': 0.013}, ...], ...], voice_recording: ade20k_validation/ade20k_validation_ADE_val_00000175_125.ogg }

where 'In' and 'this' each have their own start_time and end_time.

I would appreciate if you could help me shed light on this disparity. Sorry if this is a mistake on my part and I thank you for reading my message.

Yours Sincerely,
Gordon

The text was updated successfully, but these errors were encountered:

jponttuset · 2021-08-16T13:29:45Z

Hi @gordon-lim,

Thanks for your interest in the data.

The ASR we used (and the majority, I'd say) work with "utterances" as their smallest unit, not words. Sometimes, therefore, they recognize "In this" as a single unit, as happened in the example you pointed out. The wording in this fragment of the paper is indeed not precise: it should say utterance instead of word.

Bottom line is that we rely on the ASR to segment the sentences and we have no way of dividing them further, as we wouldn't know which timestamp to assign to them.

I hope this helps.

Best,

gordon-lim · 2021-08-16T13:35:15Z

Thank you very much for the clarification.

jponttuset added the question Further information is requested label Aug 16, 2021

jponttuset self-assigned this Aug 16, 2021

gordon-lim closed this as completed Aug 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Every word should have an individual time span. #9

Every word should have an individual time span. #9

gordon-lim commented Aug 14, 2021

jponttuset commented Aug 16, 2021

gordon-lim commented Aug 16, 2021

Every word should have an individual time span. #9

Every word should have an individual time span. #9

Comments

gordon-lim commented Aug 14, 2021

jponttuset commented Aug 16, 2021

gordon-lim commented Aug 16, 2021