Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every word should have an individual time span. #9

Closed
gordon-lim opened this issue Aug 14, 2021 · 2 comments
Closed

Every word should have an individual time span. #9

gordon-lim opened this issue Aug 14, 2021 · 2 comments
Assignees
Labels
question Further information is requested

Comments

@gordon-lim
Copy link

I quote from the paper that "Note that µ assigns each ai to exactly one mj , but mj can match to zero or multiple words in a". and refer you to the formal definition for how the start_time and end_time is derived for a word mj. This stands to reason that every individual word should have its own individual time span. However, in the sample data

{ dataset_id: 'mscoco_val2017', image_id: '137576', annotator_id: 93, caption: 'In this image there are group of cows standing and eating th...', timed_caption: [{'utterance': 'In this', 'start_time': 0.0, 'end_time': 0.4}, ...], traces: [[{'x': 0.2086, 'y': -0.0533, 't': 0.022}, ...], ...], voice_recording: 'coco_val/coco_val_137576_93.ogg' }

it is shown that two words 'In this' shares the same time window which is contrary to how the paper describes the start_time and end_time would be assigned... that is to individual words.

Yet, it is correct in other parts of the dataset such as the following

{ dataset_id: ADE20k, image_id: ADE_val_00000175, annotator_id: 125, caption: In this image on the left side I can see a bed and a window...., timed_caption: [{'utterance': 'In', 'start_time': 0.0, 'end_time': 0.0}, {'utterance': 'this', 'start_time': 0.0, 'end_time': 0.8}, ...], traces: [[{'x': 0.6408, 'y': 0.1371, 't': 0.013}, ...], ...], voice_recording: ade20k_validation/ade20k_validation_ADE_val_00000175_125.ogg }

where 'In' and 'this' each have their own start_time and end_time.

I would appreciate if you could help me shed light on this disparity. Sorry if this is a mistake on my part and I thank you for reading my message.

Yours Sincerely,
Gordon

@jponttuset
Copy link
Member

Hi @gordon-lim,

Thanks for your interest in the data.

The ASR we used (and the majority, I'd say) work with "utterances" as their smallest unit, not words. Sometimes, therefore, they recognize "In this" as a single unit, as happened in the example you pointed out. The wording in this fragment of the paper is indeed not precise: it should say utterance instead of word.

Bottom line is that we rely on the ASR to segment the sentences and we have no way of dividing them further, as we wouldn't know which timestamp to assign to them.

I hope this helps.

Best,

@jponttuset jponttuset added the question Further information is requested label Aug 16, 2021
@jponttuset jponttuset self-assigned this Aug 16, 2021
@gordon-lim
Copy link
Author

Thank you very much for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants