Can the author confirm how the recall is implemented for both text to image and image to text given there are 5 captions per image?