Hello,
This is a minor question about the code, I want to be sure I do not let slip any subtleties.
In L354 of model.py, there is the final step to extract the text
# x.shape = [batch_size, n_ctx, transformer.width]
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
self.text_projection is the last projection to have the text features
text.argmax(dim=-1) picks the features of the EOT token.
Why there is a torch.arange(x.shape[0])? It could be x[:, text.argmax(dim=-1)], right?
Thanks for the work, code and model.