I think it would be nice to generally have some better infrastructure/mechanism for getting values to attention processors.
At the moment the main means of transportation seems to be cross_attention_kwargs but they are finally expanded as function arguments and have to match ALL function signatures that receive them. That's why there is a lot of dict gymnastics before the call. Just further passing the dict via reference where every processor may pick what it needs not bothering with other items would be more flexible imo.
In my mind pipeline and processors should be the malleable 'loose ends' to modify and swap with infrastructure in between rarely in need of adjustment.