Why wasn't the problem masked in the first stage? Are both the question and answer losses considered during computation?