-
Notifications
You must be signed in to change notification settings - Fork 0
progress 2020-07-11 #2
Comments
Yes, I think this is worth thinking about. One thing is that under the assumption of random responding, our all-endorse questions are naturally less diagnostic that the none-endorse simply because of the structure of the anchors (2/4 responses are permissible for the BISBAS and SHAPS, compared with 1/4 for the GAD7 and SUSD). I would be interested in exploring the idea that the diagnosticity of an infrequency check is, to a first approximation, a function of (a) the number of permissible responses, and (b) the complexity (operationalised as number of words) of the questionnaire.
It can't hurt to calculate things like split-half, even if we expect it to be less robust - as you say, we can still compute them and compare. I agree that we probably don't have the right style of items for the synonym/antonym approach to be useful.
It's an interesting point. My two responses are (1) that the ROC curves do away with the problem of thresholding, since they tell us how good any given threshold will be, and (2) if we really want to set an arbitrary threshold we could do it on the basis of distribution quantiles; e.g., how well do we do if we exclude everyone above the 90th quantile on entropy, for instance? |
To summarize our conversation from earlier:
|
Some points for discussion after looking through the data today:
Infrequency thresholds: we may want to think about the consistency between the different infrequency items, which is lower than what would be expected under pure random responding. Obviously pure random responding an unrealistic assumption, but I'm wondering if there's anything else to say about those items (e.g. all-endorse items are somehow less discriminative?).
Additional survey metrics: there are some recommended survey quality metrics I have not yet implemented as they are somewhat challenging for our dataset. A metric like internal (split-half) consistency is possibly less robust in our case where we have few items per subscale. Similarly, it's not clear if we have enough items to compute consistency via "psychometric synonyms/antonyms". It doesn't seem crucial to me to compute all of these survey metrics as they're not the crux of the paper -- that said, if there's an easy way to compute these it'd be interesting to compare them to behavioral metrics (re: Major Point progress 2020-07-11 #2, behavior =/= survey thresholding).
Thresholding non-behavior metrics: it is somewhat more clear what the anchor points are for thresholding behavior (i.e. chance). It's somewhat less clear for other metrics (total experiment duration, entropy, Mahalanobis D). It's possible the literature may have some recommendations. Short of that, we'll want to think about a sensible rule.
The text was updated successfully, but these errors were encountered: