progress 2020-07-11 #2

szorowi1 · 2020-07-11T23:26:38Z

Some points for discussion after looking through the data today:

Infrequency thresholds: we may want to think about the consistency between the different infrequency items, which is lower than what would be expected under pure random responding. Obviously pure random responding an unrealistic assumption, but I'm wondering if there's anything else to say about those items (e.g. all-endorse items are somehow less discriminative?).
Additional survey metrics: there are some recommended survey quality metrics I have not yet implemented as they are somewhat challenging for our dataset. A metric like internal (split-half) consistency is possibly less robust in our case where we have few items per subscale. Similarly, it's not clear if we have enough items to compute consistency via "psychometric synonyms/antonyms". It doesn't seem crucial to me to compute all of these survey metrics as they're not the crux of the paper -- that said, if there's an easy way to compute these it'd be interesting to compare them to behavioral metrics (re: Major Point progress 2020-07-11 #2, behavior =/= survey thresholding).
Thresholding non-behavior metrics: it is somewhat more clear what the anchor points are for thresholding behavior (i.e. chance). It's somewhat less clear for other metrics (total experiment duration, entropy, Mahalanobis D). It's possible the literature may have some recommendations. Short of that, we'll want to think about a sensible rule.

danielbrianbennett · 2020-07-12T14:41:49Z

Infrequency thresholds: we may want to think about the consistency between the different infrequency items, which is lower than what would be expected under pure random responding. Obviously pure random responding an unrealistic assumption, but I'm wondering if there's anything else to say about those items (e.g. all-endorse items are somehow less discriminative?).

Yes, I think this is worth thinking about. One thing is that under the assumption of random responding, our all-endorse questions are naturally less diagnostic that the none-endorse simply because of the structure of the anchors (2/4 responses are permissible for the BISBAS and SHAPS, compared with 1/4 for the GAD7 and SUSD). I would be interested in exploring the idea that the diagnosticity of an infrequency check is, to a first approximation, a function of (a) the number of permissible responses, and (b) the complexity (operationalised as number of words) of the questionnaire.

Additional survey metrics: there are some recommended survey quality metrics I have not yet implemented as they are somewhat challenging for our dataset. A metric like internal (split-half) consistency is possibly less robust in our case where we have few items per subscale. Similarly, it's not clear if we have enough items to compute consistency via "psychometric synonyms/antonyms". It doesn't seem crucial to me to compute all of these survey metrics as they're not the crux of the paper -- that said, if there's an easy way to compute these it'd be interesting to compare them to behavioral metrics (re: Major Point #2, behavior =/= survey thresholding).

It can't hurt to calculate things like split-half, even if we expect it to be less robust - as you say, we can still compute them and compare. I agree that we probably don't have the right style of items for the synonym/antonym approach to be useful.

Thresholding non-behavior metrics: it is somewhat more clear what the anchor points are for thresholding behavior (i.e. chance). It's somewhat less clear for other metrics (total experiment duration, entropy, Mahalanobis D). It's possible the literature may have some recommendations. Short of that, we'll want to think about a sensible rule.

It's an interesting point. My two responses are (1) that the ROC curves do away with the problem of thresholding, since they tell us how good any given threshold will be, and (2) if we really want to set an arbitrary threshold we could do it on the basis of distribution quantiles; e.g., how well do we do if we exclude everyone above the 90th quantile on entropy, for instance?

szorowi1 · 2020-07-14T00:50:05Z

To summarize our conversation from earlier:

We will think some more about what might be driving differences in the efficacy of screening items (e.g. survey difficulty, response scales, anchor placement relative to modal response).
Will compute additional metrics.
Will leave metrics as continuous/ordinal. Will compute item similarity.
Will compute the attenuation of correlations as fraction of flagged participants removed.
Will bolster arguments through simulations (esp. sample size and low-effort base rate).

szorowi1 added the analysis label Jul 11, 2020

szorowi1 closed this as completed Jul 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

progress 2020-07-11 #2

progress 2020-07-11 #2

szorowi1 commented Jul 11, 2020

danielbrianbennett commented Jul 12, 2020 •

edited

Loading

szorowi1 commented Jul 14, 2020

progress 2020-07-11 #2

progress 2020-07-11 #2

Comments

szorowi1 commented Jul 11, 2020

danielbrianbennett commented Jul 12, 2020 • edited Loading

szorowi1 commented Jul 14, 2020

danielbrianbennett commented Jul 12, 2020 •

edited

Loading