Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

progress 2020-07-11 #2

Closed
szorowi1 opened this issue Jul 11, 2020 · 2 comments
Closed

progress 2020-07-11 #2

szorowi1 opened this issue Jul 11, 2020 · 2 comments
Labels

Comments

@szorowi1
Copy link
Contributor

Some points for discussion after looking through the data today:

  • Infrequency thresholds: we may want to think about the consistency between the different infrequency items, which is lower than what would be expected under pure random responding. Obviously pure random responding an unrealistic assumption, but I'm wondering if there's anything else to say about those items (e.g. all-endorse items are somehow less discriminative?).

  • Additional survey metrics: there are some recommended survey quality metrics I have not yet implemented as they are somewhat challenging for our dataset. A metric like internal (split-half) consistency is possibly less robust in our case where we have few items per subscale. Similarly, it's not clear if we have enough items to compute consistency via "psychometric synonyms/antonyms". It doesn't seem crucial to me to compute all of these survey metrics as they're not the crux of the paper -- that said, if there's an easy way to compute these it'd be interesting to compare them to behavioral metrics (re: Major Point progress 2020-07-11 #2, behavior =/= survey thresholding).

  • Thresholding non-behavior metrics: it is somewhat more clear what the anchor points are for thresholding behavior (i.e. chance). It's somewhat less clear for other metrics (total experiment duration, entropy, Mahalanobis D). It's possible the literature may have some recommendations. Short of that, we'll want to think about a sensible rule.

@danielbrianbennett
Copy link
Contributor

danielbrianbennett commented Jul 12, 2020

Infrequency thresholds: we may want to think about the consistency between the different infrequency items, which is lower than what would be expected under pure random responding. Obviously pure random responding an unrealistic assumption, but I'm wondering if there's anything else to say about those items (e.g. all-endorse items are somehow less discriminative?).

Yes, I think this is worth thinking about. One thing is that under the assumption of random responding, our all-endorse questions are naturally less diagnostic that the none-endorse simply because of the structure of the anchors (2/4 responses are permissible for the BISBAS and SHAPS, compared with 1/4 for the GAD7 and SUSD). I would be interested in exploring the idea that the diagnosticity of an infrequency check is, to a first approximation, a function of (a) the number of permissible responses, and (b) the complexity (operationalised as number of words) of the questionnaire.

Additional survey metrics: there are some recommended survey quality metrics I have not yet implemented as they are somewhat challenging for our dataset. A metric like internal (split-half) consistency is possibly less robust in our case where we have few items per subscale. Similarly, it's not clear if we have enough items to compute consistency via "psychometric synonyms/antonyms". It doesn't seem crucial to me to compute all of these survey metrics as they're not the crux of the paper -- that said, if there's an easy way to compute these it'd be interesting to compare them to behavioral metrics (re: Major Point #2, behavior =/= survey thresholding).

It can't hurt to calculate things like split-half, even if we expect it to be less robust - as you say, we can still compute them and compare. I agree that we probably don't have the right style of items for the synonym/antonym approach to be useful.

Thresholding non-behavior metrics: it is somewhat more clear what the anchor points are for thresholding behavior (i.e. chance). It's somewhat less clear for other metrics (total experiment duration, entropy, Mahalanobis D). It's possible the literature may have some recommendations. Short of that, we'll want to think about a sensible rule.

It's an interesting point. My two responses are (1) that the ROC curves do away with the problem of thresholding, since they tell us how good any given threshold will be, and (2) if we really want to set an arbitrary threshold we could do it on the basis of distribution quantiles; e.g., how well do we do if we exclude everyone above the 90th quantile on entropy, for instance?

@szorowi1
Copy link
Contributor Author

To summarize our conversation from earlier:

  1. We will think some more about what might be driving differences in the efficacy of screening items (e.g. survey difficulty, response scales, anchor placement relative to modal response).

  2. Will compute additional metrics.

  3. Will leave metrics as continuous/ordinal. Will compute item similarity.

  4. Will compute the attenuation of correlations as fraction of flagged participants removed.

  5. Will bolster arguments through simulations (esp. sample size and low-effort base rate).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants