Set `serum_id` to `lot_number` for CDC titer imports #126

huddlej · 2022-09-08T21:38:17Z

Current Behavior

We currently ingest CDC titer data into fauna using the sr_ferret column as the serum_id of each measurement.

Expected behavior

However, the values in the sr_ferret column are not what the CDC uses to discuss these measurements, so reporting these values has no meaning. Instead, the CDC uses the sr_lot column to discuss measurements. We currently map the sr_lot column to a field in the tdb database called lot_number.

Possible solution

One solution would be to set the serum_id to the sr_lot column value in the CDC tdb upload script instead of using the sr_ferret column. The simplest way to make this change might be to change sr_ferret to sr_lot in the mapping of columns for CDC data.

Interestingly, there is some code in the upload script that effectively does this mapping of the lot_number to the serum_id, but that mapping only happens when the serum_id isn't already set.

The text was updated successfully, but these errors were encountered:

joverlee521 · 2022-09-08T22:24:41Z

One small issue with changing the underlying data for serum_id is that it is one of the index fields for generating the record index. We need to remember to delete the records already in fauna and re-upload them with the new serum_id so that we don't have duplicate titers.

Another issue is do we need to change over the serum_id for older titer measurements? The mapping of columns for CDC data only applies to their new data dump that only goes back to September 2019. I would have to do some digging to figure out how the older titers serum_id was set.

Stepping back a bit, I think the current values of serum_id and lot_number are accurate for our data model. If the CDC is wanting to use the lot number for discussions, then maybe we can just include this column in our downloads from fauna instead of overhauling the data model.

huddlej · 2022-09-08T22:45:02Z

Good points! I hadn't thought about repopulating the database. I was thinking more about updating the mapping now so future imports (new records) use the lot number instead. Would that kind of change disrupt uploads of new data, though, because the uploads work from the full CDC database TSV and all older records would get an updated index? If so, would it be possible to do a one-time delete of everything from 2019-onward and a fresh upload that maps the lot number to the serum id column? Then subsequent updates wouldn't change the index, right?

Surfacing lot_number in addition to serum_id sounds much easier technically, although this means adding a column that only the CDC will use and that serum_id remains meaningless for them. This becomes a new field to display in the measurements tooltips, a filter and group-by column to add to the measurements config, etc. It's not a big deal, but it reduces the value of the original serum id.

The issue of recreating fauna from older data could be important to figure out eventually, if we really hope to deprecate fauna in favor of a cloud-based file store solution...but this probably isn't the place to discuss that undertaking. 😅

joverlee521 · 2022-09-09T18:04:18Z

If so, would it be possible to do a one-time delete of everything from 2019-onward and a fresh upload that maps the lot number to the serum id column? Then subsequent updates wouldn't change the index, right?

Yup, I can track down the date that we started using the new database dump and delete records in cdc_tdb/flu starting from that date. Then we can re-upload all records with the changed serum id.

Surfacing lot_number in addition to serum_id sounds much easier technically, although this means adding a column that only the CDC will use and that serum_id remains meaningless for them.

Oh right! The lot_number column only exists in the CDC database, but does not exist for the other CCs. That would mean we would have to special case the columns to use in the tdb/download script based on the database. Hmm, I think your original proposal is better 🤔

I did a little more digging into the fauna/tdb code and the parse step automatically assigns the ferret_id to the serum_id, so we may also want to remove the ferret_id in the column map.

joverlee521 · 2022-09-12T20:38:50Z

Yup, I can track down the date that we started using the new database dump and delete records in cdc_tdb/flu starting from that date.

Never mind, just realized the start date of the new database dump doesn't matter because the data contains tests from 2019. Like you said, we can delete records based on assay_date. The earliest assay_date included in the database dump is 2019-09-03.

huddlej added the bug Something isn't working label Sep 8, 2022

huddlej self-assigned this Nov 2, 2022

joverlee521 self-assigned this Nov 17, 2022

joverlee521 mentioned this issue Nov 18, 2022

Fix cdc uploads #131

Merged

3 tasks

joverlee521 mentioned this issue Oct 25, 2023

Revisit tdb/upload's index_fields #144

Open

joverlee521 closed this as completed in 7450be6 Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set `serum_id` to `lot_number` for CDC titer imports #126

Set `serum_id` to `lot_number` for CDC titer imports #126

huddlej commented Sep 8, 2022

joverlee521 commented Sep 8, 2022

huddlej commented Sep 8, 2022

joverlee521 commented Sep 9, 2022

joverlee521 commented Sep 12, 2022

Set serum_id to lot_number for CDC titer imports #126

Set serum_id to lot_number for CDC titer imports #126

Comments

huddlej commented Sep 8, 2022

Current Behavior

Expected behavior

Possible solution

joverlee521 commented Sep 8, 2022

huddlej commented Sep 8, 2022

joverlee521 commented Sep 9, 2022

joverlee521 commented Sep 12, 2022

Set `serum_id` to `lot_number` for CDC titer imports #126

Set `serum_id` to `lot_number` for CDC titer imports #126