Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serum_passage_category should be set to "egg" instead of "cell" for CDC human pool data like "L21/22 H3-EGG HUMAN POOL" #129

Closed
huddlej opened this issue Oct 26, 2022 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@huddlej
Copy link
Contributor

huddlej commented Oct 26, 2022

Current Behavior

Human pool titers represent measurements for people vaccinated with either cell-passaged or egg-passaged vaccine strains. Data from the CDC represent this passage status with names like L21/22 H3-EGG HUMAN POOL in the serum id. Egg-passaged data appear in the cell-passaged downloads from fauna, however. For example, the following command returns a list of egg-passaged data for H3N2:

grep H3-EGG data/h3n2/who_cell_fra_titers.tsv

Expected behavior

These egg-passaged data should only appear in the corresponding egg-passaged titer file (e.g., data/h3n2/who_egg_fra_titers.tsv for the example above). The serum_passage_category of these records should be set to egg instead of cell.

Possible solution

We may need to check each measurement's serum id for the appearance of "egg" and override the inferred serum passage status based on what we find. For example, similar logic already exists to set the "host" for each measurement based. There might be a cleaner fauna-style way to implement this check though.

@huddlej huddlej added the bug Something isn't working label Oct 26, 2022
@huddlej
Copy link
Contributor Author

huddlej commented Oct 26, 2022

@joverlee521 Maybe we can work on this together? It seems like a good opportunity for me to learn more about fauna's internal workings...

@joverlee521
Copy link
Contributor

Here's the current parsing of the serum passage category for CDC titers:

  1. The original sr_passage column in the CDC TSV is mapped to serum_antigen_passage.
  2. Within tdb/cdc_upload, the serum_antigen_passage column is used to infer serum_passage_category.
  3. The format_passage method is inherited from vdb/flu_upload, which uses a series of regexes to parse the passage category.

We can special case the human pool titers and use the lot_number to format the serum_passage_category. (lot_number is the column that contains the names like 21/22 H3-EGG HUMAN POOL since the serum_id formatting happens after the serum passage formatting)

@huddlej
Copy link
Contributor Author

huddlej commented Oct 28, 2022

Thank you for laying out the steps so clearly, @joverlee521! Special casing the human pool titers sounds reasonable. Would that logic live in the format_passage function?

@joverlee521
Copy link
Contributor

Special casing the human pool titers sounds reasonable. Would that logic live in the format_passage function?

Hmm, I'm a little hesitant to make format_passage any more complicated 😅
Maybe we can just keep all the human pool specific logic in one place within tdb/cdc_upload:

diff --git a/tdb/cdc_upload.py b/tdb/cdc_upload.py
index 3a007c2..7aa6b3d 100644
--- a/tdb/cdc_upload.py
+++ b/tdb/cdc_upload.py
@@ -72,6 +72,7 @@ class cdc_upload(upload):
                 self.test_virus_strains.add(meas['virus_strain'])
             if "Human" in meas['serum_id']:
                 meas['serum_host'] = 'human'
+                self.format_passage(meas, 'serum_id', 'serum_passage_category')
             self.rethink_io.check_optional_attributes(meas, self.optional_fields)
             self.remove_fields(meas)
         if len(self.new_different_date_format) > 0:

@huddlej
Copy link
Contributor Author

huddlej commented Nov 1, 2022

I know what you mean! That function is among the hairier I've seen in this repo. If we start getting human data from other CCs, though, would you want to encode the human-specific parsing in each respective upload script? Or just refactor any shared parsing logic into a new function when we need to?

@joverlee521
Copy link
Contributor

Yup, I would want to keep the human-specific parsing in each respective upload script because I'm expecting each CC to provide them in different formats...If there's any parsing logic that can be shared then we can refactor into a new function.

@huddlej
Copy link
Contributor Author

huddlej commented Nov 1, 2022

Sounds good to me!

@joverlee521 joverlee521 mentioned this issue Nov 18, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

2 participants