Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import additional columns from CR6 #290

Merged
merged 13 commits into from
Sep 24, 2021
Merged

Import additional columns from CR6 #290

merged 13 commits into from
Sep 24, 2021

Conversation

naity
Copy link
Contributor

@naity naity commented Sep 6, 2021

Closes #279.

@grst grst linked an issue Sep 6, 2021 that may be closed by this pull request
@grst
Copy link
Collaborator

grst commented Sep 7, 2021

Thanks for putting this together! LMK if you need any input!

@naity
Copy link
Contributor Author

naity commented Sep 9, 2021

Thanks for putting this together! LMK if you need any input!

The added columns caused some current tests to fail. Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs? Thank you!

scirpy/io/_io.py Outdated
Comment on lines 182 to 183
fwrs = [f"fwr{i}" if i < 5 else f"fwr{i-4}_nt" for i in range(1, 9)]
cdrs = [f"cdr{i}" if i < 3 else f"cdr{i-2}_nt" for i in range(1, 5)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has very unclear semantics... Please either use separate loops for nt and aa or take a look at
itertools.product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update, thanks for the suggestion!

@grst
Copy link
Collaborator

grst commented Sep 9, 2021

Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs?

Please only include those columns if they are present in the CR output.

@naity
Copy link
Contributor Author

naity commented Sep 12, 2021

Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs?

Please only include those columns if they are present in the CR output.

Sounds good, thanks! Please let me know if there are additional changes that need to be made.

Copy link
Collaborator

@grst grst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed a small thing about the field names!

And thanks a lot for adding tests :)

scirpy/io/_io.py Outdated
Comment on lines 145 to 146
chain[col] = cell[col].get("aa_seq") if cell[col] else None
chain[col + "_nt"] = cell[col].get("nt_seq") if cell[col] else None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the rearrangement standard

the fields are cdr1 for nt and cdr1_aa for aa

@grst
Copy link
Collaborator

grst commented Sep 13, 2021

One more thing I was thinking about:
The CDR3 column of cellranger is called junction according to the AIRR standard because it contains the two conserved C and W/F residues. Now that there is also CDR1 and 2 it might be confusing that CDR3 is missing.

Do you think it might make sense to automatically add also a CDR3 column that is automatically generated as

chain['cdr3_aa'] = chain['junction_aa'][1:-1]
chain['cdr3'] = chain['junction'][3:-3]

?

ping @zktuong, (How) did you deal with that in dandelion?

@zktuong
Copy link
Contributor

zktuong commented Sep 13, 2021

One more thing I was thinking about:
The CDR3 column of cellranger is called junction according to the AIRR standard because it contains the two conserved C and W/F residues. Now that there is also CDR1 and 2 it might be confusing that CDR3 is missing.

Do you think it might make sense to automatically add also a CDR3 column that is automatically generated as

chain['cdr3_aa'] = chain['junction_aa'][1:-1]
chain['cdr3'] = chain['junction'][3:-3]

?

ping @zktuong, (How) did you deal with that in dandelion?

Hmm i don't parse that - i just use the junction column. Your suggestion sounds sensible and clean though.

@grst
Copy link
Collaborator

grst commented Sep 13, 2021

@zktuong, thanks for your quick response! In any case, having an additional cdr3 column should not interfere with dandelion when running to_dandelion?

@naity, do you want to give it a go to additionally add the cdr3 column as described above?

@zktuong
Copy link
Contributor

zktuong commented Sep 13, 2021

yup should be no problem

@naity
Copy link
Contributor Author

naity commented Sep 13, 2021

@zktuong, thanks for your quick response! In any case, having an additional cdr3 column should not interfere with dandelion when running to_dandelion?

@naity, do you want to give it a go to additionally add the cdr3 column as described above?

Sure, I fixed the column names and added trimmed cdr3 as suggested. But the added cdr3 and cdr3aa seemed to cause the following test to fail:

    def test_convert_dandelion(anndata_from_10x_sample):
        """Test dandelion round-trip conversion"""
        anndata = anndata_from_10x_sample
        ddl = to_dandelion(anndata)
        anndata2 = from_dandelion(ddl)
    
        ir_objs1 = to_airr_cells(anndata)
        ir_objs2 = to_airr_cells(anndata2)
    
        assert len(ir_objs1) == len(ir_objs2) == anndata.shape[0]
    
        # Frame-level comparison is not possible as the "extra chains" field is different
        # due to 'sequence_id' being different.
        for ir_obj1, ir_obj2 in zip(ir_objs1, ir_objs2):
            assert len(ir_obj1.chains) == len(ir_obj2.chains)
            for tmp_chain1, tmp_chain2 in zip(ir_obj1.chains, ir_obj2.chains):
                # this field is expected to be different
                del tmp_chain1["sequence_id"]
                del tmp_chain2["sequence_id"]
>               assert tmp_chain1 == tmp_chain2
E               AssertionError: assert {'c_call': 'IGKC',\n 'cdr3': 'CAGCAGTATGGTAGCTCACTTACGTGGACG',\n 'cdr3_aa': 'QQYGSSLTWT',\n 'consensus_count': 5260.0,\n 'd_call': None,\n 'd_cigar': None,\n 'duplicate_count': 41.0,\n 'germline_alignment': None,\n 'j_call': 'IGKJ1',\n 'j_cigar': None,\n 'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',\n 'junction_aa': 'CQQYGSSLTWTF',\n 'locus': 'IGK',\n 'productive': 'True',\n 'rev_comp': None,\n 'sequence': None,\n 'sequence_alignment': None,\n 'v_call': 'IGKV3-20',\n 'v_cigar': None} == {'c_call': 'IGKC',\n 'consensus_count': 5260.0,\n 'd_call': None,\n 'd_cigar': None,\n 'duplicate_count': 41.0,\n 'germline_alignment': None,\n 'j_call': 'IGKJ1',\n 'j_cigar': None,\n 'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',\n 'junction_aa': 'CQQYGSSLTWTF',\n 'locus': 'IGK',\n 'productive': 'True',\n 'rev_comp': None,\n 'sequence': None,\n 'sequence_alignment': None,\n 'v_call': 'IGKV3-20',\n 'v_cigar': None}
E                 Common items:
E                 {'c_call': 'IGKC',
E                  'consensus_count': 5260.0,
E                  'd_call': None,
E                  'd_cigar': None,
E                  'duplicate_count': 41.0,
E                  'germline_alignment': None,
E                  'j_call': 'IGKJ1',
E                  'j_cigar': None,
E                  'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',
E                  'junction_aa': 'CQQYGSSLTWTF',
E                  'locus': 'IGK',
E                  'productive': 'True',
E                  'rev_comp': None,
E                  'sequence': None,
E                  'sequence_alignment': None,
E                  'v_call': 'IGKV3-20',
E                  'v_cigar': None}
E                 Left contains 2 more items:
E                 {'cdr3': 'CAGCAGTATGGTAGCTCACTTACGTGGACG', 'cdr3_aa': 'QQYGSSLTWT'}
E                 Full diff:
E                   {
E                    'c_call': 'IGKC',
E                 +  'cdr3': 'CAGCAGTATGGTAGCTCACTTACGTGGACG',
E                 +  'cdr3_aa': 'QQYGSSLTWT',
E                    'consensus_count': 5260.0,
E                    'd_call': None,
E                    'd_cigar': None,
E                    'duplicate_count': 41.0,
E                    'germline_alignment': None,
E                    'j_call': 'IGKJ1',
E                    'j_cigar': None,
E                    'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',
E                    'junction_aa': 'CQQYGSSLTWTF',
E                    'locus': 'IGK',
E                    'productive': 'True',
E                    'rev_comp': None,
E                    'sequence': None,
E                    'sequence_alignment': None,
E                    'v_call': 'IGKV3-20',
E                    'v_cigar': None,
E                   }

scirpy/tests/test_io.py:238: AssertionError

@grst
Copy link
Collaborator

grst commented Sep 14, 2021

that looks alright, I think you can just update the test cases.

@naity
Copy link
Contributor Author

naity commented Sep 14, 2021

that looks alright, I think you can just update the test cases.

shall we add cdr3 and cdr3_aa by default or only if cdr1 and cdr2 exist? If the former, test_workflow would need to be modified as well.

@grst
Copy link
Collaborator

grst commented Sep 15, 2021

I would say always.

But, maybe then we should implement an include_fields like in read_airr?
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L340

The behaviour would be that when converting the chain dicts to the anndata object, per default, only the columns defined here
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L36-L47

would get imported. If the user wants all attributes (including cdr1, 2, 3), they can set include_fields=None.

The idea is to avoid cluttering anndata with too many columns that are not needed for the most common use-cases.
What do you think?

That might be too much to ask from you though - I can always take over this PR if you prefer.

@naity
Copy link
Contributor Author

naity commented Sep 16, 2021

I would say always.

But, maybe then we should implement an include_fields like in read_airr?
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L340

The behaviour would be that when converting the chain dicts to the anndata object, per default, only the columns defined here
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L36-L47

would get imported. If the user wants all attributes (including cdr1, 2, 3), they can set include_fields=None.

The idea is to avoid cluttering anndata with too many columns that are not needed for the most common use-cases.
What do you think?

That might be too much to ask from you though - I can always take over this PR if you prefer.

Yes, that's a great idea. I added an include_fields parameter to 10x functions but it seems to cause several tests to fail including test_ir_objs_roundtrip_conversion. I'll try to sort them out and please let me know if you have any suggestions.

Additionally, should we verify that chain['junction_aa'] starts with "C" and ends with "W/F" like before trimming?

@grst
Copy link
Collaborator

grst commented Sep 16, 2021

Additionally, should we verify that chain['junction_aa'] starts with "C" and ends with "W/F" like before trimming?

Not sure... In the example file I looked at, indeed all sequences start with C, but not all of them finish with [WF]. Here are a few:

CYSHSPTSMWVS
CARRGAGTTLCG
CTDLLRLAYTAWG
CATRLTRHLVRLTT
CQQYYSTPPI
CARWITGTPSL
CARDLKDIVLMVYASRLGGTLTT

Maybe @zktuong can help us again? Would it make sense to trim these sequences to [1:-1] as well? Or just set CDR3 to None?

@zktuong
Copy link
Contributor

zktuong commented Sep 16, 2021

Additionally, should we verify that chain['junction_aa'] starts with "C" and ends with "W/F" like before trimming?

Not sure... In the example file I looked at, indeed all sequences start with C, but not all of them finish with [WF]. Here are a few:

CYSHSPTSMWVS
CARRGAGTTLCG
CTDLLRLAYTAWG
CATRLTRHLVRLTT
CQQYYSTPPI
CARWITGTPSL
CARDLKDIVLMVYASRLGGTLTT

Maybe @zktuong can help us again? Would it make sense to trim these sequences to [1:-1] as well? Or just set CDR3 to None?

Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive?

@grst
Copy link
Collaborator

grst commented Sep 16, 2021

Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive?

they are flagged as non-productive. So usually they get filtered out anyway.

@zktuong
Copy link
Contributor

zktuong commented Sep 16, 2021

Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive?

they are flagged as non-productive. So usually they get filtered out anyway.

yea then skipping trimming and switching to None/"" should be fine.

@grst
Copy link
Collaborator

grst commented Sep 16, 2021

ok, perfect. None in that case!

@naity
Copy link
Contributor Author

naity commented Sep 17, 2021

HI @grst, I am having trouble with this error, could you please help take a look?

___________________________________________________________________ test_merge_airr_chains_identity ___________________________________________________________________

    def test_merge_airr_chains_identity():
        """Test that mergeing adata with itself results in the identity"""
        adata1 = read_10x_vdj(TESTDATA / "10x/filtered_contig_annotations.csv")
        adata2 = adata1.copy()
        obs_expected = adata1.obs.copy()
>       merge_airr_chains(adata1, adata2)

scirpy/tests/test_preprocessing.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
scirpy/io/_util.py:67: in check_wrapper
    return f(*args, **kwargs)
scirpy/_preprocessing/_merge_adata.py:47: in merge_airr_chains
    ir_objs1 = to_airr_cells(adata)
scirpy/io/_util.py:67: in check_wrapper
    return f(*args, **kwargs)
scirpy/io/_convert_anndata.py:119: in to_airr_cells
    tmp_ir_cell[col] = row[col]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AirrCell AAACCTGGTCCGTTAA-1 with 0 chains, k = 'multi_chain', v = True

    def __setitem__(self, k, v) -> None:
        if k == "multi_chain":
            v = _is_true2(v)
        try:
            existing_value = self._cell_attrs[k]
            if existing_value != v and not _is_na2(existing_value):
>               raise ValueError(
                    "Cell-level attributes differ between different chains. "
                    f"Already present: `{existing_value}`. Tried to add `{v}`."
                )
E               ValueError: Cell-level attributes differ between different chains. Already present: `False`. Tried to add `True`.

scirpy/io/_datastructures.py:113: ValueError

It might result from

        if barcode not in airr_cells:
            ir_obj = AirrCell(
                barcode,
                logger=logger,
                cell_attribute_fields=["is_cell", "high_confidence"],
            )
            airr_cells[barcode] = ir_obj
        else:
            ir_obj = airr_cells[barcode]

as "multi_chain" is not included in cell_attribute_fields by default in contrast to read_airr.

@grst
Copy link
Collaborator

grst commented Sep 17, 2021

multi_chain should definitely be a cell-level attribute. If that doesn't help, I need to take a closer look next week!

@naity
Copy link
Contributor Author

naity commented Sep 17, 2021

multi_chain should definitely be a cell-level attribute. If that doesn't help, I need to take a closer look next week!

Sounds good, it would be great if you could take a look at it. Thanks!

@grst
Copy link
Collaborator

grst commented Sep 20, 2021

@naity, just to be sure: Did you push all your changes? I'll hopefully find some time later today to debug this.

@naity
Copy link
Contributor Author

naity commented Sep 20, 2021

@naity, just to be sure: Did you push all your changes? I'll hopefully find some time later today to debug this.

Yes, I have.

1. When include_fields=None, the multi_chain value has not been
   transferred properly to adata.obs, as it had been overwritten.
   Rearranged the order of statements to fix it. This prevented
   bug 2 to surface.
2. Empty Airr cells got initialized with multi_chain=False. This
   led to a conflict due to 'inconsistent cell-level attributes'
   as soon as chains got added, even if all added chains correctly
   had multi_chain=True. Now set it to None initially, as None
   gets ignored when merging cell-level attributes.
@grst
Copy link
Collaborator

grst commented Sep 23, 2021

Will take a final look tomorrow and then merge.

@grst grst merged commit 113ee73 into scverse:master Sep 24, 2021
@grst
Copy link
Collaborator

grst commented Sep 24, 2021

Thanks again @naity and @zktuong!

@naity
Copy link
Contributor Author

naity commented Sep 24, 2021

Thanks again @naity and @zktuong!

Awesome, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Compatibility with Cell Ranger 6
3 participants