Import additional columns from CR6 #290

naity · 2021-09-06T02:37:41Z

Closes #279.

grst · 2021-09-07T07:08:04Z

Thanks for putting this together! LMK if you need any input!

naity · 2021-09-09T00:39:31Z

Thanks for putting this together! LMK if you need any input!

The added columns caused some current tests to fail. Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs? Thank you!

grst · 2021-09-09T07:52:13Z

scirpy/io/_io.py

+            fwrs = [f"fwr{i}" if i < 5 else f"fwr{i-4}_nt" for i in range(1, 9)]
+            cdrs = [f"cdr{i}" if i < 3 else f"cdr{i-2}_nt" for i in range(1, 5)]


This code has very unclear semantics... Please either use separate loops for nt and aa or take a look at
itertools.product.

Update, thanks for the suggestion!

grst · 2021-09-09T07:58:29Z

Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs?

Please only include those columns if they are present in the CR output.

naity · 2021-09-12T23:56:36Z

Should we only include fwr/cdr1,2 columns for CR6 outputs or include them regardless and use nan as values for non-CR6 outputs?

Please only include those columns if they are present in the CR output.

Sounds good, thanks! Please let me know if there are additional changes that need to be made.

grst

Just noticed a small thing about the field names!

And thanks a lot for adding tests :)

grst · 2021-09-13T08:57:24Z

scirpy/io/_io.py

+                chain[col] = cell[col].get("aa_seq") if cell[col] else None
+                chain[col + "_nt"] = cell[col].get("nt_seq") if cell[col] else None


According to the rearrangement standard

the fields are cdr1 for nt and cdr1_aa for aa

grst · 2021-09-13T09:06:58Z

One more thing I was thinking about:
The CDR3 column of cellranger is called junction according to the AIRR standard because it contains the two conserved C and W/F residues. Now that there is also CDR1 and 2 it might be confusing that CDR3 is missing.

Do you think it might make sense to automatically add also a CDR3 column that is automatically generated as

chain['cdr3_aa'] = chain['junction_aa'][1:-1]
chain['cdr3'] = chain['junction'][3:-3]

?

ping @zktuong, (How) did you deal with that in dandelion?

zktuong · 2021-09-13T09:19:00Z

One more thing I was thinking about:
The CDR3 column of cellranger is called junction according to the AIRR standard because it contains the two conserved C and W/F residues. Now that there is also CDR1 and 2 it might be confusing that CDR3 is missing.

Do you think it might make sense to automatically add also a CDR3 column that is automatically generated as
chain['cdr3_aa'] = chain['junction_aa'][1:-1]
chain['cdr3'] = chain['junction'][3:-3]
?

ping @zktuong, (How) did you deal with that in dandelion?

Hmm i don't parse that - i just use the junction column. Your suggestion sounds sensible and clean though.

grst · 2021-09-13T09:38:09Z

@zktuong, thanks for your quick response! In any case, having an additional cdr3 column should not interfere with dandelion when running to_dandelion?

@naity, do you want to give it a go to additionally add the cdr3 column as described above?

zktuong · 2021-09-13T10:15:09Z

yup should be no problem

naity · 2021-09-13T15:24:30Z

@zktuong, thanks for your quick response! In any case, having an additional cdr3 column should not interfere with dandelion when running to_dandelion?

@naity, do you want to give it a go to additionally add the cdr3 column as described above?

Sure, I fixed the column names and added trimmed cdr3 as suggested. But the added cdr3 and cdr3aa seemed to cause the following test to fail:

    def test_convert_dandelion(anndata_from_10x_sample):
        """Test dandelion round-trip conversion"""
        anndata = anndata_from_10x_sample
        ddl = to_dandelion(anndata)
        anndata2 = from_dandelion(ddl)
    
        ir_objs1 = to_airr_cells(anndata)
        ir_objs2 = to_airr_cells(anndata2)
    
        assert len(ir_objs1) == len(ir_objs2) == anndata.shape[0]
    
        # Frame-level comparison is not possible as the "extra chains" field is different
        # due to 'sequence_id' being different.
        for ir_obj1, ir_obj2 in zip(ir_objs1, ir_objs2):
            assert len(ir_obj1.chains) == len(ir_obj2.chains)
            for tmp_chain1, tmp_chain2 in zip(ir_obj1.chains, ir_obj2.chains):
                # this field is expected to be different
                del tmp_chain1["sequence_id"]
                del tmp_chain2["sequence_id"]
>               assert tmp_chain1 == tmp_chain2
E               AssertionError: assert {'c_call': 'IGKC',\n 'cdr3': 'CAGCAGTATGGTAGCTCACTTACGTGGACG',\n 'cdr3_aa': 'QQYGSSLTWT',\n 'consensus_count': 5260.0,\n 'd_call': None,\n 'd_cigar': None,\n 'duplicate_count': 41.0,\n 'germline_alignment': None,\n 'j_call': 'IGKJ1',\n 'j_cigar': None,\n 'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',\n 'junction_aa': 'CQQYGSSLTWTF',\n 'locus': 'IGK',\n 'productive': 'True',\n 'rev_comp': None,\n 'sequence': None,\n 'sequence_alignment': None,\n 'v_call': 'IGKV3-20',\n 'v_cigar': None} == {'c_call': 'IGKC',\n 'consensus_count': 5260.0,\n 'd_call': None,\n 'd_cigar': None,\n 'duplicate_count': 41.0,\n 'germline_alignment': None,\n 'j_call': 'IGKJ1',\n 'j_cigar': None,\n 'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',\n 'junction_aa': 'CQQYGSSLTWTF',\n 'locus': 'IGK',\n 'productive': 'True',\n 'rev_comp': None,\n 'sequence': None,\n 'sequence_alignment': None,\n 'v_call': 'IGKV3-20',\n 'v_cigar': None}
E                 Common items:
E                 {'c_call': 'IGKC',
E                  'consensus_count': 5260.0,
E                  'd_call': None,
E                  'd_cigar': None,
E                  'duplicate_count': 41.0,
E                  'germline_alignment': None,
E                  'j_call': 'IGKJ1',
E                  'j_cigar': None,
E                  'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',
E                  'junction_aa': 'CQQYGSSLTWTF',
E                  'locus': 'IGK',
E                  'productive': 'True',
E                  'rev_comp': None,
E                  'sequence': None,
E                  'sequence_alignment': None,
E                  'v_call': 'IGKV3-20',
E                  'v_cigar': None}
E                 Left contains 2 more items:
E                 {'cdr3': 'CAGCAGTATGGTAGCTCACTTACGTGGACG', 'cdr3_aa': 'QQYGSSLTWT'}
E                 Full diff:
E                   {
E                    'c_call': 'IGKC',
E                 +  'cdr3': 'CAGCAGTATGGTAGCTCACTTACGTGGACG',
E                 +  'cdr3_aa': 'QQYGSSLTWT',
E                    'consensus_count': 5260.0,
E                    'd_call': None,
E                    'd_cigar': None,
E                    'duplicate_count': 41.0,
E                    'germline_alignment': None,
E                    'j_call': 'IGKJ1',
E                    'j_cigar': None,
E                    'junction': 'TGTCAGCAGTATGGTAGCTCACTTACGTGGACGTTC',
E                    'junction_aa': 'CQQYGSSLTWTF',
E                    'locus': 'IGK',
E                    'productive': 'True',
E                    'rev_comp': None,
E                    'sequence': None,
E                    'sequence_alignment': None,
E                    'v_call': 'IGKV3-20',
E                    'v_cigar': None,
E                   }

scirpy/tests/test_io.py:238: AssertionError

grst · 2021-09-14T07:38:12Z

that looks alright, I think you can just update the test cases.

naity · 2021-09-14T14:49:21Z

that looks alright, I think you can just update the test cases.

shall we add cdr3 and cdr3_aa by default or only if cdr1 and cdr2 exist? If the former, test_workflow would need to be modified as well.

grst · 2021-09-15T06:51:09Z

I would say always.

But, maybe then we should implement an include_fields like in read_airr?
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L340

The behaviour would be that when converting the chain dicts to the anndata object, per default, only the columns defined here
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L36-L47

would get imported. If the user wants all attributes (including cdr1, 2, 3), they can set include_fields=None.

The idea is to avoid cluttering anndata with too many columns that are not needed for the most common use-cases.
What do you think?

That might be too much to ask from you though - I can always take over this PR if you prefer.

naity · 2021-09-16T01:17:57Z

I would say always.

But, maybe then we should implement an include_fields like in read_airr?
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L340

The behaviour would be that when converting the chain dicts to the anndata object, per default, only the columns defined here
https://github.com/icbi-lab/scirpy/blob/2f02f16aa4e03b99c814515bcac70cbf7c251f08/scirpy/io/_io.py#L36-L47

would get imported. If the user wants all attributes (including cdr1, 2, 3), they can set include_fields=None.

The idea is to avoid cluttering anndata with too many columns that are not needed for the most common use-cases.
What do you think?

That might be too much to ask from you though - I can always take over this PR if you prefer.

Yes, that's a great idea. I added an include_fields parameter to 10x functions but it seems to cause several tests to fail including test_ir_objs_roundtrip_conversion. I'll try to sort them out and please let me know if you have any suggestions.

Additionally, should we verify that chain['junction_aa'] starts with "C" and ends with "W/F" like before trimming?

grst · 2021-09-16T06:55:55Z

Additionally, should we verify that chain['junction_aa'] starts with "C" and ends with "W/F" like before trimming?

Not sure... In the example file I looked at, indeed all sequences start with C, but not all of them finish with [WF]. Here are a few:

CYSHSPTSMWVS
CARRGAGTTLCG
CTDLLRLAYTAWG
CATRLTRHLVRLTT
CQQYYSTPPI
CARWITGTPSL
CARDLKDIVLMVYASRLGGTLTT

Maybe @zktuong can help us again? Would it make sense to trim these sequences to [1:-1] as well? Or just set CDR3 to None?

zktuong · 2021-09-16T08:31:55Z

Additionally, should we verify that chain['junction_aa'] starts with "C" and ends with "W/F" like before trimming?

Not sure... In the example file I looked at, indeed all sequences start with C, but not all of them finish with [WF]. Here are a few:
CYSHSPTSMWVS
CARRGAGTTLCG
CTDLLRLAYTAWG
CATRLTRHLVRLTT
CQQYYSTPPI
CARWITGTPSL
CARDLKDIVLMVYASRLGGTLTT
Maybe @zktuong can help us again? Would it make sense to trim these sequences to [1:-1] as well? Or just set CDR3 to None?

Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive?

grst · 2021-09-16T08:37:01Z

Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive?

they are flagged as non-productive. So usually they get filtered out anyway.

zktuong · 2021-09-16T08:41:56Z

Are those sequences flagged as out of frame, not full V -> J, contains stop codon or not productive?

they are flagged as non-productive. So usually they get filtered out anyway.

yea then skipping trimming and switching to None/"" should be fine.

grst · 2021-09-16T08:43:26Z

ok, perfect. None in that case!

naity · 2021-09-17T16:03:24Z

HI @grst, I am having trouble with this error, could you please help take a look?

___________________________________________________________________ test_merge_airr_chains_identity ___________________________________________________________________

    def test_merge_airr_chains_identity():
        """Test that mergeing adata with itself results in the identity"""
        adata1 = read_10x_vdj(TESTDATA / "10x/filtered_contig_annotations.csv")
        adata2 = adata1.copy()
        obs_expected = adata1.obs.copy()
>       merge_airr_chains(adata1, adata2)

scirpy/tests/test_preprocessing.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
scirpy/io/_util.py:67: in check_wrapper
    return f(*args, **kwargs)
scirpy/_preprocessing/_merge_adata.py:47: in merge_airr_chains
    ir_objs1 = to_airr_cells(adata)
scirpy/io/_util.py:67: in check_wrapper
    return f(*args, **kwargs)
scirpy/io/_convert_anndata.py:119: in to_airr_cells
    tmp_ir_cell[col] = row[col]
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = AirrCell AAACCTGGTCCGTTAA-1 with 0 chains, k = 'multi_chain', v = True

    def __setitem__(self, k, v) -> None:
        if k == "multi_chain":
            v = _is_true2(v)
        try:
            existing_value = self._cell_attrs[k]
            if existing_value != v and not _is_na2(existing_value):
>               raise ValueError(
                    "Cell-level attributes differ between different chains. "
                    f"Already present: `{existing_value}`. Tried to add `{v}`."
                )
E               ValueError: Cell-level attributes differ between different chains. Already present: `False`. Tried to add `True`.

scirpy/io/_datastructures.py:113: ValueError

It might result from

        if barcode not in airr_cells:
            ir_obj = AirrCell(
                barcode,
                logger=logger,
                cell_attribute_fields=["is_cell", "high_confidence"],
            )
            airr_cells[barcode] = ir_obj
        else:
            ir_obj = airr_cells[barcode]

as "multi_chain" is not included in cell_attribute_fields by default in contrast to read_airr.

grst · 2021-09-17T16:12:29Z

multi_chain should definitely be a cell-level attribute. If that doesn't help, I need to take a closer look next week!

naity · 2021-09-17T17:05:31Z

multi_chain should definitely be a cell-level attribute. If that doesn't help, I need to take a closer look next week!

Sounds good, it would be great if you could take a look at it. Thanks!

grst · 2021-09-20T06:48:07Z

@naity, just to be sure: Did you push all your changes? I'll hopefully find some time later today to debug this.

naity · 2021-09-20T16:21:14Z

@naity, just to be sure: Did you push all your changes? I'll hopefully find some time later today to debug this.

Yes, I have.

1. When include_fields=None, the multi_chain value has not been transferred properly to adata.obs, as it had been overwritten. Rearranged the order of statements to fix it. This prevented bug 2 to surface. 2. Empty Airr cells got initialized with multi_chain=False. This led to a conflict due to 'inconsistent cell-level attributes' as soon as chains got added, even if all added chains correctly had multi_chain=True. Now set it to None initially, as None gets ignored when merging cell-level attributes.

grst · 2021-09-23T17:02:05Z

Will take a final look tomorrow and then merge.

grst · 2021-09-24T06:51:06Z

Thanks again @naity and @zktuong!

naity · 2021-09-24T14:20:04Z

Thanks again @naity and @zktuong!

Awesome, thank you!

naity added 5 commits September 5, 2021 13:57

import addtional CR6 columns

c768265

import additional CR6 columns

33e7fb7

import additional CR6 columns

eaadcc6

import additional CR6 columns

c243f03

import additional CR6 columns

789f5c3

grst linked an issue Sep 6, 2021 that may be closed by this pull request

Compatibility with Cell Ranger 6 #279

Closed

grst requested changes Sep 9, 2021

View reviewed changes

naity added 2 commits September 12, 2021 16:30

09122021

bc2fdcd

09122021

d1e7607

grst requested changes Sep 13, 2021

View reviewed changes

09172021

0fcbe98

grst added 5 commits September 23, 2021 09:52

Merge branch 'fix-airrcell' into import_cr6

425ff88

Fix that empty columns were added during merge_anndata

318e97b

Move cdr3_from_junction into separate function

eeffb69

Fix typo in docstring

c23d58a

grst approved these changes Sep 23, 2021

View reviewed changes

grst merged commit 113ee73 into scverse:master Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import additional columns from CR6 #290

Import additional columns from CR6 #290

naity commented Sep 6, 2021 •

edited by grst

Loading

grst commented Sep 7, 2021

naity commented Sep 9, 2021

grst Sep 9, 2021

naity Sep 12, 2021

grst commented Sep 9, 2021

naity commented Sep 12, 2021 •

edited

Loading

grst left a comment

grst Sep 13, 2021

grst commented Sep 13, 2021

zktuong commented Sep 13, 2021

grst commented Sep 13, 2021

zktuong commented Sep 13, 2021

naity commented Sep 13, 2021

grst commented Sep 14, 2021

naity commented Sep 14, 2021

grst commented Sep 15, 2021

naity commented Sep 16, 2021

grst commented Sep 16, 2021

zktuong commented Sep 16, 2021

grst commented Sep 16, 2021

zktuong commented Sep 16, 2021

grst commented Sep 16, 2021

naity commented Sep 17, 2021 •

edited

Loading

grst commented Sep 17, 2021

naity commented Sep 17, 2021

grst commented Sep 20, 2021

naity commented Sep 20, 2021

grst commented Sep 23, 2021

grst commented Sep 24, 2021

naity commented Sep 24, 2021

		fwrs = [f"fwr{i}" if i < 5 else f"fwr{i-4}_nt" for i in range(1, 9)]
		cdrs = [f"cdr{i}" if i < 3 else f"cdr{i-2}_nt" for i in range(1, 5)]

		chain[col] = cell[col].get("aa_seq") if cell[col] else None
		chain[col + "_nt"] = cell[col].get("nt_seq") if cell[col] else None

Import additional columns from CR6 #290

Import additional columns from CR6 #290

Conversation

naity commented Sep 6, 2021 • edited by grst Loading

grst commented Sep 7, 2021

naity commented Sep 9, 2021

grst Sep 9, 2021

Choose a reason for hiding this comment

naity Sep 12, 2021

Choose a reason for hiding this comment

grst commented Sep 9, 2021

naity commented Sep 12, 2021 • edited Loading

grst left a comment

Choose a reason for hiding this comment

grst Sep 13, 2021

Choose a reason for hiding this comment

grst commented Sep 13, 2021

zktuong commented Sep 13, 2021

grst commented Sep 13, 2021

zktuong commented Sep 13, 2021

naity commented Sep 13, 2021

grst commented Sep 14, 2021

naity commented Sep 14, 2021

grst commented Sep 15, 2021

naity commented Sep 16, 2021

grst commented Sep 16, 2021

zktuong commented Sep 16, 2021

grst commented Sep 16, 2021

zktuong commented Sep 16, 2021

grst commented Sep 16, 2021

naity commented Sep 17, 2021 • edited Loading

grst commented Sep 17, 2021

naity commented Sep 17, 2021

grst commented Sep 20, 2021

naity commented Sep 20, 2021

grst commented Sep 23, 2021

grst commented Sep 24, 2021

naity commented Sep 24, 2021

naity commented Sep 6, 2021 •

edited by grst

Loading

naity commented Sep 12, 2021 •

edited

Loading

naity commented Sep 17, 2021 •

edited

Loading