Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot convert output from Scirpy to dandelion #343

Closed
sbenjamaporn opened this issue Jun 7, 2022 · 19 comments · Fixed by #347
Closed

Cannot convert output from Scirpy to dandelion #343

sbenjamaporn opened this issue Jun 7, 2022 · 19 comments · Fixed by #347
Labels
bug Something isn't working
Projects

Comments

@sbenjamaporn
Copy link

Description of the bug

I try to convert the AnnData after clonal assignment by Scirpy into dandelion format. The result showed that "field productive has invalid bool T + T". I would like to convert it for updating germline sequence of each BCR sequences using dandelion because I did not find this function in Scirpy. However, if you could suggest the other ways. Feel free to let me know.

Minimal reproducible example

import scirpy as ir

ABC_irdata_exclude_orphan_dandelion = ir.io.to_dandelion(ABC_irdata_exclude_orphan)
ABC_irdata_exclude_orphan_dandelion

The error message produced by the code above

~/.conda/envs/dandelion/lib/python3.8/site-packages/airr/schema.py in validate_row(self, row)
    276                 if spec == 'number':  self.to_float(row[f], validate=True)
    277             except ValidationError as e:
--> 278                 raise ValidationError('field %s has %s' %(f, e))
    279 
    280         return True

ValidationError: field productive has invalid bool T + T

Version information

versions

@sbenjamaporn sbenjamaporn added the bug Something isn't working label Jun 7, 2022
@grst
Copy link
Collaborator

grst commented Jun 7, 2022

Hi @sbenjamaporn,

thanks for reporting this issue.
I'm not yet sure what could be the problem. Could you please report the following:

  • the entire stack trace of the error message (not just the last part as above)
  • the result of ABC_irdata_exclude_orphan.obs.columns

Thanks,
Gregor

@sbenjamaporn
Copy link
Author

sbenjamaporn commented Jun 7, 2022

Thanks for your prompt response!

This is my results attached as a pdf file.

Scirpy_to_Dandelion_error.pdf

@grst
Copy link
Collaborator

grst commented Jun 8, 2022

@sbenjamaporn, thanks for the stacktrace! Regarding my second request, you checked adata.columns instead of adata.obs.columns. Could you please send me the result of the latter?

@zktuong, according to the stacktrace, the error occurs within Dandelion. It could theoretically be that there's a problem with scirpy's output, but I have absolutely no idea where the T + T should come from. Do you have an idea what could be the problem?

@sbenjamaporn
Copy link
Author

Sorry for missunderstanding.
This is the result of adata.obs.columns
6BCCCB0A-F5FB-4E1E-8940-F547F0423630

@zktuong
Copy link
Contributor

zktuong commented Jun 8, 2022

hi @grst @sbenjamaporn

I think it's got to do with a malformed entries in productive columns in the adata.obs that airrCell is getting.

So @sbenjamaporn, can you check what's the unique values for:
scirpy's columns:
['IR_VDJ_1_productive', 'IR_VDJ_2_productive', 'IR_VJ_1_productive', 'IR_VJ_2_productive']
There shouldn't be any T + T in these

I think the following are dandelion's columns:
['productive_VDJ', 'productive_VJ', 'productive', 'productive_summary']
There would be T + T in these. But these columns would be ignored by scirpy when converting to an airr table?

I don't think dandelion use any of these columns when converting (it just refreshes based on the airr table that scirpy produces)

I wonder if the round trip of scirpy <-> dandelion somehow got the columns confused?

@sbenjamaporn
Copy link
Author

Dear @zktuong @grst

Yes, ['IR_VDJ_1_productive', 'IR_VDJ_2_productive', 'IR_VJ_1_productive', 'IR_VJ_2_productive'] show only "TRUE or None" output. I agree with you about the convertion between them make it confuse. So, we could not convert scirpy result to dandelion, right?

@grst
Copy link
Collaborator

grst commented Jun 8, 2022

What's also weird is that in addition to the IR_V(D)J_1/2_productive you also have a productive column. What does this one contain?

"productive" should be a chain-level attribute (i.e only available as IR_V(D)J_1/2_productive).

Could you please describe

  • how you initially load the data into scirpy (ir.io.read_10x_vdj, ir.io.read_airr, etc)?
  • if you do any additional conversions between dandelion and scirpy before this error occurs?

@zktuong
Copy link
Contributor

zktuong commented Jun 8, 2022

Hmm it looks like those additional productive columns are from dandelion.

Can you try and remove every column from “clone_id_by_size” onwards and see if there’s still the issue of conversion?

@sbenjamaporn
Copy link
Author

@grst,
The productive column shows "T + T" . I preprocessed my scBCR sequences from Dandelion and then use "ddl.to_scirpy". After that, I define clone by Scirpy. Now, I would like to convert it back in order to update germline to study mutational analysis in Dandelion.

@sbenjamaporn
Copy link
Author

Dear @grst,

Is there any ways that Scirpy could give full information as an AIRR standard ? ( I try ir.io.write_airr, but it did not create full information)

@sbenjamaporn
Copy link
Author

@zktuong Thanks for suggestion, I will try!

@grst
Copy link
Collaborator

grst commented Jun 8, 2022

Hmm it looks like those additional productive columns are from dandelion.

It seems the actual problem is in the conversion from dandelion to scirpy. @zktuong ddl.to_scirpy is just calling scirpy code?

Is there any ways that Scirpy could give full information as an AIRR standard ? ( I try ir.io.write_airr, but it did not create full information)

Happy to discuss this. Could you please open a separate issue and describe what's missing?

@zktuong
Copy link
Contributor

zktuong commented Jun 9, 2022

It seems the actual problem is in the conversion from dandelion to scirpy. @zktuong ddl.to_scirpy is just calling scirpy code?

That's right. it's just a wrapper to call ir.io.from_dandelion.

A small update on this - the issue seems to lie in:

# works ok
irdata = ddl.to_scirpy(vdj) # or ir.io.from_dandelion(vdj)
vdj2 = ir.io.to_dandelion(irdata)

# same issue with ValidationError: field productive has invalid bool T + T appears
irdata = ddl.to_scirpy(vdj, transfer = True)  # or ir.io.from_dandelion(vdj, transfer = True)
vdj2 = ir.io.to_dandelion(irdata)

@zktuong
Copy link
Contributor

zktuong commented Jun 9, 2022

ok. the 'issue' is with line 273:

def to_airr_records(self) -> Iterable[dict]:
"""Iterate over chains as AIRR-Rearrangent compliant dictonaries.
Each dictionary will also include the cell-level information.
Yields
------
Dictionary representing one row of a AIRR rearrangement table
"""
for tmp_chain in self.chains:
chain = AirrCell.empty_chain_dict()
# add the actual data
chain.update(tmp_chain)
# add cell-level attributes
chain.update(self)

because dandelion's productive column in the metadata will update the productive key that scirpy was making from to_airr_cells because it appears later

scirpy/scirpy/io/_io.py

Lines 726 to 739 in 2c5b99e

airr_cells = to_airr_cells(adata)
contig_dicts = {}
for tmp_cell in airr_cells:
for i, chain in enumerate(tmp_cell.to_airr_records(), start=1):
# dandelion-specific modifications
chain.update(
{
"sequence_id": f"{tmp_cell.cell_id}_contig_{i}",
}
)
contig_dicts[chain["sequence_id"]] = chain
data = pd.DataFrame.from_dict(contig_dicts, orient="index")

Can confirm that if just change the name away from productive on dandelion's side, it resolves this.

@sbenjamaporn if you just rename the current productive column to productive_status:

adata.obs.rename(columns={'productive':'productive_status'}, inplace=True)

you should be able to do the transfer.

I will action this on dandelion's side to rename productive to productive_status.

@sbenjamaporn
Copy link
Author

@zktuong Thanks, It works now!.

I have a more question during update germline sequence by update_germline. I have many samples to update. Should the fasta file be "tigger_heavy_igblast_db-pass_genotype.fasta" ? ( I also got the error in this case) or manually specify in each sample ?

OSError: Environmental variable GERMLINE must be set. Otherwise, please provide path to folder containing germline IGHV, IGHD, and IGHJ fasta files.

@sbenjamaporn
Copy link
Author

@grst
Of course!

@grst
Copy link
Collaborator

grst commented Jun 9, 2022

Thanks a lot @zktuong! LMK once you have a release including the fix, then I'll pin the latest version of dandelion.

@zktuong
Copy link
Contributor

zktuong commented Jun 10, 2022

@zktuong Thanks, It works now!.

I have a more question during update germline sequence by update_germline. I have many samples to update. Should the fasta file be "tigger_heavy_igblast_db-pass_genotype.fasta" ? ( I also got the error in this case) or manually specify in each sample ?

OSError: Environmental variable GERMLINE must be set. Otherwise, please provide path to folder containing germline IGHV, IGHD, and IGHJ fasta files.

Let's follow up over on dandelion's side:
zktuong/dandelion#153

@zktuong
Copy link
Contributor

zktuong commented Jun 27, 2022

Hi @grst,

i've just release sc-dandelion==0.2.3 (it's actually 0.2.2 but i thought my upload went wrong)
Need to wait for pypi/warehouse#11696 to be fixed before making any changes here though... =(

@grst grst added this to In progress in scirpy-dev Jun 28, 2022
@grst grst closed this as completed in #347 Jun 28, 2022
scirpy-dev automation moved this from In progress to Done Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
scirpy-dev
  
Done
Development

Successfully merging a pull request may close this issue.

3 participants