Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample_modifiers applied to subsamples #435

Closed
johnyaku opened this issue Mar 12, 2023 · 12 comments
Closed

sample_modifiers applied to subsamples #435

johnyaku opened this issue Mar 12, 2023 · 12 comments

Comments

@johnyaku
Copy link

When a subsample_table is merged into the sample_table the new columns resulting from the merge are list columns.
This makes sense, as part of the motivation for subsample tables is to cater for situations where there a single sample has multiple values for a given property (column).

However, applying sample_modifiers such as derive breaks on these list columns.

More natural behaviour would be to iteratively apply the sample modifier rules to all elements in the list.
Alternatively, perhaps single item lists could be "unlisted" to become normal columns, and thus subject to sample_modifiers.

Not sure if this a feature request or a "how to" question...

@johnyaku
Copy link
Author

The motivation for wanting to be able to do this relates to issue #434.

Specifically, I want to be able to separate sample "properties" from "data paths".
I was hoping that putting one or the other in a subsample table would do the trick, but the single item lists are breaking things.

If this is actually a feature request (ie, there is no easy way to do this atm) then can I propose the idea of a "merge table", as similar to a subsample table in that it gets merged into the sample_table but where the expectation is that there will be no more than one row per sample.

@johnyaku
Copy link
Author

Digging deeper, it looks like that this is a feature request.

But a simple implementation (rather than introducing a merge_table item) would be to allow the specification of more than one sample_table as a list, as can now be done with subsample_table. There would need to be a common index, of course, but the tables could then be merged.

@nsheff
Copy link
Contributor

nsheff commented Mar 13, 2023

This makes a lot of sense to me, and I like the idea. I also agree that just making sample_table accept either a single table or a list of tables would give you the machinery that you need to separate your sample tables into two. It also wouldn't be too hard to implement.

But as I think about this more, I'm not certain this is better than the way I've typically approached this in the past: which is to use a combined append/derived attribute. With this approach, I simply don't put the "data paths" into the sample table at all, I only put them into the project config file. You can do 1) use append to create a new column; and then 2) use derive to turn it into a path.

This way, the paths are solely encoded in derived attribute modifiers, and the sample table has no paths. I discuss this a little bit here:

https://pep.databio.org/en/latest/howto_eliminate_paths/

You can make it even more powerful in this example by not including the file_path in the sample table, but adding it in using append (or even using imply if it varies across the table).

@johnyaku
Copy link
Author

Thanks! Your suggestion works beautifully.

For anyone stumbling on this thread, my toy example looks like this:

sample_name,slide,area
CID_1234,V11Y24-111,A1
CID_4321,V11Y24-111,B1

.. saved as samples.csv. Then the config file looks like this:

name: peptest

pep_version: 2.1.0
sample_table: "samples.csv"

sample_modifiers:
    append:
        image: "_img"
        fastq: "_fq"
    derive:
        attributes: [image,fastq]
        sources:
            _img: data/{slide}_{area}.jpg
            _fq: data/{sample_name}.fastq.gz

If I get time I'll make a PR to extend the documentation to show how to compbine appened and derive. The linked document above (which actually sent me down this line of thought) uses imply plus derive. I guess if I was more familar with the PEP specification the extension to append plus derive would have been obvious, but it might still be worth spelling it out for slow-pokes like me!

@johnyaku
Copy link
Author

johnyaku commented Mar 20, 2023

This is working great when the sample table is indexed by sampe_name, but at least in some cases it fails with a non-standard index.

Here is my reprex ...

First, samples.csv ...

CAPTURE_ID,SAMPLE_ID
BH_pool1,CT_1126
BH_pool1,CT_1128
BH_pool1,CT_1108
BH_pool2,CT_167
BH_pool2,CT_1134
BH_pool2,CT_174
BH_pool3,CT_1194
BH_pool3,CT_1155
BH_pool3,CT_1152
BH_pool4,CT_1135
BH_pool4,CT_1185
BH_pool4,CT_1209
BH_pool5,CT_1147
BH_pool5,CT_1178
BH_pool5,CT_1157
BH_pool6,CT_1206
BH_pool6,CT_1205
BH_pool6,CT_1227
BH_pool8,CT_1213
BH_pool7,CT_1210
BH_pool7,CT_1201
BH_pool7,CT_1200
BH_pool8,CT_1203
BH_pool8,CT_1216
BH_pool8,CT_1214

.. and my config.yaml ...

name: reprex

pep_version: 2.1.0
sample_table: samples.csv
sample_table_index: SAMPLE_ID

sample_modifiers:
  append:
    cell_barcode_file: _bc
    cell_bam_file: _bam
    genotype_cel_file: _cel
  derive:
    attributes: [cell_barcode_file,cell_bam_file,genotype_cel_file]
    sources:
      _bc  : data/counts/{CAPTURE_ID}/GE/{CAPTURE_ID}/outs/per_sample_outs/{CAPTURE_ID}/count/sample_feature_bc_matrix/barcodes.tsv.gz
      _bam : data/counts/{CAPTURE_ID}/GE/{CAPTURE_ID}/outs/per_sample_outs/{CAPTURE_ID}/count/sample_alignments.bam
      _cel : data/snp/SNP_{SAMPLE_ID}.CEL

I then try to load the sample sheet with ...

import peppy
pep=peppy.Project("config.yaml")

... and I get the following traceback:

Traceback (most recent call last):
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 39, in __getattr__
    v = super(PathExAttMap, self).__getattribute__(item)
AttributeError: 'Sample' object has no attribute 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.10/site-packages/attmap/ordattmap.py", line 46, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 42, in __getattr__
    return self.__getitem__(item, expand)
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 59, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
  File "/path/to/lib/python3.10/site-packages/attmap/ordattmap.py", line 48, in __getitem__
    return AttMap.__getitem__(self, item)
  File "/path/to/lib/python3.10/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/pep-reprex/./pep_check.py", line 20, in <module>
    pep=peppy.Project(config_file)
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 160, in __init__
    self.create_samples(modify=False if self[SAMPLE_TABLE_FILE_KEY] else True)
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 259, in create_samples
    self.modify_samples()
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 438, in modify_samples
    self.attr_derive()
  File "/path/to/lib/python3.10/site-packages/peppy/project.py", line 830, in attr_derive
    "Deriving '{}' attribute for '{}'".format(attr, sample.sample_name)
  File "/path/to/lib/python3.10/site-packages/attmap/pathex_attmap.py", line 46, in __getattr__
    raise AttributeError(item)
AttributeError: sample_name

If I rename the SAMPLE_ID column in sampes.csv to sample_name and then use the derive modifier to make SAMPLE_ID from sample_name then it works, but that kind of defeats the purpose of sample_table_index, no?

@johnyaku johnyaku reopened this Mar 20, 2023
@nsheff
Copy link
Contributor

nsheff commented Mar 23, 2023

Indeed. in fact I think to use it with subsamples you will also need to specify subsample_table_index in addition to sample_table_index...

@johnyaku
Copy link
Author

Thanks @nsheff

Can you please elaborate a little?

Here's an even simpler reprex for you...

Take this example. When I try to load this with peppy.Project() everything works as advertised.

Now add the following:

sample_modifiers:
  append:
    a: x
  derive:
    attributes: [a]
    sources:
      x: y/{sample_id}

Load this with peppy.Project() I get the following traceback:

Traceback (most recent call last):
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 39, in __getattr__
    v = super(PathExAttMap, self).__getattribute__(item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Sample' object has no attribute 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.11/site-packages/attmap/ordattmap.py", line 46, in __getitem__
    return super(OrdAttMap, self).__getitem__(item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 42, in __getattr__
    return self.__getitem__(item, expand)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 59, in __getitem__
    v = super(PathExAttMap, self).__getitem__(item)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/ordattmap.py", line 48, in __getitem__
    return AttMap.__getitem__(self, item)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/attmap.py", line 32, in __getitem__
    return self.__dict__[item]
           ~~~~~~~~~~~~~^^^^^^
KeyError: 'sample_name'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 163, in __init__
    self.create_samples(modify=False if self[SAMPLE_TABLE_FILE_KEY] else True)
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 262, in create_samples
    self.modify_samples()
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 441, in modify_samples
    self.attr_derive()
  File "/path/to/lib/python3.11/site-packages/peppy/project.py", line 839, in attr_derive
    "Deriving '{}' attribute for '{}'".format(attr, sample.sample_name)
                                                    ^^^^^^^^^^^^^^^^^^
  File "/path/to/lib/python3.11/site-packages/attmap/pathex_attmap.py", line 46, in __getattr__
    raise AttributeError(item)
AttributeError: sample_name

Rather than sample.sample_name perhaps this should be sample.st_index? I don't understand the code base well enough yet to offer a PR, but this looks like a bug to me.

@johnyaku
Copy link
Author

FWIW, ammending project_config.yaml to the following kind of "works", but is more than a little clunky.

pep_version: "2.1.0"
sample_table: sample_table.csv
sample_table_index: sample_name

sample_modifiers:
  append:
    sample_id: sid
    a: x
  derive:
    attributes: [sample_id, a]
    sources:
      sid: "{sample_name}"
      x: y/{sample_id}

For context, our motivation for not simply using sample_name for the index is that we frequently multiplex samples into a "pool" which is sampled together and then demultiplexed back into individual samples. Some of our workflows operate on pools, whereas others operate on individual samples. We would like to use the term "sample" consistently to refer to the individual biological samples, as distinct from a "pool" or "capture". Being forced to use sample_name to index a table of "pools" breaks this semantic consistency.

@nsheff
Copy link
Contributor

nsheff commented Mar 27, 2023

Yes, you found a bug, and I have fixed it locally. Will push soon. Thanks!

@johnyaku
Copy link
Author

Thanks! :)

@nsheff
Copy link
Contributor

nsheff commented Mar 27, 2023

Try version 0.35.5 (now released) and see if this solves it for you.

@johnyaku
Copy link
Author

Thanks Nathan, I have tested this on my original reprex and it solves the problem.

Thanks for fixing this so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants