Skip to content

Commit

Permalink
typos
Browse files Browse the repository at this point in the history
  • Loading branch information
nsheff committed May 5, 2017
1 parent 21ae957 commit f586a86
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions doc/source/implied-columns.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Implied columns
=============================================

At some point, you may have a situation where you need a single sample attribute (or column) to populate several different pipeline arguments with differnent values. In other words, the value of a given attribute may **imply** values for other attributes. It would be nice if you didn't have to enumerate all of these secondary, implied attributes, and could instead just infer them from the value of the original attribute. For example, if my `organism` attribute is ``human``, this implies a few other secondary attributes: I want to set ``genome`` to ``hg38`` and ``macs_genome_size`` to ``hs``. Of course, I could just define columns called ``genome`` and ``macs_genome_size``, but these would be invariant so it feels inefficient and unweildy; then, changing the aligned genome would require changing the sample annotation sheet.
At some point, you may have a situation where you need a single sample attribute (or column) to populate several different pipeline arguments with different values. In other words, the value of a given attribute may **imply** values for other attributes. It would be nice if you didn't have to enumerate all of these secondary, implied attributes, and could instead just infer them from the value of the original attribute. For example, if my `organism` attribute is ``human``, this implies a few other secondary attributes (which may be project-specific): For one project, I want to set ``genome`` to ``hg38`` and ``macs_genome_size`` to ``hs``. Of course, I could just define columns called ``genome`` and ``macs_genome_size``, but these would be invariant, so it feels inefficient and unweildy; and then, changing the aligned genome would require changing the sample annotation sheet (every sample, in fact). You can do this with looper, of course, but a better way would be handle these things at the project level.

As a more elegant alternative, Looper offers a ``project_config`` section called ``implied_columns``. Instead of hard-coding ``genome`` and ``macs_genome_size`` in the sample annotation sheet, you can simply specify that the attribute ``organism`` **implies** additional attribute-value pairs (which may vary by sample based on the value of the ``organism`` attribute). This lets you specify the genome, transcriptome, genome size, and other similar variables all in your project configuration file.

Expand All @@ -20,4 +20,6 @@ To do this, just add an ``implied_columns`` section to your project_config.yaml
genome: "mm10"
macs_genome_size: "mm"
There are 3 layers in the ``implied_columns`` hierarcy. The first layer, (sub-values under ``implied_columns``; here, ``organism``), are primary columns from which new attributes will rely. The second layer (here, ``human`` or ``mouse``) are possible values your samples may take in the primary column. The third layer (``genome`` and ``macs_genome_size``) are the key-value pair of new, implied columns for any samples with the required value for that primary column. In this example, any samples with organism set to "human" will automatically also have attributes for genome (hg38) and for macs_genome_size (hs). Any samples with organism set to "mouse" will have the corresponding values. A sample with organism set to ``frog`` would lack attributes for ``genome`` and ``macs_genome_size``, since those columns are not implied by ``frog``.
There are 3 layers in the ``implied_columns`` hierarchy. The first layer, (sub-values under ``implied_columns``; here, ``organism``), are primary columns from which new attributes will rely. The second layer (here, ``human`` or ``mouse``) are possible values your samples may take in the primary column. The third layer (``genome`` and ``macs_genome_size``) are the key-value pair of new, implied columns for any samples with the required value for that primary column. In this example, any samples with organism set to "human" will automatically also have attributes for genome (hg38) and for macs_genome_size (hs). Any samples with organism set to "mouse" will have the corresponding values. A sample with organism set to ``frog`` would lack attributes for ``genome`` and ``macs_genome_size``, since those columns are not implied by ``frog``.

This system essentially lets you set global, species-level attributes at the project level instead of duplicating that information for every sample that belongs to a species. Even better, it's generic -- so you can do this for any subdivision of samples (just replace ``organism`` with whatever you like). This makes your project more portable and does a better job conceptually with separating sample attributes from project attributes; after all, a reference genome assembly is really not an inherent property of a sample, but of a sample in respect to a particular project or alignment.

0 comments on commit f586a86

Please sign in to comment.