Skip to content

Saddle cli#11

Merged
nvictus merged 20 commits intoopen2c:masterfrom
dekkerlab:saddle_cli
Feb 25, 2018
Merged

Saddle cli#11
nvictus merged 20 commits intoopen2c:masterfrom
dekkerlab:saddle_cli

Conversation

@sergpolly
Copy link
Member

saddle cli conversation starter ...

There are too many things going on at the same time, for me alone to decide everything.

  • I tried my best to add some cross-validation to make sure cooler, track and expected are compatible between each other.
  • took the actual working part from Nezar's and Ankita's implementations https://github.com/nandankita/labUtilityTools
  • did some effort to produce sensible text-outputs - needs to be finalized a bit
  • savefig to generate an actual saddleplot. no interactive plots for now
  • already using chrom-chrom pairs trans expected, not quite ready for chromosomes-arms (is that even a thing?) ...
  • fetcher/getter functions updated accordingly, sometimes switching from groupby to using pandas indexing

TODO:

  • update help (it's a mess now)
  • improve output (text and savefig)
  • chrom arms ?????

IMPORTANT:

added some fixes into saddle.py

  • 1 outlier bin was missing from saddledata matrix
  • saddleplot was misaligned bin-wise between the heatmap and hist

before it looked like that:
download-9

now it looks this way:
finale

@sergpolly sergpolly requested review from golobor and nvictus February 22, 2018 21:53
Copy link
Member

@nvictus nvictus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only urgent thing is the output of the digitized track and bin edges. I think it would make more sense to save them as a BED-like file, and maybe optionally the saddledata as a npy.

# # ...
# from bioframe.io import formats
# from bioframe.schemas import SCHEMAS
# formats.read_table(track_path, schema=SCHEMAS["bed4"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, these types of files with non-overlapping segmentations of quantitative data are called bedGraph, so it would be schema='bedGraph'. But bioframe isn't ready to be a dependency yet, though soon it should.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned that we might switch to bioframe in the future for input validation etc.
corrected schema reference.

# # just like in diamond_insulation:
if output is not None:
# output saddledata (square matrix):
pd.DataFrame(saddledata).to_csv(output+".saddledata.tsv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. An alternative way to do the same is np.savetxt, but even less flexible. You could also output .npy if that extension is used instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see reply for the next comment

index=False,
header=False,
na_rep='nan')
# output digitized track:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we output the digitized track and binedges as a single bedGraph file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the idea here was to output data needed to plot saddleplot from scratch, e.g.,
provide necessary inputs for the saddle.saddleplot function: binedges, digitized, saddledata.

100% agree we should store them as a single container, rather than 3 text files ...

also digitized by its nature is a begGraph, in fact, it is the same exact bedGraph that users would provide as TRACK_PATH, with actual values replaced with digitized indices.
However, binedges is simply an array of edges for those digitized indices, e.g..,if binedges=[-0.1,-0.05,0,...], then values from track that are <-0.1 would assigned digitized index 0, values between -0.1 and -0.05 are assigned digitized index 1, etc.
So, how exactly would binedges fit into bedGraph schema ?

Can we just output binedges, digitized, saddledata as a single npz container ?
Potentially adding a 4th thing there: actual 1D histogram from saddleplot in a form of bins and counts (I think we'd need new bins , again , since binedges in this case are made up of digitized values, not the initial values from the track)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about bedGraph:
is the original track:

   chrom   start      end    eigen
0  chr10       1   200000  0.02511
1  chr10  200001   400000  0.04261
2  chr10  400001   600000  0.04836
3  chr10  600001   800000  0.05034
4  chr10  800001  1000000  0.05601
...

then bedGraph-like data structure that includes both original 'eigen' and digitized one would look like:

    chrom    start      end    eigen  digitized.eigen
0   chr10        1   200000  0.02511               13
1   chr10   200001   400000  0.04261               15
2   chr10   400001   600000  0.04836               15
3   chr10   600001   800000  0.05034               16
4   chr10   800001  1000000  0.05601               16
5   chr10  1000001  1200000  0.03689               14
6   chr10  1200001  1400000 -0.01956                7
7   chr10  1400001  1600000 -0.03823                5
8   chr10  1600001  1800000 -0.04384                5
9   chr10  1800001  2000000 -0.04233                5
...

so what exactly do we want to store? 100% bedGraph compliant thing:

chrom    start      end  digitized.eigen
chr10        1   200000               13
chr10   200001   400000               15
chr10   400001   600000               15
chr10   600001   800000               16
...

or a dataframe with eigen and digitized.eigen (which isn't a bedGraph anymore).

we could also add binedges here in a form of another start_eigen, stop_eigen columns:

    chrom    start      end    eigen  digitized.eigen  start_eigen  stop_eigen
0   chr10        1   200000  0.02511               13     0.023993    0.032685
1   chr10   200001   400000  0.04261               15     0.041377    0.050069
2   chr10   400001   600000  0.04836               15     0.041377    0.050069
3   chr10   600001   800000  0.05034               16     0.050069    0.058761
4   chr10   800001  1000000  0.05601               16     0.050069    0.058761
5   chr10  1000001  1200000  0.03689               14     0.032685    0.041377
6   chr10  1200001  1400000 -0.01956                7    -0.028160   -0.019468
7   chr10  1400001  1600000 -0.03823                5    -0.045544   -0.036852
8   chr10  1600001  1800000 -0.04384                5    -0.045544   -0.036852
9   chr10  1800001  2000000 -0.04233                5    -0.045544   -0.036852
10  chr10  2000001  2200000 -0.04399                5    -0.045544   -0.036852
11  chr10  2200001  2400000 -0.04955                4    -0.054236   -0.045544
12  chr10  2400001  2600000 -0.04825                4    -0.054236   -0.045544

thus binedges would be implicitly included here...

array([-0.0803127 , -0.07162058, -0.06292847, -0.05423635, -0.04554423,
       -0.03685212, -0.02816   , -0.01946788, -0.01077577, -0.00208365,
        0.00660847,  0.01530058,  0.0239927 ,  0.03268482,  0.04137693,
        0.05006905,  0.05876117,  0.06745328,  0.0761454 ])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting... Let's put it on hold and for now just output binedges, hist_counts and saddledata to a single npz. The saddleplot function could probably be modified to take pre-histogrammed counts as input. Obtaining digitized tracks is a separate concern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

The only thing though, is that digitized is a dict internally, and when savez-ed it is being wrapped into an ndarray with shape () - so you cannot extract that dict by simply saying npz['digitized'][0] (I couldn't at least), so https://stackoverflow.com/questions/24565916/why-is-numpy-shape-empty suggest using npz['digitized'].item().
I've documented that stuff in the output option help message.

try:
import matplotlib as mpl
# savefig only for now:
mpl.use('Agg')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This should be the default, especially if running on headless computers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it was simply copy/pasted from your cooler_show implementation: https://github.com/mirnylab/cooler/blob/master/cooler/cli/show.py

@sergpolly
Copy link
Member Author

it seems to be mostly done for the initial iteration.

I added one thing though, - computation of the compartment strength, it's not very flexible but experimental people are asking for it... very much so.
Also given the fact that we're dumping saddledata as npz, it would be harder for R-people to custom process the data.

Let me know if you think that the naive strength calculation is too misleading, or not robust enough even to be included in this initial iteration.

@nvictus nvictus merged commit 0900446 into open2c:master Feb 25, 2018
@sergpolly sergpolly deleted the saddle_cli branch February 26, 2018 00:02
sergpolly pushed a commit that referenced this pull request Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants