-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strand-aware expand (bedtools slop) #144
Comments
thanks for the suggestion! any suggestions on the API? I wonder if we're not overloading some of the core bioframe functions with arguments. E.g. the current implementation of Relatedly, any thoughts on the balance between keeping the functions lean & encourage users to compose functions to achieve their goals, vs. adding this sort of functionality? (which could, e.g., also be implemented as two calls to expand). |
Probably a boolean flag, similar to bedtools. Alternatively, more options for direction?
I see your point with However, I don't think this is an issue here. It's one argument and it's quite clear how it interacts with the other arguments. As an argument from prior art: bedtools does have a pretty nice interface. I don't think you can go that wrong copying it.
Ha, tough question. I think it always requires a lot of thought. It's especially hard in a language like python or R where the language itself is slow, so we actually need to wrap quite a lot of operations in a small number of calls. Some relevant points here:
I don't think this one is that easy to accomplish. My initial thought here using just the pandas and bioframe APIs would be: split by strand, apply (Oh, you may also need to make sure you don't expand any intervals over chromosome bounds, but that might be a separate issue.)
The logic for doing this in a strand aware way could look something like: from typing import Literal
from numba import njit
import numpy as np, pandas as pd
def expand(df: pd.DataFrame, amount: int, strand: Literal["+", "-"]) -> pd.DataFrame:
result = df.copy()
left, right = result["start"].values, result["end"].values
result["start"], result["end"] = _expand(left, right, amount, (df["strand"] == strand).values)
return result
@njit
def _expand(left, right, amount, strand):
N = len(left)
for i in range(N):
if strand[i]:
right[i] += amount
else:
left[i] -= amount
return left, right This is going to be much more efficient that the pandas/ bioframe solution I thought up. |
Discussing this afternoon, we see this as an opportunity to come up with a good semantics for strand-awareness, which was also recently added to Two proposals came up:
This convention could apply to P.S. Thank you for getting the ball rolling in #152! |
For (1): I also personally find But also bedtools seems kinda inconsistent about this stuff, suggesting it's hard to decide. Could be worth asking the authors what they prefer these days? I would like to pitch the other direction of (2): def expand(
df: pd.DataFrame,
both: int | None = None,
*,
left: int | None = None,
right: int | None = None,
strand_aware: bool = False,
) -> pd.DataFrame:
if both is not None and (left is not None or right is not None):
raise ValueError("Cannot specify both and left/right")
... This collapses it to one argument, instead of two and lets you do things like def closest(
df1, df2, *, left: bool = True, right: bool = True, ...
) -> pd.DataFrame:
Thanks to @emdann for coming up with the elegant solution! |
One of the solutions might be defining the "search space" for bioframe.closest(df_oriented, df,
overlaps: bool | True, # whether to expand the search space to overlapping regions
upstream: bool | True, # -"- expand the search space to upstream
downstream: bool | True, # -"- downstream
direction_col=None) bioframe.expand(df,
pad: int, # The amount by which the intervals are additively expanded.
upstream: bool | True, # -"- expand to upstream
downstream: bool | True, # -"- downstream
direction_col=None
) |
another issue related to some ideas:
|
I don't think that it's super important that argument name is consistent with argument type across functions. Especially for a concept as deeply ingrained into the core data structure as upstream/ downstream, I think there will be many different kinds of values that could be appropriate for different functions.
Going a bit further on the idea of having # Expand upstream by half the gene length
bf.expand(genes, upstream=((genes["stop"] - genes["start"]) * 0.5).round()) Longer example: Cut out first exonfirst_exon = (
genes
.join(exons, how="left", on="gene_id")
.groupby("gene_id")
.sort_values("start")
.first()
)
# Assuming first_exon is the same order as genes
bf.expand(
genes,
upstream=first_exon["start"] - first_exon["stop"],
) This could cover the |
I hate to bikeshed more on this, but I think it's important to get this right, because semantics of "direction" (absolute/reference orientation or relative to a supplied orientation per record) and "orientation" (i.e. strandedness of a feature) can get extremely confusing for end users. I think when speaking of the direction of an operation, e.g. the search direction of
However, a nice alternative is to not use specifiers at all and accept a tuple ( bioframe.expand(df, 5) # symmetric
bioframe.expand(df, df["slop"]) # symmetric, vectorized
bioframe.expand(df, (6, 5))
bioframe.expand(df, (rev_array, fwd_array))
# single-sided
bioframe.expand(df, (rev, 0))
bioframe.expand(df, (0, fwd)) I suggest that the orientation that defines the direction of an operation not be called bioframe.expand(df, (6, 5), along="strand") where the convention would be to map features with The bioframe.expand(df, (rev_slops, fwd_slops), along=strands) I would finally advocate to move the "scale" use case to a new function. Thoughts? Thanks to @manzt for the tuple suggestion! |
If by 5 and 3 you mean 5-prime and 3-prime ends, shouldn't it be the opposite? E.g. 5-prime on the positive strand is on the left. |
Or is it just "random" numbers that confused me because they match the DNA ends? :) |
Oh man, yeah, 5 and 3 were just random... Bad choice, low on caffeine :) |
Perhaps this has a nice mapping to |
I've updated @emdann's draft to introduce a new function Like the proposal above, That means that a single shift value will translate intervals by moving the leading and trailing bounds the same amount in the same direction, while pairs of opposing signs will shrink or expand intervals. The |
Towards getting TSS regions from genomic annotations: it would be nice if
expand
could be strand aware.Right now,
expand
only lets you choose "left", "right", or "both". I would like to be able for this to take into account strand direction, similar tobedtools slop
's orbedtools flank
's-s
argument allows.Related to:
The text was updated successfully, but these errors were encountered: