Skip to content

Commit

Permalink
Merge branch 'master' of github.com:maxibor/pydamage
Browse files Browse the repository at this point in the history
  • Loading branch information
maxibor committed Apr 23, 2021
2 parents bc24033 + 93d70a4 commit 4904994
Show file tree
Hide file tree
Showing 27 changed files with 379 additions and 198 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/pydamage_ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,6 @@ jobs:
- name: Check pydamage on test data
shell: bash -l {0}
run: |
pydamage --verbose tests/data/aligned.bam
pydamage analyze --verbose tests/data/aligned.bam
pydamage filter pydamage_results/pydamage_results.csv
pydamage cite
20 changes: 18 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ pip install -e .
## Quick start

```bash
pydamage aligned.bam
pydamage analyze aligned.bam
```

## CLI help
Expand All @@ -65,4 +65,20 @@ pydamage --help

## Documentation

[pydamage.readthedocs.io](https://pydamage.readthedocs.io)
[pydamage.readthedocs.io](https://pydamage.readthedocs.io)

## Cite

```
@article{Borry2021_pydamage,
author = {Borry, Maxime and Huebner, Alexander and Rohrlach, Adam B and Warinner, Christina G},
doi = {10.1101/2021.03.24.436838},
elocation-id = {2021.03.24.436838},
eprint = {https://www.biorxiv.org/content/early/2021/03/24/2021.03.24.436838.full.pdf},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
title = {PyDamage: automated ancient damage identification and estimation for contigs in ancient DNA de novo assembly},
url = {https://www.biorxiv.org/content/early/2021/03/24/2021.03.24.436838},
year = {2021}
}
```
4 changes: 3 additions & 1 deletion conda/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,14 +21,16 @@ requirements:
- python>=3.6
- click>=7.0
- numpy>=1.17.5
- pandas>=1.0.0
- pandas>=1.1.0
- patsy>=0.5.1
- pip>=20.0.2
- bioconda::pysam>=0.15.2
- scipy>=1.4.1
- statsmodels>=0.11.0
- matplotlib>=3.1.1
- tqdm>=4.45.0
- biopython>=1.78
- kneed>=0.7.0

test:
commands:
Expand Down
Binary file modified docs/img/logo.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/source/CLI.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ The list of arguments of options is detailed below

.. click:: pydamage.cli:cli
:prog: pydamage
:show-nested:
:nested: full
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ __ homepage_
API
CLI
output
troubleshooting



Expand Down
11 changes: 9 additions & 2 deletions docs/source/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

Pydamage generates both a tabular and a visual output.

The tabular output is a comma-separated file (`.csv`) with the following columns, for each analysed reference:
The tabular outputs are comma-separated file (`.csv`) with the following columns, for each analysed reference:

### `pydamage_results.csv`

* `reference`: name of the reference genome/contig
* `pred_accuracy`: Predicted accuracy of Pydamage prediction, from the GLM modelling
Expand All @@ -22,7 +24,12 @@ The tabular output is a comma-separated file (`.csv`) with the following columns
* `CtoT-N`: Proportion of CtoT substitutions observed at position `N` from 5' end
* `GtoA-N`: Proportion of GtoA substitutions observed at position `N` from 5'

> To select contigs/references with damage, you will most likely want to look at two columns: `pred_accuracy > 0.9` and `qvalue <= 0.05`

### `pydamage_filtered_results.csv`

Same file as above, but with contigs filtered with `qvalue <= 0.05` and `pred_accuracy >= threshold` with the filtering threshold determined with the [kneedle](https://ieeexplore.ieee.org/document/5961514) method.

### Plots

The visual output are PNG files, one per reference contig. They show the frequency of observed C to T, and G to A transition at the 5' end of the sequencing data and overlay it with the fitted models for both the null and the damage model, including 95% confidence intervals. Furthermore, it provides a "residuals versus fitted" plot to help evaluate the fit of the pydamage damage model. Finally, the plot contains informtion on the average coverage along the reference and the p-value calculated from the likelihood-ratio test-statistic using a chi-squared distribution.

Expand Down
11 changes: 11 additions & 0 deletions docs/source/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Troubleshooting

## My alignment files don't have a MD tag

You can use [samtools calmd](http://www.htslib.org/doc/samtools-calmd.html) to set the MD tag

Example:

```bash
samtools calmd -b alignment.bam reference.fasta > aln.bam
```
3 changes: 2 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,6 @@ dependencies:
- statsmodels>=0.11.0
- matplotlib>=3.1.1
- tqdm>=4.45.0
- sphinx-click>=2.3.2
- kneed>=0.7.0
- sphinx-click>=0.7.0
- biopython>=1.78
4 changes: 2 additions & 2 deletions pydamage/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
__version__ = "0.50"
from pydamage.main import analyze
__version__ = "0.60"
from pydamage.main import pydamage_analyze
8 changes: 5 additions & 3 deletions pydamage/accuracy_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,16 @@
import pandas as pd
import numpy as np
import pickle
import gzip


def load_model():
"""Returns the pmml model"""
stream = pkg_resources.resource_stream(
__name__, "models/accuracy_model_v2_python.pickle"
model_path = pkg_resources.resource_stream(
__name__, "models/accuracy_model_v2_python.pickle.gz"
)
return pickle.load(stream)
with gzip.open(model_path, "rb") as mod:
return pickle.load(mod)


def prepare_data(pd_df):
Expand Down
15 changes: 15 additions & 0 deletions pydamage/citation.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
def get_citation():
BIB = """
@article{Borry2021_pydamage,
author = {Borry, Maxime and Huebner, Alexander and Rohrlach, Adam B and Warinner, Christina G},
doi = {10.1101/2021.03.24.436838},
elocation-id = {2021.03.24.436838},
eprint = {https://www.biorxiv.org/content/early/2021/03/24/2021.03.24.436838.full.pdf},
journal = {bioRxiv},
publisher = {Cold Spring Harbor Laboratory},
title = {PyDamage: automated ancient damage identification and estimation for contigs in ancient DNA de novo assembly},
url = {https://www.biorxiv.org/content/early/2021/03/24/2021.03.24.436838},
year = {2021}
}
"""
print(BIB)
104 changes: 82 additions & 22 deletions pydamage/cli.py
Original file line number Diff line number Diff line change
@@ -1,53 +1,113 @@
#!/usr/bin/env python3

import click
from pydamage.main import analyze, analyze_group
from pydamage.main import pydamage_analyze
from pydamage.citation import get_citation
from pydamage.kneedle import apply_filter
from pydamage import __version__
from collections import OrderedDict

class NaturalOrderGroup(click.Group):

@click.command()
def __init__(self, name=None, commands=None, **attrs):
super(NaturalOrderGroup, self).__init__(
name=name, commands=None, **attrs)
if commands is None:
commands = OrderedDict()
elif not isinstance(commands, OrderedDict):
commands = OrderedDict(commands)
self.commands = commands

def list_commands(self, ctx):
return self.commands.keys()

@click.group(cls=NaturalOrderGroup)
@click.version_option(__version__)
@click.pass_context
@click.option(
"-o",
"--outdir",
type=click.Path(writable=True, dir_okay=True),
default="pydamage_results",
show_default=True,
help="Output directory",
)
def cli(ctx, outdir):
"""\b
PyDamage: Damage parameter estimation for ancient DNA
Author: Maxime Borry
Contact: <borry[at]shh.mpg.de>
Homepage & Documentation: github.com/maxibor/pydamage
"""

ctx.ensure_object(dict)

ctx.obj["outdir"] = outdir
pass


@cli.command()
@click.pass_context
@click.argument("bam", type=click.Path(exists=True))
@click.option(
"-w",
"--wlen",
default=35,
type=int,
show_default=True,
help="Window length for damage modeling",
help="Window length (in bp) for damage modeling",
)
@click.option(
"-p",
"--process",
default=2,
type=int,
show_default=True,
help="Number of processes",
help="Number of processes for parallel computing",
)
@click.option("-s", "--show_al", is_flag=True, help="Show alignments representations")
@click.option("-pl", "--plot", is_flag=True, help="Make the damage plots")
@click.option("--verbose", is_flag=True, help="Verbose mode")
@click.option(
"-o",
"--outdir",
type=click.Path(writable=True, dir_okay=True),
default="pydamage_results",
show_default=True,
help="Output directory",
"-s", "--show_al", is_flag=True, help="Display alignments representations"
)
@click.option("--force", is_flag=True, help="Force overwriting of results directory")
@click.option("--group", is_flag=True, help="Group references together for analyis")
def cli(no_args_is_help=True, **kwargs):
@click.option("-pl", "--plot", is_flag=True, help="Write damage fitting plots to disk")
@click.option("-vv", "--verbose", is_flag=True, help="Verbose mode")
@click.option(
"-f", "--force", is_flag=True, help="Force overwriting of results directory"
)
@click.option(
"-g",
"--group",
is_flag=True,
help="Use entire BAM file as single reference for analyis "
"(ignore reference headers)",
)
def analyze(ctx, no_args_is_help=True, **kwargs):
"""\b
PyDamage: Damage parameter estimation for ancient DNA
Author: Maxime Borry
Contact: <borry[at]shh.mpg.de>
Homepage & Documentation: github.com/maxibor/pydamage
Run the PyDamage analysis
BAM: path to BAM/SAM/CRAM alignment file. MD tags need to be set.
"""
pydamage_analyze(**kwargs, **ctx.obj)

BAM: path to BAM/SAM/CRAM alignment file

@cli.command()
@click.pass_context
@click.argument("csv", type=click.Path(exists=True))
def filter(ctx, no_args_is_help=True, **kwargs):
"""\b
Filter PyDamage results with optimal pred_accuracy threshold selection
CSV: path to PyDamage result file
"""
analyze(**kwargs)
print(kwargs, ctx.obj)

apply_filter(**kwargs, **ctx.obj)

@cli.command()
def cite():
"""Get pydamage citation in bibtex format
"""
get_citation()

if __name__ == "__main__":
cli()
23 changes: 7 additions & 16 deletions pydamage/damage.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def __init__(self, reference, al_handle):
al_handle(pysam.AlignmentFile)
"""
self.alignments = al_handle.fetch(reference)

self.reference = reference
# self.alignments = al_file

# def __repr__(self):
Expand Down Expand Up @@ -45,14 +45,13 @@ def get_damage(self, wlen, show_al):
all_bases = []
for al in self.alignments:
if al.is_reverse is False and al.is_unmapped is False:
cigar = al.cigartuples
ref = al.get_reference_sequence()
quer = al.query_sequence

all_damage = damage_al(
reference=ref,
query=quer,
cigartuple=cigar,
reference=al.get_reference_sequence(),
read_name=al.query_name,
ref_name=self.reference,
query=al.query_sequence,
cigartuple=al.cigartuples,
wlen=wlen,
show_al=show_al,
)
Expand Down Expand Up @@ -205,15 +204,7 @@ def get_damage_group(ref, bam, mode, wlen, show_al, process):


def test_damage_group(
ct_data,
ga_data,
cc_data,
all_bases,
nb_reads_aligned,
cov,
reflen,
wlen,
verbose
ct_data, ga_data, cc_data, all_bases, nb_reads_aligned, cov, reflen, wlen, verbose
):
"""Performs damage test
Expand Down
15 changes: 15 additions & 0 deletions pydamage/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
class Error(Exception):
"""Base class for exceptions in this module."""

pass


class AlignmentFileError(Error):
"""Exception raised for errors in validating the dataset agains the standards.
Attributes:
message -- explanation of the error
"""

def __init__(self, message):
self.message = message

0 comments on commit 4904994

Please sign in to comment.