legpipe2: a SNP calling pipeline

The SNP calling pipeline used at CREA-ZA (Lodi, Italy). Main application is legumes (diploid and tetraploid) and restricted sequencing (mainly Genotyping By Sequencing). You can use it for everything else, though.

Main features

clean separation in modules (alignment, trimmin, SNP calling, ...)
each module can be (re)executed easily, carries its own configuration
easy to hack. You don't like a module and want to improve it? Just go to the corresponding file
internally use of GVCF workflow from GATK/HaplotypeCaller suite, allowing for low-memory fast calling

Status

Usable.

Dependencies

Legpipe2 internally uses many other tools. The script setup.sh allows to install everything starting from a clean ubuntu machine (here "ubuntu" means: apt package manager, sudo for installation privileges, basing Unix shell). Even if you are not using setup.sh you should check it and use it as a guideline for setting up your machine.

Using Legpipe2

Since it's python3 all the way down, you just have to clone the repo. Then do a:

python3 /path/to/legpipe2.py your_config_file.ini

The configuration file is the core of the pipeline, since it dictates what steps are going to be executed and on what data. A lot of interesting details are provided in the sample_config.ini example file.

Main inspirations

TODO

Big chunks

Nothing :)

Functionalities

I'd like to implement the following checks and features to ensure that the pipeline fails gracefully when something goes wrong:

support both .fa.gz and .fasta.gz genome files (see in particular genome indexing module)
check if the reference genome is gzipped and not bgzipped
indexing module needs a better management of the logs

Bugs

None known :)

Wishlist

Stuff that it would be nice to have, once the above blocks are empty:

export to conf file all the commands, especially from align/filter
we are using both samtools and bcftools. Is it really necessary?
call module does a lot of work. it could probably be split in three modules (HaplotypeCaller, GenomicsDBImport and GenotypeGVCFs). On the other hand, I'm not sure how common is to do each step separatedly
implement test for required software (maybe module based? like what is done for interpolate() and validate() functions). Also check for python 3.6+
post calling filtering from UGbS-Flex:
- [optional] Remove adjacent SNPs
- [optional] Consolidate SNPs
- [optional] Select SNPs based on parental scores (only applies to some mapping populations)
- [optional] Remove cosegregating SNP markers
remove any subprocess.run(..., shell=True)

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
legpipe2		legpipe2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
sample_config.ini		sample_config.ini
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

legpipe2

legpipe2

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

sample_config.ini

sample_config.ini

setup.sh

setup.sh

Repository files navigation

legpipe2: a SNP calling pipeline

Main features

Status

Dependencies

Using Legpipe2

Main inspirations

TODO

Big chunks

Functionalities

Bugs

Wishlist

About

Releases

Packages

Languages

License

ne1s0n/legpipe2

Folders and files

Latest commit

History

Repository files navigation

legpipe2: a SNP calling pipeline

Main features

Status

Dependencies

Using Legpipe2

Main inspirations

TODO

Big chunks

Functionalities

Bugs

Wishlist

About

Resources

License

Stars

Watchers

Forks

Languages