I’d recommend using either the “Flo” pipeline https://github.com/wurmlab/flo or Liftoff https://github.com/agshumate/Liftoff
This Nextflow pipeline was never really that robust.
See here for more details if interested: https://genomic.social/@photocyte/112255774455268103
A NextFlow pipeline to lift over GFF files using the UCSC liftover tools.
Unlike many other liftOver pipelines, which use pre-computed liftover files (e.g. from UCSC), this script generates a custom liftover file by performing blat alignment of the provided "old" and "new" FASTA files.
- Inspired by: doSameSpeciesLiftOver.pl
- And this: using-liftover-to-convert-genome-assembly-coordinates/
- Also this: flo
- This is a nice general overview: Griffith Lab - Introduction to liftover tools
- Install Miniconda3
- Setup conda environment
conda create --name doSameSpeciesLiftOver
conda activate doSameSpeciesLiftOver
- Install conda dependencies:
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority false
conda install nextflow graphviz
- The
doSameSpeciesLiftOver.nf
script will dynamically install the rest of the conda dependencies as needed, but the dependicies can be preinstalled if you'd like. Install using the below line and simply delete the conda directives from thedoSameSpeciesLiftOver.nf
script, or set thetotalCondaEnvPath
parameter to an environment with the dependencies.
conda install ucsc-fatotwobit blat ucsc-fasplit ucsc-liftup \
ucsc-axtchain ucsc-chainmergesort ucsc-chainsplit ucsc-chainsort \
seqkit ucsc-chainnet ucsc-netchainsubset ucsc-liftover \
genometools-genometools gffutils
nextflow run doSameSpeciesLiftOver.nf \
-resume \
--gff examples/Homo_sapiens.GRCh38_chr6-subset.84.gff3 \
--oldFasta examples/hg38-chr6-subseq1.fa \
--newFasta examples/hg38-chr6-subseq2.fa \
-with-trace examples/trace.txt \
-with-report examples/report.html \
-with-dag examples/flowchart.svg
(Currently the example takes about 2 mins to run, with 50% of the computational time being conda installing things and the other 50% the blat alignment)
- Splitting of FASTA files to a sub-record level (e.g. splitting a record every 4999 bp using 1500 bp overlaps by using non-default values for
params.splitDepth
params.splitSize
params.recordSplit
andparams.extraBases
), leads to an incorrect liftover file. I believe the NextFlow logic is correct, so the problem is something I don't quite understand about blat, chainfiles, and/or liftover files. - blat isn't multithreaded. Could use pblat instead.
- Without sub-record splitting, and without multithreading, the script can't take advantage of parallel computing resources very well (only parallelizes at 1 blat alignment process per FASTA record). So, it takes a looong time for a whole genome liftover. The script is probably best used in a genome assembly tweaking context, where two versions of a single scaffold can be compared and lifted over. The script runs pretty fast in that context.
- It's possible that the script could only compare/liftover the scaffolds which have a matching record ID (e.g., in different genome assembly versions where the record IDs are the same between the two versions), and also use a hash up front to confirm that the two scaffolds differ before using blat to align them, but not implemented.
- Features which transverse "N" gaps, are oftentimes not lifted over properly. This is even with no sub-record splitting, and the
-extendThroughN
parameter for blat. Therescue_unlifted_features
node attempts to fix some of the more trivial edge cases, but doesn't work super well. - Could use CrossMap, to liftover things like .bam, .vcf files, but was quite buggy for me & couldn't get it to work. There is still a vestigal node
crossmap_liftover
, partially implemeting this. - blat supports repeat-aware alignment (using the
-mask=
parameter). Including such repeat information would presumably make the alignment faster, but not implemented. Noted that theconstructOocFile
node and the-ooc
parameter in blat do a sort of simple repeat annotation.