Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod5 subset job with double free or corruption error #87

Closed
antoinefelden opened this issue Nov 14, 2023 · 12 comments
Closed

pod5 subset job with double free or corruption error #87

antoinefelden opened this issue Nov 14, 2023 · 12 comments

Comments

@antoinefelden
Copy link

I’m running into a error that I haven’t managed to debug so far. The job runs fine until the pod5 subset command, starts it for a few seconds, and then stops writing files. The errors are not always exactly the same, nor they are happening at the same point of the subsetting process. Here is a representative example, the “double free” error is recurring:

Subsetting:  0%|     | 0/497 [00:00<?, ?Files/s]
Subsetting:  2%|2     | 11/497 [00:00<00:04, 100.57Files/s]tcache_thread_shutdown(): unaligned tcache chunk detected
corrupted size vs. prev_size while consolidating

Subsetting:  4%|4     | 22/497 [00:00<00:06, 72.94Files/s] 
Subsetting:  7%|6     | 33/497 [00:00<00:06, 72.09Files/s]
Subsetting:  9%|8     | 44/497 [00:00<00:06, 70.97Files/s]double free or corruption (fasttop)

Subsetting: 10%|#     | 51/497 [00:19<02:54, 2.55Files/s]

Both errors seem to be related to memory/pointers at the pod5 subset command. I’ve tried the script with the newer version of Python, it returns a POD5 has encountered an error: 'libffi.so.7: cannot open shared object file: No such file or directory error.

Here is the script I’m running:

#!/bin/bash
#SBATCH --partition=bigmem
#SBATCH --cpus-per-task=24
#SBATCH --mem=512G
#SBATCH --time=10-0:00:00
#SBATCH --ntasks=1
#SBATCH --job-name=1_pod5_split
#SBATCH -o /nfs/scratch/feldenan/%J.out
#SBATCH -e /nfs/scratch/feldenan/%J.err

set -o history -o histexpand

task_ID=${SLURM_JOB_NAME}_${SLURM_JOB_ID}
echo ${task_ID}

output_dir=/nfs/scratch/feldenan/$task_ID
mkdir -p $output_dir
cd $output_dir

module load GCCcore/9.3.0
module load Python/3.8.2
#module load GCCcore/11.3.0
#module load Python/3.10.4
module load GCC/10.3.0
module load OpenMPI/4.1.1
module load R/4.0.0
pip install pod5

POD5_DEBUG=1
LIB=Varroa_gDNA; POD5=/nfs/scratch/feldenan/Nanopore/01_data/Varroa_gDNA/B_clean/20231031_1641_MN45095_FAU99644_84b8b260/pod5_skip

echo $LIB

mkdir -p ./${LIB}_pod5_split

pod5 merge $POD5/*.pod5 -o ./$LIB.pod5
pod5 view ./$LIB.pod5 --threads 24 --include "read_id, channel" --output ./$LIB'_summary.tsv'
pod5 subset ./$LIB.pod5 --threads 24 --summary ./$LIB'_summary.tsv' --columns channel --output ./$LIB'_pod5_split'

Any idea on where the error could come from? I can post more outputs if needed. I've tried to run it a fair few time now, it is interesting how the job stops at slightly different stages, with sometimes other error messages popping up (e.g. malloc(): unaligned fastbin chunk detected)

@0x55555555
Copy link
Collaborator

Hi @antoinefelden,

Can you confirm what version of pod5 you are using?

What version of python are you using when you get the original tcache_thread_shutdown(): crash?

What environment is the command running in, OS, architecture etc.

  • George

@antoinefelden
Copy link
Author

Thanks George, I'm working on a cluster with the following specs:

NAME="Rocky Linux"
VERSION="8.8 (Green Obsidian)"
Architecture: x86_64

pod5 is the latest version, i.e. 0.3.0

Here is my $PATH as well:
/home/software/EasyBuild/software/Python/3.8.2-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/XZ/5.2.5-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/SQLite/3.31.1-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/Tcl/8.6.10-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/ncurses/6.2-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/bzip2/1.0.8-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/binutils/2.34-GCCcore-9.3.0/bin:/home/software/EasyBuild/software/GCCcore/9.3.0/bin:/nfs/home/feldenan/bin:/nfs/home/feldenan/.local/bin:/home/software/vuwrc/utils:/home/software/apps/local/bin:/opt/ohpc/pub/mpi/libfabric/1.13.0/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/bin:/opt/ohpc/pub/libs/hwloc/bin:/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/bin:/opt/ohpc/pub/compiler/gcc/9.4.0/bin:/opt/ohpc/pub/utils/prun/2.2:/opt/ohpc/pub/utils/autotools/bin:/opt/ohpc/pub/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin

Let me know if any other detail would be helpful!

@0x55555555
Copy link
Collaborator

Hi @antoinefelden,

We have a new release https://github.com/nanoporetech/pod5-file-format/releases/tag/0.3.1, which has fixes in the area of subset which this could be.

If this doesn't fix the issue, I'd ideally like to try and reproduce your issue under a debugger (or in my own setup), to see whats going on.

Thanks,

  • George

@antoinefelden
Copy link
Author

Thanks, I've downloaded the tarball but from here I'm not sure how to run version 0.3.1, pip only seems to install 0.3.0 and adding the directory to PATH produced the error "export: [...]/pod5-file-format/pod5-file-format-0.3.1': not a valid identifier".
Do you have some guidance on how to run version 0.3.1? Is that the instructions that are in install.srt? Thanks - sorry for the newbie question!

@0x55555555
Copy link
Collaborator

Hmm, that is strange, 0.3.1 is on pypi: https://pypi.org/project/pod5/, can you install from there?

@antoinefelden
Copy link
Author

I think pip doesn't overwrite the previous version I have? I get the warning "Requirement already satisfied: pod5 in /nfs/home/feldenan/.local/lib/python3.8/site-packages (0.3.0)"

@0x55555555
Copy link
Collaborator

Interesting, What is the exact command you're running to try to install it?

@antoinefelden
Copy link
Author

I've tried pip install pod5 and python3.8 -m pip install pod5
Both return

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pod5 in /nfs/home/feldenan/.local/lib/python3.8/site-packages (0.3.0)

So I've removed all things pod5 in /nfs/home/feldenan/.local/lib/python3.8/site-packages and reinstalled pod5 - it seems to run! However, the output file is funky looking - should I be worried?
352232.err.txt

@0x55555555
Copy link
Collaborator

can you try:

> pip install --upgrade pod5

That should kick it to install the latest.

  • George

@antoinefelden
Copy link
Author

antoinefelden commented Nov 15, 2023

Thanks for the pip tip - it now seems to run fine even on the full 70 Gb dataset. The error log is however a bit weird with lots of line returns and "[A" as attached in my previous comment. Is that normal?

@0x55555555
Copy link
Collaborator

That is a bit odd... I'll take a look internally now, but it sounds like the subset itself is working?

Thanks,

  • George

@antoinefelden
Copy link
Author

Yes, the output itself has no visible issue. Thanks for helping!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants