Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors when convert fast5 to slow5 #89

Closed
loganylchen opened this issue Mar 8, 2023 · 16 comments
Closed

Errors when convert fast5 to slow5 #89

loganylchen opened this issue Mar 8, 2023 · 16 comments

Comments

@loganylchen
Copy link

Hi @hasindu2008 ,

Me again. Recently, I have been investigating some public Nanopore DRS data. According to your suggestions, I first converted the raw FAST5 files into SLOW5. But I encountered an error when I did this.

image

# cmd
slow5tools f2s data/hek293tMettl3KORep3/fast5/workspace -d results/slow5/hek293tMettl3KORep3 -p 8 2>logs/fast5toslow5/hek293tMettl3KORep3.log

I have to say, when I used GUPPY to re-basecall the FAST5, GUPPY also raised some ERRORs saying that they can't read the dataset for read_id:xxxx. I think, this may be caused by the RAW FAST5 files. So I did the conversion by using the GUPPY generate FAST5 files (by using --fast5_out when did the basecalling).

I am asking if I could ignore these error reads or fast5 files when doing the conversion, or if is there better way to fix this kind of issue?

Best

@hasindu2008
Copy link
Owner

Hi, can I know the original source of this data?

@hasindu2008
Copy link
Owner

hasindu2008 commented Mar 8, 2023

@loganylchen

If it is from https://github.com/GoekeLab/sg-nex-data, may I suggest to download directly from the BLOW5 bucket here http://sg-nex-data-blow5.s3-website-ap-southeast-1.amazonaws.com?

There were quite a few FAST5 corruptions in some datasets and so many inconsistencies in this original FAST5. The conversion to BLOW5, I had to do it in a complex way and also needed some manual curation. So, you could save a lot of your time by downloading them.

Thanks to cying111, jonathangoeke and AWS, those datasets in BLOW5 files have been hosted on the AWS public bucket.

@loganylchen
Copy link
Author

Hi @hasindu2008 ,

I think some of them are coming from SG-NEx project, but some of them are not. What I've used is this one HEK293T-Mettl3-KO-rep2. It is not a standard cell condition.

@hasindu2008
Copy link
Owner

@loganylchen

The behaviour of slow5tools is to be conservative and error out when a FAST5 file is corrupted. There is no option at the moment to ignore corrupted reads, first too much implementation effort required, second, such corrupted files should not have been written on the first place and third, A two step-process is better in cases like this (first run some sort of sanitiser/cleaner to get rid of corrupted reads, then running slow5tools to convert).

When I was converting SG-Nex datasets, what i did was opening the HDF5 file using HDF5view, manually locate the affected read and delete the entry from the FAST5 file. Luckily only around 50 such manual cleaning up had to be done.

But seems like running Guppy fast basecalling with --fast5_out removes such corrupted reads? Did it work for you? If so, in the merged final BLOW5 file, run slow5tools stats <file.blow5> and see if the total number of reads match the total number of reads in the basecalled reads/FAST5 files?

@loganylchen
Copy link
Author

Och! Thanks for your suggestions; I may do it following your way.

I tried the GUPPY with --fast5_out, but the errors are still there. I think GUPPY doesn't remove them but retains them there without basecalled information. It is my best luck If I can directly use your prepared BLOW5 file in the SG-NEx project, but for other datasets, I may do it by myself in such a painful way.

@hasindu2008
Copy link
Owner

@loganylchen but given that seems to be a common problem, I wonder if we should programmatically handle this. @Psy-Fer and @hiruna72 any thoughts on this?

@Psy-Fer
Copy link
Collaborator

Psy-Fer commented Mar 8, 2023

By default it should fail and show the error. But a flag to allow slow5tools to skip corrupted reads. If we can get the readIDs of corrupted reads as we go and dump them somewhere for investigation later, even better.

At the very least, people can use this method to clean up their fast5 files.

@hasindu2008
Copy link
Owner

@hiruna72 can comment, on how difficult it is to implement this option (in a separate branch called "bad5" for now).

@hiruna72
Copy link
Collaborator

hiruna72 commented Mar 8, 2023

Yes, a fix should be possible without much difficulty. Will post here once implemented. @Psy-Fer yes that is a good idea. A file with the corrupted read_id and the original file path can be written at the output.

@Psy-Fer
Copy link
Collaborator

Psy-Fer commented Mar 8, 2023

Any ideas on what the flag could be? --bad5 ['error', 'skip', 'log'] where error is default value to error out on a bad fast5 record. skip just skips them and maybe dumps in stderr, and log puts it into a file? or something like that.

@hiruna72
Copy link
Collaborator

@loganylchen The dataset you sent us is still downloading and will take about five more days.
If you got time can you build from bad_fast5 branch and run f2s on the failed dataset? This time use --bad5 1 to skip corrupted files.

Thanks.

@loganylchen
Copy link
Author

@hiruna72 I've tried the bad_fast5 branch with --bad5 1, and it works. It can go through all the fast5 files and complete the conversion. Do you need more testing before merging it into the main branch?

@hasindu2008
Copy link
Owner

hasindu2008 commented Mar 17, 2023

@loganylchen the issue with hiruna's method is, if a multi-fast5 has at least 1 bad read, it just gets rid of the whole multi-fast5 file which is not optimal.

If you are unhappy with losing those reads (I am), I also recently wrote a script (which is very inefficient though) for handling this:

two scripts are involved:

  1. https://github.com/hasindu2008/slow5tools/blob/master/scripts/mixed-multi-fast5-to-blow5.sh
  2. attached below (rename to .sh extension)
    clean-crappy-single-fast5-aggressive2.sh.txt

Now on the input multi-fast5 directory (20190805_Sho_M3-1/ in this case):

./mixed-multi-fast5-to-blow5.sh 20190805_Sho_M3-1/ #run this and expect this to fail 
rm -r tmp_blow5/ tmp_single_fast5/ #remove temporary files created which we do not need
mv tmp_fast5 fast5 #contains the converted and classified single FAST5 based on runid
./clean-crappy-single-fast5-aggressive2.sh #a nasty script to move  the corrupted single fast5 to a quarantine director and run the f2s on the rest

@hiruna72, what happens if we convert to a single FAST5 and run your thing - will it remove the resultant BLOW5 file belonging to the whole process?

@hiruna72
Copy link
Collaborator

no only the bad fast5(s) will be skipped. The output blow5 of the process will not be deleted.

@loganylchen
Copy link
Author

Thanks @hasindu2008 , I have not checked how many reads will be skipped after the conversion. If not too many, I think it will be fine, otherwise, I will still keep your scripts if I will need them someday.

And Thanks @hiruna72 . You are really nice guys.

@hasindu2008
Copy link
Owner

You are welcome. I will close the issue. Feel free to reopen if you want anything more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants