-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multithreading issue with anvi-run-hmms #1564
Comments
I guess I see it better now. Here is the output from a single thread:
There are multiple of these in the multithreaded mode, of course:
And this is how the look like:
So basically all hits were in one of the chunks of the input FASTA file, and probably somehow merging all these four into a single |
OK. This is all happening here: def append_to_main_table_file(self, merged_file_buffer, table_output_file, buffer_write_lock):
"""Append table output to the main file.
FIXME In addition to appending, this function also pre-processes the data, which should not
be done here. That qualifies as hmmer output parsing, and should be in
anvio/parsers/hmmer.py.
"""
detected_non_ascii = False
lines_with_non_ascii = []
clip_description_index = None
clip_index_found = False
with open(table_output_file, 'rb') as hmm_hits_file:
line_counter = 0
for line_bytes in hmm_hits_file:
line_counter += 1
line = line_bytes.decode('ascii', 'ignore')
if not len(line) == len(line_bytes):
lines_with_non_ascii.append(line_counter)
detected_non_ascii = True
if line.startswith('#'):
if not clip_index_found and line.find('description') != -1:
# This parser removes the description column from the data
clip_description_index = line.find('description')
clip_index_found = True
continue
with buffer_write_lock:
merged_file_buffer.write('\t'.join(line[:clip_description_index].split()) + '\n')
if detected_non_ascii:
self.run.warning("Just a heads-up, Anvi'o HMMer parser detected non-ascii characters while processing "
"the file '%s' and cleared them. Here are the line numbers with non-ascii characters: %s. "
"You may want to check those lines with a command like \"awk 'NR==<line number>' <file path> | cat -vte\"." %
(table_output_file, ", ".join(map(str, lines_with_non_ascii)))) Only when it is running as a single thread, this function reads this line:
as this:
What is wrong with this function is beyond me. It is the single-threaded mode that ruins the output. But multi-threaded mode gives the error :/ |
:((( OK. There are way too many issues here. But the end-of-line discrepancies is related to this piece of code: if line.startswith('#'):
if not clip_index_found and line.find('description') != -1:
# This parser removes the description column from the data
clip_description_index = line.find('description')
clip_index_found = True The lines follow this section uses This is where this code is cutting every following line in hit lines: Admittedly, this garbage of a file format is not making anything easier. |
The long story short so far;
|
OK. One thing learned. This only happens as a result of a perfect storm: HMMER sets the locations for each output field as a function of the first line. For one to get this STUPID error, all these must happen at the same time:
A very sophisticated bug due to some 'fancy' code someone wrote to make things look 'better' that took hours from my life and I am not even close to a solution. |
I still have no idea why there is a difference between single-threaded and multi-threaded modes. But what I know is, when you run HMMER 4 different times, it sets the distances between different columns of its output table differently. Go figure:
Mind you, three out of four of these have no hits. But in multi-threaded mode, our code finds The solution is to set 'query name' in FASTA files to something that is very short. This is the kind of crap 2020 asks us to deal with. |
To prove this, I created a file with the contig names above:
Exported them:
Renamed the contigs names to something shorter:
Generated a contigs db:
Run HMMs:
and OF COURSE everything worked: |
A bullet-proof solution to this is the following:
But the problem with this solution is that it will take computational overhead for 99% of anvi'o projects for no reason. This problem only applies to those who do not follow our advice regarding the use of So I will not solve this, but add a warning to the cryptic error message so people know what might have gone wrong. |
This is insane. :( And it reminds me that we need to stop doing 'fancy' parsing of the HMMER output in the driver and move that to the parser code. I was planning to fix that after I finish with updating the multi-threading part of the code. In general, lots of things need to be fixed with HMMER. :(
|
Maybe in the future we should switch from work with But I am closing this issue now after wasting my entire day on it rather than working on our more time-critical things :/ |
Sorry this happened to you @meren. I wrote this piece of garbage: if line.startswith('#'):
if not clip_index_found and line.find('description') != -1:
# This parser removes the description column from the data
clip_description_index = line.find('description')
clip_index_found = True Apologies it has backfired. Adding this was a necessary evil, as the loop used to look like this: line_counter = 0
for line_bytes in hmm_hits_file:
line_counter += 1
line = line_bytes.decode('ascii', 'ignore')
if not len(line) == len(line_bytes):
lines_with_non_ascii.append(line_counter)
detected_non_ascii = True
if line.startswith('#'):
continue
with buffer_write_lock:
merged_file_buffer.write('\t'.join(line.split()[0:18]) + '\n') The hard-coded 18 worked for us for years, but only by miracle. IIRC 18 was the number of columns for amino acids, but for nucleotides the number is something like 15 or 16. Yet amazingly, python does not register this an error: >>> x = [0,1,2]
>>> x[:100000]
[0,1,2] So no problem. Yet under some circumstances, the number of important fields exceeds 18, which I discovered when using HMMER for integrating Interacdome into anvi'o (#1472). So I implemented what we are now currently dealing with. Of course I did not expect HMMER would misformat its own output file.
I agree. It should be moved to parser modules and parsed with pandas. |
No worries at all, @ekiefl. It wasn't your fault as you couldn't foresee that the HMMER output was not as consistent as it seemed. I realized that it works perfectly in most cases but when we use contig names as target, then those contigs databases with long and variable length contig names fail for HMMs that do not target AA sequences. If we ever want to change this, I think we shouldn't bother with the table output, but work directly with the raw search output that is full of information that seemed easier to parse.
Error tolerance in programming languages is the absolute path to hell paved with good intentions. |
Short description of the problem
anvi-run-hmms works well on a contigs database as a single thread, but fails when running in multiple threads.
anvi'o version
This happens with the main branch.
Detailed description of the issue
Running it with a single thread, everything is fine:
Running it with multiple threads:
Indeed, there is something wrong with the
hmm.table
file that is produced. This is how it looks with single thread:And this is the same file after HMMs were run with multiple threads:
Look at the end of lines o_O.
This must be due to a race condition, or some files not being closed. Very disturbing.
Files to reproduce
The contigs database to reproduce this is here:
https://www.dropbox.com/s/bdzvlfqxngnw00u/CONTIGS.db.gz
It would be great if someone run the following two commands and comment if they get the same error:
The text was updated successfully, but these errors were encountered: