Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multithreading issue with anvi-run-hmms #1564

Closed
meren opened this issue Nov 22, 2020 · 12 comments
Closed

[BUG] Multithreading issue with anvi-run-hmms #1564

meren opened this issue Nov 22, 2020 · 12 comments

Comments

@meren
Copy link
Member

meren commented Nov 22, 2020

Short description of the problem

anvi-run-hmms works well on a contigs database as a single thread, but fails when running in multiple threads.

anvi'o version

This happens with the main branch.

Detailed description of the issue

Running it with a single thread, everything is fine:

anvi-run-hmms -c CONTIGS.db -I Ribosomal_RNA_23S

HMM sources ..................................: Ribosomal_RNA_23S
Alphabet/context target found ................: RNA:CONTIG

HMM Profiling for Ribosomal_RNA_23S
===============================================
(...)
Number of raw hits ...........................: 8
Number of weak hits removed ..................: 0
Number of hits in annotation dict  ...........: 8
Pruned .......................................: 3 out of 8 hits were removed due to redundancy
Gene calls added to db .......................: 5 (from source "Ribosomal_RNA_23S")

✓ anvi-run-hmms took 0:01:47.840152

Running it with multiple threads:

anvi-run-hmms -c CONTIGS.db -T 4 -I Ribosomal_RNA_23S --just-do-it

HMM sources ..................................: Ribosomal_RNA_23S
Alphabet/context target found ................: RNA:CONTIG

HMM Profiling for Ribosomal_RNA_23S
===============================================
(...)
Number of raw hits ...........................: 8

✖ anvi-run-hmms encountered an error after 0:00:45.813513


Config Error: Number of column names declared (15) differs from the number of columns found
              (16) in the matrix
              ('/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpe_h6eje3/hmm.table') :/

Indeed, there is something wrong with the hmm.table file that is produced. This is how it looks with single thread:

cat /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp2sl4ixmm/hmm.table
23S_rRNA_bac	-	215_10M_human_filteredcontig37	-	3	1185	1619	2816	1617	2818	2906	+	2.7e-278	913.6	27.9
23S_rRNA_bac	-	215_10M_human_filteredcontig414	-	3	1265	1270	4	1272	1	2906	-	1.1e-258	848.4	40.
23S_rRNA_bac	-	215_10M_human_filteredcontig422	-	4	2111	2097	3	2100	1	2906	-	0	1567.7	63.
23S_rRNA_arc	-	215_10M_human_filteredcontig422	-	17	2204	2083	5	2101	2	2978	-	1.8e-269	884.1	61.
23S_rRNA_bac	-	215_10M_human_filteredcontig504	-	5	2383	2383	3	2387	1	2906	-	0	1704.1	37.
23S_rRNA_arc	-	215_10M_human_filteredcontig504	-	15	2450	2372	20	2385	1	2978	-	1.8e-301	991.2	34.
23S_rRNA_bac	-	215_10M_human_filteredcontig515	-	11	2886	3365	317	3374	299	2906	-	0	2084.4	109.
23S_rRNA_arc	-	215_10M_human_filteredcontig515	-	17	2963	3358	317	3375	302	2978	-	0	1208.9	107.

And this is the same file after HMMs were run with multiple threads:

23S_rRNA_bac	-	215_10M_human_filteredcontig37	-	3	1185	1619	2816	1617	2818	2906	+	2.7e-278	913.6	27.9	-
23S_rRNA_bac	-	215_10M_human_filteredcontig414	-	3	1265	1270	4	1272	1	2906	-	1.1e-258	848.4	40.1	-
23S_rRNA_bac	-	215_10M_human_filteredcontig422	-	4	2111	2097	3	2100	1	2906	-	0	1567.7	63.7	-
23S_rRNA_arc	-	215_10M_human_filteredcontig422	-	17	2204	2083	5	2101	2	2978	-	1.8e-269	884.1	61.7	-
23S_rRNA_bac	-	215_10M_human_filteredcontig504	-	5	2383	2383	3	2387	1	2906	-	0	1704.1	37.2	-
23S_rRNA_arc	-	215_10M_human_filteredcontig504	-	15	2450	2372	20	2385	1	2978	-	1.8e-301	991.2	34.9	-
23S_rRNA_bac	-	215_10M_human_filteredcontig515	-	11	2886	3365	317	3374	299	2906	-	0	2084.4	109.2	-
23S_rRNA_arc	-	215_10M_human_filteredcontig515	-	17	2963	3358	317	3375	302	2978	-	0	1208.9	107.7	-

Look at the end of lines o_O.

This must be due to a race condition, or some files not being closed. Very disturbing.

Files to reproduce

The contigs database to reproduce this is here:

https://www.dropbox.com/s/bdzvlfqxngnw00u/CONTIGS.db.gz

It would be great if someone run the following two commands and comment if they get the same error:

anvi-run-hmms -c CONTIGS.db --just-do-it -I Ribosomal_RNA_23S

anvi-run-hmms -c CONTIGS.db --just-do-it -I Ribosomal_RNA_23S -T 4
@meren
Copy link
Member Author

meren commented Nov 22, 2020

I guess I see it better now. Here is the output from a single thread:

$ cat /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp2sl4ixmm/RNA_contig_sequences.fa.0_table

# target name        accession  query name                    accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
#------------------- ----------          -------------------- ---------- ------- ------- ------- ------- ------- ------- ------- ------ --------- ------ ----- ---------------------
23S_rRNA_bac         -          215_10M_human_filteredcontig37 -                3    1185    1619    2816    1617    2818    2906    +    2.7e-278  913.6  27.9  -
23S_rRNA_bac         -          215_10M_human_filteredcontig414 -                3    1265    1270       4    1272       1    2906    -    1.1e-258  848.4  40.1  -
23S_rRNA_bac         -          215_10M_human_filteredcontig422 -                4    2111    2097       3    2100       1    2906    -           0 1567.7  63.7  -
23S_rRNA_arc         -          215_10M_human_filteredcontig422 -               17    2204    2083       5    2101       2    2978    -    1.8e-269  884.1  61.7  -
23S_rRNA_bac         -          215_10M_human_filteredcontig504 -                5    2383    2383       3    2387       1    2906    -           0 1704.1  37.2  -
23S_rRNA_arc         -          215_10M_human_filteredcontig504 -               15    2450    2372      20    2385       1    2978    -    1.8e-301  991.2  34.9  -
23S_rRNA_bac         -          215_10M_human_filteredcontig515 -               11    2886    3365     317    3374     299    2906    -           0 2084.4 109.2  -
23S_rRNA_arc         -          215_10M_human_filteredcontig515 -               17    2963    3358     317    3375     302    2978    -           0 1208.9 107.7  -
#
# Program:         hmmscan
# Version:         3.2.1 (June 2018)
# Pipeline mode:   SCAN
# Query file:      /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp2sl4ixmm/RNA_contig_sequences.fa.0
# Target file:     /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp0a37vifu/Ribosomal_RNA_23S.hmm
# Option settings: nhmmscan -o /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp2sl4ixmm/RNA_contig_sequences.fa.0_output --tblout /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp2sl4ixmm/RNA_contig_sequences.fa.0_table --cut_ga --cpu 1 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp0a37vifu/Ribosomal_RNA_23S.hmm /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp2sl4ixmm/RNA_contig_sequences.fa.0
# Current dir:     /Users/meren/Downloads/P215-MERGED
# Date:            Sun Nov 22 14:58:01 2020
# [ok]

There are multiple of these in the multithreaded mode, of course:

ls -l /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/*_table
-rw-r--r-- 1 meren staff 1180 Nov 22 14:46 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.0_table
-rw-r--r-- 1 meren staff 1188 Nov 22 14:46 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.1_table
-rw-r--r-- 1 meren staff 2499 Nov 22 14:46 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.2_table
-rw-r--r-- 1 meren staff 1186 Nov 22 14:46 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.3_table

And this is how the look like:

/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.0_table

# target name        accession  query name                    accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
#------------------- ----------          -------------------- ---------- ------- ------- ------- ------- ------- ------- ------- ------ --------- ------ ----- ---------------------
#
# Program:         hmmscan
# Version:         3.2.1 (June 2018)
# Pipeline mode:   SCAN
# Query file:      /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.0
# Target file:     /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm
# Option settings: nhmmscan -o /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.0_output --tblout /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.0_table --cut_ga --cpu 1 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.0
# Current dir:     /Users/meren/Downloads/P215-MERGED
# Date:            Sun Nov 22 14:46:44 2020
# [ok]


/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.1_table

# target name        accession  query name                        accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
#------------------- ----------              -------------------- ---------- ------- ------- ------- ------- ------- ------- ------- ------ --------- ------ ----- ---------------------
#
# Program:         hmmscan
# Version:         3.2.1 (June 2018)
# Pipeline mode:   SCAN
# Query file:      /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.1
# Target file:     /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm
# Option settings: nhmmscan -o /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.1_output --tblout /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.1_table --cut_ga --cpu 1 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.1
# Current dir:     /Users/meren/Downloads/P215-MERGED
# Date:            Sun Nov 22 14:46:42 2020
# [ok]


/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.2_table

# target name        accession  query name                        accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
#------------------- ----------              -------------------- ---------- ------- ------- ------- ------- ------- ------- ------- ------ --------- ------ ----- ---------------------
23S_rRNA_bac         -          215_10M_human_filteredcontig37 -                3    1185    1619    2816    1617    2818    2906    +    2.7e-278  913.6  27.9  -
23S_rRNA_bac         -          215_10M_human_filteredcontig414 -                3    1265    1270       4    1272       1    2906    -    1.1e-258  848.4  40.1  -
23S_rRNA_bac         -          215_10M_human_filteredcontig422 -                4    2111    2097       3    2100       1    2906    -           0 1567.7  63.7  -
23S_rRNA_arc         -          215_10M_human_filteredcontig422 -               17    2204    2083       5    2101       2    2978    -    1.8e-269  884.1  61.7  -
23S_rRNA_bac         -          215_10M_human_filteredcontig504 -                5    2383    2383       3    2387       1    2906    -           0 1704.1  37.2  -
23S_rRNA_arc         -          215_10M_human_filteredcontig504 -               15    2450    2372      20    2385       1    2978    -    1.8e-301  991.2  34.9  -
23S_rRNA_bac         -          215_10M_human_filteredcontig515 -               11    2886    3365     317    3374     299    2906    -           0 2084.4 109.2  -
23S_rRNA_arc         -          215_10M_human_filteredcontig515 -               17    2963    3358     317    3375     302    2978    -           0 1208.9 107.7  -
#
# Program:         hmmscan
# Version:         3.2.1 (June 2018)
# Pipeline mode:   SCAN
# Query file:      /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.2
# Target file:     /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm
# Option settings: nhmmscan -o /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.2_output --tblout /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.2_table --cut_ga --cpu 1 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.2
# Current dir:     /Users/meren/Downloads/P215-MERGED
# Date:            Sun Nov 22 14:46:59 2020
# [ok]


/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.3_table

# target name        accession  query name                       accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
#------------------- ----------             -------------------- ---------- ------- ------- ------- ------- ------- ------- ------- ------ --------- ------ ----- ---------------------
#
# Program:         hmmscan
# Version:         3.2.1 (June 2018)
# Pipeline mode:   SCAN
# Query file:      /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.3
# Target file:     /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm
# Option settings: nhmmscan -o /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.3_output --tblout /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.3_table --cut_ga --cpu 1 /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmp6aq2gbjs/Ribosomal_RNA_23S.hmm /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpvxchsekk/RNA_contig_sequences.fa.3
# Current dir:     /Users/meren/Downloads/P215-MERGED
# Date:            Sun Nov 22 14:46:49 2020
# [ok]

So basically all hits were in one of the chunks of the input FASTA file, and probably somehow merging all these four into a single hmm.table output in the temp directory is what went wrong.

@meren meren self-assigned this Nov 22, 2020
@meren meren added this to the v7 milestone Nov 22, 2020
@meren
Copy link
Member Author

meren commented Nov 22, 2020

OK. This is all happening here:

    def append_to_main_table_file(self, merged_file_buffer, table_output_file, buffer_write_lock):
        """Append table output to the main file.

        FIXME In addition to appending, this function also pre-processes the data, which should not
        be done here. That qualifies as hmmer output parsing, and should be in
        anvio/parsers/hmmer.py.
        """

        detected_non_ascii = False
        lines_with_non_ascii = []
        clip_description_index = None
        clip_index_found = False

        with open(table_output_file, 'rb') as hmm_hits_file:
            line_counter = 0
            for line_bytes in hmm_hits_file:
                line_counter += 1
                line = line_bytes.decode('ascii', 'ignore')

                if not len(line) == len(line_bytes):
                    lines_with_non_ascii.append(line_counter)
                    detected_non_ascii = True

                if line.startswith('#'):
                    if not clip_index_found and line.find('description') != -1:
                        # This parser removes the description column from the data
                        clip_description_index = line.find('description')
                        clip_index_found = True

                    continue

                with buffer_write_lock:
                    merged_file_buffer.write('\t'.join(line[:clip_description_index].split()) + '\n')

        if detected_non_ascii:
            self.run.warning("Just a heads-up, Anvi'o HMMer parser detected non-ascii characters while processing "
                             "the file '%s' and cleared them. Here are the line numbers with non-ascii characters: %s. "
                             "You may want to check those lines with a command like \"awk 'NR==<line number>' <file path> | cat -vte\"." %
                                                 (table_output_file, ", ".join(map(str, lines_with_non_ascii))))

Only when it is running as a single thread, this function reads this line:

23S_rRNA_bac         -          215_10M_human_filteredcontig414 -                3    1265    1270       4    1272       1    2906    -    1.1e-258  848.4  40.1  -

as this:

23S_rRNA_bac	-	215_10M_human_filteredcontig414	-	3	1265	1270	4	1272	1	2906	-	1.1e-258	848.4	40.

What is wrong with this function is beyond me. It is the single-threaded mode that ruins the output. But multi-threaded mode gives the error :/

@meren
Copy link
Member Author

meren commented Nov 22, 2020

:(((

OK. There are way too many issues here. But the end-of-line discrepancies is related to this piece of code:

                if line.startswith('#'):
                    if not clip_index_found and line.find('description') != -1:
                        # This parser removes the description column from the data
                        clip_description_index = line.find('description')
                        clip_index_found = True

The lines follow this section uses clip_description_index to trim the line, but OF COURSE this assumes that the where the word 'description' is found will also match to the beginning of the description in the lines for the hits. Bad assumption.

This is where this code is cutting every following line in hit lines:

image

Admittedly, this garbage of a file format is not making anything easier.

@meren
Copy link
Member Author

meren commented Nov 22, 2020

The long story short so far;

  • Multi-threaded mode fails to cut the description line properly, hence fails. This needs fixing.
  • Single-threaded mode cuts it badly (by removing ACTUAL data from the file), but things go smoothly anyway. This needs fixing, too.
  • Why is there a difference between multi-threaded and single-threaded behavior is still not clear to me.

@meren
Copy link
Member Author

meren commented Nov 22, 2020

OK. One thing learned. This only happens as a result of a perfect storm:

image

HMMER sets the locations for each output field as a function of the first line. For one to get this STUPID error, all these must happen at the same time:

  • The FASTA file you used to generate the contigs database have variable contig name lengths,
  • You are using an HMM that runs on CONTIGS and not GENES (because genes have gene caller ids that are way too short to push things too far in the hits line, but in the case of HMMs running on contigs, contig names can extend the default space given for the query_name field in the output).
  • AND if the first contig with a hit has a short name compared to other contigs so HMMER sets the width of everything based on that and in the other lines the things are pushed out of boundaries like shown in the screenshot above.

A very sophisticated bug due to some 'fancy' code someone wrote to make things look 'better' that took hours from my life and I am not even close to a solution.

@meren
Copy link
Member Author

meren commented Nov 22, 2020

I still have no idea why there is a difference between single-threaded and multi-threaded modes. But what I know is, when you run HMMER 4 different times, it sets the distances between different columns of its output table differently. Go figure:

grep 'target' /var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpa84j_czg/RNA_contig_sequences.fa.*table
/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpa84j_czg/RNA_contig_sequences.fa.0_table:# target name        accession  query name                    accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpa84j_czg/RNA_contig_sequences.fa.1_table:# target name        accession  query name                        accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpa84j_czg/RNA_contig_sequences.fa.2_table:# target name        accession  query name                        accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target
/var/folders/x5/gt4031w53fs63csv1fp0r_3w0000gn/T/tmpa84j_czg/RNA_contig_sequences.fa.3_table:# target name        accession  query name                       accession  hmmfrom hmm to alifrom  ali to envfrom  env to  modlen strand   E-value  score  bias  description of target

Mind you, three out of four of these have no hits. But in multi-threaded mode, our code finds description here, and cuts the next lines from the wrong place, failing to exclude descriptions:

image

The solution is to set 'query name' in FASTA files to something that is very short. This is the kind of crap 2020 asks us to deal with.

@meren
Copy link
Member Author

meren commented Nov 22, 2020

The solution is to set 'query name' in FASTA files to something that is very short.

To prove this, I created a file with the contig names above:

215_10M_human_filteredcontig37
215_10M_human_filteredcontig414
215_10M_human_filteredcontig422
215_10M_human_filteredcontig504
215_10M_human_filteredcontig515

Exported them:

anvi-export-contigs -c CONTIGS.db \
                    --contigs-of-interest contigs.txt \
                    -o contigs.fa

Renamed the contigs names to something shorter:

anvi-script-reformat-fasta contigs.fa --simplify-names -o contigs_better.fa

Generated a contigs db:

anvi-gen-contigs-database -f contigs_better.fa -o NEW-CONTIGS.db

Run HMMs:

anvi-run-hmms -c NEW-CONTIGS.db -I Ribosomal_RNA_23S --debug

and OF COURSE everything worked:

image

@meren
Copy link
Member Author

meren commented Nov 22, 2020

A bullet-proof solution to this is the following:

  • Add a simplify_names=False flag for utils.export_sequences_from_contigs_db and return a names conversion dictionary,
  • Find each instance in the codebase where HMMer class is inherited,
  • Upstream of that code, find export_sequences_from_contigs_db for *:CONTIG type HMMs and call the function with simplify_names=True,
  • Downstream of that code, update the results dict using the name conversion dictionary.

But the problem with this solution is that it will take computational overhead for 99% of anvi'o projects for no reason. This problem only applies to those who do not follow our advice regarding the use of anvi-script-reformat-fasta to simplify their contig names.

So I will not solve this, but add a warning to the cryptic error message so people know what might have gone wrong.

@ivagljiva
Copy link
Contributor

This is insane. :( And it reminds me that we need to stop doing 'fancy' parsing of the HMMER output in the driver and move that to the parser code. I was planning to fix that after I finish with updating the multi-threading part of the code. In general, lots of things need to be fixed with HMMER. :(

This is the kind of crap 2020 asks us to deal with.

image

@meren
Copy link
Member Author

meren commented Nov 23, 2020

Maybe in the future we should switch from work with *_ouptut files to *_table files.

But I am closing this issue now after wasting my entire day on it rather than working on our more time-critical things :/

@meren meren closed this as completed Nov 23, 2020
@ekiefl
Copy link
Contributor

ekiefl commented Nov 23, 2020

Sorry this happened to you @meren. I wrote this piece of garbage:

                if line.startswith('#'):
                    if not clip_index_found and line.find('description') != -1:
                        # This parser removes the description column from the data
                        clip_description_index = line.find('description')
                        clip_index_found = True

Apologies it has backfired. Adding this was a necessary evil, as the loop used to look like this:

            line_counter = 0
            for line_bytes in hmm_hits_file:
                line_counter += 1
                line = line_bytes.decode('ascii', 'ignore')

                if not len(line) == len(line_bytes):
                    lines_with_non_ascii.append(line_counter)
                    detected_non_ascii = True

                if line.startswith('#'):
                    continue

                with buffer_write_lock:
                    merged_file_buffer.write('\t'.join(line.split()[0:18]) + '\n')

The hard-coded 18 worked for us for years, but only by miracle. IIRC 18 was the number of columns for amino acids, but for nucleotides the number is something like 15 or 16. Yet amazingly, python does not register this an error:

>>> x = [0,1,2]
>>> x[:100000]
[0,1,2]

So no problem. Yet under some circumstances, the number of important fields exceeds 18, which I discovered when using HMMER for integrating Interacdome into anvi'o (#1472). So I implemented what we are now currently dealing with. Of course I did not expect HMMER would misformat its own output file.

And it reminds me that we need to stop doing 'fancy' parsing of the HMMER output in the driver and move that to the parser code.

I agree. It should be moved to parser modules and parsed with pandas.

@meren
Copy link
Member Author

meren commented Nov 23, 2020

Apologies it has backfired. Adding this was a necessary evil

No worries at all, @ekiefl. It wasn't your fault as you couldn't foresee that the HMMER output was not as consistent as it seemed. I realized that it works perfectly in most cases but when we use contig names as target, then those contigs databases with long and variable length contig names fail for HMMs that do not target AA sequences.

If we ever want to change this, I think we shouldn't bother with the table output, but work directly with the raw search output that is full of information that seemed easier to parse.

The hard-coded 18 worked for us for years, but only by miracle

Error tolerance in programming languages is the absolute path to hell paved with good intentions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants