6 Motif analysis does not work (bug) #40

Joseph-Vineland · 2021-07-15T18:30:10Z

Hello, Step 6 Motif finding does not work, even with the provided example data and commands.

Motif analyis converts patterns detected by DNABERT to actionable biological insights. I really hope this bug can be fixed, becasue DNABERT is working out great so far, I don’t want to be stopped at the crucial last step. @Zhihan1996

The problem has also been identified as a bug here: #2

(Fixing the issue with the header lines only results in other error messages, and fixing those errors causes even more confusing error messages.)

python find_motifs.py \
    --data_dir $DATA_PATH \
    --predict_dir $PREDICTION_PATH \
    --window_size 24 \
    --min_len 5 \
    --pval_cutoff 0.005 \
    --min_n_motif 3 \
    --align_all_ties \
    --save_file_dir $MOTIF_PATH \
    --verbose

 File "find_motifs.py", line 110, in <module>
    main()
  File "find_motifs.py", line 86, in main
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=None)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 460, in _read
    data = parser.read(nrows)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 1198, in read
    ret = self._engine.read(nrows)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2

The text was updated successfully, but these errors were encountered:

Joseph-Vineland · 2021-07-15T19:24:39Z

I modified the script 'find_motifs.py', and now it runs without error. See below. I also edited the example dev.tsv file so the header is tab-seperated not space-seperated.

However, no motifs are output for the provided example data/commands.

import os
import pandas as pd
import numpy as np
import argparse
import motif_utils as utils


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the sequence+label .tsv files (or other data files) for the task.",
    )

    parser.add_argument(
        "--predict_dir",
        default=None,
        type=str,
        required=True,
        help="Path where the attention scores were saved. Should contain both pred_results.npy and atten.npy",
    )

    parser.add_argument(
        "--window_size",
        default=24,
        type=int,
        help="Specified window size to be final motif length",
    )

    parser.add_argument(
        "--min_len",
        default=5,
        type=int,
        help="Specified minimum length threshold for contiguous region",
    )

    parser.add_argument(
        "--pval_cutoff",
        default=0.005,
        type=float,
        help="Cutoff FDR/p-value to declare statistical significance",
    )

    parser.add_argument(
        "--min_n_motif",
        default=3,
        type=int,
        help="Minimum instance inside motif to be filtered",
    )

    parser.add_argument(
        "--align_all_ties",
        action='store_true',
        help="Whether to keep all best alignments when ties encountered",
    )

    parser.add_argument(
        "--save_file_dir",
        default='.',
        type=str,
        help="Path to save outputs",
    )

    parser.add_argument(
        "--verbose",
        action='store_true',
        help="Verbosity controller",
    )

    parser.add_argument(
        "--return_idx",
        action='store_true',
        help="Whether the indices of the motifs are only returned",
    )

    # TODO: add the conditions
    args = parser.parse_args()

    atten_scores = np.load(os.path.join(args.predict_dir,"atten.npy"))
    pred = np.load(os.path.join(args.predict_dir,"pred_results.npy"))
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=0)
    #dev.columns = ['sequence','label']
    #dev['seq'] = dev['sequence']
    dev['sequence'] = dev['sequence'].apply(utils.kmer2seq)
    dev_pos = dev[dev['label'] == 1]
    dev_neg = dev[dev['label'] == 0]
    #print(dev_pos)
    #print(dev_neg)
    pos_atten_scores = atten_scores[dev_pos.index.values]
    neg_atten_scores = atten_scores[dev_neg.index.values]
    assert len(dev_pos) == len(pos_atten_scores)

    #print(dev_pos['sequence'])


    # run motif analysis
    merged_motif_seqs = utils.motif_analysis(dev_pos['sequence'],
                                        dev_neg['sequence'],
                                        pos_atten_scores,
                                        window_size = args.window_size,
                                        min_len = args.min_len,
                                        pval_cutoff = args.pval_cutoff,
                                        min_n_motif = args.min_n_motif,
                                        align_all_ties = args.align_all_ties,
                                        save_file_dir = args.save_file_dir,
                                        verbose = args.verbose
                                        #return_idx  = args.return_idx
                                    )

if __name__ == "__main__":
    main()

Joseph-Vineland · 2021-07-16T15:57:37Z

FYI - After editing the script as per above,the motif finding procedure worked on our data and identified motifs. GREAT! Thank you very much for the awesome tool! However, if I set the --max_seq_length to 56 instead of 64 during the (4) prediction and (5) visualization steps, then the following eror is thrown:

python find_motifs.py \
>     --data_dir $DATA_PATH \
>     --predict_dir $PREDICTION_PATH \
>     --window_size 24 \
>     --min_len 5 \
>     --pval_cutoff 0.05 \
>     --min_n_motif 3 \
>     --align_all_ties \
>     --save_file_dir $MOTIF_PATH \
>     --verbose
*** Begin motif analysis ***
* pos_seqs: 17938; neg_seqs: 18061
* Finding high attention motif regions
* Filtering motifs by hypergeometric test
Traceback (most recent call last):
  File "find_motifs.py", line 116, in <module>
    main()
  File "find_motifs.py", line 111, in main
    verbose = args.verbose
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 514, in motif_analysis
    **kwargs)
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 233, in filter_motifs
    pvals = motifs_hypergeom_test(pos_seqs, neg_seqs, motifs, **kwargs)
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 196, in motifs_hypergeom_test
    motif_count_all = count_motif_instances(pos_seqs+neg_seqs, motifs, allow_multi_match=allow_multi_match)
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 152, in count_motif_instances
    matches = sorted(map(itemgetter(1), A.iter(seq)))
AttributeError: Not an Aho-Corasick automaton yet: call add_word to add some keys and call make_automaton to convert the trie to an automaton.

lhy0322 · 2021-08-09T05:13:47Z

我修改了脚本'find_motifs.py'，现在它运行没有错误。见下文。我还编辑了示例 dev.tsv 文件，因此标题是制表符分隔而不是空格分隔的。

但是，提供的示例数据/命令不会输出任何主题。

import os
import pandas as pd
import numpy as np
import argparse
import motif_utils as utils


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the sequence+label .tsv files (or other data files) for the task.",
    )

    parser.add_argument(
        "--predict_dir",
        default=None,
        type=str,
        required=True,
        help="Path where the attention scores were saved. Should contain both pred_results.npy and atten.npy",
    )

    parser.add_argument(
        "--window_size",
        default=24,
        type=int,
        help="Specified window size to be final motif length",
    )

    parser.add_argument(
        "--min_len",
        default=5,
        type=int,
        help="Specified minimum length threshold for contiguous region",
    )

    parser.add_argument(
        "--pval_cutoff",
        default=0.005,
        type=float,
        help="Cutoff FDR/p-value to declare statistical significance",
    )

    parser.add_argument(
        "--min_n_motif",
        default=3,
        type=int,
        help="Minimum instance inside motif to be filtered",
    )

    parser.add_argument(
        "--align_all_ties",
        action='store_true',
        help="Whether to keep all best alignments when ties encountered",
    )

    parser.add_argument(
        "--save_file_dir",
        default='.',
        type=str,
        help="Path to save outputs",
    )

    parser.add_argument(
        "--verbose",
        action='store_true',
        help="Verbosity controller",
    )

    parser.add_argument(
        "--return_idx",
        action='store_true',
        help="Whether the indices of the motifs are only returned",
    )

    # TODO: add the conditions
    args = parser.parse_args()

    atten_scores = np.load(os.path.join(args.predict_dir,"atten.npy"))
    pred = np.load(os.path.join(args.predict_dir,"pred_results.npy"))
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=0)
    #dev.columns = ['sequence','label']
    #dev['seq'] = dev['sequence']
    dev['sequence'] = dev['sequence'].apply(utils.kmer2seq)
    dev_pos = dev[dev['label'] == 1]
    dev_neg = dev[dev['label'] == 0]
    #print(dev_pos)
    #print(dev_neg)
    pos_atten_scores = atten_scores[dev_pos.index.values]
    neg_atten_scores = atten_scores[dev_neg.index.values]
    assert len(dev_pos) == len(pos_atten_scores)

    #print(dev_pos['sequence'])


    # run motif analysis
    merged_motif_seqs = utils.motif_analysis(dev_pos['sequence'],
                                        dev_neg['sequence'],
                                        pos_atten_scores,
                                        window_size = args.window_size,
                                        min_len = args.min_len,
                                        pval_cutoff = args.pval_cutoff,
                                        min_n_motif = args.min_n_motif,
                                        align_all_ties = args.align_all_ties,
                                        save_file_dir = args.save_file_dir,
                                        verbose = args.verbose
                                        #return_idx  = args.return_idx
                                    )

if __name__ == "__main__":
    main()

Hi, After I modified it according to your code.
`
*** Begin motif analysis ***

pos_seqs: 9674; neg_seqs: 9709
Finding high attention motif regions
Filtering motifs by hypergeometric test
motif CACCT: N=19383; K=9674; n=3316; x=1956; p=7.539397922768265e-31
Merging similar motif instances
Making fixed_length window = 24
Removing motifs with less than 3 instances
Saving outputs to directory
`
but no files were saved

Joseph-Vineland · 2021-08-09T14:22:41Z

@lhy0322 For me, there were also no files saved when using the provided example data.

When I used my own data, motif files were saved.

Joseph-Vineland · 2021-08-15T01:19:11Z

If you receive the error:

AttributeError: Not an Aho-Corasick automaton yet: call add_word to add some keys and call make_automaton to convert the trie to an automaton.

I think it means: "No Motifs found , no high-attention regions" for your data.

jerryji1993 · 2021-08-26T15:36:46Z

Hi @Joseph-Vineland and @lhy0322,

Sorry about the bugs and delay in my response and thank you very much for reporting. We have uploaded new test data and performed general bug fixes on different modules of this framework. Please try again with the updated code and let me know if any bug still happens.

Thanks,
Jerry

jerryji1993 · 2021-08-26T15:37:06Z

Going to close this one for now. Feel free to reopen if there's any additional questions.

jerryji1993 closed this as completed Aug 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6 Motif analysis does not work (bug) #40

6 Motif analysis does not work (bug) #40

Joseph-Vineland commented Jul 15, 2021 •

edited

Joseph-Vineland commented Jul 15, 2021 •

edited

Joseph-Vineland commented Jul 16, 2021 •

edited

lhy0322 commented Aug 9, 2021

Joseph-Vineland commented Aug 9, 2021

Joseph-Vineland commented Aug 15, 2021 •

edited

jerryji1993 commented Aug 26, 2021

jerryji1993 commented Aug 26, 2021

6 Motif analysis does not work (bug) #40

6 Motif analysis does not work (bug) #40

Comments

Joseph-Vineland commented Jul 15, 2021 • edited

Joseph-Vineland commented Jul 15, 2021 • edited

Joseph-Vineland commented Jul 16, 2021 • edited

lhy0322 commented Aug 9, 2021

Joseph-Vineland commented Aug 9, 2021

Joseph-Vineland commented Aug 15, 2021 • edited

jerryji1993 commented Aug 26, 2021

jerryji1993 commented Aug 26, 2021

Joseph-Vineland commented Jul 15, 2021 •

edited

Joseph-Vineland commented Jul 15, 2021 •

edited

Joseph-Vineland commented Jul 16, 2021 •

edited

Joseph-Vineland commented Aug 15, 2021 •

edited