Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6 Motif analysis does not work (bug) #40

Closed
Joseph-Vineland opened this issue Jul 15, 2021 · 7 comments
Closed

6 Motif analysis does not work (bug) #40

Joseph-Vineland opened this issue Jul 15, 2021 · 7 comments

Comments

@Joseph-Vineland
Copy link

Joseph-Vineland commented Jul 15, 2021

Hello, Step 6 Motif finding does not work, even with the provided example data and commands.

Motif analyis converts patterns detected by DNABERT to actionable biological insights. I really hope this bug can be fixed, becasue DNABERT is working out great so far, I don’t want to be stopped at the crucial last step. @Zhihan1996

The problem has also been identified as a bug here: #2

(Fixing the issue with the header lines only results in other error messages, and fixing those errors causes even more confusing error messages.)

python find_motifs.py \
    --data_dir $DATA_PATH \
    --predict_dir $PREDICTION_PATH \
    --window_size 24 \
    --min_len 5 \
    --pval_cutoff 0.005 \
    --min_n_motif 3 \
    --align_all_ties \
    --save_file_dir $MOTIF_PATH \
    --verbose
 File "find_motifs.py", line 110, in <module>
    main()
  File "find_motifs.py", line 86, in main
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=None)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 688, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 460, in _read
    data = parser.read(nrows)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 1198, in read
    ret = self._engine.read(nrows)
  File "/home/joneill/anaconda3/envs/dnabert/lib/python3.6/site-packages/pandas/io/parsers.py", line 2157, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 918, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 905, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2042, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 2, saw 2
@Joseph-Vineland
Copy link
Author

Joseph-Vineland commented Jul 15, 2021

I modified the script 'find_motifs.py', and now it runs without error. See below. I also edited the example dev.tsv file so the header is tab-seperated not space-seperated.

However, no motifs are output for the provided example data/commands.

import os
import pandas as pd
import numpy as np
import argparse
import motif_utils as utils


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the sequence+label .tsv files (or other data files) for the task.",
    )

    parser.add_argument(
        "--predict_dir",
        default=None,
        type=str,
        required=True,
        help="Path where the attention scores were saved. Should contain both pred_results.npy and atten.npy",
    )

    parser.add_argument(
        "--window_size",
        default=24,
        type=int,
        help="Specified window size to be final motif length",
    )

    parser.add_argument(
        "--min_len",
        default=5,
        type=int,
        help="Specified minimum length threshold for contiguous region",
    )

    parser.add_argument(
        "--pval_cutoff",
        default=0.005,
        type=float,
        help="Cutoff FDR/p-value to declare statistical significance",
    )

    parser.add_argument(
        "--min_n_motif",
        default=3,
        type=int,
        help="Minimum instance inside motif to be filtered",
    )

    parser.add_argument(
        "--align_all_ties",
        action='store_true',
        help="Whether to keep all best alignments when ties encountered",
    )

    parser.add_argument(
        "--save_file_dir",
        default='.',
        type=str,
        help="Path to save outputs",
    )

    parser.add_argument(
        "--verbose",
        action='store_true',
        help="Verbosity controller",
    )

    parser.add_argument(
        "--return_idx",
        action='store_true',
        help="Whether the indices of the motifs are only returned",
    )

    # TODO: add the conditions
    args = parser.parse_args()

    atten_scores = np.load(os.path.join(args.predict_dir,"atten.npy"))
    pred = np.load(os.path.join(args.predict_dir,"pred_results.npy"))
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=0)
    #dev.columns = ['sequence','label']
    #dev['seq'] = dev['sequence']
    dev['sequence'] = dev['sequence'].apply(utils.kmer2seq)
    dev_pos = dev[dev['label'] == 1]
    dev_neg = dev[dev['label'] == 0]
    #print(dev_pos)
    #print(dev_neg)
    pos_atten_scores = atten_scores[dev_pos.index.values]
    neg_atten_scores = atten_scores[dev_neg.index.values]
    assert len(dev_pos) == len(pos_atten_scores)

    #print(dev_pos['sequence'])


    # run motif analysis
    merged_motif_seqs = utils.motif_analysis(dev_pos['sequence'],
                                        dev_neg['sequence'],
                                        pos_atten_scores,
                                        window_size = args.window_size,
                                        min_len = args.min_len,
                                        pval_cutoff = args.pval_cutoff,
                                        min_n_motif = args.min_n_motif,
                                        align_all_ties = args.align_all_ties,
                                        save_file_dir = args.save_file_dir,
                                        verbose = args.verbose
                                        #return_idx  = args.return_idx
                                    )

if __name__ == "__main__":
    main()

@Joseph-Vineland
Copy link
Author

Joseph-Vineland commented Jul 16, 2021

FYI - After editing the script as per above,the motif finding procedure worked on our data and identified motifs. GREAT! Thank you very much for the awesome tool! However, if I set the --max_seq_length to 56 instead of 64 during the (4) prediction and (5) visualization steps, then the following eror is thrown:

python find_motifs.py \
>     --data_dir $DATA_PATH \
>     --predict_dir $PREDICTION_PATH \
>     --window_size 24 \
>     --min_len 5 \
>     --pval_cutoff 0.05 \
>     --min_n_motif 3 \
>     --align_all_ties \
>     --save_file_dir $MOTIF_PATH \
>     --verbose
*** Begin motif analysis ***
* pos_seqs: 17938; neg_seqs: 18061
* Finding high attention motif regions
* Filtering motifs by hypergeometric test
Traceback (most recent call last):
  File "find_motifs.py", line 116, in <module>
    main()
  File "find_motifs.py", line 111, in main
    verbose = args.verbose
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 514, in motif_analysis
    **kwargs)
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 233, in filter_motifs
    pvals = motifs_hypergeom_test(pos_seqs, neg_seqs, motifs, **kwargs)
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 196, in motifs_hypergeom_test
    motif_count_all = count_motif_instances(pos_seqs+neg_seqs, motifs, allow_multi_match=allow_multi_match)
  File "/home/joneill/DNABERT/motif/motif_utils.py", line 152, in count_motif_instances
    matches = sorted(map(itemgetter(1), A.iter(seq)))
AttributeError: Not an Aho-Corasick automaton yet: call add_word to add some keys and call make_automaton to convert the trie to an automaton.

@lhy0322
Copy link

lhy0322 commented Aug 9, 2021

我修改了脚本'find_motifs.py',现在它运行没有错误。见下文。我还编辑了示例 dev.tsv 文件,因此标题是制表符分隔而不是空格分隔的。

但是,提供的示例数据/命令不会输出任何主题。

import os
import pandas as pd
import numpy as np
import argparse
import motif_utils as utils


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--data_dir",
        default=None,
        type=str,
        required=True,
        help="The input data dir. Should contain the sequence+label .tsv files (or other data files) for the task.",
    )

    parser.add_argument(
        "--predict_dir",
        default=None,
        type=str,
        required=True,
        help="Path where the attention scores were saved. Should contain both pred_results.npy and atten.npy",
    )

    parser.add_argument(
        "--window_size",
        default=24,
        type=int,
        help="Specified window size to be final motif length",
    )

    parser.add_argument(
        "--min_len",
        default=5,
        type=int,
        help="Specified minimum length threshold for contiguous region",
    )

    parser.add_argument(
        "--pval_cutoff",
        default=0.005,
        type=float,
        help="Cutoff FDR/p-value to declare statistical significance",
    )

    parser.add_argument(
        "--min_n_motif",
        default=3,
        type=int,
        help="Minimum instance inside motif to be filtered",
    )

    parser.add_argument(
        "--align_all_ties",
        action='store_true',
        help="Whether to keep all best alignments when ties encountered",
    )

    parser.add_argument(
        "--save_file_dir",
        default='.',
        type=str,
        help="Path to save outputs",
    )

    parser.add_argument(
        "--verbose",
        action='store_true',
        help="Verbosity controller",
    )

    parser.add_argument(
        "--return_idx",
        action='store_true',
        help="Whether the indices of the motifs are only returned",
    )

    # TODO: add the conditions
    args = parser.parse_args()

    atten_scores = np.load(os.path.join(args.predict_dir,"atten.npy"))
    pred = np.load(os.path.join(args.predict_dir,"pred_results.npy"))
    dev = pd.read_csv(os.path.join(args.data_dir,"dev.tsv"),sep='\t',header=0)
    #dev.columns = ['sequence','label']
    #dev['seq'] = dev['sequence']
    dev['sequence'] = dev['sequence'].apply(utils.kmer2seq)
    dev_pos = dev[dev['label'] == 1]
    dev_neg = dev[dev['label'] == 0]
    #print(dev_pos)
    #print(dev_neg)
    pos_atten_scores = atten_scores[dev_pos.index.values]
    neg_atten_scores = atten_scores[dev_neg.index.values]
    assert len(dev_pos) == len(pos_atten_scores)

    #print(dev_pos['sequence'])


    # run motif analysis
    merged_motif_seqs = utils.motif_analysis(dev_pos['sequence'],
                                        dev_neg['sequence'],
                                        pos_atten_scores,
                                        window_size = args.window_size,
                                        min_len = args.min_len,
                                        pval_cutoff = args.pval_cutoff,
                                        min_n_motif = args.min_n_motif,
                                        align_all_ties = args.align_all_ties,
                                        save_file_dir = args.save_file_dir,
                                        verbose = args.verbose
                                        #return_idx  = args.return_idx
                                    )

if __name__ == "__main__":
    main()

Hi, After I modified it according to your code.
`
*** Begin motif analysis ***

pos_seqs: 9674; neg_seqs: 9709
Finding high attention motif regions
Filtering motifs by hypergeometric test
motif CACCT: N=19383; K=9674; n=3316; x=1956; p=7.539397922768265e-31
Merging similar motif instances
Making fixed_length window = 24
Removing motifs with less than 3 instances
Saving outputs to directory
`
but no files were saved

@Joseph-Vineland
Copy link
Author

@lhy0322 For me, there were also no files saved when using the provided example data.

When I used my own data, motif files were saved.

@Joseph-Vineland
Copy link
Author

Joseph-Vineland commented Aug 15, 2021

If you receive the error:

AttributeError: Not an Aho-Corasick automaton yet: call add_word to add some keys and call make_automaton to convert the trie to an automaton.

I think it means: "No Motifs found , no high-attention regions" for your data.

@jerryji1993
Copy link
Owner

Hi @Joseph-Vineland and @lhy0322,

Sorry about the bugs and delay in my response and thank you very much for reporting. We have uploaded new test data and performed general bug fixes on different modules of this framework. Please try again with the updated code and let me know if any bug still happens.

Thanks,
Jerry

@jerryji1993
Copy link
Owner

Going to close this one for now. Feel free to reopen if there's any additional questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants