Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BINDetect not reading motif files correctly #78

Closed
connorrogerson opened this issue Jun 24, 2021 · 4 comments
Closed

BINDetect not reading motif files correctly #78

connorrogerson opened this issue Jun 24, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@connorrogerson
Copy link

connorrogerson commented Jun 24, 2021

Hi team,

I'm recently re-running some analyses and BINDetect is not reading the motif file correctly.

My original run was using the JASPAR2020 motif database. If we head the file:

MA0006.1_Ahr::Arnt
0.1250 0.3333 0.0834 0.4583
0.0000 0.0000 0.9582 0.0417
0.0000 0.9582 0.0000 0.0417
0.0000 0.0000 0.9582 0.0417
0.0000 0.0000 0.0000 0.9999
0.0000 0.0000 0.9999 0.0000
MA0854.1_Alx1
0.1900 0.4400 0.1700 0.2000
0.0700 0.2100 0.6500 0.0700
0.3800 0.3300 0.1400 0.1500
0.4300 0.0900 0.3100 0.1700
0.0500 0.4000 0.0200 0.5300
0.0100 0.0700 0.0000 0.9200
0.9899 0.0000 0.0101 0.0000
0.9800 0.0000 0.0100 0.0100
0.0100 0.0100 0.0000 0.9800
0.0000 0.0101 0.0000 0.9899
0.9200 0.0000 0.0700 0.0100
0.5300 0.0200 0.4000 0.0500
0.1500 0.2100 0.1100 0.5300
0.2300 0.3200 0.1600 0.2900
0.3939 0.3131 0.1212 0.1717
0.2400 0.2900 0.2300 0.2400
0.1500 0.4000 0.1400 0.3100

However, I've been trying to get TOBIAS to read the motif database that comes with Gimmemotifs (https://github.com/vanheeringen-lab/gimmemotifs), just for consistency. If we head the file:

GM.5.0.Sox.0001
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.0030 0.0030 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.0050 0.0168 0.8299
GM.5.0.Homeodomain.0001
0.8870 0.0000 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0.0000 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.9170 0.0140 0.0046 0.0644
0.2228 0.2421 0.3300 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428
#GM.5.0.Mixed.0001 EGR1_disc6 EGR1_GM12878_encode-Myers_seq_hsa_v041610.1_r1:AlignACE#2#Intergenic;;SRF_disc2 SRF_HepG2_encode-Myers_seq_hsa_v041610.1_r1:AlignACE#3#Intergenic
GM.5.0.Mixed.0001
0.0061 0.3310 0.6435 0.0195
0.1550 0.3038 0.3952 0.1460
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.0050 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.3690 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.2560 0.5840 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061

When I try and run this, I get the errror:

# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.pfm --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/ --cores 4

# ----- Input parameters -----
# signals:	['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks:	/rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs:	['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.pfm']
# genome:	/rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names:	['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header:	None
# naming:	name_id
# motif_pvalue:	0.0001
# bound_pvalue:	0.001
# pseudo:	None
# time_series:	False
# skip_excel:	False
# output_peaks:	None
# prefix:	bindetect
# outdir:	/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores:	4
# split:	100
# verbosity:	3


# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf


2021-06-24 10:12:53 (76618) [INFO]	----- Processing input data -----
2021-06-24 10:12:53 (76618) [INFO]	Checking reading/writing of files
2021-06-24 10:12:53 (76618) [INFO]	Reading peaks
2021-06-24 10:12:54 (76618) [INFO]	- Found 1213 regions in input peaks
2021-06-24 10:12:54 (76618) [INFO]	- Merged to 1213 regions
2021-06-24 10:12:54 (76618) [INFO]	Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 10:12:54 (76618) [INFO]	- Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 10:12:54 (76618) [INFO]	- Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 10:12:54 (76618) [INFO]	- Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 10:12:54 (76618) [INFO]	Estimating GC content from peak sequences
2021-06-24 10:12:57 (76618) [INFO]	- GC content estimated at 51.22%
2021-06-24 10:12:57 (76618) [INFO]	Reading motifs from file
Traceback (most recent call last):
  File "/home/cjr78/miniconda3/envs/seq/bin/TOBIAS", line 33, in <module>
    sys.exit(load_entry_point('tobias==0.12.10', 'console_scripts', 'TOBIAS')())
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/TOBIAS.py", line 154, in main
    args.func(args)		
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/tools/bindetect.py", line 223, in run_bindetect
    motif_list += MotifList().from_file(f)  #List of OneMotif objects
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/utils/motifs.py", line 241, in from_file
    for m in motifs.parse(f, file_format):
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/__init__.py", line 116, in parse
    return jaspar.read(handle, fmt)
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/jaspar/__init__.py", line 164, in read
    record = _read_jaspar(handle)
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/jaspar/__init__.py", line 311, in _read_jaspar
    counts[nucleotides[row_count]] = [float(x) for x in words]
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/Bio/motifs/jaspar/__init__.py", line 311, in <listcomp>
    counts[nucleotides[row_count]] = [float(x) for x in words]
ValueError: could not convert string to float: '#GM.5.0.Mixed.0001'

I edited the motif file to remove # items. Head of the edited motif file:

GM.5.0.Sox.0001
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.0030 0.0030 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.0050 0.0168 0.8299
GM.5.0.Homeodomain.0001
0.8870 0.0000 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0.0000 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.9170 0.0140 0.0046 0.0644
0.2228 0.2421 0.3300 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428
GM.5.0.Mixed.0001
0.0061 0.3310 0.6435 0.0195
0.1550 0.3038 0.3952 0.1460
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.0050 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.3690 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.2560 0.5840 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061

However when I try and run BINDetect, I get this error:

# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.tobias.pfm --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/ --cores 4

# ----- Input parameters -----
# signals:      ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks:        /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs:       ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.tobias.pfm']
# genome:       /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names:   ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header:  None
# naming:       name_id
# motif_pvalue: 0.0001
# bound_pvalue: 0.001
# pseudo:       None
# time_series:  False
# skip_excel:   False
# output_peaks: None
# prefix:       bindetect
# outdir:       /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores:        4
# split:        100
# verbosity:    3


# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf


2021-06-24 11:29:49 (171212) [INFO]     ----- Processing input data -----
2021-06-24 11:29:49 (171212) [INFO]     Checking reading/writing of files
2021-06-24 11:29:49 (171212) [INFO]     Reading peaks
2021-06-24 11:29:49 (171212) [INFO]     - Found 1213 regions in input peaks
2021-06-24 11:29:49 (171212) [INFO]     - Merged to 1213 regions
2021-06-24 11:29:49 (171212) [INFO]     Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 11:29:49 (171212) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 11:29:49 (171212) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 11:29:49 (171212) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 11:29:49 (171212) [INFO]     Estimating GC content from peak sequences
2021-06-24 11:29:51 (171212) [INFO]     - GC content estimated at 51.22%
2021-06-24 11:29:51 (171212) [INFO]     Reading motifs from file
2021-06-24 11:29:52 (171212) [INFO]     - Read 5531 motifs
2021-06-24 11:29:52 (171212) [WARNING]  The motif output names (as given by --naming) are not unique.
2021-06-24 11:29:52 (171212) [WARNING]  The following names occur more than once: ['_']
2021-06-24 11:29:52 (171212) [WARNING]  These motifs will be renamed with '_1', '_2' etc. To prevent this renaming, please make the names of the input --motifs unique
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Somehow it reads over 5000 motifs? Any ideas would be appreciated!

@connorrogerson
Copy link
Author

connorrogerson commented Jun 24, 2021

Tried it with a meme format motif database and receive this error:

TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

# TOBIAS 0.12.1 BINDetect (run started 2021-06-24 13:10:49.881592)
# Working directory: /rds/user/cjr78/hpc-work/ATAC/tobias
# Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

# ----- Input parameters -----
# signals:      ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']
# peaks:        /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed
# motifs:       ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme']
# genome:       /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
# cond_names:   ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']
# peak_header:  None
# naming:       name_id
# motif_pvalue: 0.0001
# bound_pvalue: 0.001
# pseudo:       None
# time_series:  False
# skip_excel:   False
# output_peaks: None
# prefix:       bindetect
# outdir:       /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated
# cores:        1
# split:        100
# verbosity:    3


# ----- Output files -----
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_7_8_9_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_bound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_merged_1_2_3_corrected_unbound.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/beds/*_all.bed
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/plots/*_log2fcs.pdf
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/*/*_overview.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx
# /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf


2021-06-24 13:10:49 (213198) [INFO]     ----- Processing input data -----
2021-06-24 13:10:49 (213198) [INFO]     Checking reading/writing of files
2021-06-24 13:10:50 (213198) [INFO]     Reading peaks
2021-06-24 13:10:50 (213198) [INFO]     - Found 1213 regions in input peaks
2021-06-24 13:10:50 (213198) [INFO]     - Merged to 1213 regions
2021-06-24 13:10:50 (213198) [INFO]     Checking for match between --peaks and --fasta/--signals boundaries
2021-06-24 13:10:50 (213198) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-24 13:10:50 (213198) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-24 13:10:50 (213198) [INFO]     - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-24 13:10:50 (213198) [INFO]     Estimating GC content from peak sequences
2021-06-24 13:10:57 (213198) [INFO]     - GC content estimated at 51.22%
2021-06-24 13:10:57 (213198) [INFO]     Reading motifs from file
Traceback (most recent call last):
  File "/home/cjr78/miniconda3/envs/seq/bin/TOBIAS", line 33, in <module>
    sys.exit(load_entry_point('tobias==0.12.1', 'console_scripts', 'TOBIAS')())
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/TOBIAS.py", line 154, in main
    args.func(args)
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/tools/bindetect.py", line 212, in run_bindetect
    motif_list += MotifList().from_file(f)  #List of OneMotif objects
  File "/home/cjr78/miniconda3/envs/seq/lib/python3.7/site-packages/tobias/utils/motifs.py", line 201, in from_file
    self[-1].n = int(info["nsites"]) #overwrites OneMotif default
ValueError: invalid literal for int() with base 10: '1000.0'

@msbentsen
Copy link
Member

Hi, thank you for this issue! The first errors with gimme.vertebrate.v5.0.tobias.pfm arise because BINDetect expects either MEME or JASPAR-style motif file. I am not sure what this format is:

MA0006.1_Ahr::Arnt
0.1250 0.3333 0.0834 0.4583
0.0000 0.0000 0.9582 0.0417
0.0000 0.9582 0.0000 0.0417
0.0000 0.0000 0.9582 0.0417
0.0000 0.0000 0.0000 0.9999
0.0000 0.0000 0.9999 0.0000

I would love to add the option to read more formats, but I haven't been able to find a good source for the file format convention. Sometimes .pfm looks like the above, sometimes it looks like:

SIX5_disc1 SIX5_GM12878_encode-Myers_seq_hsa_r1:MEME#1#Intergenic
G 0.008511 0.004255 0.987234 0.000000
A 0.902127 0.012766 0.038298 0.046809
R 0.455319 0.072340 0.344681 0.127660
W 0.251064 0.085106 0.085106 0.578724
T 0.000000 0.046809 0.012766 0.940425
G 0.000000 0.000000 1.000000 0.000000
T 0.038298 0.021277 0.029787 0.910638
A 0.944681 0.004255 0.051064 0.000000
G 0.000000 0.000000 1.000000 0.000000
T 0.000000 0.000000 0.012766 0.987234

So that is unfortunately a bit hard....

But for the error with the .meme file, this is definitely a bug from my side :-) In the letter-probability matrix-line, you probably have a float number nsites:
letter-probability matrix: nsites= 1000.0 vs. letter-probability matrix: nsites= 1000

And the error arises because this can't be directly transformed into int. So changing nsites to an integer will also clear the error. I have made a small fix on the development branch (e5251cc), so you might also fetch and install that. I like to collect a few bugfixes before I release a new version, so I can't say when it will be officially 'out'. But I hope this helps you out!

Mette

@msbentsen msbentsen added the bug Something isn't working label Jun 25, 2021
@connorrogerson
Copy link
Author

Hi I'm getting another error unfortunately.

This is my meme file:

MEME version 3.0

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25
MOTIF GM.5.0.Sox.0001
letter-probability matrix: alength= 4 w= 6 nsites= 1000.0 E= 0
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.003 0.003 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.005 0.0168 0.8299
MOTIF GM.5.0.Homeodomain.0001
letter-probability matrix: alength= 4 w= 11 nsites= 999.9 E= 0
0.887 0 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.917 0.014 0.0046 0.0644
0.2228 0.2421 0.33 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428
MOTIF GM.5.0.Mixed.0001
letter-probability matrix: alength= 4 w= 11 nsites= 1000.1 E= 0
0.0061 0.331 0.6435 0.0195
0.155 0.3038 0.3952 0.146
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.005 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.369 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.256 0.584 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061

The run output is:
TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

TOBIAS 0.12.12 BINDetect (run started 2021-06-25 16:50:29.791082)

Working directory: /rds/user/cjr78/hpc-work/ATAC/tobias

Command line call: TOBIAS BINDetect --motifs /rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme --signals /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw --genome /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa --peaks /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed --outdir /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/

----- Input parameters -----

signals: ['/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw', '/rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw']

peaks: /rds/user/cjr78/hpc-work/ATAC/DEseq2/Alluvial_open_activated.bed

motifs: ['/rds/user/cjr78/hpc-work/Motifs/gimme.vertebrate.v5.0.meme']

genome: /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa

cond_names: ['merged_7_8_9_corrected', 'merged_1_2_3_corrected']

peak_header: None

naming: name_id

motif_pvalue: 0.0001

bound_pvalue: 0.001

pseudo: None

time_series: False

skip_excel: False

output_peaks: None

prefix: bindetect

outdir: /rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated

cores: 1

split: 100

verbosity: 3

----- Output files -----

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_7_8_9_corrected_bound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_7_8_9_corrected_unbound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_1_2_3_corrected_bound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_merged_1_2_3_corrected_unbound.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//beds/_all.bed

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//plots/_log2fcs.pdf

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//_overview.txt

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated//_overview.xlsx

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_distances.txt

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.txt

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_results.xlsx

/rds/user/cjr78/hpc-work/ATAC/tobias/Alluvial_gimmemotifs/Open_activated/bindetect_figures.pdf

2021-06-25 16:50:29 (127630) [INFO] ----- Processing input data -----
2021-06-25 16:50:29 (127630) [INFO] Checking reading/writing of files
2021-06-25 16:50:29 (127630) [INFO] Reading peaks
2021-06-25 16:50:29 (127630) [INFO] - Found 1213 regions in input peaks
2021-06-25 16:50:29 (127630) [INFO] - Merged to 1213 regions
2021-06-25 16:50:29 (127630) [INFO] Checking for match between --peaks and --fasta/--signals boundaries
2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/References/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_7_8_9_corrected.bw
2021-06-25 16:50:29 (127630) [INFO] - Comparing peaks to /rds/user/cjr78/hpc-work/ATAC/tobias/ATAC_peaks/FootprintScores/merged_1_2_3_corrected.bw
2021-06-25 16:50:29 (127630) [INFO] Estimating GC content from peak sequences
2021-06-25 16:50:30 (127630) [INFO] - GC content estimated at 51.22%
2021-06-25 16:50:30 (127630) [INFO] Reading motifs from file
ERROR: No matrix could be read for motif 'GM.5.0.Sox.0001 ' - please check the format of the input motif file.

@msbentsen
Copy link
Member

Hi Connor,

This was a small oversight from my end - I falsely assumed there would always be at least one line between two motifs, e.g.:

MEME version 3.0

ALPHABET= ACGT

strands: + -

Background letter frequencies
A 0.25 C 0.25 G 0.25 T 0.25

MOTIF GM.5.0.Sox.0001
letter-probability matrix: alength= 4 w= 6 nsites= 1000.0 E= 0
0.7213 0.0793 0.1103 0.0891
0.9259 0.0072 0.0062 0.0607
0.0048 0.9203 0.0077 0.0672
0.9859 0.003 0.003 0.0081
0.9778 0.0043 0.0128 0.0051
0.1484 0.005 0.0168 0.8299

MOTIF GM.5.0.Homeodomain.0001
letter-probability matrix: alength= 4 w= 11 nsites= 999.9 E= 0
0.887 0 0.0178 0.0951
0.1156 0.2033 0.6629 0.0181
0.0017 0.7452 0.0809 0.1722
0.0011 0.0003 0.0003 0.9983
0.0026 0.0141 0.9721 0.0111
0 0.0189 0.0054 0.9758
0.0006 0.9983 0.0006 0.0006
0.917 0.014 0.0046 0.0644
0.2228 0.2421 0.33 0.2051
0.3621 0.1054 0.2208 0.3116
0.5727 0.0104 0.1741 0.2428

MOTIF GM.5.0.Mixed.0001
letter-probability matrix: alength= 4 w= 11 nsites= 1000.1 E= 0
0.0061 0.331 0.6435 0.0195
0.155 0.3038 0.3952 0.146
0.1556 0.2468 0.4775 0.1201
0.1152 0.2145 0.6121 0.0581
0.005 0.3077 0.6282 0.0591
0.0027 0.4947 0.4958 0.0069
0.0994 0.4565 0.369 0.0751
0.1759 0.3192 0.4488 0.0561
0.1084 0.256 0.584 0.0516
0.0031 0.1431 0.8534 0.0004
0.0168 0.5105 0.4666 0.0061

The above motifs work with version 0.12.11, and the file without spaces works with the current 0.12.12 (in progress, might include more fixes in the future) on the dev branch. Thanks for letting me know of these issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants