Error when checking input: wrong input shape #8

VGalata · 2021-08-16T12:31:04Z

Hello,

First of all, thanks for the great tool!

Unfortunately, I have encountered an issue when trying to run PACIFIC on one of our samples:

Error message:

ValueError: Error when checking input: expected embedding_1_input to have shape (142,) but got array with shape (1,)

Command line:

python /path/to/PACIFIC.py -i my.fq -m path/to/pacific.h5 -t /path/to/model/tokenizer.01.pacific_9mers.pickle -l /path/to/model/label_maker.01.pacific_9mers.pickle -o output/folder/ -f fastq

The input FASTQ file contains paired reads and single-end reads (all three FASTQ files were just concatenated).

Version: 075fb55

conda env. YAML:

channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python=3.7.6
  - numpy=1.18.1
  - tensorflow=2.2.0
  - keras=2.3.1
  - pandas=1.0.1
  - scikit-learn=0.21.3
  - biopython=1.76

Thanks in advance!

Best,
Valentina

The text was updated successfully, but these errors were encountered:

hp2048 · 2021-08-17T00:59:01Z

Hi Valentina,
Thank you for your kind words. I am having difficulty troubleshooting based on the error you posted. It seems like the error is stemming from one of the libraries and not the PACIFIC.py. Can you please provide the entire stderr/stdout when you run the command for us to see what may be causing the issue if there is more debugging information available? Another option may be to share your my.fq file with us if you feel comfortable doing so. We can try to run it locally and see if we can reproduce the error.
I will consult with @pabloacera as well to see if he has more insight into the problem.
Cheers
Hardip

VGalata · 2021-08-17T05:43:37Z

Dear Hardip,

Thanks a lot for the quick reply!

I attached the log file: it is the complete stderr/stdout of the executed commend, I only truncated some file paths and changed the sample ID. Unfortunately, I cannot share the FASTQ file as this is an unpublished dataset and we are not allowed to share that data yet.

Attachment: pacific.log

Best,
Valentina

hp2048 · 2021-08-17T07:22:09Z

HI Valentina
The error occurs between sequences 1100000 and 1200000 in your input file. And it seems to occur at predictions = model.predict(np.array(kmer_sequences)) in the PACIFIC.py code. It looks like there may be some sequence related issues. Pacific creates an array of 9mers for each sequence extending it up to 150bp if sequence is shorter. Hence the number 142 in the error message.

However, it feels like for one of the sequences, instead of 142 items, it is receiving only 1 mer.

For further troubleshooting, could you please isolate the reads in question.

head -n 4800004 my.fq | tail -n 400004 >debug.fq

Could you please check if the new debug.fq has any sequences that are zero length or looks odd? If all is good with the file, then to narrow down the problem, could you run the PACIFIC.py with debug.fq as the input and set the --chunk_size 100. This will process 100 reads at a time and allow us to narrow down reads that may be causing issue.

Cheers
Hardip

VGalata · 2021-08-17T08:44:35Z

Hi Hardip,

I created a FASTQ file as suggested and looked at some sequences and also computed the min/max sequence length:

awk 'NR%4==2' debug.fq | sort -R | head -n 1000 | less
awk 'NR%4==2' debug.fq | awk '{ print length($0) }' | sort -n | head -n 1 # 40
awk 'NR%4==2' debug.fq | awk '{ print length($0) }' | sort -n | tail -n 1 # 150

I did not see anything suspicious.

So, I ran the tool on the read subset using different values for --chunk_size. However, I get the error for different chunks depending on which chunk size I use

100: 85900 86000
10: 20 30
5: 10 15

For --chunk_size 10 I extracted the subset 20 30

awk 'NR%4==2' debug.fq | head -n 30 | tail -n 20

TTATTTCTAAAGAGCGATACTTAAGCACTTACATGATTATTATAAAACTCGAAAAAGCCTAGATAA
CTGCAAAGTCTTGTCGCGTCATTCAGTGCGCTCGTGCCGGTTATTACGGTGTGCGCGGTGATCATTGCGGTCGTGGTCATTTCCGGCGGGAAGGGACAGACGGCAGAATTTGCGCCGGAGGCATCGCTGCTGCGTCCGA
CTCCCGGCTTGCCGCAGTCTGCACGGATCGCGTGCCGCCGTCCACGGCTGCCGGCGCGATATTCGGAGCAGGCAGAGCGGATTCCTTTGACGCAGGCTGTACTGCCGCCTTTCGATTGGGGTTGACGACTGCCATTTTTTTGAATGCCGC
GCGTCACATCGCGGAAGATACCCGAGAGGCGCCACATATCCTGATCCTCAAGATAGCTGCCCGAGCAGTAGCGGC
ATCAGCAAATCCGTCTCAGACGGCAAGACCTACACAAAGGTCGAACGCCTTGATACCGAGGGCAGAATACATGAGATA
TTGCCGGTCTTGTCCATCAGGCGCATCGAGCAGGTGAGATGAATAGCTACCGGGGTGTCGGTCTGGTGGAAATCCAGCCTGTCGGCCAGGAAACCAAGGATGAACTCCGCCGGTTCGTAGAGCCGCATCTTCTTGATCTTGT
AGCTTCTGTACGTTCTGGGATTACACGATCCTGCACCGGATGCGCGATTACGAC
AACATAAGCCCGAAAAATCTTGTGATTTTAATCGGAACGAATGACATCGGCAT
AGGAAGCCCTGCGCTGGTGCGCTGAGGTGTTCCATGCCCTGGCAGCTCTGCTCAAGAGTCGCGGCCTGGCCACCTCTGTGGGCGATGAGGGCGGCTTTGCCCCCAACCTGTCCAGCGATGAAGAGACCATCGAGACCATTCTGGAA
CGCGGCTGCTTGATCGCGCTCTTGACCGTCTTTTCCGTGATCTCGTTGAATTCCACGCGGCATTCGCTTTCCGGGTCGATGCCCAGAATGGTCGCCAGATGCCACGAAATCGCCTCGCCCTCGCGGTCAGGGTCC
GCAGGCCGCGGCCAGCCGTCCGGCGAAAAGCTCTCCGGCCTCGTTGCGGGATCGGGCCAGCGTTTGCAGACCGCTTTTGTCTGCCGGACGAAGCAGAAAGGAAATTTCTTCCCGGTTGAGCAGTCCGCCGGGGAATGTCGATTCGAGCGT
ATCAAAAATAGGCGAAAAATTAAAGGGTGGAGCGAAAATATAGAAAAGGCAACTGGCAGTTGCTTGCTAGTAAACTAAATGTGGTT
CATAGAGTTCTTTTTCCCTGCCGGGGTAGTATATCCATCCAGAAAAGATCTGGACACGATCAACAAACGCCCTCTCAAATCTTTGCTCACTTTCGATACCTGTATGATATGAGTCGGCATCAAGGTT
CTGGCGAAAACCGCAGTCCAACTACCCATACGATTTCCTGTCCGCATGCAAGCACCGGCCATCCTGCCCGTTTGGTCCT
GGCGAGTTGATCGCCTTCGTGATCGGTTGGGATCTGATTCTGGAATACGCGTTGCAGGCGGCCACCGTGTCCGCCGGCTGGTCGGGCTATTTCAACAAGCTGTTGGAAGGCTTCGGCCTGCATCTGCCGGTCGAACTGACCGCCGCAT
TTACAAAGCAAAAGATTATCAGGACATGAGTAGAAAAGCTGCAAATATTATTTCAGCACAGATTATTATGAAACCAAACTGTGTGCTTGGACTTGCCACAGGTTCTTCCCCAGTCGGAACTTACAAACAATTGATCGAATGGTATAAAAA
CAACAGATCTTCTTCATTAAAATCATCTGGGCTGTCAGAATCATCACTGTCACAAAGACGCCACACAGAAACATTATCCATGAACCATGCCTGCCGTTTTAAAGTATTAACGATCCGACCGAACAAACTGAAATCCCTCACCCACCGCTT
CCGAAACGGCATCCTCCATCCACTCGCGGCTTGCGATATCCAGATGGTTTGTGGGCTCGTCGAGGATGAGCAGGTTAATGTCGCTGCCCATGAGCATACAGAGCCTCAAACGGCTCTTCTCGCCGCCGGAAAGCGCTCCGAC
CGGCGTCCGTGGGCACCATGGCGTCAAGGCGGCGAAGGCACACGGTTGCGCGGTTGGCGACCATGCGCTCCGTGGGATATTGAAAGAGGTTCTC
GGAGACCGTCGCCAAGGTGCTCTCGAACGCCCGCGCCCTGAGTCCCCACCGCATGGATGTGCATGAGCTCGCCGTGGATGGCTGCGAGCTCACTCTCATCGACGACTCCT

Maybe the issue is with how the chunks are being created and processed?

Best,
Valentina

Edit: typos

hp2048 · 2021-08-17T09:52:44Z

Hi Valentina,
Pacific creates a subset of reads based on the chunk size. i.e. if the chunk size is set to 100, it creates a list of 100 sequences to process. For debug.fq with 100 chunk size, the failure was in for sequences between 85900 and 86000. I think this is a bug in our code where we may not be handling the exception. We will use your 20 sequences that you have pasted above and try to fix the code. Thank you for your patience.
Cheers

pabloacera · 2021-08-18T05:40:12Z

Hi Valentina, thanks for using our software. Indeed there was a bug that originated when all reads in a chunk were discarded. I have fixed the bug so you may want to download PACIFIC.py and give it a go. I have also noticed that your reads have very variable lengths. So far we are not predicting the origin of reads smaller than 150bp, just for you to know.
Hope it works now, thanks!!

VGalata · 2021-08-18T06:14:27Z

Dear @pabloacera,

Thank you for the quick fix! I will try the updated version today.

Regarding the length of the reads: yes, they have different length because they have been preprocessed (quality and adapter trimming). I did not know about the minimal length constraint - thank you for pointing that out!

hp2048 · 2021-08-18T07:09:38Z

@pabloacera : Thank you for the fix. Is it possible to extend reads with Ns up to 150bp and run the prediction? This can be done at the user end or within PACIFIC perhaps.

pabloacera · 2021-08-18T09:16:02Z

Hi, that is possible to do but we will have to see how that affect the predictions of the model.
Cheers.

VGalata · 2021-08-18T12:10:43Z

I can confirm that the bug has been fixed - the test sample runs through.
And I second the suggestion to have a parameter for the minimal read length.

Thanks again for fixing the issue so quickly!

pabloacera closed this as completed Aug 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when checking input: wrong input shape #8

Error when checking input: wrong input shape #8

VGalata commented Aug 16, 2021

hp2048 commented Aug 17, 2021

VGalata commented Aug 17, 2021 •

edited

hp2048 commented Aug 17, 2021

VGalata commented Aug 17, 2021 •

edited

hp2048 commented Aug 17, 2021

pabloacera commented Aug 18, 2021

VGalata commented Aug 18, 2021

hp2048 commented Aug 18, 2021

pabloacera commented Aug 18, 2021

VGalata commented Aug 18, 2021

Error when checking input: wrong input shape #8

Error when checking input: wrong input shape #8

Comments

VGalata commented Aug 16, 2021

hp2048 commented Aug 17, 2021

VGalata commented Aug 17, 2021 • edited

hp2048 commented Aug 17, 2021

VGalata commented Aug 17, 2021 • edited

hp2048 commented Aug 17, 2021

pabloacera commented Aug 18, 2021

VGalata commented Aug 18, 2021

hp2048 commented Aug 18, 2021

pabloacera commented Aug 18, 2021

VGalata commented Aug 18, 2021

VGalata commented Aug 17, 2021 •

edited

VGalata commented Aug 17, 2021 •

edited