Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when checking input: wrong input shape #8

Closed
VGalata opened this issue Aug 16, 2021 · 10 comments
Closed

Error when checking input: wrong input shape #8

VGalata opened this issue Aug 16, 2021 · 10 comments

Comments

@VGalata
Copy link

VGalata commented Aug 16, 2021

Hello,

First of all, thanks for the great tool!

Unfortunately, I have encountered an issue when trying to run PACIFIC on one of our samples:

Error message:

ValueError: Error when checking input: expected embedding_1_input to have shape (142,) but got array with shape (1,)

Command line:

python /path/to/PACIFIC.py -i my.fq -m path/to/pacific.h5 -t /path/to/model/tokenizer.01.pacific_9mers.pickle -l /path/to/model/label_maker.01.pacific_9mers.pickle -o output/folder/ -f fastq

The input FASTQ file contains paired reads and single-end reads (all three FASTQ files were just concatenated).

Version: 075fb55

conda env. YAML:

channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - python=3.7.6
  - numpy=1.18.1
  - tensorflow=2.2.0
  - keras=2.3.1
  - pandas=1.0.1
  - scikit-learn=0.21.3
  - biopython=1.76

Thanks in advance!

Best,
Valentina

@hp2048
Copy link
Contributor

hp2048 commented Aug 17, 2021

Hi Valentina,
Thank you for your kind words. I am having difficulty troubleshooting based on the error you posted. It seems like the error is stemming from one of the libraries and not the PACIFIC.py. Can you please provide the entire stderr/stdout when you run the command for us to see what may be causing the issue if there is more debugging information available? Another option may be to share your my.fq file with us if you feel comfortable doing so. We can try to run it locally and see if we can reproduce the error.
I will consult with @pabloacera as well to see if he has more insight into the problem.
Cheers
Hardip

@VGalata
Copy link
Author

VGalata commented Aug 17, 2021

Dear Hardip,

Thanks a lot for the quick reply!

I attached the log file: it is the complete stderr/stdout of the executed commend, I only truncated some file paths and changed the sample ID. Unfortunately, I cannot share the FASTQ file as this is an unpublished dataset and we are not allowed to share that data yet.

Attachment: pacific.log

Best,
Valentina

@hp2048
Copy link
Contributor

hp2048 commented Aug 17, 2021

HI Valentina
The error occurs between sequences 1100000 and 1200000 in your input file. And it seems to occur at predictions = model.predict(np.array(kmer_sequences)) in the PACIFIC.py code. It looks like there may be some sequence related issues. Pacific creates an array of 9mers for each sequence extending it up to 150bp if sequence is shorter. Hence the number 142 in the error message.

However, it feels like for one of the sequences, instead of 142 items, it is receiving only 1 mer.

For further troubleshooting, could you please isolate the reads in question.

head -n 4800004 my.fq | tail -n 400004 >debug.fq

Could you please check if the new debug.fq has any sequences that are zero length or looks odd? If all is good with the file, then to narrow down the problem, could you run the PACIFIC.py with debug.fq as the input and set the --chunk_size 100. This will process 100 reads at a time and allow us to narrow down reads that may be causing issue.

Cheers
Hardip

@VGalata
Copy link
Author

VGalata commented Aug 17, 2021

Hi Hardip,

I created a FASTQ file as suggested and looked at some sequences and also computed the min/max sequence length:

awk 'NR%4==2' debug.fq | sort -R | head -n 1000 | less
awk 'NR%4==2' debug.fq | awk '{ print length($0) }' | sort -n | head -n 1 # 40
awk 'NR%4==2' debug.fq | awk '{ print length($0) }' | sort -n | tail -n 1 # 150

I did not see anything suspicious.

So, I ran the tool on the read subset using different values for --chunk_size. However, I get the error for different chunks depending on which chunk size I use

  • 100: 85900 86000
  • 10: 20 30
  • 5: 10 15

For --chunk_size 10 I extracted the subset 20 30

awk 'NR%4==2' debug.fq | head -n 30 | tail -n 20
TTATTTCTAAAGAGCGATACTTAAGCACTTACATGATTATTATAAAACTCGAAAAAGCCTAGATAA
CTGCAAAGTCTTGTCGCGTCATTCAGTGCGCTCGTGCCGGTTATTACGGTGTGCGCGGTGATCATTGCGGTCGTGGTCATTTCCGGCGGGAAGGGACAGACGGCAGAATTTGCGCCGGAGGCATCGCTGCTGCGTCCGA
CTCCCGGCTTGCCGCAGTCTGCACGGATCGCGTGCCGCCGTCCACGGCTGCCGGCGCGATATTCGGAGCAGGCAGAGCGGATTCCTTTGACGCAGGCTGTACTGCCGCCTTTCGATTGGGGTTGACGACTGCCATTTTTTTGAATGCCGC
GCGTCACATCGCGGAAGATACCCGAGAGGCGCCACATATCCTGATCCTCAAGATAGCTGCCCGAGCAGTAGCGGC
ATCAGCAAATCCGTCTCAGACGGCAAGACCTACACAAAGGTCGAACGCCTTGATACCGAGGGCAGAATACATGAGATA
TTGCCGGTCTTGTCCATCAGGCGCATCGAGCAGGTGAGATGAATAGCTACCGGGGTGTCGGTCTGGTGGAAATCCAGCCTGTCGGCCAGGAAACCAAGGATGAACTCCGCCGGTTCGTAGAGCCGCATCTTCTTGATCTTGT
AGCTTCTGTACGTTCTGGGATTACACGATCCTGCACCGGATGCGCGATTACGAC
AACATAAGCCCGAAAAATCTTGTGATTTTAATCGGAACGAATGACATCGGCAT
AGGAAGCCCTGCGCTGGTGCGCTGAGGTGTTCCATGCCCTGGCAGCTCTGCTCAAGAGTCGCGGCCTGGCCACCTCTGTGGGCGATGAGGGCGGCTTTGCCCCCAACCTGTCCAGCGATGAAGAGACCATCGAGACCATTCTGGAA
CGCGGCTGCTTGATCGCGCTCTTGACCGTCTTTTCCGTGATCTCGTTGAATTCCACGCGGCATTCGCTTTCCGGGTCGATGCCCAGAATGGTCGCCAGATGCCACGAAATCGCCTCGCCCTCGCGGTCAGGGTCC
GCAGGCCGCGGCCAGCCGTCCGGCGAAAAGCTCTCCGGCCTCGTTGCGGGATCGGGCCAGCGTTTGCAGACCGCTTTTGTCTGCCGGACGAAGCAGAAAGGAAATTTCTTCCCGGTTGAGCAGTCCGCCGGGGAATGTCGATTCGAGCGT
ATCAAAAATAGGCGAAAAATTAAAGGGTGGAGCGAAAATATAGAAAAGGCAACTGGCAGTTGCTTGCTAGTAAACTAAATGTGGTT
CATAGAGTTCTTTTTCCCTGCCGGGGTAGTATATCCATCCAGAAAAGATCTGGACACGATCAACAAACGCCCTCTCAAATCTTTGCTCACTTTCGATACCTGTATGATATGAGTCGGCATCAAGGTT
CTGGCGAAAACCGCAGTCCAACTACCCATACGATTTCCTGTCCGCATGCAAGCACCGGCCATCCTGCCCGTTTGGTCCT
GGCGAGTTGATCGCCTTCGTGATCGGTTGGGATCTGATTCTGGAATACGCGTTGCAGGCGGCCACCGTGTCCGCCGGCTGGTCGGGCTATTTCAACAAGCTGTTGGAAGGCTTCGGCCTGCATCTGCCGGTCGAACTGACCGCCGCAT
TTACAAAGCAAAAGATTATCAGGACATGAGTAGAAAAGCTGCAAATATTATTTCAGCACAGATTATTATGAAACCAAACTGTGTGCTTGGACTTGCCACAGGTTCTTCCCCAGTCGGAACTTACAAACAATTGATCGAATGGTATAAAAA
CAACAGATCTTCTTCATTAAAATCATCTGGGCTGTCAGAATCATCACTGTCACAAAGACGCCACACAGAAACATTATCCATGAACCATGCCTGCCGTTTTAAAGTATTAACGATCCGACCGAACAAACTGAAATCCCTCACCCACCGCTT
CCGAAACGGCATCCTCCATCCACTCGCGGCTTGCGATATCCAGATGGTTTGTGGGCTCGTCGAGGATGAGCAGGTTAATGTCGCTGCCCATGAGCATACAGAGCCTCAAACGGCTCTTCTCGCCGCCGGAAAGCGCTCCGAC
CGGCGTCCGTGGGCACCATGGCGTCAAGGCGGCGAAGGCACACGGTTGCGCGGTTGGCGACCATGCGCTCCGTGGGATATTGAAAGAGGTTCTC
GGAGACCGTCGCCAAGGTGCTCTCGAACGCCCGCGCCCTGAGTCCCCACCGCATGGATGTGCATGAGCTCGCCGTGGATGGCTGCGAGCTCACTCTCATCGACGACTCCT

Maybe the issue is with how the chunks are being created and processed?

Best,
Valentina

Edit: typos

@hp2048
Copy link
Contributor

hp2048 commented Aug 17, 2021

Hi Valentina,
Pacific creates a subset of reads based on the chunk size. i.e. if the chunk size is set to 100, it creates a list of 100 sequences to process. For debug.fq with 100 chunk size, the failure was in for sequences between 85900 and 86000. I think this is a bug in our code where we may not be handling the exception. We will use your 20 sequences that you have pasted above and try to fix the code. Thank you for your patience.
Cheers

@pabloacera
Copy link
Contributor

Hi Valentina, thanks for using our software. Indeed there was a bug that originated when all reads in a chunk were discarded. I have fixed the bug so you may want to download PACIFIC.py and give it a go. I have also noticed that your reads have very variable lengths. So far we are not predicting the origin of reads smaller than 150bp, just for you to know.
Hope it works now, thanks!!

@VGalata
Copy link
Author

VGalata commented Aug 18, 2021

Dear @pabloacera,

Thank you for the quick fix! I will try the updated version today.

Regarding the length of the reads: yes, they have different length because they have been preprocessed (quality and adapter trimming). I did not know about the minimal length constraint - thank you for pointing that out!

@hp2048
Copy link
Contributor

hp2048 commented Aug 18, 2021

@pabloacera : Thank you for the fix. Is it possible to extend reads with Ns up to 150bp and run the prediction? This can be done at the user end or within PACIFIC perhaps.

@pabloacera
Copy link
Contributor

Hi, that is possible to do but we will have to see how that affect the predictions of the model.
Cheers.

@VGalata
Copy link
Author

VGalata commented Aug 18, 2021

I can confirm that the bug has been fixed - the test sample runs through.
And I second the suggestion to have a parameter for the minimal read length.

Thanks again for fixing the issue so quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants