Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large difference in quality of the assembly using different filtering methods #46

Open
valery-shap opened this issue Dec 30, 2021 · 3 comments

Comments

@valery-shap
Copy link

Hello, Narciso,

The tool is great! thank you again)
One question to discuss.
When I'm using filtering by default (prinseq), I have near thousand of contigs and the warning from Spades:
Too many erroneous kmers, the estimates might be unreliable
I I use --filtering trimmomatic, I have 111 contigs and no warnings.
How could be so? What is the exact difference between these methods?
The isolate is the same, the version is old 1.2.1

And Happy New Year!

Best regards,
Valery

@valery-shap valery-shap changed the title Large difference in quality of the assembly depends on choosing filtering method Large difference in quality of the assembly using different filtering methods Dec 30, 2021
@nmquijada
Copy link
Owner

nmquijada commented Jan 12, 2022

Dear Valery,

Happy new year too!
Sorry for the late reply, I took some days off.

Wow that's a huge difference! The tormes version does not matter it is not the last one, as the code for those tools did not change.
I have been using Prinseq a lot in the past with good results. Since many years I switched to other trimming methods (Trimmomatic, Sickle, TrimGalore, etc.) as they allow the use of compressed (".gz") files. When using Prinseq in Tormes, it takes some time to decompress the reads and compress them again.
Prinseq and trimmomatic have many ways to be used. In Tormes, I chose to trim from the right part of the sequence all those nucleotides that do not reach a certain quality threshold (specified by -q/--quality). In the case of Prinseq, the quality is calculated as a mean of 15 bp windows starting from the last nucleotide. Apart for the algorithm they use for trimming, this is the main difference used in the code. You can read more about how they work and further options in their main webpages:

If there are specific options you would like to test, let me know and we can arrange so ;)

You can also use already quality-trimmed reads in tormes. Just include them in your metadata file and reduce the -q/--quality value and probably the --min_len to ensure that all of them pass the filter. We are preparing a big new release of Tormes (will take a while) that will include the none option for trimming and further trimming methods to chose.
I will perform some testing on prinseq... as I never got that issue.

I would recommend you to use some read quality control software, such as FastQC and MultiQC to visualize the quality of your reads. These two tools generate very nice html reports with all the quality information of your reads, adapter content, etc. This will be included in the next release of tormes, but you can run it on your system as:

#Install via conda
conda install -c bioconda fastqc
conda install -c bioconda -c conda-forge multiqc

# Run fastqc for all the reads in a file
mkdir fastqc
for i in $(ls my-reads/*gz | sed "s/.*\///"); do fastqc my-reads/${i} -o fastqc -t 24; done

# Run multiqc over all the outputs to have a combined report
mkdir multiqc
multiqc fastqc/*_R1* -o multiqc
multiqc fastqc/*_R2* -o multiqc

Just wrote the code as an example. Please modify names/path accordingly to your system

Hope this helps!
Best,
Narciso

@valery-shap
Copy link
Author

I could send the reads of this sample, before I had some strange samples with more that 700 contigs too but didn't check them again with shovill for example.
Now I've one more sample that looks like this and this choice of trimming changed even mlst type! so I suppose that raw reads are very contaminated and bad quality and with trimmomatic the mlst type is right, when I've changed to default with additional prinseq it showed wrong type.
A lot of thanks!
Valery

@nmquijada
Copy link
Owner

Hi Valery,

Sorry for the late reply.... we are under some deadlines and my time is limited these days...
This is definitely a bad behaviour of prinseq! Have you tried using prinseq but increasing the quality threshold by using the -q/--quality flag? (defalut is 25)
It looks like very contaminated or bad quality set of reads. You should always rely on assembly statistics (number of contigs, N50, taxonomic classification of the isolate (shoudn't be a mix), etc.) before trusting results as mlst. We allowed mlst and other gene screening tools to be run over the contigs automatically, and leave to the user the decision to use this results or not instead of discarding everything. However, I think a WARNING in this case would be beneficial... will include it!

Just curious, have you checked the taxonomy of your reads to discard is a contaminated isolate?

kraken2 --paired Read_1.fastq.gz Read_2.fastq.gz --db /PATH/TO/KRAKENDB --threads <threads> --report Sample.kraken-report.txt &>>/dev/null

Did you also check the quality with the FASTQC and MULTIQC tools I told you before?

We are working in a big release of TORMES, and will definitely investigate this issue with prinseq in detail. In any case, if you get better QC with other tools not in tormes, such as TrimGallore and/or shovill, you could always import the genome to tormes for the downstream analysis.

Let me know if I can help you with this!
Best,
Narciso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants