Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file_suffix flag #22

Closed
Biofarmer opened this issue Apr 6, 2022 · 24 comments
Closed

file_suffix flag #22

Biofarmer opened this issue Apr 6, 2022 · 24 comments

Comments

@Biofarmer
Copy link

Biofarmer commented Apr 6, 2022

Dear Gunc Team,

I am using GUNC v1.0.5, and want to ask a question about the --file_suffix. The suffix of my input files is .fna, and some genomes from NCBI may contain .fna in the middle of genome names. If providing with --input_dir and --file_suffix .fna, I am wondering whether GUNC could make right action on those kind of genomes that contains .fna in the middle of names? So I provide the --input_file with the path of each genome, may I ask whether --file_suffix .fna is still needed when --input_file is provide? Or any other suggestions?

Many thanks
Wang

@Biofarmer
Copy link
Author

In addition, may I ask whether compressed fasta (.fna.gz) could be directly used by GUNC?
Thanks

@fullama
Copy link
Contributor

fullama commented Apr 7, 2022

If providing with --input_dir and --file_suffix .fna, I am wondering whether GUNC could make right action on those kind of genomes that contains .fna in the middle of names?

gunc will take everything before the first occurance of .fna in the file name and use that as the sample name in the output

may I ask whether --file_suffix .fna is still needed when --input_file is provide?

No, but the sample names in your output will contain the suffix, providing the suffix is only there to allow gunc to remove it from the input filename

may I ask whether compressed fasta (.fna.gz) could be directly used by GUNC?

Yes

any other questions let me know!

@fullama fullama closed this as completed Apr 7, 2022
@Biofarmer
Copy link
Author

Hi, many thanks for your reply. So, the function of --file_suffix is just to provide the file name in the output, there is no any effect on the detection of chimerism in genomes, do I understand correctly?

@fullama
Copy link
Contributor

fullama commented Apr 7, 2022

Correct! :)

@Biofarmer
Copy link
Author

Okay, many thanks.

@Biofarmer
Copy link
Author

Sorry, a further question about the database of progenomes or GTDB, may I ask which one is generally recommended to use for the detection of chimerism in genomes?
Thanks

@defleury
Copy link

defleury commented Apr 8, 2022

Hi @Biofarmer !

Both databases work fine, we found little difference in accuracy. However, since the GUNC db based on proGenomes is smaller, it is faster to run so we use it by default.

@Biofarmer
Copy link
Author

Okay, it is good to know. Thanks

@Biofarmer
Copy link
Author

Hi, I am running GUNC for 10000 genomes with 5 threads, and it has been running 8 days, and now at Running Diamond period. There is no "diamond_output" folder in the output directory, is it normal?
In addition, may I ask whether it is possible to check the process of diamond? Because I want to know how long I still need to wait, if longer, I may kill the job and rerun it with more threads or split the genomes into parts.
PS, I have been run another 10000 genomes with 10 threads before, which was finished with 3 days.
Thanks

@fullama
Copy link
Contributor

fullama commented Apr 19, 2022

there is no way currently of seeing the progress of diamond, the run time can vary depending on the input.. but maybe you just want to run them in smaller batches? if you are running so many genomes at once i would increase both cpus and memory..

@Biofarmer
Copy link
Author

Hi, thanks for your reply. In addition, if I understand correctly, the genecall files will be merged from all input genomes, if so, may I ask whether the label of each contig (text after ">" but before the first space, which is taken by prodigal as gene ID) should be unique for all input genomes? Or it does not matter? Thanks

@fullama
Copy link
Contributor

fullama commented Apr 19, 2022

it shouldnt matter.. they are merged but are tagged with the name of the genome file so they can be separated after diamond has run

@Biofarmer
Copy link
Author

Biofarmer commented Apr 19, 2022

Okay...that's great. Thanks. The merged genecall files is intermediate, and has been deleted once finished and cannot be seen, right?

@Biofarmer
Copy link
Author

Biofarmer commented Apr 19, 2022

In addition, as the --temp_dir directory by default is Current working directory. If I submit several jobs in the same working directory at once with different output directory, may I ask whether GUNC will select the right temporary files from the same working directory? Is the temporary file of each job with different names? Thanks

@Biofarmer
Copy link
Author

Hi, may I ask the answers from questions as above?

@fullama
Copy link
Contributor

fullama commented May 12, 2022

I think it would be fine but try it out and see to be sure..

@Biofarmer
Copy link
Author

Hi, many thanks for your confirmation.

@Biofarmer
Copy link
Author

Biofarmer commented May 26, 2022

Hi, I just used GUNC to check the genome from NCBI, GCF_902703415.1_Combinated_assembly_ONT_-_Illumina_genomic.fna. It failed, however, when changed the genome name without _-_ to GCF_902703415.1_Combinated_assembly_ONT_---_Illumina_genomic.fna, or GCF_902703415.1_Combinated_assembly_ONT_--_Illumina_genomic.fna, or GCF_902703415.1_Combinated_assembly_ONT__Illumina_genomic.fna, or
GCF_902703415.1_Combinated_assembly_ONT-Illumina_genomic.fna, or
GCF_902703415.1_Combinated_assembly_ONT_Illumina_genomic.fna. Anyone worked with GUNC. May I ask why only genome name containing _-_ format did not work with GUNC?

And GUNC worked for genome GCF_902109435.1_40087_F01_genomic.fna, but there was no value in output, is it due to its small size of genome?

Thanks

@Biofarmer
Copy link
Author

Another question: there is a slight difference in the number of n_genes_mapped in the output when running one genome individually (--input_fasta genome.fna) or running a few genomes together (--input_dir genome_folder/ --file_suffix .fna). May I ask whether it is normal and why?
Thanks

@fullama
Copy link
Contributor

fullama commented May 30, 2022

samples with _-_ dont work because internally gunc uses it as a delimiter to label sequences with the samples name when merging them together.. I didnt think anyone would ever use _-_ in a sample name.. ill see if i can change it for the next version..

@fullama
Copy link
Contributor

fullama commented May 30, 2022

Another question: there is a slight difference in the number of n_genes_mapped in the output when running one genome individually (--input_fasta genome.fna) or running a few genomes together (--input_dir genome_folder/ --file_suffix .fna). May I ask whether it is normal and why? Thanks

Can you give an example of where the output differs, so i can look more closely?

@fullama fullama reopened this May 30, 2022
@Biofarmer
Copy link
Author

samples with _-_ dont work because internally gunc uses it as a delimiter to label sequences with the samples name when merging them together.. I didnt think anyone would ever use _-_ in a sample name.. ill see if i can change it for the next version..

Thanks for reply.

@Biofarmer
Copy link
Author

Another question: there is a slight difference in the number of n_genes_mapped in the output when running one genome individually (--input_fasta genome.fna) or running a few genomes together (--input_dir genome_folder/ --file_suffix .fna). May I ask whether it is normal and why? Thanks

Can you give an example of where the output differs, so i can look more closely?

Hi, just take the genomes from NCBI (GCF_902703415.1 and GCF_900232175.1) for example, the n_genes_mapped is 5145 and 6693 when tested together, and the value will be 5157 and 6740 respectively when tested individually. The overall conclusion is almost same.
By the way, may I ask whether the result of pass.GUNC is based on the third digit of clade_separation_score after decimal? Because I see same genomes that are reported with clade_separation_score 0.45, sometimes are given False or sometimes True for pass.GUNC. So I am wondering the clade_separation_score is just reported with two digits after decimal, but judgement for pass.GUNC is based on at least three digits after decimal.

@fullama
Copy link
Contributor

fullama commented May 31, 2023

Hi, just take the genomes from NCBI (GCF_902703415.1 and GCF_900232175.1) for example, the n_genes_mapped is 5145 and 6693 when tested together, and the value will be 5157 and 6740 respectively when tested individually. The overall conclusion is almost same.

this will be fixed in the next version of gunc

may I ask whether the result of pass.GUNC is based on the third digit of clade_separation_score after decimal?

yes, the output will be amended to include more decimal places in the next version also

@fullama fullama closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants