Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not clear how to make a custom DB #112

Open
iowa69 opened this issue Jul 22, 2022 · 3 comments
Open

Not clear how to make a custom DB #112

iowa69 opened this issue Jul 22, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@iowa69
Copy link

iowa69 commented Jul 22, 2022

How do I make/use a custom DB?

There are some input files that are not specified

% mob_cluster --mode build -f new_plasmids.fasta -p new_plasmids_mobtyper_report.txt -t new_plasmids_host_taxonomy.txt --outdir output_directory

-f new_plasmids.fasta MY PLASMIDS
-p new_plasmids_mobtyper_report.txt MOBTYPER OUTPUT
-t new_plasmids_host_taxonomy.txt WHAT IS THIS?

Then, how do I specify the folder where is the new db created?

Thanks

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Jul 26, 2022

Hello,

Here are some tips on how build custom database or add sequences to an existing database at #20 from previous MOB-suite versions.

The -t option expects a tab-delimited file with 2 columns specifying unique sample_id (e.g., accession, or other identifier) and a host organism where this plasmid was identified. If the host organism is not know, that field could be left black or marked as unknown or Unknown

The -p option digests the a single mob-typer report with a single header. Some information is extracted from that report and taxonomy is appended from the taxonomy file specified by the -t argument.

The --outdir option allows to specify directory to where write the new database and with help of -d option you can specify this directory to be used as a new custom database by other MOB-suite modules.

Hope this helps. Sorry for the response delay

@iowa69
Copy link
Author

iowa69 commented Jul 27, 2022

Thanks for the help,

I have a multi-fasta database (new_database.fasta) with some plasmids in it.

I have formatted the taxonomy file in this way with tab delimited spacer, plasmids are from Klebsiella pneumoniae but cannot figure what to type: (PROBLEM1)

sample_id   organism
NZ_CP070412 Unknown
NZ_CP070417 Unknown
NZ_CP079127 Unknown

after that, using "sample_mobtyper_results.txt" I launched the mob cluster, everything in a desktop folder:

$ mob_cluster --mode build -f new_database1.fasta -p sample_mobtyper_results.txt -t taxonomy.txt --outdir new_databse1 --num_threads 12

I obtain in this case 10 files in the out_dir

clusters.txt
references_updated.fasta
references_updated.fasta.msh
references_updated.fasta.ndb
references_updated.fasta.nhr
references_updated.fasta.nin
references_updated.fasta.not
references_updated.fasta.nsq
references_updated.fasta.ntf
references_updated.fasta.nto

At this point I try to use mob_recon but it does not produce an output: (PROBLEM2)

$ mob_recon -o prova -i KP1057_ST512_KPC-3.fasta -s KP1057 -d new_databse1/

​It re-installs the default ncbi database and it overwrites the clusters.txt file

I then re-specified the two files as described in the issue page but the output did not change, so i substituted in the conda folder the cluster.txt file.

$ mob_recon -n 12 -o prova577c -i KP577_Complete.fasta -s KP577c --plasmid_db new_database/references_updated.fasta --plasmid_mash_db new_database/references_updated.fasta.msh

At this point the output changes but I see that the program assigns contigs that are part of the plasmids (I am checking sequences I know) to chromosome and excludes them. The program is not capable of finishing correctly the analysis.

Is what I have done correct? It is a bit tricky and it is not explained in that way so maybe I am doing something wrong.

Any suggestion on increase accuracy in mob_recon? I can give you my inputs and database if it may help.

Thanks for all,

GL

@kbessonov1984
Copy link
Collaborator

kbessonov1984 commented Jul 28, 2022

Good day,

Indeed there are small practical aspects that need to be clarified as mob-cluster is scarcely documented as it was more designed for an internal use.

The plasmid database building command is correct and generated all necessary files. The most important files are references_updated.fasta, mash sketch of it references_updated.fasta.msh and cluster.txt. Importantly the clusters.txt contains all metadata on your custom plasmid sequences. Also remember that MOB-suite uses numerous databases including database on replicon, relaxases, repetitive elements and others. In your use case you just want to swap the plasmid database for a new one and leave all other databases intact. The mob_recon command that you've specified need additional argument specified --plasmid_meta pointing to the new clusters.txt file from your custom database (i.e. --plasmid_meta new_database/clusters.txt).

If a given contig is assigned to plasmid molecule type if it is a member of a known primary_cluster_id MOB-plasmid cluster defined in cluster.txt and supported by a MASH distance lower or equal to 0.06. In addition presence of relaxase and replicon sequences on a contig and no membership to a known MOB-cluster should also result in a given contig to be classified of plasmid type. Another criteria include presence of repetitive elements and contig circularity status (i.e. contigs marked as circular by unicycler assembler or that could be circularized by the circularize()). You can force circularization and hence improve your chances of classifying a given contig as plasmid by using the -c argument. Finally, the is optional --genome_filter_db_prefix argument that one could specify path to reference genomes of closed genomes that will filter out problematic chromosomal sequences improving contig classification prediction accuracy especially for chromosomally integrated plasmids.

I am curious why you've got your expected contigs specified as chromosomal and not plasmid. Do they have any replicon and relaxases sequences on them. Check by BLAST against the rep.dna.fas and mob.proteins.faa databases or check the mob-typer output for the respective columns.

Have you BLAST those problematic contigs against the entire nucleotide collection and got any plasmid hits (https://blast.ncbi.nlm.nih.gov/Blast.cgi)?
What results do you get when using a default plasmid MOB-suite database?
Your known plasmid contigs are also classified as chromosomal by running the default database or only your custom one?
It might take longer, but you can also add your sequences to an existing plasmid database and see what results you get.

The database initialization routine is triggered if the --database_directory parameter is specified AND a new database path does not contain the status.txt file. This scenario triggers the mob_init routine with all databases initialization including a long taxonomy initialization for the ETE3 library (stored in taxa.sql) to support host-range module predictions.

@jrober84 jrober84 added the enhancement New feature or request label Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants