-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to create custom databases #10
Comments
Hi, thanks for using StrainScan! And yes! If you want buid a custom database, then you can download all available genomes for targeted species from NCBI and then use them as input for StrainScan_build.py. If there are too many genomes (e.g. > 2000 or even > 5000 strain genomes), then it may require a lot of computational resource. In this case, you may consider using complete genomes as input to StrainScan_build.py |
Thanks for the help! yes there are over 80,000 genomes for S.aureus on ncbi, but filtering for complete brings this down to 1600. Regarding the -i input option for strainscan_build, can I provide a directory containing the fasta files? |
Yes, you can use the command Please note that: If you install StrainScan through Bioconda, then you can run the command below to use the latest GitHub version, which has more new features. (Note: you should run the code under the Bioconda environment you have built)
|
Thanks, I did install through bioconda so I will use those commands. |
I have tried running Strainscan_build.py but i get this error:
|
Nevermind i got around it by running strainscan_build instead of StrainScan_build.py |
Sorry is this the correct way to build the database:
It seems to have worked, however when i run strainscan i get this error:
|
Hi, First, in your case, the construction script (StrainScan_build.py) receives errors due to a local library problem. I will check it. To use "-b" parameter, you are suggested to use the latest version of StrainScan as I mentioned above #10 (comment). Given the code in #10 (comment), you can try For the "FileNotFoundError", it is a result of file missing for the constructed database. Can you check the files in Tree_database folder and show them to me? |
Kmwe_sets_l2 is empty |
It shows that the constructed database is not complete and thus the identification program failed. Can you check the log of the database construction script? I think there may be an error for this step. |
How do I get the log file? Was this the correct command for database generation?
|
Oh, I may know the reason. "strainscan_build" doesn't support the construction of ".gz" format genomes while the latest "StrainScan_build.py" has this function. If you still want to use "strainscan_build" to build the database, then you need to decompress all genomes and fed these decompressed genomes to "strainscan_build". If I fixed the problem of "StrainScan_build.py" in your case, I would let you know. Also, I will upload the latest version to bioconda asap to avoild these problems. |
I decompressed all the fna files but nothing is generated, i ran this code
And all of the fna files are printed on the screen but nothing is generated in the database directory |
Then this is a little strange. Can you send me several genomes (10-20 genomes would be ideal) you used to build the database? Then I can test the program to discover the potential reason for this problem. Thanks! |
5 of the genomes can be found in this link: |
Ok just directing to the directory seems to be working now. I will see how it goes |
So it ran for a few hours then stopped with this message:
|
Hi, This seems a bug for the k-mer tree indexing step. We will check it. In addition, I will build the S. aureus database (with all complete genomes from NCBI) and then make the database publicly accessible (via a given link). In this case, you can use that database directly without building your own custom database. I will let you know if this is done (may require 1~2 days due to my workload and the program running time). Note: The Bioconda version is relatively older than the current version on the GitHub. Thus, there could be more potential bugs when using the program. If you are urgent, you may consider trying the GitHub version (see the install mannual below). |
Ok great thank you for that! |
OK. It seems the ".yaml" file misses some required packages. Will check and update it later. For this error, you can fix it by installing the package. |
Yep I installed psutil with pip and strainscan_build has been running smoothly overnight. |
Good news is the database build worked smoothly. However I am now trying to run StrainScan.py but I am getting this error:
|
Hi, We tested the program with our constructed Sau database, and it worked well on our end. Thus, it's hard to debug without the input fastq data that brings the error. In this case, would you mind providing the fastq data to us for debugging? Thanks! |
Here is a link to forward read: And reverse read: |
Hi, I just uploaded my newly constructed Sau database to Zenodo (https://zenodo.org/record/8369285). You can download it for your analysis. I also tested your data with the command: Please note: You are supposed to clone the latest code in GitHub (we made some tiny modifications this morning), and then run the program. Update: The database is also available via Google drive. Please check README file for more details. |
Thank you, I downloaded your made database and updated to the latest code and it seems to be working smoothly now. |
Thanks for the program.
I'm not sure how to go about building a custom database, for example I am interested in looking at S. aureus strains in my metagenome samples but the database build directions aren't really clear. Do I need to download all the available genomes for S.aureus from NCBI then use them as input for StrainScan_build.py?
The text was updated successfully, but these errors were encountered: