Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create custom databases #10

Closed
mradz19 opened this issue Sep 19, 2023 · 27 comments
Closed

How to create custom databases #10

mradz19 opened this issue Sep 19, 2023 · 27 comments

Comments

@mradz19
Copy link

mradz19 commented Sep 19, 2023

Thanks for the program.

I'm not sure how to go about building a custom database, for example I am interested in looking at S. aureus strains in my metagenome samples but the database build directions aren't really clear. Do I need to download all the available genomes for S.aureus from NCBI then use them as input for StrainScan_build.py?

@liaoherui
Copy link
Owner

Hi, thanks for using StrainScan!

And yes! If you want buid a custom database, then you can download all available genomes for targeted species from NCBI and then use them as input for StrainScan_build.py. If there are too many genomes (e.g. > 2000 or even > 5000 strain genomes), then it may require a lot of computational resource. In this case, you may consider using complete genomes as input to StrainScan_build.py

@mradz19
Copy link
Author

mradz19 commented Sep 19, 2023

Thanks for the help! yes there are over 80,000 genomes for S.aureus on ncbi, but filtering for complete brings this down to 1600. Regarding the -i input option for strainscan_build, can I provide a directory containing the fasta files?

@liaoherui
Copy link
Owner

liaoherui commented Sep 19, 2023

Yes, you can use the command python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir> (<Input_Genomes> refers to the directory containing the fasta files) to build your custom database.

Please note that: If you install StrainScan through Bioconda, then you can run the command below to use the latest GitHub version, which has more new features. (Note: you should run the code under the Bioconda environment you have built)

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan
chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128
python StrainScan.py -h

@mradz19
Copy link
Author

mradz19 commented Sep 19, 2023

Thanks, I did install through bioconda so I will use those commands.

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

I have tried running Strainscan_build.py but i get this error:

Traceback (most recent call last):
  File "StrainScan/StrainScan_build.py", line 8, in <module>
    from library import Cluster,Unique_kmer_detect_direct,Build_kmer_sets_unique_region_lasso_test_allinone_sp,select_rep,Build_overlap_matrix_sp,Build_overlap_matrix_sp_jellyfish,Build_tree,Build_tree_mem
  File "/data/strainscan/StrainScan/library/Unique_kmer_detect_direct.py", line 5, in <module>
    import seqpy
ModuleNotFoundError: No module named 'seqpy'

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

Nevermind i got around it by running strainscan_build instead of StrainScan_build.py

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

Sorry is this the correct way to build the database:

strainscan_build -i ../staph/*fna.gz -o database

It seems to have worked, however when i run strainscan i get this error:

strainscan -i ../strainphlan_analysis/fastq/CW2_S5_clean_unmapped_R1.fastq.gz -d database/ -b 1 -o cw2_out
usage: StrainScan.py [-h] -i INPUT_FQ -d DB_DIR [-o OUT_DIR] [-k KSIZE]
                     [-l LDEP] [-s MSN]
StrainScan.py: error: unrecognized arguments: -b 1
(strainscan) [mradzieta@sc-prod-01 strainscan]$ strainscan -i ../strainphlan_analysis/fastq/CW2_S5_clean_unmapped_R1.fastq.gz -d database/ -o cw2_out
Traceback (most recent call last):
  File "/kusers/ancillary/anaconda3/envs/strainscan/bin/strainscan", line 10, in <module>
    sys.exit(main())
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan.py", line 89, in main
    cls_dict=identify.identify_cluster(in_fq,db_dir+'/Tree_database',[0.1,0.4,1])
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/identify.py", line 373, in identify_cluster
    tree, GCF = read_tree_structure(db_dir)
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/identify.py", line 13, in read_tree_structure
    f = open(db_dir+"/tree_structure.txt", "r")
FileNotFoundError: [Errno 2] No such file or directory: 'database//Tree_database/tree_structure.txt'

@liaoherui
Copy link
Owner

Hi,

First, in your case, the construction script (StrainScan_build.py) receives errors due to a local library problem. I will check it.

To use "-b" parameter, you are suggested to use the latest version of StrainScan as I mentioned above #10 (comment). Given the code in #10 (comment), you can try python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir> .

For the "FileNotFoundError", it is a result of file missing for the constructed database. Can you check the files in Tree_database folder and show them to me?

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

ls database/Tree_database/
nodes_kmer  overlap  test
ls database/Cluster_Result/
distance_matrix_rebuild.txt  distance_matrix.txt  hclsMap_95_recls.txt  hclsMap_95_Rep.txt  hclsMap_95.txt  Other_Strain_CN.txt
ls database/Kmer_Sets_L2/

Kmwe_sets_l2 is empty

@liaoherui
Copy link
Owner

It shows that the constructed database is not complete and thus the identification program failed. Can you check the log of the database construction script? I think there may be an error for this step.

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

How do I get the log file?

Was this the correct command for database generation?

strainscan_build -i ../staph/*fna.gz -o database

@liaoherui
Copy link
Owner

Oh, I may know the reason. "strainscan_build" doesn't support the construction of ".gz" format genomes while the latest "StrainScan_build.py" has this function. If you still want to use "strainscan_build" to build the database, then you need to decompress all genomes and fed these decompressed genomes to "strainscan_build".

If I fixed the problem of "StrainScan_build.py" in your case, I would let you know. Also, I will upload the latest version to bioconda asap to avoild these problems.

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

I decompressed all the fna files but nothing is generated, i ran this code

strainscan_build -i ../staph/*.fna -o database

And all of the fna files are printed on the screen but nothing is generated in the database directory

@liaoherui
Copy link
Owner

Then this is a little strange. Can you send me several genomes (10-20 genomes would be ideal) you used to build the database? Then I can test the program to discover the potential reason for this problem. Thanks!

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

5 of the genomes can be found in this link:

https://we.tl/t-4uRzqmYeYA

@liaoherui
Copy link
Owner

Hi,

I tested the program with the genomes you sent to me, and it worked well on my end (see below). The command in "build.sh" is: strainscan_build -i ref -o sau
image

Note: you should use strainscan_build -i ../staph/ -o database rather than strainscan_build -i ../staph/*.fna -o database

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

Ok just directing to the directory seems to be working now. I will see how it goes

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

So it ran for a few hours then stopped with this message:

Traceback (most recent call last):
  File "/kusers/ancillary/anaconda3/envs/strainscan/bin/strainscan_build", line 10, in <module>
    sys.exit(main())
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
    Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 287, in build_tree
    cls_dist, mapping, tree, depths, depths_mapping = hierarchy(fna_mapping, dist)
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 59, in hierarchy
    mapping[i] -= 2
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_mut.py", line 78, in __setitem__
    self._put(key, val, self.on_dup)
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_base.py", line 219, in _put
    dedup_result = self._dedup_item(key, val, on_dup)
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_base.py", line 254, in _dedup_item
    raise KeyAndValueDuplicationError(key, val)
bidict.KeyAndValueDuplicationError: (303, 35)

@liaoherui
Copy link
Owner

liaoherui commented Sep 20, 2023

Hi,

This seems a bug for the k-mer tree indexing step. We will check it. In addition, I will build the S. aureus database (with all complete genomes from NCBI) and then make the database publicly accessible (via a given link). In this case, you can use that database directly without building your own custom database. I will let you know if this is done (may require 1~2 days due to my workload and the program running time).

Note: The Bioconda version is relatively older than the current version on the GitHub. Thus, there could be more potential bugs when using the program. If you are urgent, you may consider trying the GitHub version (see the install mannual below).

image

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

Ok great thank you for that!

@liaoherui
Copy link
Owner

OK. It seems the ".yaml" file misses some required packages. Will check and update it later.

For this error, you can fix it by installing the package. conda install -c conda-forge psutil

@mradz19
Copy link
Author

mradz19 commented Sep 20, 2023

Yep I installed psutil with pip and strainscan_build has been running smoothly overnight.

@mradz19
Copy link
Author

mradz19 commented Sep 21, 2023

Good news is the database build worked smoothly. However I am now trying to run StrainScan.py but I am getting this error:

Traceback (most recent call last):
  File "StrainScan.py", line 275, in <module>
    sys.exit(main())
  File "StrainScan.py", line 196, in main
    cls_dict=identify.identify_cluster(in_fq,db_dir+'/Tree_database',[0.1,0.4,1])
  File "/data/strainscan/StrainScan/library/identify.py", line 419, in identify_cluster
    search(pending, match_results, db_dir, valid_kmers, length, cov, abundance, cov_cutoff, ab_cutoff, results, leaves, res_temp, tree, overlapping_info)
  File "/data/strainscan/StrainScan/library/identify.py", line 258, in search
    print("parent node: %d ->"%tree.parent(group[0].identifier).identifier)
IndexError: list index out of range

@liaoherui
Copy link
Owner

Hi,

We tested the program with our constructed Sau database, and it worked well on our end. Thus, it's hard to debug without the input fastq data that brings the error. In this case, would you mind providing the fastq data to us for debugging? Thanks!

@mradz19
Copy link
Author

mradz19 commented Sep 22, 2023

Here is a link to forward read:
https://we.tl/t-8KkNR7hBBb

And reverse read:
https://we.tl/t-2GjdLkATdy

@liaoherui
Copy link
Owner

liaoherui commented Sep 22, 2023

Hi,

I just uploaded my newly constructed Sau database to Zenodo (https://zenodo.org/record/8369285). You can download it for your analysis. I also tested your data with the command: python StrainScan.py -i data_user/CW2_S5_clean_unmapped_R1.fastq.gz -j data_user/CW2_S5_clean_unmapped_R2.fastq.gz -d DB_Sau -o CW2_S5 And it works well without any errors on my end.

Please note: You are supposed to clone the latest code in GitHub (we made some tiny modifications this morning), and then run the program.

Update: The database is also available via Google drive. Please check README file for more details.

@mradz19
Copy link
Author

mradz19 commented Sep 25, 2023

Thank you, I downloaded your made database and updated to the latest code and it seems to be working smoothly now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants