How to create custom databases #10

mradz19 · 2023-09-19T05:11:50Z

Thanks for the program.

I'm not sure how to go about building a custom database, for example I am interested in looking at S. aureus strains in my metagenome samples but the database build directions aren't really clear. Do I need to download all the available genomes for S.aureus from NCBI then use them as input for StrainScan_build.py?

liaoherui · 2023-09-19T05:24:53Z

Hi, thanks for using StrainScan!

And yes! If you want buid a custom database, then you can download all available genomes for targeted species from NCBI and then use them as input for StrainScan_build.py. If there are too many genomes (e.g. > 2000 or even > 5000 strain genomes), then it may require a lot of computational resource. In this case, you may consider using complete genomes as input to StrainScan_build.py

mradz19 · 2023-09-19T05:33:21Z

Thanks for the help! yes there are over 80,000 genomes for S.aureus on ncbi, but filtering for complete brings this down to 1600. Regarding the -i input option for strainscan_build, can I provide a directory containing the fasta files?

liaoherui · 2023-09-19T05:36:46Z

Yes, you can use the command python StrainScan_build.py -i <Input_Genomes> -o <Database_Dir> (<Input_Genomes> refers to the directory containing the fasta files) to build your custom database.

Please note that: If you install StrainScan through Bioconda, then you can run the command below to use the latest GitHub version, which has more new features. (Note: you should run the code under the Bioconda environment you have built)

git clone https://github.com/liaoherui/StrainScan.git
cd StrainScan
chmod 755 library/jellyfish-linux
chmod 755 library/dashing_s128
python StrainScan.py -h

mradz19 · 2023-09-19T05:38:08Z

Thanks, I did install through bioconda so I will use those commands.

mradz19 · 2023-09-20T01:54:05Z

I have tried running Strainscan_build.py but i get this error:

Traceback (most recent call last):
  File "StrainScan/StrainScan_build.py", line 8, in <module>
    from library import Cluster,Unique_kmer_detect_direct,Build_kmer_sets_unique_region_lasso_test_allinone_sp,select_rep,Build_overlap_matrix_sp,Build_overlap_matrix_sp_jellyfish,Build_tree,Build_tree_mem
  File "/data/strainscan/StrainScan/library/Unique_kmer_detect_direct.py", line 5, in <module>
    import seqpy
ModuleNotFoundError: No module named 'seqpy'

mradz19 · 2023-09-20T01:57:54Z

Nevermind i got around it by running strainscan_build instead of StrainScan_build.py

mradz19 · 2023-09-20T02:02:14Z

Sorry is this the correct way to build the database:

strainscan_build -i ../staph/*fna.gz -o database

It seems to have worked, however when i run strainscan i get this error:

strainscan -i ../strainphlan_analysis/fastq/CW2_S5_clean_unmapped_R1.fastq.gz -d database/ -b 1 -o cw2_out
usage: StrainScan.py [-h] -i INPUT_FQ -d DB_DIR [-o OUT_DIR] [-k KSIZE]
                     [-l LDEP] [-s MSN]
StrainScan.py: error: unrecognized arguments: -b 1
(strainscan) [mradzieta@sc-prod-01 strainscan]$ strainscan -i ../strainphlan_analysis/fastq/CW2_S5_clean_unmapped_R1.fastq.gz -d database/ -o cw2_out
Traceback (most recent call last):
  File "/kusers/ancillary/anaconda3/envs/strainscan/bin/strainscan", line 10, in <module>
    sys.exit(main())
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan.py", line 89, in main
    cls_dict=identify.identify_cluster(in_fq,db_dir+'/Tree_database',[0.1,0.4,1])
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/identify.py", line 373, in identify_cluster
    tree, GCF = read_tree_structure(db_dir)
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/identify.py", line 13, in read_tree_structure
    f = open(db_dir+"/tree_structure.txt", "r")
FileNotFoundError: [Errno 2] No such file or directory: 'database//Tree_database/tree_structure.txt'

liaoherui · 2023-09-20T02:10:52Z

Hi,

First, in your case, the construction script (StrainScan_build.py) receives errors due to a local library problem. I will check it.

To use "-b" parameter, you are suggested to use the latest version of StrainScan as I mentioned above #10 (comment). Given the code in #10 (comment), you can try python StrainScan.py -i <Input_reads> -d <Database_Dir> -o <Output_Dir> .

For the "FileNotFoundError", it is a result of file missing for the constructed database. Can you check the files in Tree_database folder and show them to me?

mradz19 · 2023-09-20T02:13:15Z

ls database/Tree_database/
nodes_kmer  overlap  test

ls database/Cluster_Result/
distance_matrix_rebuild.txt  distance_matrix.txt  hclsMap_95_recls.txt  hclsMap_95_Rep.txt  hclsMap_95.txt  Other_Strain_CN.txt

ls database/Kmer_Sets_L2/

Kmwe_sets_l2 is empty

liaoherui · 2023-09-20T02:17:10Z

It shows that the constructed database is not complete and thus the identification program failed. Can you check the log of the database construction script? I think there may be an error for this step.

mradz19 · 2023-09-20T02:18:39Z

How do I get the log file?

Was this the correct command for database generation?

strainscan_build -i ../staph/*fna.gz -o database

liaoherui · 2023-09-20T02:25:08Z

Oh, I may know the reason. "strainscan_build" doesn't support the construction of ".gz" format genomes while the latest "StrainScan_build.py" has this function. If you still want to use "strainscan_build" to build the database, then you need to decompress all genomes and fed these decompressed genomes to "strainscan_build".

If I fixed the problem of "StrainScan_build.py" in your case, I would let you know. Also, I will upload the latest version to bioconda asap to avoild these problems.

mradz19 · 2023-09-20T02:30:54Z

I decompressed all the fna files but nothing is generated, i ran this code

strainscan_build -i ../staph/*.fna -o database

And all of the fna files are printed on the screen but nothing is generated in the database directory

liaoherui · 2023-09-20T02:33:29Z

Then this is a little strange. Can you send me several genomes (10-20 genomes would be ideal) you used to build the database? Then I can test the program to discover the potential reason for this problem. Thanks!

mradz19 · 2023-09-20T02:45:54Z

5 of the genomes can be found in this link:

https://we.tl/t-4uRzqmYeYA

liaoherui · 2023-09-20T02:55:48Z

Hi,

I tested the program with the genomes you sent to me, and it worked well on my end (see below). The command in "build.sh" is: strainscan_build -i ref -o sau

Note: you should use strainscan_build -i ../staph/ -o database rather than strainscan_build -i ../staph/*.fna -o database

mradz19 · 2023-09-20T03:30:42Z

Ok just directing to the directory seems to be working now. I will see how it goes

mradz19 · 2023-09-20T05:47:55Z

So it ran for a few hours then stopped with this message:

Traceback (most recent call last):
  File "/kusers/ancillary/anaconda3/envs/strainscan/bin/strainscan_build", line 10, in <module>
    sys.exit(main())
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/StrainScan_build.py", line 161, in main
    Build_tree.build_tree([cls_res+'/distance_matrix.txt',cls_res+'/hclsMap_95_recls.txt',out_dir+'/Tree_database',31,params])
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 287, in build_tree
    cls_dist, mapping, tree, depths, depths_mapping = hierarchy(fna_mapping, dist)
  File "/data/anaconda3/envs/strainscan/lib/python3.7/site-packages/StrainScan/library/Build_tree.py", line 59, in hierarchy
    mapping[i] -= 2
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_mut.py", line 78, in __setitem__
    self._put(key, val, self.on_dup)
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_base.py", line 219, in _put
    dedup_result = self._dedup_item(key, val, on_dup)
  File "/kusers/ancillary/anaconda3/envs/strainscan/lib/python3.7/site-packages/bidict/_base.py", line 254, in _dedup_item
    raise KeyAndValueDuplicationError(key, val)
bidict.KeyAndValueDuplicationError: (303, 35)

liaoherui · 2023-09-20T06:05:09Z

Hi,

This seems a bug for the k-mer tree indexing step. We will check it. In addition, I will build the S. aureus database (with all complete genomes from NCBI) and then make the database publicly accessible (via a given link). In this case, you can use that database directly without building your own custom database. I will let you know if this is done (may require 1~2 days due to my workload and the program running time).

Note: The Bioconda version is relatively older than the current version on the GitHub. Thus, there could be more potential bugs when using the program. If you are urgent, you may consider trying the GitHub version (see the install mannual below).

mradz19 · 2023-09-20T06:42:43Z

Ok great thank you for that!

liaoherui · 2023-09-20T07:00:15Z

OK. It seems the ".yaml" file misses some required packages. Will check and update it later.

For this error, you can fix it by installing the package. conda install -c conda-forge psutil

mradz19 · 2023-09-20T23:01:01Z

Yep I installed psutil with pip and strainscan_build has been running smoothly overnight.

mradz19 · 2023-09-21T23:27:19Z

Good news is the database build worked smoothly. However I am now trying to run StrainScan.py but I am getting this error:

Traceback (most recent call last):
  File "StrainScan.py", line 275, in <module>
    sys.exit(main())
  File "StrainScan.py", line 196, in main
    cls_dict=identify.identify_cluster(in_fq,db_dir+'/Tree_database',[0.1,0.4,1])
  File "/data/strainscan/StrainScan/library/identify.py", line 419, in identify_cluster
    search(pending, match_results, db_dir, valid_kmers, length, cov, abundance, cov_cutoff, ab_cutoff, results, leaves, res_temp, tree, overlapping_info)
  File "/data/strainscan/StrainScan/library/identify.py", line 258, in search
    print("parent node: %d ->"%tree.parent(group[0].identifier).identifier)
IndexError: list index out of range

liaoherui · 2023-09-22T04:04:28Z

Hi,

We tested the program with our constructed Sau database, and it worked well on our end. Thus, it's hard to debug without the input fastq data that brings the error. In this case, would you mind providing the fastq data to us for debugging? Thanks!

mradz19 · 2023-09-22T04:19:15Z

Here is a link to forward read:
https://we.tl/t-8KkNR7hBBb

And reverse read:
https://we.tl/t-2GjdLkATdy

liaoherui · 2023-09-22T08:02:04Z

Hi,

I just uploaded my newly constructed Sau database to Zenodo (https://zenodo.org/record/8369285). You can download it for your analysis. I also tested your data with the command: python StrainScan.py -i data_user/CW2_S5_clean_unmapped_R1.fastq.gz -j data_user/CW2_S5_clean_unmapped_R2.fastq.gz -d DB_Sau -o CW2_S5 And it works well without any errors on my end.

Please note: You are supposed to clone the latest code in GitHub (we made some tiny modifications this morning), and then run the program.

Update: The database is also available via Google drive. Please check README file for more details.

mradz19 · 2023-09-25T02:46:07Z

Thank you, I downloaded your made database and updated to the latest code and it seems to be working smoothly now.

mradz19 closed this as completed Sep 25, 2023

NamenloseHeldin mentioned this issue Oct 6, 2023

How to create custom databases #11

Open

wangqi0000 mentioned this issue Oct 10, 2023

IndexError: list index out of range #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create custom databases #10

How to create custom databases #10

mradz19 commented Sep 19, 2023

liaoherui commented Sep 19, 2023

mradz19 commented Sep 19, 2023

liaoherui commented Sep 19, 2023 •

edited

Loading

mradz19 commented Sep 19, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023 •

edited

Loading

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 21, 2023

liaoherui commented Sep 22, 2023

mradz19 commented Sep 22, 2023

liaoherui commented Sep 22, 2023 •

edited

Loading

mradz19 commented Sep 25, 2023

How to create custom databases #10

How to create custom databases #10

Comments

mradz19 commented Sep 19, 2023

liaoherui commented Sep 19, 2023

mradz19 commented Sep 19, 2023

liaoherui commented Sep 19, 2023 • edited Loading

mradz19 commented Sep 19, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023 • edited Loading

mradz19 commented Sep 20, 2023

liaoherui commented Sep 20, 2023

mradz19 commented Sep 20, 2023

mradz19 commented Sep 21, 2023

liaoherui commented Sep 22, 2023

mradz19 commented Sep 22, 2023

liaoherui commented Sep 22, 2023 • edited Loading

mradz19 commented Sep 25, 2023

liaoherui commented Sep 19, 2023 •

edited

Loading

liaoherui commented Sep 20, 2023 •

edited

Loading

liaoherui commented Sep 22, 2023 •

edited

Loading