Error in step: Using the VFDB Virulence Factor Database with SRST2 #59

schultzm · 2016-03-22T03:35:24Z

Error in step: Using the VFDB Virulence Factor Database with SRST2

1: The URL to the VFDB has changed and, using wget, should now be:

wget http://www.mgc.ac.cn/VFs/Down/VFDB_setB_nt.fas.gz

2: We need three extra steps for the database parsing scripts to work.

i) gunzip VFDB_setB_nt.fas.gz
ii) mv VFDB_setB_nt.fas VFDB_setB_nt.fas.ffn
iii) save the following code as convert.py and execute as python convert.py, which gets rid of the "(gi:xxxxx)" bit in the sequence headers and the commas (replacing with '_') because downstream processing splits on the commas:

with open("VFDB_setB_nt.fas.ffn", "r") as input_handle:
    with open("VFDB_setB_nt.fas_corrected.ffn", "w") as output_handle:
        for line in input_handle:
            if '(gi:' in line:
                line=line.replace('(gi:', ' (gi:').replace(',','_')
                line = line.split()
                output_handle.write(line[0]+' '+' '.join(line[2:])+'\n')
            else:
                output_handle.write(line)

Without removing the text "(gi:xxxxx)" from the header, this next step will not work:

python VFDB_cdhit_to_csv.py --cluster_file Clostridium_cdhit90.clstr --infile Clostridium.fsa --outfile Clostridium_cdhit90.csv

Also, there is a redundant comment on line 46 in the script VFDB_cdhit_to_csv.py (should read "VFGxxxxxx"):

schultzm@x:gene_dbs $ grep 'the unique ID R0xxx' ~/VFDB_cdhit_to_csv.py -n
46:         database[ClusterNr].append(seqID) # for virulence gene DB, this is the unique ID R0xxx

The text was updated successfully, but these errors were encountered:

katholt · 2016-03-22T05:17:03Z

Would be great if you could just put your edits into the original script and push the changes.

On 22 March 2016 at 3:36:04 pm, Mark Schultz (notifications@github.commailto:notifications@github.com) wrote:

Actually, it's a bit more than the above. Need to remove the whole "(gi:xxxxxxxx)" bit. Here's how I did it using python:

with open("VFDB_setB_nt.fas.ffn", "r") as input_handle:
with open("VFDB_setB_nt.fas.ffn_edit", "w") as output_handle:
for line in input_handle:
if ' (gi:' in line:
line = line.split()
output_handle.write(line[0]+' '+' '.join(line[2:])+'\n')
else:
output_handle.write(line)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-199634498

rrwick · 2016-06-28T07:34:46Z

Fixed in 5b1639b - thanks!

aphayt mentioned this issue May 18, 2016

UnboundLocalError for VFDB_cdhit_to_csv.py #52

Closed

rrwick self-assigned this Jun 27, 2016

rrwick closed this as completed Jun 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in step: Using the VFDB Virulence Factor Database with SRST2 #59

Error in step: Using the VFDB Virulence Factor Database with SRST2 #59

schultzm commented Mar 22, 2016

katholt commented Mar 22, 2016

rrwick commented Jun 28, 2016

Error in step: Using the VFDB Virulence Factor Database with SRST2 #59

Error in step: Using the VFDB Virulence Factor Database with SRST2 #59

Comments

schultzm commented Mar 22, 2016