# Process phage reference genome

Phage genomes were downloaded from NCBI GenBank using search keywords: [phage] AND [pseudomonas].

This search returned 1,950 samples (as of 15 December 2020)

By manual inspection, this download includes phages for other bacteria such as Samonella and E. Coli. This notebook removes those FASTA entries and saves the cleaned version.

In [1]:
from Bio import SeqIO
from core_acc_modules import paths_phage

In [2]:
# Select only those entries with keyword, "pseudomonas"
cleaned_records = []
keyword = "pseudomonas"
for record in SeqIO.parse(paths_phage.RAW_PHAGE_REF, "fasta"):
    print("%s %s %i" % (record.id, record.description.lower(), len(record)))
    if keyword in record.description.lower():
        cleaned_records.append(record)

NC_028999.1 nc_028999.1 pseudomonas phage phipa3, complete genome 309208
HQ630627.1 hq630627.1 pseudomonas phage phipa3, complete genome 309208
CP019649.1 cp019649.1 salmonella enterica subsp. enterica serovar typhimurium var. monophasic 4,5,12:i:- strain tw-stm6 chromosome, complete genome 4999862
MT133560.1 mt133560.1 pseudomonas phage fnug, complete genome 278899
MK599315.1 mk599315.1 pseudomonas phage pa1c, complete genome 304671
MH725810.1 mh725810.1 pseudomonas phage payy-2, complete genome 92348
MF974178.1 mf974178.1 pseudomonas phage ys35, complete genome 93296
NC_016765.1 nc_016765.1 pseudomonas phage vb_paes_pmg1, complete genome 54024
HQ711985.1 hq711985.1 pseudomonas phage vb_paes_pmg1, complete genome 54024
NC_031063.1 nc_031063.1 pseudomonas phage pev2, complete genome 72697
KU948710.1 ku948710.1 pseudomonas phage pev2, complete genome 72697
NC_020083.1 nc_020083.1 serratia phage phimam1, complete genome 157834
JX878496.1 jx878496.1 serratia phage phimam1, complete genome

MT094431.1 mt094431.1 pseudomonas phage bim bv-46, complete genome 38860
MK511036.1 mk511036.1 pseudomonas phage vb_pae_br327a, partial genome 37474
MK511018.1 mk511018.1 pseudomonas phage vb_pae_cf28a, partial genome 37395
MK511015.1 mk511015.1 pseudomonas phage vb_pae_br313c, partial genome 33113
MK511004.1 mk511004.1 pseudomonas phage vb_pae_cf74b, partial genome 34669
NC_018279.1 nc_018279.1 salmonella phage vb_soss_oslo, complete genome 49116
MH517022.1 mh517022.1 acinetobacter phage sh-ab 15599, complete genome 143204
AB008550.1 ab008550.1 pseudomonas phage phictx dna, complete genome 35580
JQ806764.1 jq806764.1 salmonella phage vb_soss_oslo, complete genome 49116
AF165214.2 af165214.2 bacteriophage d3, complete genome 56426
NC_007810.1 nc_007810.1 pseudomonas phage f8, complete genome 66015
NC_003278.1 nc_003278.1 pseudomonas phage phictx, complete genome 35580
MN850614.1 mn850614.1 escherichia phage adrianh, complete genome 88226
NC_041880.1 nc_041880.1 pseudomonas phage phipmw

NC_042091.1 nc_042091.1 pseudomonas phage nickie, complete genome 112225
NC_042054.1 nc_042054.1 pseudomonas phage vb_psym_kil4, complete genome 92816
NC_041968.1 nc_041968.1 pseudomonas phage vb_paem_g1, complete genome 87646
NC_041953.1 nc_041953.1 pseudomonas phage pamx25, complete genome 57899
MG432151.1 mg432151.1 pseudomonas phage delta, complete genome 45970
MG948468.1 mg948468.1 pantoea phage vb_pags_vid5, complete genome 61437
NC_031014.1 nc_031014.1 pseudomonas phage andromeda, complete genome 40008
NC_030934.1 nc_030934.1 pseudomonas phage vb_psym_kil1, complete genome 90552
NC_028931.1 nc_028931.1 pseudomonas phage pamx28, partial genome 55108
NC_028879.1 nc_028879.1 pseudomonas phage pamx42, complete genome 43225
NC_028809.1 nc_028809.1 pseudomonas phage pamx74, partial genome 58637
NC_028770.1 nc_028770.1 pseudomonas phage pamx11, complete genome 59878
NC_026602.1 nc_026602.1 pseudomonas phage vb_paep_pao1_ab05, complete genome 43639
NC_026599.1 nc_026599.1 pseudomonas ph

MH179473.2 mh179473.2 aeromonas phage 25ahydr2pp, complete genome 42696
MH593832.1 mh593832.1 phage sp. isolate ctcj9, complete genome 40520
MH179477.1 mh179477.1 aeromonas phage 60ahydr15pp, complete genome 165795
M11912.1 m11912.1 bacteriophage pf3 from pseudomonas aeruginosa (nijmegen strain), complete genome 5833
M19377.1 m19377.1 bacteriophage pf3 from pseudomonas aeruginosa (new york strain), complete genome 5833
MK774614.2 mk774614.2 aeromonas phage cf8, complete genome 238150
AY576273.1 ay576273.1 alphaproteobacteria phage phijl001, complete genome 63649
NC_027118.1 nc_027118.1 vibrio phage phivc8, complete genome 39422
JF712866.1 jf712866.1 vibrio phage phivc8, complete genome 39422
DI373498.1 di373498.1 kr 1020130142837-a/12: bacteriophage of pseudomonas aeruginosa and uses thereof 187
DI373497.1 di373497.1 kr 1020130142837-a/11: bacteriophage of pseudomonas aeruginosa and uses thereof 217
DI373496.1 di373496.1 kr 1020130142837-a/10: bacteriophage of pseudomonas aeruginosa an

In [3]:
# Write cleaned fasta records to file
SeqIO.write(cleaned_records, paths_phage.PHAGE_REF, "fasta")

1519