# Retrieval of HIV sequences from Los Alamos database

In short: download sequences from Los Alamos and merge into one file per protein

In [1]:
# Los Alamos output can only be generated per protein (since we require protein sequences, not DNA)
# Therefore these lists are needed for later looping through all .fasta files
HIVproteome = ['env', 'gag', 'pol', 'tat', 'rev', 'nef', 'vpr', 'vpu', 'vif']
%store HIVproteome
HIV2proteome = ['env', 'gag', 'pol', 'tat', 'rev', 'nef', 'vpr', 'vpx', 'vif']
%store HIV2proteome

Stored 'HIVproteome' (list)
Stored 'HIV2proteome' (list)


# HIV1

For proteinX in HIVproteome:

https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html

Alignment type: Subtype reference  
Organism: HIV-1/SIVcpz  
Region: pre-defined proteinX  
Subtype: All M group (A-K + Recombinants)  
DNA/Protein: Protein  
Year: 2010  
Format: Fasta

Output files:    "HIV1_REF_2010_proteinX_PRO.fasta"

In [None]:
%store -r HIVproteome

HIV1AKletters = 'ABCDEFGHIJK' # for selecting only A-K strains within M group

for protein in HIV1proteome: # loop through LANL download files for each protein
    HIV1Reflist = [""] # initiate
    with open('/Users/pcevaal/Desktop/TheoreticalBiol/LANL/HIV1_REF_2010_%s_PRO.fasta' %protein,'r') as f:
        flist = f.read().split('>') # split file into item for each reference strain 
        for seq in flist:
            unaligned = seq.replace("-","") # aligned sequences contain - character which interferes with peptide generation
            lst = unaligned.split('\n') # split strain name from protein sequence
            if lst[0][4] in HIV1AKletters: # select only strains from group HIV1 A-K 
                HIV1Reflist.append(unaligned) # append unaligned fasta sequence to total list
    HIV1AK = ">".join(HIV1Reflist) # reorganise total list into fasta format, separating all strains with >
    with open('/Users/pcevaal/Desktop/TheoreticalBiol/HIV1AK_%s.fasta.txt' %protein,'w') as w:
        w.write(HIV1AK) # produce new output file

Output files: "HIV1AK_proteinX_fasta.txt"

Manual changes (sorry for it being manual):  
 
Edit false double reference strains: first of Ref.F2.CM.95.95 becomes Ref.F2.CM.95.9a(5) and first of Ref.H.BE.93.VI9 becomes Ref.H.BE.93.VIa(9)

# HIV2

For proteinX in HIV2proteome

https://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html

Alignment type: Web   
Organism: HIV-2/SIVsmm    
Region: pre-defined proteinX  
Subtype: All M group (A-K + Recombinants)    
DNA/Protein: Protein  
Year: 2016  
Format: Fasta

Output files:    "HIV2_ALL_2016_proteinX_PRO.fasta"

Yields 128 sequences (Web: all sequences known). Use Machado 2014 for used reference strains for HIV2A and HIV2B (see below: HIV2Refnames)

In [25]:
%store -r HIV2proteome

HIV2Refnames = ['A.GM.87.D194.J04542','A.GH.x.GH1.M30895','A.CI.88.UC2.U38293','A.DE.x.BEN.M30502','B.CI.88.UC1.L07625','B.GH.86.D205_ALT.X61240','B.CI.x.EHO.U27200','B.JP.01.IMCJ_KR020_1.AB100245','B.CI.x.20_56.AB485670']

for protein in HIV2proteome: # highly similar to above code for HIV1
    HIV2Reflist = [""]
    with open('/Users/pcevaal/Desktop/TheoreticalBiol/LANL/HIV2_ALL_2016_%s_PRO.fasta' %protein,'r') as f:
        flist = f.read().split('>')
        for seq in flist:
            unaligned = seq.replace("-","")
            lst = unaligned.split('\n')
            if lst[0] in HIV2Refnames:
                HIV2Reflist.append(unaligned) 
    HIV2AB = ">".join(HIV2Reflist)
    if protein == 'vpx': # HIV2 contains VPX protein instead of HIV1 VPU, but in order to allow automated processing, we combine these
        with open('/Users/pcevaal/Desktop/TheoreticalBiol/HIV2_AB_2016_vpu_PRO.fasta','w') as w:
            w.write(HIV2AB) # save HIV2 VPX file as HIV2 "VPU" 
    else:  # the other proteins are processed normally
        with open('/Users/pcevaal/Desktop/TheoreticalBiol/HIV2_AB_2016_%s_PRO.fasta' %protein,'w') as w:
            w.write(HIV2AB)

Output files: "HIV2_AB_2016_proteinX_PRO.fasta"  

Add sequences from HIV2 to files for HIV1, resulting in general HIV files per protein containing reference sequences from different HIV1 and HIV2 groups.  
NB!! HIV2 contains VPX protein where HIV1 has VPU. Both belong to accessory proteins however and will thus be combined anyway, therefore, rename to VPU and join fasta files

In [26]:
%store -r HIVproteome
for protein in HIVproteome: # HIV2 VPX has been renamed VPU, so we can now loop through HIV(1)proteome
    HIV12 = "" # initiate
    with open('/Users/pcevaal/Desktop/TheoreticalBiol/HIV1AK_%s.fasta.txt' %protein, 'r') as f:
        data = f.read() # open HIV1 protein sequences
        HIV12 = HIV12 + data # append to merged HIV1/2 file
    with open('/Users/pcevaal/Desktop/TheoreticalBiol/HIV2_AB_2016_%s_PRO.fasta' %protein,'r') as g:
        data = g.read() # open HIV2 protein sequences
        HIV12 = HIV12 + "\n" + data # append to merged HIV1/2 file
    with open('/Users/pcevaal/Desktop/TheoreticalBiol/HIV12_%s.fasta' %protein, 'w') as w:
        w.write(HIV12) # save merged files for netMHCpan input

Output files: "HIV12_proteinX.fasta"