# Program name: ReadSIFTS_FilterPDBDIMER.ipynb 
This notebook notes down the detailed steps to obtain bacteria Homodimer/Heterodimer, Human Homodimer/Heterodimer structures from PDB. It can also be adapted to certain multimer for certain species from PDB.

It starts from reading two SIFTS files (relatively "stable"): *pdb_chain_uniprot.csv* and *pdb_chain_tax.csv*; retrieve experimental method and resolution from PDB API services (*REST*); checks chain length; taxnomy information; and lastly, check chains in the first bioassembly by reading the header of PDB file. 

>- Run with python3.9
>- Install pandas, biopython, pypdb

Author: Haiqing Zhao, Honig Lab at Columbia University
<br>Related other programs: ReadPDBreport_FindHeterodimer.py
<br>Created: 08/11/2022
<br>Last Update: 09/09/2023

### Selecting conditions for a pure hetero- or homo-dimer of bacteria or human:
1. All included proteins must come from one bacterial species or from human 9606 [using files *SIFTS_pdb_chain_taxonomy.csv*; *NCBI_taxonomy_bacteria_full_result.txt*];
2. Number of chains =2; only one Uniprot protein for homodimer; two Uniprot proteins for heterodimer; sequence length(s) > 30aa;
3. Resolution < 4.0(X-ray) or < 4.5(EM) $\AA$;
4. Select BioAssembly 1, 2chains, and the Author- or software-defined;
5. Remove redundant dimers by keeping the structure of significantly-LONG total length, or better resolution; (significant: >2-fold than others)
6. => *human_PDB_g2chains_hetero2uni_g30aa_reslgood_BU1_2chains_nochim_nonredun_2uni_05082023.csv*



In [281]:
import pandas as pd
import pypdb
from Bio.PDB.PDBParser import PDBParser
import Bio.PDB as PDB

localpathprefix = '/Users/hz2592/OneDrive - cumc.columbia.edu/WORK'

Firstly, read the two SIFTS files *pdb_chain_uniprot.csv*, *pdb_chain_taxonomy.csv* and filter from Uniprot and taxonomy perspective

The two SIFTS files are downloaded from https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html

Remember to pre-process SIFTS files: run the following command to avoid misreading fields
>`sed "s/,/\-/4" pdb_chain_taxonomy.csv > pdb_chain_taxonomy.fixed.csv`

In [38]:
# Load the two SIFTS pdb_chain files in two  dictionaries
### This block takes 7-8 min to run. 

SIFTS_pdb_chain_uniprot=localpathprefix+'/SIFTS_202111/pdb_chain_uniprot.csv'
SIFTS_pdb_chain_taxonomy=localpathprefix+'/SIFTS_202111/pdb_chain_taxonomy.csv'
#head pdb_chain_taxonomy.csv
#PDB,CHAIN,TAX_ID,SCIENTIFIC_NAME
#head pdb_chain_uniprot.csv
#PDB,CHAIN,SP_PRIMARY,RES_BEG,RES_END,PDB_BEG,PDB_END,SP_BEG,SP_END

df_SIFTS_chain_Uni = pd.read_csv(SIFTS_pdb_chain_uniprot,dtype=object,skiprows=1,usecols=['PDB','CHAIN','SP_PRIMARY','SP_BEG','SP_END']).fillna(0)
df_SIFTS_chain_Uni = df_SIFTS_chain_Uni.rename(columns={'SP_PRIMARY': 'Uniprot'})

dic_chain_Unis,dic_PDB_Chain_Unis={},{}
for index, row in df_SIFTS_chain_Uni.iterrows():
    pdb=row['PDB']
    chainlength=int(row['SP_END'])-int(row['SP_BEG'])
    if pdb in dic_PDB_Chain_Unis:
        if row['CHAIN'] not in dic_PDB_Chain_Unis[pdb]:
            dic_PDB_Chain_Unis[pdb].update({row['CHAIN']:[row['Uniprot'],chainlength]})
        elif row['Uniprot'] not in dic_PDB_Chain_Unis[pdb][row['CHAIN']]:
            dic_PDB_Chain_Unis[pdb][row['CHAIN']].extend([row['Uniprot'],chainlength])
    else:
        dic_PDB_Chain_Unis[pdb]={row['CHAIN']:[row['Uniprot'],chainlength]}

dic_PDB_Taxs={}
df_SIFTS_chain_Tax = pd.read_csv(SIFTS_pdb_chain_taxonomy,dtype=object,skiprows=1,usecols=['PDB','CHAIN','TAX_ID']) #.fillna(0)
for index, row in df_SIFTS_chain_Tax.iterrows():
    pdb=row['PDB']
    if pdb in dic_PDB_Taxs:
        if row['TAX_ID'] not in dic_PDB_Taxs[pdb]:
            dic_PDB_Taxs[pdb].append(row['TAX_ID'])
    else:
        dic_PDB_Taxs[pdb]=[row['TAX_ID']]



Note: because all we require is that the chains of one PDB are from the same species, we checked the taxonomy at the PDB level. Here, one scenaro that we did not take into account is the cases where SIFTS_chain_Tax may have missed one chain's information. For example, 3pse, SIFT_chain_tax only contains chain B.   

In [None]:
## Pre-filtering step: Define bacteria and human taxonomy 
#### Skip three blocks if already provided with the categorizedPDB_goodresolution file. 
#### This block takes 15-20min to run. To run, switch to code mode.
#### Full bacteria taxID list is downloaded from https://www.ncbi.nlm.nih.gov/taxonomy?term=txid2%5BSubtree%5D&cmd=DetailsSearch; 

print("Number of PDBs:",len(dic_PDB_Taxs))
bacteria_taxlistfile = '/Users/hz2592/Dropbox/Methods/bacteria_CSV/NCBI_taxonomy_bacteria_full_result.txt'
with open(bacteria_taxlistfile, "r") as filestream:
    lines = filestream.readlines()
    bacteria_taxlist = [line.rstrip() for line in lines]

bacteria_PDBs=[];human_PDBs=[]
for i in dic_PDB_Taxs:
    if len(dic_PDB_Taxs[i]) ==1:
        if dic_PDB_Taxs[i][0] in bacteria_taxlist:
            bacteria_PDBs.append(i)
        if dic_PDB_Taxs[i][0]== '9606':
            human_PDBs.append(i)


In [None]:
## Pre-filtering step: from >2-chain PDB, define bacteria heteroPDB (2uni), bacteria homoPDB (1uni); same for human; select len ≥ 30aa
#### Skip two blocks if already provided with the categorizedPDB_goodresolution file. 
#### This block takes 1min to run. To run, switch to code mode.

bacteria_homoPDBs=[];bacteria_heteroPDBs=[]
for pdb in bacteria_PDBs:
    if pdb in dic_PDB_Chain_Unis:
        if len(dic_PDB_Chain_Unis[pdb]) >= 2: 
            unis = [chainrecord[0] for chainrecord in dic_PDB_Chain_Unis[pdb].values()]
            lengths = [chainrecord[1] for chainrecord in dic_PDB_Chain_Unis[pdb].values()]
            if len(set(unis)) == 1 and all([length >= 30 for length in lengths]):
                bacteria_homoPDBs.append(pdb)
            if len(set(unis)) == 2 and all([length >= 30 for length in lengths]):
                bacteria_heteroPDBs.append(pdb)
print('Number of bacteria_homoPDBs, bacteria_heteroPDBs:',len(bacteria_homoPDBs), len(bacteria_heteroPDBs))

human_homoPDBs=[];human_heter2merPDBs=[]
for pdb in human_PDBs:
    if pdb in dic_PDB_Chain_Unis:
        if len(dic_PDB_Chain_Unis[pdb]) >= 2: 
            unis = [chainrecord[0] for chainrecord in dic_PDB_Chain_Unis[pdb].values()]
            lengths = [chainrecord[1] for chainrecord in dic_PDB_Chain_Unis[pdb].values()]
            if len(set(unis)) == 1 and all([length >= 30 for length in lengths]):
                human_homoPDBs.append(pdb)
            if len(set(unis)) == 2 and all([length >= 30 for length in lengths]):
                human_heter2merPDBs.append(pdb)
print('Number of human_homoPDBs, human_heter2merPDBs:',len(human_homoPDBs),len(human_heter2merPDBs))


In [None]:
## Pre-filtering step: from above, generate good resolution lists for human/bacteria_hetero/homo: x-ray/EM < 4/4.5 å
#### Write categorizedPDB_goodresolution files
#### Skip to next block if already provided with the filtered_resolution file. 
#### This block takes 40-45 min to run. To run, switch to code mode.

def get_expmethod_reslution(pdbid):
    PDB_info = pypdb.get_entity_info(pdbid)
    if PDB_info != None:
        try:
            PDB_info = PDB_info['rcsb_entry_info']
            if PDB_info['selected_polymer_entity_types']=='Protein (only)' and (PDB_info['experimental_method'] in ['X-ray','ELECTRON MICROSCOPY','EM']) and PDB_info['resolution_combined'] != None:
                return PDB_info['experimental_method'],PDB_info['resolution_combined'][0]
            else: return None,None
        except (KeyError, TypeError):
            print('KeyError or TyperError:',pdbid)
            return None, None
    else:
        print('Problematic PDB:',pdbid)
        return None,None

def select_goodresl_PDB_writefile(inputPDBlist,outfile):
    outPDB_resl={}
    for pdb in inputPDBlist:
        method,resolution = get_expmethod_reslution(pdb)
        if (method == 'X-ray' and resolution <= 4) or (method == 'ELECTRON MICROSCOPY' and resolution <= 4.5) or (method == 'EM' and resolution <= 4.5):
            outPDB_resl.update({pdb:resolution})   
    with open(outfile, 'w') as f:
        f.write("%s,%s\n"%("PDBID","Resolution"))
        for key in outPDB_resl.keys():
            f.write("%s,%s\n"%(key,outPDB_resl[key]))
    print("Selected PDB number from/to:",len(inputPDBlist),len(outPDB_resl))
    return

bacteria_homoPDBs_g2chains_resl_file='/Users/hz2592/WORK/SIFTS_202111/PDB_resl/bacteria_homoPDBs_g2chains_g30aa_reslgood_05092023.csv'
human_homoPDBs_g2chains_resl_file='/Users/hz2592/WORK/SIFTS_202111/PDB_resl/human_homoPDBs_g2chains_g30aa_reslgood_05092023.csv'
human_heter2merPDBs_g2chains_resl_file='/Users/hz2592/WORK/SIFTS_202111/PDB_resl/human_heter2merPDBs_g2chains_g30aa_reslgood_05092023.csv'
bacteria_heter2merPDBs_g2chains_resl_file='/Users/hz2592/WORK/SIFTS_202111/PDB_resl/bacteria_heter2merPDBs_g2chains_g30aa_reslgood_05092023.csv'

select_goodresl_PDB_writefile(bacteria_homoPDBs,bacteria_homoPDBs_g2chains_resl_file)
select_goodresl_PDB_writefile(human_homoPDBs,human_homoPDBs_g2chains_resl_file)
select_goodresl_PDB_writefile(human_heter2merPDBs,human_heter2merPDBs_g2chains_resl_file)
select_goodresl_PDB_writefile(bacteria_heteroPDBs,bacteria_heter2merPDBs_g2chains_resl_file)


In [164]:
# Load saved categorizedPDB_goodresolution files and read to dictionaries.

bacteria_homoPDBs_g2chains_resl_file=localpathprefix+'/SIFTS_202111/PDB_resl/bacteria_homoPDBs_g2chains_g30aa_reslgood_05092023.csv'
human_homoPDBs_g2chains_resl_file=localpathprefix+'/SIFTS_202111/PDB_resl/human_homoPDBs_g2chains_g30aa_reslgood_05092023.csv'
human_heter2merPDBs_g2chains_resl_file=localpathprefix+'/SIFTS_202111/PDB_resl/human_heter2merPDBs_g2chains_g30aa_reslgood_05092023.csv'
bacteria_heter2merPDBs_g2chains_resl_file=localpathprefix+'/SIFTS_202111/PDB_resl/bacteria_heter2merPDBs_g2chains_g30aa_reslgood_05092023.csv'

human_heter2merPDBs_resl = pd.read_csv(human_heter2merPDBs_g2chains_resl_file).set_index('PDBID').to_dict()['Resolution']
bacteria_heter2merPDBs_resl = pd.read_csv(bacteria_heter2merPDBs_g2chains_resl_file).set_index('PDBID').to_dict()['Resolution']
bacteria_homoPDBs_resl = pd.read_csv(bacteria_homoPDBs_g2chains_resl_file).set_index('PDBID').to_dict()['Resolution']
human_homoPDBs_resl = pd.read_csv(human_homoPDBs_g2chains_resl_file).set_index('PDBID').to_dict()['Resolution']

dict_PDB_resl = bacteria_homoPDBs_resl | human_homoPDBs_resl | human_heter2merPDBs_resl | bacteria_heter2merPDBs_resl


In [165]:
# Select PDBs with only TWO chains for a pure dimer list

bacteria_homoPDBs_2chains=[]
for pdb in bacteria_homoPDBs_resl:
    if pdb in dic_PDB_Chain_Unis:
        if len(dic_PDB_Chain_Unis[pdb]) == 2: 
            bacteria_homoPDBs_2chains.append(pdb)

human_homoPDBs_2chains=[]
for pdb in human_homoPDBs_resl:
    if pdb in dic_PDB_Chain_Unis:
        if len(dic_PDB_Chain_Unis[pdb]) == 2: 
            human_homoPDBs_2chains.append(pdb)

human_heter2merPDBs_2chains=[]
for pdb in human_heter2merPDBs_resl:
    if pdb in dic_PDB_Chain_Unis:
        if len(dic_PDB_Chain_Unis[pdb]) == 2: 
            human_heter2merPDBs_2chains.append(pdb)

bacteria_heter2merPDBs_2chains=[]
for pdb in bacteria_heter2merPDBs_resl:
    if pdb in dic_PDB_Chain_Unis:
        if len(dic_PDB_Chain_Unis[pdb]) == 2: 
            bacteria_heter2merPDBs_2chains.append(pdb)

print('Number of bacteria_homoPDBs_2chains, bacteria_heter2merPDBs_2chains: ',len(bacteria_homoPDBs_2chains),len(bacteria_heter2merPDBs_2chains))
print('Number of human_homoPDBs_2chains, human_heter2merPDBs_2chains: ',len(human_homoPDBs_2chains),len(human_heter2merPDBs_2chains))

human_heter2merPDBs_g2chains = human_heter2merPDBs_resl.keys()
print('Number of human_heter2merPDBs_g2chains: ',len(human_heter2merPDBs_g2chains))


Number of bacteria_homoPDBs_2chains, bacteria_heter2merPDBs_2chains:  16166 651
Number of human_homoPDBs_2chains, human_heter2merPDBs_2chains:  9455 1202
Number of human_heter2merPDBs_g2chains:  2637


### Check bio-molecule 1 (bio-unit 1, bio-assembly 1)
At the time being, the PDB API service either REST or GraphQL could not provide exact chain IDs that are in the first bio-assembly. 
The next block is written to do this job, by reading the header of PDB file. The REMARK 350 session marks the author- or software-defined bio-assembly, and described what transformation was applied to what chains to generate it. 

Note that the number of "APPLY"s is equal to the number of MODEL in pdb1 file. So one should calculate IFC across models since these are considered as a bio-assembly. Example: when two APPLYs of transformation happen in bio-assembly 1. 

To run the below block, one needs to download needed PDB files to local machine. 
>Run command to download PDB:
>`wget http://www.rcsb.org/pdb/files/XXXX.pdb`


In [282]:
# Define function to read PDB header to obtain BU1 author_bu1, software_bu1, chainIDs and their resolved seq_length information
#### Select PDBs with only 2chains in biounit 1
#### Skip three blocks if provided with PDBread files.

def get_auth_software_bu1(inputPDBs):
    PDBdb_dir='/Volumes/bh_lab_data/shares/databases/pdb/';dic_output={}
    for pdb in inputPDBs:
        auth_oligstate_biomol1=None;software_oligstate_biomol1=None;chains_biomol1=[]
        try:
            pdbfile=open(PDBdb_dir+pdb+'.pdb','r')
            #pdbfile = open('/Volumes/bh_lab_data/shares/databases/pdb/1axi.pdb','r')
        except FileNotFoundError:
            print('Not available complexes; Skip',pdb)
            continue
        for line in pdbfile:
            if "REMARK 350 BIOMOLECULE:" in line:
                if line.split(':')[1].strip()=="1":
                    biomol=1
                else:
                    biomol=line.split(':')[1].strip()
                    break
            if "REMARK 350 AUTHOR DETERMINED BIOLOGICAL UNIT:" in line:
                if biomol==1:
                    auth_oligstate_biomol1=line.split(':')[1].strip()
            if "REMARK 350 SOFTWARE DETERMINED QUATERNARY STRUCTURE:" in line:
                if biomol==1:
                    software_oligstate_biomol1=line.split(':')[1].strip()
            if 'REMARK 350 APPLY THE FOLLOWING TO CHAINS:' in line or 'REMARK 350 IN ADDITION APPLY THE FOLLOWING TO CHAINS' in line or 'REMARK 350 AND CHAINS' in line:
                if biomol==1:
                    chains_biomol1 = chains_biomol1+line.split(':')[1].strip().split(', ')
            if "REMARK 500" in line: 
                break
        if all(i in dic_PDB_Chain_Unis[pdb].keys() for i in chains_biomol1) and len(chains_biomol1)==2:
            structure = PDBParser().get_structure('', PDBdb_dir+pdb+'.pdb')
            reslv_seqs_biomol1 = []
            for i in chains_biomol1:
                reslv_seq = len([_ for _ in structure[0][i].get_residues() if PDB.is_aa(_)])
                reslv_seqs_biomol1.append(reslv_seq)
            dic_output.update({pdb:[auth_oligstate_biomol1,software_oligstate_biomol1,chains_biomol1,reslv_seqs_biomol1]})
            #except KeyError: print ('PDBfile chain reading error:', pdb,chains_biomol1) 
    return dic_output


In [None]:
# Get BU1's Author-oligomer, software-oligomer, chainIDs
#### bac_homo takes 6-7min; human_homo takes 2-3min; 
#### human heterodimers takes 8min; bac_homo takes 50min. 

human_heterPDBs_2chains_dic_auth_sotware_bu1_2chains = get_auth_software_bu1(human_heter2merPDBs_2chains)
bacteria_heterPDBs_2chains_dic_auth_sotware_bu1_2chains = get_auth_software_bu1(bacteria_heter2merPDBs_2chains)
human_homoPDBs_2chains_dic_auth_sotware_bu1_2chains = get_auth_software_bu1(human_homoPDBs_2chains)
bacteria_homoPDBs_2chains_dic_auth_sotware_bu1_2chains = get_auth_software_bu1(bacteria_homoPDBs_2chains)



In [None]:
# Optional test for special PDB cases

test_list =['1A22','2D7T','4NUG','4NUJ','4NWU','4O5L','5EWI','5IDN','5M9O','5SZH','5SZI','5SZJ','5VSI','5W6C','5WHJ','6D4P','6DC4','6TX3','6U07','6W2L','6XI7','6ZBK','6ZUE','7E2I','7F6L','7FGM','7MNR','7MON','7NJ1','7PW9','7S0U','7SCW','7SHQ','7TYR','7UZU','7V8F','7VG7','7VV9','7VVB','8A8M']
for pdb in test_list:
    pdb=pdb.lower()
    try:
        dic_PDB_Chain_Unis[pdb]
    except KeyError:
        print("Not in SIFTS_pdbchain_Uniprot:",pdb)
    try:
        dic_PDB_Taxs[pdb]
    except KeyError:
        print("Not in SIFTS_pdbchain_Tax:",pdb)


In [None]:
# Write PDB-head-read information to files

def writeReslSeq_file(inputPDBdict,outfile):
    with open(outfile, 'w') as f:
        f.write("%s,%s,%s,%s,%s,%s,%s\n"%("PDBID","Auth_bu1","Software_bu1","Chain1","Chain2","rslvSeq1","rslvSeq2"))
        for key in inputPDBdict.keys():
            f.write("%s,%s,%s,%s,%s,%s,%s\n"%(key,inputPDBdict[key][0],inputPDBdict[key][1],inputPDBdict[key][2][0],inputPDBdict[key][2][1],inputPDBdict[key][3][0],inputPDBdict[key][3][0],))
    return

#Auth_bu1,Software_bu1,Resolution,Chain1,Chain2,Tax1,Tax2,Uni1,Uni2,rslvSeq1,rslvSeq2
#{pdb:[auth_oligstate_biomol1,software_oligstate_biomol1,chains_biomol1,reslv_seqs_biomol1]}

bacteria_homoPDBs_2chains_readPDB_file=localpathprefix+'/SIFTS_202111/PDB_resl/bacteria_homoPDBs_2chains_g30aa_reslgood_PDBread_BU1_05102023.csv'
bacteria_heteroPDBs_2chains_readPDB_file=localpathprefix+'/SIFTS_202111/PDB_resl/bacteria_heteroPDBs_2chains_g30aa_reslgood_PDBread_BU1_05102023.csv'
human_heteroPDBs_2chains_readPDB_file=localpathprefix+'/SIFTS_202111/PDB_resl/human_heteroPDBs_2chains_g30aa_reslgood_PDBread_BU1_05102023.csv'
human_homoPDBs_2chains_readPDB_file=localpathprefix+'/SIFTS_202111/PDB_resl/human_homoPDBs_2chains_g30aa_reslgood_PDBread_BU1_05102023.csv'

writeReslSeq_file(human_heterPDBs_2chains_dic_auth_sotware_bu1_2chains,human_heteroPDBs_2chains_readPDB_file)
writeReslSeq_file(human_homoPDBs_2chains_dic_auth_sotware_bu1_2chains,human_homoPDBs_2chains_readPDB_file)
writeReslSeq_file(bacteria_homoPDBs_2chains_dic_auth_sotware_bu1_2chains,bacteria_homoPDBs_2chains_readPDB_file)
writeReslSeq_file(bacteria_heterPDBs_2chains_dic_auth_sotware_bu1_2chains,bacteria_heteroPDBs_2chains_readPDB_file)


In [None]:
# Remove PPI Uniprot-pair redundancy; Remove chimeria proteins; Remove short proteins
## by keeping the structure that has significantly-LONG sequence or better resolution  

df_SIFTS_chain_Tax['PDBchain'] = df_SIFTS_chain_Tax['PDB'].astype(str) + df_SIFTS_chain_Tax['CHAIN'].astype(str)
SIFTS_chain_tax_avail = df_SIFTS_chain_Tax['PDBchain'].to_list()

def remove_redundancy_dimer(PDBs_2chains_dic_auth_sotware_bu1_2chains,olig='hetero'):
    PDBs_2chains_bu1_2chains_nonredun={};chimera_list=[]
    for pdb in PDBs_2chains_dic_auth_sotware_bu1_2chains.keys():
        if 'DIMERIC' not in PDBs_2chains_dic_auth_sotware_bu1_2chains[pdb][0:2]: continue
        [chain1,chain2]=PDBs_2chains_dic_auth_sotware_bu1_2chains[pdb][2][0:2]
        [len1,len2]=PDBs_2chains_dic_auth_sotware_bu1_2chains[pdb][3][0:2]
        totallen = len1+len2
        uni1,uni2=dic_PDB_Chain_Unis[pdb][chain1][0],dic_PDB_Chain_Unis[pdb][chain2][0]
        unis=[uni1,uni2]
        if (len(dic_PDB_Chain_Unis[pdb][chain1]) != 2) or (len(dic_PDB_Chain_Unis[pdb][chain2]) != 2): 
            print("Chimera chain found in",pdb,'; SKIP');chimera_list.append(pdb)
            continue
        # confirm chains are those with NCBI-taxnomoyID, not others 
        if (pdb+chain1 not in SIFTS_chain_tax_avail) or (pdb+chain2 not in SIFTS_chain_tax_avail):
            print ("Tax not covered in SIFTS_chain_tax:",pdb,'; SKIP')
            continue
        if olig=='hetero':
            if chain1 == chain2 or uni1==uni2: 
                print ("Identical_chainID or HOMOdimer in ",pdb,chain1,chain2,uni1,uni2,'; SKIP' ); 
                continue
        else:
            if chain1 == chain2 or uni1!=uni2: 
                print ("Identical_chainID or HETEROdimer in ",pdb,chain1,chain2,uni1,uni2,'; SKIP');
                continue
        if len1 < 30 or len2 < 30: 
            print ("Resvl seq len shorter than 30: ",pdb,chain1,chain2,len1,len2,'; SKIP' ); 
            continue
        resl = dict_PDB_resl[pdb]
        sorted_unis = tuple(sorted(unis))
        if sorted_unis in PDBs_2chains_bu1_2chains_nonredun:
            if totallen > 2*PDBs_2chains_bu1_2chains_nonredun[sorted_unis][2]:
                PDBs_2chains_bu1_2chains_nonredun.update({sorted_unis:[pdb,resl,totallen]})
            elif dict_PDB_resl[pdb] < PDBs_2chains_bu1_2chains_nonredun[sorted_unis][1]:
                PDBs_2chains_bu1_2chains_nonredun.update({sorted_unis:[pdb,resl,totallen]})
        else:
            PDBs_2chains_bu1_2chains_nonredun.update({sorted_unis:[pdb,resl,totallen]})
    return PDBs_2chains_bu1_2chains_nonredun 


human_heterPDBs_2chains_bu1DI_2chains_nonredun_dic = remove_redundancy_dimer(human_heterPDBs_2chains_dic_auth_sotware_bu1_2chains)
print("Number of human_heterPDBs_2chains_bu1_2chains_nonredun:",len(human_heterPDBs_2chains_bu1_2chains_nonredun_dic))
human_heterPDBs_2chains_bu1DI_2chains_nonredun_list = [i[0] for i in human_heterPDBs_2chains_bu1DI_2chains_nonredun_dic.values()]

bacteria_heterPDBs_2chains_bu1DI_2chains_nonredun_dic = remove_redundancy_dimer(bacteria_heterPDBs_2chains_dic_auth_sotware_bu1_2chains)
print("Number of bacteria_heterPDBs_2chains_bu1DI_2chains_nonredun_dic:",len(bacteria_heterPDBs_2chains_bu1DI_2chains_nonredun_dic))
bacteria_heterPDBs_2chains_bu1DI_2chains_nonredun_list = [i[0] for i in bacteria_heterPDBs_2chains_bu1DI_2chains_nonredun_dic.values()]

human_homoPDBs_2chains_bu1DI_2chains_nonredun_dic = remove_redundancy_dimer(human_homoPDBs_2chains_dic_auth_sotware_bu1_2chains,'homo')
print("Number of human_homoPDBs_2chains_bu1DI_2chains_nonredun_dic:",len(human_homoPDBs_2chains_bu1DI_2chains_nonredun_dic))
human_homoPDBs_2chains_bu1DI_2chains_nonredun_list = [i[0] for i in human_homoPDBs_2chains_bu1DI_2chains_nonredun_dic.values()]

bacteria_homoPDBs_2chains_bu1DI_2chains_nonredun_dic = remove_redundancy_dimer(bacteria_homoPDBs_2chains_dic_auth_sotware_bu1_2chains,'homo')
print("Number of bacteria_homoPDBs_2chains_bu1DI_2chains_nonredun_dic:",len(bacteria_homoPDBs_2chains_bu1DI_2chains_nonredun_dic))
bacteria_homoPDBs_2chains_bu1DI_2chains_nonredun_list = [i[0] for i in bacteria_homoPDBs_2chains_bu1DI_2chains_nonredun_dic.values()]


In [278]:
# Write finally selected categrized PDBDIMER list into csv files

def write_finalfile(selectedlist,input_2chains_nonredun_dic_auth_sotware_bu1,outputfile):
    with open(outputfile, 'w') as f:
        column="PDBID,Auth_bu1,Software_bu1,Resolution,Chain1,Chain2,Tax1,Tax2,Uni1,Uni2,rslvSeq1,rslvSeq2"
        f.write("%s\n"%(column))
        for pdb in sorted(selectedlist):
            [chain1,chain2] = input_2chains_nonredun_dic_auth_sotware_bu1[pdb][2][0:2]
            [seq1,seq2] = input_2chains_nonredun_dic_auth_sotware_bu1[pdb][3][0:2]
            uni1,uni2 = dic_PDB_Chain_Unis[pdb][chain1][0],dic_PDB_Chain_Unis[pdb][chain2][0]
            f.write("%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n"%(pdb,input_2chains_nonredun_dic_auth_sotware_bu1[pdb][0],input_2chains_nonredun_dic_auth_sotware_bu1[pdb][1],\
                dict_PDB_resl[pdb],chain1,chain2,dic_PDB_Taxs[pdb][0],dic_PDB_Taxs[pdb][0],uni1,uni2,seq1,seq2))
    return

human_heteroPDBs_2chains_resl_bu1DI_nonredun_file = localpathprefix+'/PDB_MI/HumanPDB_Hetero2mer/human_PDB_2chains_hetero2uni_g30aa_reslgood_BU1DI_2chains_nochim_nonredun_2uni_05112023.csv'
write_finalfile(human_heterPDBs_2chains_bu1DI_2chains_nonredun_list,human_heterPDBs_2chains_dic_auth_sotware_bu1_2chains,human_heteroPDBs_2chains_resl_bu1DI_nonredun_file)

bacteria_heteroPDBs_2chains_resl_nonredun_bu1DI_file = localpathprefix+'/PDB_MI/BacteriaPDB_Hetero2mer/bacteria_PDB_2chains_hetero2uni_g30aa_reslgood_BU1DI_2chains_nochim_nonredun_2uni_05112023.csv'
write_finalfile(bacteria_heterPDBs_2chains_bu1DI_2chains_nonredun_list,bacteria_heterPDBs_2chains_dic_auth_sotware_bu1_2chains,bacteria_heteroPDBs_2chains_resl_nonredun_bu1DI_file)

human_homoPDBs_2chains_resl_nonredun_bu1DI_file = localpathprefix+'/PDB_MI/HumanPDB_Homo2mer/human_PDB_2chains_1uni_g30aa_reslgood_BU1DI_2chains_nochim_nonredun_1uni_05112023.csv'
write_finalfile(human_homoPDBs_2chains_bu1DI_2chains_nonredun_list,human_homoPDBs_2chains_dic_auth_sotware_bu1_2chains,human_homoPDBs_2chains_resl_nonredun_bu1DI_file)

bacteria_homoPDBs_2chains_resl_nonredun_bu1DI_file = localpathprefix+'/PDB_MI/BacteriaPDB_Homo2mer/bacteria_PDB_2chains_1uni_g30aa_reslgood_BU1DI_2chains_nochim_nonredun_1uni_05112023.csv'
write_finalfile(bacteria_homoPDBs_2chains_bu1DI_2chains_nonredun_list,bacteria_homoPDBs_2chains_dic_auth_sotware_bu1_2chains,bacteria_homoPDBs_2chains_resl_nonredun_bu1DI_file)
