NAfragDB: a multi-purpose structural database of nucleic-acid/protein complexes for advanced users
===================

A. Moniot¹

S. J. de Vries², D. W. Ritchie¹, I. Chauvot-de-Beauchêne¹

1. University of Lorraine, CNRS, Inria, LORIA, 54000 Nancy, France
2. University of Paris Diderot, INSERM, 75013 Paris, France

## Introduction

Idea to create a database with customizable queries.

In the PDB : 
- 124282 structures
- of which 7547 are NA-protein complexes

NAfragDB is a pipeline to:
--------------------------------------------------------------------------
1. Clean-up and parse all NA-protein structures from the PDB into ensembles of small information units in a single file

2. Search for sets of NA-protein structures with highly cutomisable combinations of criteria

3. Create RNA/DNA 3D fragment libraries extracted from those sets of structures.

4. Perform statistics on customised features of such libraries


![](GGMM_global.png)

###    1. Parsing and stocking information into a json file

![](GGMM_global1.png)

<pre><code>
"1B7F": {
    "method": "x-ray diffraction",
    "resolution": 2.6,
    "Nmodels": 1,
    "hetnames": {},
    "nachains": [
      "P",
      "Q"
    ],
    "protchain": [
      "A",
      "B"
    ],
    "sequence": {
      "chain_P": "GUUGUUUUUUUU",
      "chain_Q": "GUUGUUUUUUUU"
    },
    "canonized": {},
    "missing_atoms": {
      "chain_P": {},
      "chain_Q": {}
    },
    "interface_protein": {
      "model_1": {
        "chain_P": {
          "res_8": {
            "sug": 3.36,
            "base": 2.65,
            "ph": 3.67
          },
          "res_10": {
            "sug": 3.3,
            "base": 2.48,
            "ph": 3.21
          },
          "res_7": {
            "base": 2.58,
            "ph": 3.84
          },
          ...
      }
    },
    "intraRNA_hb": {
      "chain_P": {
        "res_1": {
          "sug": {
            "n+1": 1.0
          }
        },
        "res_3": {
          "ph": {
            "n+1": 0.375
          }
        },
        ...
      }
    },
    "stacking": {
      "chain_P": {
        "res_7": {
          "n+1": 1
        },
        "res_8": {
          "n-1": 1
        }
      },
      "chain_Q": {
        "res_7": {
          "n+1": 1
        },
        "res_8": {
          "n-1": 1
        }
      }
    },
    "ss": {
      "chain_P": {
        "res_2": [
          "S",
          12,
          12
        ],
        "res_3": [
          "S",
          8,
          12
        ],
        ...
    },
    "bptype": {
      "chain_P": {},
      "chain_Q": {
        "res_10": [
          "Platform"
        ],
        "res_9": [
          "Platform"
        ]
      }
    },
    "interface_hetatoms": {},
    "breaks": {
      "chain_P": null,
      "chain_Q": null
    }
  }
  </code></pre>

![](GGMM_pipeline.png)

1. Xiang-Jun Lu & Wilma K. Olson (2003). ‘3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures’, Nucleic Acids Res. 31(17), 5108-21. 

In [3]:
cd example/presentation

In [4]:
pwd

/home/amoniot/git/NAfragDB/example/presentation


In [52]:
#cd ..

In [35]:
#rm -rf /home/amoniot/Documents/NAfragDB-master/example/presentation/*

In [5]:
echo -e '1AQ4\n1ASY\n1B7F' > list_pdb

./../../create_database.sh list_pdb rna

/home/amoniot/git/NAfragDB
---------------------------- Download PDBs
downloading 1AQ4
downloading 1ASY
downloading 1B7F
-------------------------- check pdb
-------------------------------1AQ4
-------------------------------1ASY
-------------------------------1B7F
-------------------------- detect NA - protein interface 
1AQ4
1ASY
1B7F
---------------------------------  parse initial pdb
clean_rna.json does not exist
process 1AQ4
process 1ASY
process 1B7F
clean_rna.json dumped
------------------------------------ cleanPDB//1AQ4R-1.pdb R
['cleanPDB//1AQ4R-1.pdb']
run pdbcomplete
read_pdb
input contains nucleic acids
chain R
{'   3 ': [None]}
------------------------------------ cleanPDB//1AQ4S-1.pdb S
['cleanPDB//1AQ4S-1.pdb']
run pdbcomplete
read_pdb
input contains nucleic acids
chain S
{'   3 ': [None]}
------------------------------------ cleanPDB//1ASYR-1.pdb R
['cleanPDB//1ASYR-1.pdb']
run pdbcomplete
read_pdb
replaces H2U by U
replaces H2U by U
replaces H2U by U
replaces H2U by U

<pre><code>
"1B7F": {
    "method": "x-ray diffraction",
    "resolution": 2.6,
    "Nmodels": 1,
    "hetnames": {},
    "nachains": [
      "P",
      "Q"
    ],
    "protchain": [
      "A",
      "B"
    ],
    "sequence": {
      "chain_P": "GUUGUUUUUUUU",
      "chain_Q": "GUUGUUUUUUUU"
    },
    "canonized": {},
    "missing_atoms": {
      "chain_P": {},
      "chain_Q": {}
    },
    "interface_protein": {
      "model_1": {
        "chain_P": {
          "res_8": {
            "sug": 3.36,
            "base": 2.65,
            "ph": 3.67
          },
          "res_10": {
            "sug": 3.3,
            "base": 2.48,
            "ph": 3.21
          },
          "res_7": {
            "base": 2.58,
            "ph": 3.84
          },
          ...
      }
    },
    "intraRNA_hb": {
      "chain_P": {
        "res_1": {
          "sug": {
            "n+1": 1.0
          }
        },
        "res_3": {
          "ph": {
            "n+1": 0.375
          }
        },
        ...
      }
    },
    "stacking": {
      "chain_P": {
        "res_7": {
          "n+1": 1
        },
        "res_8": {
          "n-1": 1
        }
      },
      "chain_Q": {
        "res_7": {
          "n+1": 1
        },
        "res_8": {
          "n-1": 1
        }
      }
    },
    "ss": {
      "chain_P": {
        "res_2": [
          "S",
          12,
          12
        ],
        "res_3": [
          "S",
          8,
          12
        ],
        ...
    },
    "bptype": {
      "chain_P": {},
      "chain_Q": {
        "res_10": [
          "Platform"
        ],
        "res_9": [
          "Platform"
        ]
      }
    },
    "interface_hetatoms": {},
    "breaks": {
      "chain_P": null,
      "chain_Q": null
    }
  }
  </code></pre>

###    2. Using the json file for queries

![](GGMM_global2.png)

#### Query to find every single-stranded nucleotide in interaction with the protein.

``` python
    ss_set = set(["L", "T", "S", "J", "B", "I"])
    
    nuclfrag = set()
    ss = pdb_info['ss'][chain_id]
    
    interf_prot = pdb_info["interface_protein"]["model_1"][chain_id]
    for element in interf_prot:
        element = element.split("_")[1]
    
    # List of interface nucleotide resid
    nucl_interf = list(interf_prot.keys())
    nucl_interf = [element.split("_")[1] for element in nucl_interf]
    # List of single-stranded nucleotides resid
    nucl_ss = [pdb_info["mapping"][chain_id][n.split("_")[1]] for n in ss if ss[n][0] in ss_set]
    # List of single-stranded interface nucleotide resid
    nucl_interf_ss = [int(n) for n in nucl_interf if n in nucl_ss]
    
```

In [6]:
pwd

/home/amoniot/git/NAfragDB/example/presentation


In [7]:
cd ..
cd ..

In [8]:
cd create_benchmark

In [23]:
#Change to your complete path

SCRIPTS=/home/amoniot/git/NAfragDB/create_benchmark/

In [24]:
#Change to your complete paths

python3 get_benchmark_ss_ds.py /home/amoniot/git/NAfragDB/example/presentation/structures.json run_ss /home/amoniot/git/NAfragDB/example/presentation

Download of 1AQ4.pdb1.gz is a success
rm: impossible de supprimer '1AQ4.pdb1.gz': Aucun fichier ou dossier de ce type
Erreur sur le PDB 1AQ4
Done for 1AQ4


Download of 1ASY.pdb1.gz is a success
rm: impossible de supprimer '1ASY.pdb1.gz': Aucun fichier ou dossier de ce type
Erreur sur le PDB 1ASY
Done for 1ASY


Download of 1B7F.pdb1.gz is a success
rm: impossible de supprimer '1B7F.pdb1.gz': Aucun fichier ou dossier de ce type
Download of 1B7F.pdb2.gz is a success
rm: impossible de supprimer '1B7F.pdb2.gz': Aucun fichier ou dossier de ce type
Erreur sur le PDB 1B7F
Done for 1B7F




![](GGMM_1B7F.png)


It is possible to ask queries on a lot of parameters :

- secondary structures
- interface or not
- what is at the interface (sugar, phosphate or base)
- the length of nucleotides
- the number of NA chains
- the resolution of the complex
- ...

#### Query to find only loop of size of 5 nucleotides in interaction with the protein

``` python
    ss_set = set(["L"])
    
    nuclfrag = set()
    ss = pdb_info['ss'][chain_id]
    
    interf_prot = pdb_info["interface_protein"]["model_1"][chain_id]
    for element in interf_prot:
        element = element.split("_")[1]
    
    # List of interface nucleotide resid
    nucl_interf = list(interf_prot.keys())
    nucl_interf = [element.split("_")[1] for element in nucl_interf]
    # List of single-stranded nucleotides resid
    nucl_l = [pdb_info["mapping"][chain_id][n.split("_")[1]] for n in ss if ss[n][0] in ss_set]
    # List of single-stranded interface nucleotide resid
    nucl_interf_l = [int(n) for n in nucl_interf if n in nucl_l]
    nucl_interf_l.sort()
    
    
    result = []
    
    for i in range(len(nucl_interf_l)-5):
        if nucl_interf_l[i+4] - nucl_interf_l[i] == 4:
            result.append(nucl_interf_l[i:i+5])
            
    result = set(result)
    result = list(result)
    result.sort()
```

#### Find every segment of 7 nucleotides at least and in ss, where at most 2 consecutive nucleotides are not in contact with the proteine

### 3. Creation of the fragment libraries

![](GGMM_global3.png)

<pre><code>  "AAA": {
    "1": {
      "chain": "R",
      "clust0.2": 1,
      "clust0.2_center": true,
      "clust1.0": 4,
      "clust1.0_center": true,
      "clust2.0": 3,
      "clust2.0_center": true,
      "indices": [
        3,
        4,
        5
      ],
      "model": 1,
      "resid": [
        "5",
        "6",
        "7"
      ],
      "seq": "GAG",
      "structure": "1AQ4"
    },
    </code></pre>

In [1]:
pwd

/home/amoniot/Documents/NAfragDB-master


In [104]:
cd ..
cd example/presentation

In [110]:
ln -s ./../templates .
ln -s ./../data .

ln: impossible de créer le lien symbolique './data': Le fichier existe


: 1

In [None]:
./../../create_frag_library.sh rna

### 4. Statistics on libraries

![](GGMM_global4.png)

In [None]:
pwd

In [200]:
cd ..

In [2]:
cd search_frag_library

In [5]:
python3 statistic.py

Nb of fragments in ds : 117035
Nb of fragments in ss : 29291
Percentage in contact with base : ss 75.26%, ds 58.21%, p-value 0.000000
Percentage in contact with ph : ss 85.57%, ds 87.79%, p-value 0.000000
Percentage in contact with sugar : ss 80.91%, ds 73.62%, p-value 0.000000


``` python
from scipy.stats import fisher_exact

fragments = np.load("fragments_clust-aa_missing.npy")
chaindata = json.load(open("chainsmodel_frag_light.json"))

ss = query(chaindata, chainschema, fragments, is_ss, "ss", part=None)

ds = query(chaindata, chainschema, fragments, is_ds, "ss", part=None)

counts={}
for part in ["ph","sug", "base"]:
    for statename, state in zip(["ss","ds"],[ss, ds]):
        #print(part, statename)
        counts[statename, part] = np.array(query(chaindata, chainschema, fragments[state], contact_parts, "interface_protein", part))

a = sum(counts["ss","base"])
b = ss_all - sum(counts["ss","base"])
c = sum(counts["ds","base"])
d = ds_all - sum(counts["ds","base"])
table = [[a,b],[c,d]]

print("Percentage in contact with base : ss {0:.2f}%, ds {1:.2f}%, p-value {2:.6f}".format(
100*a/ss_all, 100*c/ds_all, fisher_exact(table)[1]))
```

## Conclusion

![](GGMM_global5.png)