User manual for ProtNAff

Protein-bound Nucleic Acid filters and fragment libraries (protNAff) is a tool to create filters that select structures of Protein - Nucleic acids complexes from the PDB and to build libraries of protein-bound RNA fragments.

This document explains how to use protNAff and what can be done with it.

To summarize, protNAff is a pipeline to:

Clean-up and parse NA-protein structures from the PDB into ensembles of small information units in a single file.
Search for sets of NA-protein structures with highly customisable combinations of criteria.
Create RNA/DNA 3D fragment libraries extracted from those sets of structures.
Perform statistics on customised features of such libraries.

Step 1 is necessary, steps 2 and 3 can be done independently, step 4 can be done only after step 3. The output of step 1 (for the PDB at a certain time) can now be downloaded from https://zenodo.org/record/6475637#.YmK3MFxByV4 , allowing to do steps 2, 3 and/or 4 directly.

Creation of the structures database (step 1)

The first step is to create structures.json, a JSON containing information obtained by parsing 3D structures downloaded from the PDB. The list of PDB entries to consider is provided by the user. It can be obtained by using the "advanced search" page of rcsb.org and selecting all structures that contain both protein and RNA/DNA, or a subset of those by using the available filters from RCSB (such as specific protein families, organism...). The information contained in the created database (structures.json) are either obtained by running the dssr tool of 3DNA or computed by protNAff directly. Examples of usage of this database are given below.

Usage of filters (step 2)

The filters are the most customizable part: you can create the filters you need. Examples of filter are given in the protNAff paper, and more examples are given in the filters folder. Especially, look at the explanation_filters.ipynb notebook for a detailed explanation on how to build filters.

Some filters are python scripts, such as filter_no_modified.py, which returns a JSON file, or filter_ss.py, which prints the single-stranded nucleotides per PDB id. Others are jupyter-notebooks, such as filtering-clustering.ipynb. The earliers must be runon your machine after installation of the proper conda environment, while the laters can be run through a Colab session (see INSTALLATION section below).

The detailed information contained in the structures.json are:

Per PDB structure:
- Experimental method;
- Resolution (if X-ray);
- Name of NA chains and protein chains;
- Number of models (for NMR);
- Name of cofactors.
Per nucleic acid chain:
- Position of breaks within the backbone;
- Sequence.
Per nucleotide:
- All H-bonds made with the protein, with for each H-bond (i) the amino-acid type and atom, (ii) the nucleotide sub-part (phosphate group/sugar/base), (iii) the H-bond distance, and (iv) the fact that 3DNA would consider the H-bond donor or acceptor as "questionable"
- All H-bonds made with another nucleic acid, with (i) the position of the other nucleotide in the sequence (n-2, n-1, n+1, n+2 or other), (ii) the nucleotide sub-part, (iii) the H-bond distance, and (iv) the fact that 3DNA would consider the H-bond donor or acceptor as "questionable" according to DSSR;
- Total number of H-bonds with protein for each sub-part (phosphate group/sugar/base), with a 0.5 weighting for questionable H-bonds according to 3DNA;
- Base-pairing types it is involved in;
- Initial name of the residue in the PDB file (if canonized residue);
- Minimal distance of each sub-part to the protein and to cofactors, if < 5 \AA;
- Parts that had missing atoms in the initial PDB file;
- Secondary structure (terminal single-stranded parts, hairpin loop, internal loop, junction, double-stranded);
- Presence of a stacking interaction with nucleotides at position n-2, n-1, n+1, n+2 or any position in sequence.
Per fragment:
- Name of the PDB structure it is extracted from;
- Model index (for multi-model PDB structures);
- Name of the PDB chain;
- Residue indices in the PDB file;
- Initial sequence;
- In which part of which nucleotide were atoms missing (if any);
- If the fragment is a cluster prototype for the different clustering thresholds (0.2\AA, 1\AA ~and 3\AA ~in the current implementation);
- Index of the cluster it belongs to, for the different thresholds.

All those information can be used and combined to filter the structures or the fragments and create a set suitable for a given application. It is also possible to use the 3DNA outputs directly, as we are doing in the filter_hairpin.py.

Creation of fragment libraries (step 3)

This step creates a 3D fragment library from the set of structures created in the previous step. The clustering creates clusters of fragments that are at a maximal RMSD from the cluster center. This can be done by two methods:

"fastclust", which is fast but non deterministic (dependent on the fragments order) and does not minimize the total number of clusters
"radius", which is slow but deterministic and minimize the number of clusters.

Statistics (step 4)

ProtNAff allows to run all kinds of statistics on the structures database and on the fragment libraries. Examples of statistics from the protNAff paper are provided in the notebooks named below (Testing and Examples section).

Installation

Installation instructions are here

Alternatively, you can run the protNAff filtering and clustering web server by clicking here.

Testing and Examples

There are several notebooks to help you to understand ProtNAff. In order to launch one, do:

conda activate protnaff
jupyter console NOTEBOOK-FILE.ipynb

List of notebooks:

The example notebook helps you create a small database and your first fragment library. At the end of the notebook, a graph is creates to check if your installation is correct, by comparing this graph to the one in the next notebook.
The test notebook creates the same graph as should be obtained at the end of the example notebook: if both graphs are identical, the installation went fine.
The data_protnaff notebook creates the JSON files containing the data used for analyses in the protNAff paper.
The figures_protnaff notebook uses the JSON files created by data_protnaff.ipynb to create the figures in the protNAff paper.
The figures_dna_protnaff notebook creates the same figures as previously but for DNA instead of RNA.
The filtering and clustering notebook shows you how to perform custom filtering and clustering. In the notebook, you can select and run custom filter and clustering methods among the provided examples. You can also write your own filter or clustering method. You can also run this notebook as a web server in Google Colab by clicking here. This does not require protNAff to be installed.

Description of the main scripts

The main protNAff scripts:

create_database.sh to create the structures.json file.
- input: a user-given list of pdb ids of NA-protein complexes and the type of nucleic acids you work on (rna/dna). Optional: a threshold for NA-protein/cofactor contact distance'.
- ouput: the structures.json file
- usage: ./create_database.sh --rna/dna -t
create_frag_library.sh the script to create the fragment libraries
- input: structures.json, and the type of nucleic acids you work on (rna/dna)
- output: fragments.json and fragments_clust.json
- usage: ./create_frag_library.sh --rna/dna

Some useful scripts:

create_frag_library/npy2pdb.py converts an npy matrix of 3D coordinates into pdb files.
- input: the npy matrix, and a pdb template (with the same atoms in same order). Optional: an index or a list of indices of the structures to be converted into PDB.
- output: a multi-model pdb file
- usage: ./create_frag_library/npy2pdb.py filename.npy template.pdb [--list ] [--index ] > filename.pdb
create_frag_library/pdb2npy.py converts pdb files into a npy matrix.
- input: a text file containing a list paths to the pdb files to be converted into npy
- output: a npy matrix
- usage: create_frag_library/pdb2npy.py filename.list --list --outp filename.npy
create_frag_library/reduce.py this script reduces all-atom rna/dba pdb into ATTRACT coarse-grained representation

Work in progress

The main idea of protNAff is to provide a highly versatile pipeline to cover as many usages as possible. This is intended to be a dynamic collaborative work:

_ If you need a specific feature that you can't find or don't know how to add in the current pipeline, please contact us, and we will do our best to include it.

_ If you added some feature that you think can be useful to others, please feel free to propose a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 274 Commits
.ipynb_checkpoints		.ipynb_checkpoints
create_benchmark		create_benchmark
create_database		create_database
create_frag_library		create_frag_library
data		data
debugging		debugging
example		example
filters		filters
presentation_GGMM_2019		presentation_GGMM_2019
.directory		.directory
.gitignore		.gitignore
COPYRIGHT.txt		COPYRIGHT.txt
INSTALLATION.md		INSTALLATION.md
LICENSE.txt		LICENSE.txt
README.md		README.md
WARNINGS.txt		WARNINGS.txt
add_env_var.sh		add_env_var.sh
alignement_1A.png		alignement_1A.png
alignement_3A.png		alignement_3A.png
all.list		all.list
bc-90.out		bc-90.out
build-library.md		build-library.md
build-library.py		build-library.py
build-library.sh		build-library.sh
clustering_comparison.ipynb		clustering_comparison.ipynb
create_database.sh		create_database.sh
create_frag_library.sh		create_frag_library.sh
create_helices_library.sh		create_helices_library.sh
data_protnaff.ipynb		data_protnaff.ipynb
figures_dna_protnaff.ipynb		figures_dna_protnaff.ipynb
figures_protnaff-standalone.ipynb		figures_protnaff-standalone.ipynb
figures_protnaff.ipynb		figures_protnaff.ipynb
filtering-clustering.ipynb		filtering-clustering.ipynb
id_dna.list		id_dna.list
id_paper.list		id_paper.list
library.png		library.png
minor		minor
modify-library.py		modify-library.py
new-05-2019.list		new-05-2019.list
npy.py		npy.py
pdbRfam		pdbRfam
pdbUnip		pdbUnip
pdbcode_test_dna.list		pdbcode_test_dna.list
protnaff_environment.yml		protnaff_environment.yml
recluster.py		recluster.py
recluster.sh		recluster.sh
rerecluster.py		rerecluster.py
rerecluster.sh		rerecluster.sh
rmsdlib.py		rmsdlib.py
sequence-specific_conformations.ipynb		sequence-specific_conformations.ipynb

License

isaureCdB/ProtNAff

Folders and files

Latest commit

History

Repository files navigation

User manual for ProtNAff

To summarize, protNAff is a pipeline to:

Creation of the structures database (step 1)

Usage of filters (step 2)

Creation of fragment libraries (step 3)

Statistics (step 4)

Installation

Testing and Examples

Description of the main scripts

Work in progress

About

Resources

License

Stars

Watchers

Forks

Languages