Skip to content

Toolkit manual

祝云篪 edited this page May 12, 2023 · 24 revisions

Scripts

To improve compatibility, MineProt scripts are coded in native Python3 as much as possible, but they still depend on following third-party packages:

numpy
requests
biopython

Please run pip3 install -r toolkit/scripts/requirements.txt to install them.

Operations will be more convenient with a Bash shell, as several function-integrated shell scripts are provided.

alphafold/transform.py

Description

Preprocess your AlphaFold predictions for curation: pick out top-1 model and
generate .cif for visualization.

optional arguments:
  -h, --help  show this help message and exit
  -i I        Path to input folder. THIS ARGUMENT IS MANDATORY.
  -o O        Path to output folder. THIS ARGUMENT IS MANDATORY.
  -n N        Naming mode: 0: Use prefix; 1: Use name in .a3m; 2: Auto rename;
              3: Customize name.
  --url URL   URL of PDB2CIF API.

Workflow

  1. Make temporary folder MP-Temp-XXXX;

  2. Scan the -i directory and locate result folders;

    AlphaFold will generate one subdirectory for one prediction in its output_dir. Please make sure your -i matches your AlphaFold's --output_dir.

  3. For each result folder:

    • Copy ranked_0.pdb to MP-Temp-XXXX;
    • Add text #Added by MineProt toolkit to the top of msas/bfd_uniref_hits.a3m (or msas/bfd_uniclust_hits.a3m for old versions) and modify its sequence identifiers to UniProt protein accessions before copying it to MP-Temp-XXXX;
    • Read the ranking_debug.json and locate the PKL file for the best model; export vital model scores from it into a JSON file in MP-Temp-XXXX;
  4. Rename all files in MP-Temp-XXXX according to -n parameters and move them to the -o directory;

  5. Generate CIF files using MineProt PDB2CIF API.

Output

The -o directory will have the following structure:

<output_dir>/
    <protein1>.a3m <protein1>.json <protein1>.pdb <protein1>.cif
    <protein2>.a3m <protein2>.json <protein2>.pdb <protein2>.cif
    ...

For each protein, contents of output file are as follows:

  • .a3m - MSA file generated by AI system

  • .json - Model scores

    {
      "max_pae": maximum predicted aligned error
      "pae": [[predicted aligned error],[...],...]
      "plddt": predicted lDDT-Cα
      "ptm": pTM score
    }
  • .pdb - Structure file generated by AI system

  • .cif - Structure file generated by MineProt PDB2CIF API, supporting Mol* visualization in the style of AlphaFold DB

colabfold/transform.py

Description

Preprocess your ColabFold predictions for curation: pick out top-1 model and
generate .cif for visualization.

optional arguments:
  -h, --help  show this help message and exit
  -i I        Path to input folder. THIS ARGUMENT IS MANDATORY.
  -o O        Path to output folder. THIS ARGUMENT IS MANDATORY.
  -n N        Naming mode: 0: Use prefix; 1: Use name in .a3m; 2: Auto rename;
              3: Customize name.
  -z          Unzip results.
  -r          Use relaxed results.
  --url URL   URL of PDB2CIF API.

-n 1 is especially useful when you you employ colabfold_search and colabfold_batch generate large scale structure predictions, for colabfold_search will name output A3M files as <order>.a3m by default.

Workflow

  1. Make temporary folder MP-Temp-XXXX;

  2. Scan the -i directory, locate target A3M, PDB and JSON files and copy them to MP-Temp-XXXX;

    Please make sure your -z and -r matches your ColabFold's --zip and --relax.

  3. Rename all files in MP-Temp-XXXX according to -n parameters and move them to the -o directory;

  4. Generate CIF files using MineProt PDB2CIF API.

Output

The same as outputs of alphafold/transform.py.

import2es.py

Description

Import your proteins into Elasticsearch.

optional arguments:
  -h, --help         show this help message and exit
  -i I               Path to your preprocessed A3M files. THIS ARGUMENT IS MANDATORY.
  -n N               Elasticsearch index name.
  -a                 Annotate proteins using UniProt API.
  -f                 Force overwrite.
  -t T               Threads to use.
  --max-msa MAX_MSA  Max number of msas to use for annotation.
  --url URL          URL of MineProt Elasticsearch API.

Workflow

  1. If -n is unset, use basename of -i as -n;

    Please ensure that your target repository name adheres to the naming standards of Elasticsearch indices.

  2. Read each PDB file in the -i directory, get their basenames and contents;

  3. For each PDB file:

    • read related JSON file and generate a JSON request body:

      {
          "name": PDB basename
          "score": model score (Usually pLDDT)
          "seq": protein sequence
          "anno": 
          {
              "homolog": homolog accession
              "database": annotation database (UniProtKB/UniParc)
              "description": []
          }
      }

      Please note that your PDB basename will be recognized as protein name.

    • If -a is set, read related A3M file, enumerate sequence identifiers and send them to the UniProt API, until a UniProtKB/UniParc registered homolog is found:

      GET https://www.ebi.ac.uk/proteins/api/proteins/<identifier>
      

      Set {anno{homolog}} to the found homolog, and extract its UniProt, GO & InterPro annotations into {anno{description}};

    • Send the request to MineProt Elasticsearch API (--url):

      POST <URL>/<N>/add/<base64-encoded_A3M_basename>
      

      The URL of Elasticsearch API is usually at /api/es of your MineProt site.

import2repo.py

Description

Import your proteins into repository.

options:
  -h, --help  show this help message and exit
  -i I        Path to your preprocessed files (A3M, PDB, CIF & JSON). THIS ARGUMENT IS MANDATORY.
  -n N        MineProt repository name.
  -m M        Upload mode: 0: all files; 1: without A3M; 2: only PDB & JSON; 3: only PDB.
  -f          Force overwrite.
  -z          Compress files.
  --url URL   URL of MineProt import2repo API.

Workflow

  1. If -n is unset, use basename of -i as -n;

    Please ensure that your target repository name adheres to the naming standards of Elasticsearch indices.

  2. Enumerate each file in the -i directory and send their basename to MineProt import2repo/check API:

    GET <URL>/check.php?repo=<N>&name=<file_basename>
    
  3. Skip files already existing in target repository, and import files according to -m by MineProt import2repo API.

    The URL of import2repo API is usually at /api/import2repo/ of your MineProt site.
    If -f is set, all files will be imported without checking. If -z is set, all files will be gzip-compressed.
    In fact, MineProt application can work as usual with only PDB or even FCZ files in repositories. Due to open source licensing issue, MineProt toolkit does not support processing FCZ format. Given the amazing compression power of foldcomp, we recommend you to use rsync to upload FCZ files directly.

*/import.sh

These shell scripts integrate their corresponding <AI_system>/transform.py with import2es.py and import2repo.py. Please try <AI_system>/import.sh --help for their arguments.

export.py

Description

Export data from MineProt Search Page.

positional arguments:
  url         Search URL. THIS ARGUMENT IS MANDATORY.
              e.g. http://mineprot-demo.bmeonline.cn/search.php?search=dicer

options:
  -h, --help  show this help message and exit
  -o O        Path to output folder

Workflow

  1. Download result.json from MineProt ES Search API hidden in the Search Page to -o;
  2. Read result.json, make necessary directories in -o according to _index (MineProt repositories), and download search result files to corresponding directories.

Output

The -o directory will have the following structure:

<output_dir>/
   result.json
   <repo1>/
      <protein1>.a3m <protein1>.json <protein1>.pdb <protein1>.cif
      <protein2>.a3m <protein2>.json <protein2>.pdb <protein2>.cif
   <repo2>/
      <protein3>.a3m <protein3>.json <protein3>.pdb <protein3>.cif
      ...
   ...

pre4beacon.py

Description

Prepare metadata for 3D-Beacons client.

optional arguments:
  -h, --help         show this help message and exit
  -i I               Path to your MineProt repo. THIS ARGUMENT IS MANDATORY.
  -o O               Path to your data directory for 3D-Beacons. THIS ARGUMENT
                     IS MANDATORY.
  -t T               Threads to use.
  --max-msa MAX_MSA  Max number of msas to use for mapping.

Workflow

  1. Make necessary directories in -o;

    mkdir -p /path/to/3d-beacons-client/data/{pdb,cif,metadata,index}
  2. For each PDB file in -i (usually MineProt repository):

    • read related JSON file to get pLDDT;

    • read related A3M file to get mapping UniProt accession, check if it meets the requirement of 3D-Beacons client, pull its sequence and make pairwise alignment;

    • generate JSON in /path/to/3d-beacons-client/data/metadata:

      {
         "mappingAccession": UniProt accession,
         "mappingAccessionType": "uniprot",
         "start": start site of pairwise alignment,
         "end": end site of pairwise alignment,
         "modelCategory": "Ab initio",
         "modelType": "single",
         "confidenceType": "pLDDT",
         "confidenceAvgLocalScore": pLDDT,
         "createdDate": created date of PDB file,
         "sequenceIdentity": identity of pairwise alignment,
         "coverage": coverage of pairwise alignment,
      }
    • copy PDB file to /path/to/3d-beacons-client/data/pdb.

You can run docker-compose exec cli snakemake --cores=2 to process your models for 3D-beacons client once this script is done. It's also recommended to directly copy the CIF files in your MineProt repository to /path/to/3d-beacons-client/data/cif and run CLI commands manually.

Plugins

MineProt plugins are developed to extend the functionality of your protein server.

SequenceServer

sequenceserver.js is designed to link SequenceServer hits to MineProt search interface, which enables users to search their proteins via BLAST. Please note that the sequence identifiers in your BLAST database should match protein names in your MineProt repositories and be added with suffix "|repo=XXX" for distinction if you have more than one repository. You can use msa2seq.sh to generate FASTA files for SequenceServer deployment.

Usage

TamperMonkey plugin

  1. Install TamperMonkey plugin to your browser;
  2. Edit sequenceserver.js:
    • Fill in the URL pattern of target SequenceServer app behind // @include, for example:

      // @include *://localhost:4567/*
    • Fill in the URL of your MineProt site as the first argument of MainFunction(), for example:

      (function () {
         MainFunction("http://localhost", 0);
      })();
  3. Add the edited script directly to TamperMonkey.

Embedded

  1. Fill in the URL of your MineProt site as the first argument of MainFunction() in sequenceserver.js;
  2. Add the edited script to views/report.erb of your SequenceServer in a <script>...</script> tag;
  3. Restart your SequenceServer, and all users can directly access the MineProt search interface of each BLAST hit.

Demo

Please note the extra search button.

AlphaFill

alphafill.js is designed for directly sending CIF files from MineProt Search Page to AlphaFill online annotation service.

Usage

TamperMonkey plugin

  1. Install TamperMonkey plugin to your browser;
  2. Edit alphafill.js:
    • Fill in the URL pattern of your MineProt Search Page behind // @include, for example:

      // @include *://localhost/search.php*
  3. Add the edited script directly to TamperMonkey.

Embedded

Add this script to web/search.php of your MineProt in a <script>...</script> tag.