In [1]:
import os

from smolagents import ToolCollection, InferenceClientModel, OpenAIServerModel, ToolCallingAgent
from mcp import StdioServerParameters

In [2]:
os.environ['NEBIUS_API_KEY'] = open('secret.txt', 'r').read().strip()

In [3]:
server = StdioServerParameters(
    command="docker",
    args=["exec", "-i", "youthful_ride", "node", "/app/build/index.js"]
)

In [7]:
MODEL = "Qwen/Qwen3-235B-A22B-Instruct-2507"

model = OpenAIServerModel(
    model_id=MODEL,
    api_key=os.environ["NEBIUS_API_KEY"],
    api_base="https://api.studio.nebius.com/v1/",
    temperature=0,
)

In [6]:
with ToolCollection.from_mcp(
    server_parameters=server,
    trust_remote_code=True,
    structured_output=False
) as tools:
    print([t.name for t in tools.tools])  # посмотреть, какие инструменты есть

['search_proteins', 'get_protein_info', 'search_by_gene', 'get_protein_sequence', 'get_protein_features', 'compare_proteins', 'get_protein_homologs', 'get_protein_orthologs', 'get_phylogenetic_info', 'get_protein_structure', 'get_protein_domains_detailed', 'get_protein_variants', 'analyze_sequence_composition', 'get_protein_pathways', 'get_protein_interactions', 'search_by_function', 'search_by_localization', 'batch_protein_lookup', 'advanced_search', 'search_by_taxonomy', 'get_external_references', 'get_literature_references', 'get_annotation_confidence', 'export_protein_data', 'validate_accession', 'get_taxonomy_info']


In [43]:
SYSTEM_PROMPT = """
You are a **bioinformatics research agent** connected exclusively to the **Augmented-Nature-UniProt-MCP-Server**.
Your task is to retrieve and synthesize comprehensive information about a given protein — based only on UniProt data —
using the tools provided by this MCP server.

The user will provide **only a protein name, gene name, synonym, or UniProt accession ID** (for example, “NRF2”, “NFE2L2”, or “Q16236”).
You must query UniProt via the available tools (e.g., `search_proteins`, `get_protein_info`, `get_protein_features`, `get_protein_sequence`)
and return a structured, well-written, scientific text describing the protein.

If any data are missing, clearly state that they are not available in UniProt.
Do not invent or infer facts beyond UniProt annotations.
Return your answer **only as text** — not as JSON.

---

### 📘 Structure and Content of the Response

Format your output as a structured scientific text with section headings (`###`), lists, and tables where appropriate.
Highlight sequence intervals, amino acid positions, and modifications clearly.

Your report **must include all the following sections**, even if some are empty.
Strictly follow the schema provided. Your answer must have all these sections. 
If you find any additional information tou find valuable include it in the report.
Aging relation is a very important section.
---

#### **1. Gene / Protein Name / ID**

* Primary protein name and UniProt Accession ID.
* Gene name encoding the protein.
* All known synonyms and alternative names.
* Organism (species) of origin.
* List isoforms, if available.

---

#### **2. Protein / DNA Sequence**

* Link to amino acid sequence (DO NOT print the sequence itself!).
* Length, molecular mass, isoelectric point.
* All known isoforms (canonical and alternative).
* If a specific interval is provided (e.g., residues 100–200), extract and display that fragment.
* Indicate functional or domain regions located within that interval.

---

#### **3. Interval in Sequence**

* Identify notable regions of the protein: motifs, domains, active sites, binding regions, or signal sequences.
* For each interval, specify start and end positions, domain/motif name, and biological function.
* If a specific interval is requested, explain its biological role or structural relevance.

---

#### **4. Function (Text Format)**

* Describe in detail the biological and molecular functions of the protein.
* Include its role in cellular pathways, molecular mechanisms, and known interaction partners.
* List Gene Ontology (GO) annotations: Molecular Function, Biological Process, Cellular Component.
* For enzymes, include catalytic activity and substrates/products.
* Mention any known interacting proteins if annotated.

---

#### **5. Modification Effects**

* List all annotated **post-translational modifications** (PTMs): phosphorylation, acetylation, ubiquitination, etc.
* For each modification, include:

  * the type of modification;
  * the modified residue and position (e.g., Ser40);
  * the functional consequence (e.g., “promotes dissociation from KEAP1”).
* If natural variants or mutations are annotated, include them and describe their impact if available.

---

#### **6. Longevity Association**

* Identify any UniProt-annotated information relating the protein to **longevity, aging, oxidative stress, or lifespan regulation**.
* Describe how the protein contributes to stress resistance, repair mechanisms, or metabolic adaptation.
* If no longevity connection is annotated, explicitly state: *“No known association with longevity is reported in UniProt.”*

---

#### **7. Evolutionary Conservation**

* Describe how evolutionarily conserved this protein is.
* List known **orthologs** in other species with UniProt IDs and approximate sequence identity.
* Mention **paralogs** within the same organism, if any.
* Identify which motifs or domains are most conserved.

---

#### **8. Orthologs and Paralogs Across Species**

* Provide examples of orthologs and functional analogs (e.g., SKN-1 in *C. elegans*).
* State the degree of sequence identity, main similarities, and conserved regions.
* If paralogs perform distinct functions, summarize these functional differences.

---

#### **9. Known Genetic Interventions**

* List any annotated **experimental manipulations** (knock-out, knock-in, overexpression, RNA interference, etc.).
* Summarize reported phenotypic effects or changes in activity.
* If no such experiments are annotated in UniProt, explicitly say so.

---

#### **10. Mutant Strains Data**

* Describe known mutant strains associated with this gene or protein.
* Include observed phenotypes (e.g., increased stress sensitivity, altered transcriptional regulation).
* Indicate the model organism (e.g., *Mus musculus*, *Drosophila melanogaster*, *C. elegans*).

---

#### **11. Small Molecule Binding Data**

* List annotated small molecule or protein binding sites.
* For each ligand, specify:

  * the molecule’s name;
  * the amino acid positions involved in binding;
  * the interaction type (inhibition, activation, etc.);
  * the effect on the protein’s activity.

---

#### **12. Tunable Coarse-Graining**

* Summarize the information at different levels of abstraction:

  * individual amino acids or motifs;
  * structural domains (e.g., Neh2, bZIP);
  * domain families;
  * broader protein classes (e.g., basic leucine zipper transcription factors).
* Explain how the protein’s function or interactions can be understood at each level.

---

#### **13. Summary**

* Provide a concise synthesis of key information:

  * the main biological role of the protein,
  * major structural or functional domains,
  * key modifications or interactions,
  * relevance to stress response, signaling, or longevity, if annotated.

---

### ⚙️ Style and Output Requirements

* Use **scientific but readable** English.
* Follow the order and section titles above.
* If a section lacks data, explicitly state: “Data not available in UniProt.”
"""

In [68]:
PROTEIN = "APOE"

USER_PROMPT = f"""
Return the data for the human protein: {PROTEIN}
During the search do not request the fields that won't help to fetch data in order to reduce the size of return.
Stick to the following schema:
#### **1. Gene / Protein Name / ID**
#### **2. Protein / DNA Sequence** (print only link to the sequence, DO NOT print the whole sequence)
#### **3. Interval in Sequence**
#### **5. Natural Variants**
#### **6. Evolutionary Conservation**
#### **7. Orthologs and Paralogs Across Species**
# <--the following sections must consider as the human protein, its variants and orthologs and paralogs-->
#### **8. Function (Text Format)**
#### **9. Modification Effects**
#### **10. Longevity Association**
#### **11. Known Genetic Interventions**
#### **12. Mutant Strains Data**
#### **13. Small Molecule Binding Data**
#### **14. Tunable Coarse-Graining**
#### **15. Summary**
"""

# <--the following sections must consider as the human protein, its variants and orthologs and paralogs-->

In [54]:
agent.prompt_templates.keys()

dict_keys(['system_prompt', 'planning', 'managed_agent', 'final_answer'])

In [69]:
with ToolCollection.from_mcp(
    server_parameters=server,
    trust_remote_code=True,
    structured_output=False
) as tools:
    agent = ToolCallingAgent(
        tools=[*tools.tools],
        add_base_tools=False,
        model=model,
        max_steps=3
    )
    agent.prompt_templates["system_prompt"] = SYSTEM_PROMPT

    try:
        out = agent.run(USER_PROMPT)
        print(out)
    except Exception as e:
        print(f"ERROR: {e}")

#### **1. Gene / Protein Name / ID**
- **Gene Name**: APOE (Apolipoprotein E)  
- **Protein Name**: Apolipoprotein E  
- **UniProtKB ID**: P02649 (reviewed entry for human APOE; note: search results returned TrEMBL entries, but P02649 is the canonical reviewed entry)  
- **Gene ID (NCBI)**: 348  
- **RefSeq Protein ID**: NP_000032.1  

---

#### **2. Protein / DNA Sequence** (print only link to the sequence, DO NOT print the whole sequence)
- **Protein Sequence (UniProt)**: [https://www.uniprot.org/uniprotkb/P02649/entry](https://www.uniprot.org/uniprotkb/P02649/entry)  
- **DNA Sequence (RefSeq)**: [https://www.ncbi.nlm.nih.gov/nuccore/NM_000041.4](https://www.ncbi.nlm.nih.gov/nuccore/NM_000041.4)  

---

#### **3. Interval in Sequence**
- **Signal Peptide**: 1–18  
- **Mature Protein Chain**: 19–317 (Apolipoprotein E)  
- **Coiled-Coil Region**: 244–278  
- **Receptor-Binding Region**: ~130–160 (well-documented functional domain; inferred from literature)  
- **Lipid-Binding Domain**

In [70]:
print(out)

#### **1. Gene / Protein Name / ID**
- **Gene Name**: APOE (Apolipoprotein E)  
- **Protein Name**: Apolipoprotein E  
- **UniProtKB ID**: P02649 (reviewed entry for human APOE; note: search results returned TrEMBL entries, but P02649 is the canonical reviewed entry)  
- **Gene ID (NCBI)**: 348  
- **RefSeq Protein ID**: NP_000032.1  

---

#### **2. Protein / DNA Sequence** (print only link to the sequence, DO NOT print the whole sequence)
- **Protein Sequence (UniProt)**: [https://www.uniprot.org/uniprotkb/P02649/entry](https://www.uniprot.org/uniprotkb/P02649/entry)  
- **DNA Sequence (RefSeq)**: [https://www.ncbi.nlm.nih.gov/nuccore/NM_000041.4](https://www.ncbi.nlm.nih.gov/nuccore/NM_000041.4)  

---

#### **3. Interval in Sequence**
- **Signal Peptide**: 1–18  
- **Mature Protein Chain**: 19–317 (Apolipoprotein E)  
- **Coiled-Coil Region**: 244–278  
- **Receptor-Binding Region**: ~130–160 (well-documented functional domain; inferred from literature)  
- **Lipid-Binding Domain**

### Test

In [73]:
%load_ext autoreload
%autoreload 2

In [75]:
from uniprot import run_query

In [76]:
out = run_query('OCT4')

In [77]:
print(out)

#### **1. Gene / Protein Name / ID**
- **Gene Name**: *POU5F1*  
- **Protein Name**: POU domain, class 5, transcription factor 1 (OCT4)  
- **UniProt ID**: POU5F1_HUMAN (Primary reviewed entry; note: the search returned TrEMBL entries D5K9R8/D5K9Q3, but the canonical entry is **P04629** in UniProtKB/Swiss-Prot)  
- **Alternative Names**: OCT3, OTF-3, Oct-3/4  

> ✅ *Note: The TrEMBL entries (D5K9R8, D5K9Q3) are unreviewed and likely pseudogene translations. The canonical human OCT4 is **P04629** (UniProtKB/Swiss-Prot).*

---

#### **2. Protein / DNA Sequence**
- **Protein Sequence Link**: [https://www.uniprot.org/uniprotkb/P04629/entry#sequences](https://www.uniprot.org/uniprotkb/P04629/entry#sequences)  
- **DNA Sequence (mRNA) Link**: [https://www.ncbi.nlm.nih.gov/nuccore/NM_002698.5](https://www.ncbi.nlm.nih.gov/nuccore/NM_002698.5)  

---

#### **3. Interval in Sequence**
- **POU-Specific Domain**: 138–212  
- **Homeobox Domain**: 228–288  
- **DNA-Binding Region (Homeobox)**: 230–