Protein Synthesis AI Agent Prototype

An AI-powered web application that retrieves protein information from UniProt, finds related publications, extracts and summarizes the 'Materials and Methods' section for protein synthesis, and verifies accuracy on a reference dataset.

Tech Stack

Backend: FastAPI (Python)
Frontend: Next.js (TypeScript, React, Tailwind CSS)
AI Models: Ollama (local LLM inference)
NLP: scispaCy, spaCy
APIs: UniProt REST API, Perplexity.ai (primary), PubMed/PMC (NCBI E-utilities), Semantic Scholar (fallback)

Features

🔍 Search proteins in UniProt database
📚 Retrieve related publications from Perplexity Search API (academic mode, domain-filtered) (docs), with PubMed/PMC and Semantic Scholar as fallbacks
🤖 Extract synthesis protocols using local AI models (Ollama)
📊 Entity extraction (chemicals, equipment, conditions)
✅ Protocol verification against reference dataset
📈 Verification dashboard with accuracy metrics

Security Features

This project was built with security and robustness in mind:

Rate Limiting: All API endpoints use stringent rate-limiting (slowapi) preventing abuse and DoS attacks.
Payload Limits: Sensible maximum request lengths on data-heavy endpoints enforce resource boundaries.
CORS Configuration: Restricts API calls to approved origins (ALLOWED_ORIGINS).
Secret Management: Production environment variable handling ensuring sensitive keys are decoupled from codebase.
Robust Dependency Management: No known vulnerabilities across major dependency versions.

Prerequisites

System Requirements

Python 3.9+
Node.js 18+ and npm
At least 8GB RAM (16GB+ recommended for larger models)
Optional: GPU for faster AI inference

Ollama Installation

The application uses Ollama for local AI model inference. You must install and configure Ollama before running the application.

Install Ollama

macOS/Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Windows: Download from https://ollama.ai/download

Download AI Models

Choose one of the following models based on your system resources:

# Recommended: Balanced performance (requires ~8GB RAM)
ollama pull llama3:8b

# Alternative: Faster, smaller model (requires ~6GB RAM)
ollama pull mistral:7b

# Best quality: Larger model (requires ~40GB RAM/VRAM)
ollama pull llama3:70b

Verify Ollama Installation

# Check if Ollama is running
ollama list

# Test the model
ollama run llama3:8b "Hello, world!"

The Ollama service typically runs on http://localhost:11434. Make sure it's running before starting the backend.

Setup Instructions

Backend Setup

Navigate to backend directory:
```
cd backend
```

Create and activate virtual environment:

# From project root
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Download spaCy model:

python -m spacy download en_core_web_sm
# If scispaCy model is available:
python -m spacy download en_core_sci_sm

Create environment file:
```
# Create .env file in backend directory
touch .env
```
Edit .env and set:
```
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=llama3:8b
PERPLEXITY_API_KEY=your_perplexity_api_key_here
UNPAYWALL_EMAIL=your_email@example.com
```
Getting a Perplexity API Key:
- Sign up at https://www.perplexity.ai/
- Navigate to Account Settings → API
- Generate an API key
- Copy the key to your .env file
Note: The application will fallback to PubMed/Semantic Scholar if Perplexity API key is not configured.
Run the backend:
```
uvicorn main:app --reload
```
Backend will be available at http://localhost:8000 API documentation at http://localhost:8000/docs

Frontend Setup

Navigate to frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```

Create environment file (optional):

# Create .env.local if you need to customize API URL
echo "NEXT_PUBLIC_API_URL=http://localhost:8000" > .env.local

Run the development server:
```
npm run dev
```
Frontend will be available at http://localhost:3000

Usage

Start Ollama (if not running automatically):
```
ollama serve
```

Start the backend:

cd backend
source ../venv/bin/activate
uvicorn main:app --reload

Start the frontend:
```
cd frontend
npm run dev
```
Open the application: Navigate to http://localhost:3000 in your browser
Search for a protein:
- Type a protein name (e.g., "hemoglobin", "insulin")
- Select from the dropdown
- View related publications
- Click "Extract Protocol" to extract synthesis methods
View verification dashboard:
- Click the "Verification" tab
- View accuracy metrics and validation results

API Endpoints

Protein Search

GET /protein/search?query={protein_name}

Publications

GET /publications/{uniprot_id}?protein_name={name}&methodology_focus={purification|synthesis|expression|general}

Uses Perplexity.ai as primary source, falls back to PubMed/PMC and Semantic Scholar if no results found. Supports methodology focus (default: purification).

Extract Methods

POST /extract_methods
Body: {
  "publication_text": "...",
  "protein_name": "..."
}

Extract Entities

POST /extract_entities
Body: {
  "text": "..."
}

Summarize Protocol

POST /summarize_protocol
Body: {
  "extracted_methods": "..."
}

Verify Protocol

POST /verify_protocol
Body: {
  "ai_protocol": "...",
  "protein_name": "...",
  "uniprot_id": "..."
}

Verification Report

GET /verification/report

Full API documentation available at http://localhost:8000/docs when backend is running.

Reference Dataset

The reference dataset (backend/data/reference.csv) contains 20 predefined proteins with validated synthesis protocols:

Hemoglobin, Insulin, GFP, Lysozyme, Myoglobin
Albumin, Cytochrome C, Trypsin, Collagen, Fibrinogen
Actin, Tubulin, Catalase, Peroxidase, Lactate dehydrogenase
Ribonuclease, Chymotrypsin, Elastase, Carbonic anhydrase

Testing

End-to-End Testing

Run the end-to-end test script:

cd backend
python tests/test_e2e.py

This will test the full pipeline for sample proteins and log performance metrics.

Performance Metrics

The system validates:

Response time < 5s per step
Extraction accuracy >= 70%
End-to-end success >= 80%

Troubleshooting

Ollama Connection Issues

If you see "Ollama service not available" errors:

Check if Ollama is running:
```
curl http://localhost:11434/api/tags
```
Start Ollama manually:
```
ollama serve
```
Verify model is downloaded:
```
ollama list
```

scispaCy Model Not Found

If en_core_sci_sm is not available, the application will fall back to en_core_web_sm. To install scispaCy models:

pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz

Frontend API Connection Issues

Ensure backend is running on http://localhost:8000
Check CORS settings in backend/main.py
Verify NEXT_PUBLIC_API_URL in frontend .env.local if using custom URL

Project Structure

Invictus-plan/
├── backend/
│   ├── main.py              # FastAPI application
│   ├── requirements.txt     # Python dependencies
│   ├── .env.example         # Environment template
│   ├── services/            # Business logic
│   │   ├── uniprot.py       # UniProt API integration
│   │   ├── publications.py  # Publication retrieval (Perplexity/PubMed/Semantic Scholar)
│   │   ├── perplexity.py    # Perplexity.ai API integration
│   │   ├── extraction.py    # Text extraction and cleaning
│   │   ├── ai_models.py      # Ollama integration
│   │   └── verification.py  # Protocol verification
│   ├── data/
│   │   └── reference.csv    # Reference dataset
│   └── tests/
│       └── test_e2e.py      # End-to-end tests
├── frontend/
│   ├── app/                 # Next.js app directory
│   ├── components/          # React components
│   ├── lib/                 # API client
│   └── hooks/               # React hooks
└── README.md                # This file

Development

Backend Development

FastAPI auto-reloads on code changes when using --reload
API documentation available at /docs
Environment variables loaded from .env

Frontend Development

Next.js hot-reloads on code changes
TypeScript for type safety
Tailwind CSS for styling

GitHub Setup

Initial Setup

Initialize Git repository (if not already done):

git init
git add .
git commit -m "Initial commit"

Create GitHub repository:
- Go to GitHub and create a new repository
- Don't initialize with README (you already have one)

Connect and push:

git remote add origin https://github.com/yourusername/invictus-plan.git
git branch -M main
git push -u origin main

Environment Variables

Important: Never commit .env files! They are already in .gitignore.

Backend: Copy backend/env.template to backend/.env and fill in your values
Frontend: Copy frontend/env.template to frontend/.env.local and fill in your values

See the template files for required environment variables.

Deployment

Quick Start

For detailed deployment instructions, see DEPLOYMENT.md.

Frontend (Vercel)

Push your code to GitHub
Go to vercel.com and import your repository
Set root directory to frontend
Add environment variable: NEXT_PUBLIC_API_URL=https://your-backend-domain.com
Deploy!

Backend Options

Recommended platform:

Railway (easiest, good free tier, Docker support) - See backend/railway.json and RAILWAY_DEPLOYMENT.md

Alternative platforms:

Fly.io (global edge, generous free tier)
DigitalOcean (reliable, paid)

Docker deployment (universal):

cd backend
docker build -t invictus-backend .
docker run -p 8000:8000 --env-file .env invictus-backend

See RAILWAY_DEPLOYMENT.md for detailed Railway deployment instructions, or DEPLOYMENT.md for other platform options.

Security Checklist

Before deploying to production:

All .env files are in .gitignore (already done)
Environment variables are set in hosting platform (not in code)
CORS is configured to only allow your frontend domain
API keys are rotated and secure
HTTPS is enabled (automatic on most platforms)
Security headers are configured (see frontend/vercel.json)
Rate limiting is considered (add if needed)
Monitoring/alerting is set up

See DEPLOYMENT.md for comprehensive security guidelines.

License

This is a prototype project for demonstration purposes.

Acknowledgments

UniProt for protein database
Perplexity.ai for intelligent publication search (primary)
PubMed/PMC (NCBI E-utilities) for publication access (fallback)
Semantic Scholar for publication fallback access
Unpaywall API for open access PDF retrieval
Ollama for local LLM inference
scispaCy for biomedical NLP

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
.cursor-mcp-config.json.example		.cursor-mcp-config.json.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
railway.json		railway.json
start.sh		start.sh
stop.sh		stop.sh

Folders and files

Latest commit

History

Repository files navigation

Protein Synthesis AI Agent Prototype

Tech Stack

Features

Security Features

Prerequisites

System Requirements

Ollama Installation

Install Ollama

Download AI Models

Verify Ollama Installation

Setup Instructions

Backend Setup

Frontend Setup

Usage

API Endpoints

Protein Search

Publications

Extract Methods

Extract Entities

Summarize Protocol

Verify Protocol

Verification Report

Reference Dataset

Testing

End-to-End Testing

Performance Metrics

Troubleshooting

Ollama Connection Issues

scispaCy Model Not Found

Frontend API Connection Issues

Project Structure

Development

Backend Development

Frontend Development

GitHub Setup

Initial Setup

Environment Variables

Deployment

Quick Start

Frontend (Vercel)

Backend Options

Security Checklist

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages