An AI-powered web application that retrieves protein information from UniProt, finds related publications, extracts and summarizes the 'Materials and Methods' section for protein synthesis, and verifies accuracy on a reference dataset.
- Backend: FastAPI (Python)
- Frontend: Next.js (TypeScript, React, Tailwind CSS)
- AI Models: Ollama (local LLM inference)
- NLP: scispaCy, spaCy
- APIs: UniProt REST API, Perplexity.ai (primary), PubMed/PMC (NCBI E-utilities), Semantic Scholar (fallback)
- π Search proteins in UniProt database
- π Retrieve related publications from Perplexity Search API (academic mode, domain-filtered) (docs), with PubMed/PMC and Semantic Scholar as fallbacks
- π€ Extract synthesis protocols using local AI models (Ollama)
- π Entity extraction (chemicals, equipment, conditions)
- β Protocol verification against reference dataset
- π Verification dashboard with accuracy metrics
This project was built with security and robustness in mind:
- Rate Limiting: All API endpoints use stringent rate-limiting (
slowapi) preventing abuse and DoS attacks. - Payload Limits: Sensible maximum request lengths on data-heavy endpoints enforce resource boundaries.
- CORS Configuration: Restricts API calls to approved origins (
ALLOWED_ORIGINS). - Secret Management: Production environment variable handling ensuring sensitive keys are decoupled from codebase.
- Robust Dependency Management: No known vulnerabilities across major dependency versions.
- Python 3.9+
- Node.js 18+ and npm
- At least 8GB RAM (16GB+ recommended for larger models)
- Optional: GPU for faster AI inference
The application uses Ollama for local AI model inference. You must install and configure Ollama before running the application.
macOS/Linux:
curl -fsSL https://ollama.ai/install.sh | shWindows: Download from https://ollama.ai/download
Choose one of the following models based on your system resources:
# Recommended: Balanced performance (requires ~8GB RAM)
ollama pull llama3:8b
# Alternative: Faster, smaller model (requires ~6GB RAM)
ollama pull mistral:7b
# Best quality: Larger model (requires ~40GB RAM/VRAM)
ollama pull llama3:70b# Check if Ollama is running
ollama list
# Test the model
ollama run llama3:8b "Hello, world!"The Ollama service typically runs on http://localhost:11434. Make sure it's running before starting the backend.
-
Navigate to backend directory:
cd backend -
Create and activate virtual environment:
# From project root python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Download spaCy model:
python -m spacy download en_core_web_sm # If scispaCy model is available: python -m spacy download en_core_sci_sm -
Create environment file:
# Create .env file in backend directory touch .envEdit
.envand set:OLLAMA_BASE_URL=http://localhost:11434 OLLAMA_MODEL=llama3:8b PERPLEXITY_API_KEY=your_perplexity_api_key_here UNPAYWALL_EMAIL=your_email@example.com
Getting a Perplexity API Key:
- Sign up at https://www.perplexity.ai/
- Navigate to Account Settings β API
- Generate an API key
- Copy the key to your
.envfile
Note: The application will fallback to PubMed/Semantic Scholar if Perplexity API key is not configured.
-
Run the backend:
uvicorn main:app --reload
Backend will be available at
http://localhost:8000API documentation athttp://localhost:8000/docs
-
Navigate to frontend directory:
cd frontend -
Install dependencies:
npm install
-
Create environment file (optional):
# Create .env.local if you need to customize API URL echo "NEXT_PUBLIC_API_URL=http://localhost:8000" > .env.local
-
Run the development server:
npm run dev
Frontend will be available at
http://localhost:3000
-
Start Ollama (if not running automatically):
ollama serve
-
Start the backend:
cd backend source ../venv/bin/activate uvicorn main:app --reload
-
Start the frontend:
cd frontend npm run dev -
Open the application: Navigate to
http://localhost:3000in your browser -
Search for a protein:
- Type a protein name (e.g., "hemoglobin", "insulin")
- Select from the dropdown
- View related publications
- Click "Extract Protocol" to extract synthesis methods
-
View verification dashboard:
- Click the "Verification" tab
- View accuracy metrics and validation results
GET /protein/search?query={protein_name}
GET /publications/{uniprot_id}?protein_name={name}&methodology_focus={purification|synthesis|expression|general}
Uses Perplexity.ai as primary source, falls back to PubMed/PMC and Semantic Scholar if no results found. Supports methodology focus (default: purification).
POST /extract_methods
Body: {
"publication_text": "...",
"protein_name": "..."
}
POST /extract_entities
Body: {
"text": "..."
}
POST /summarize_protocol
Body: {
"extracted_methods": "..."
}
POST /verify_protocol
Body: {
"ai_protocol": "...",
"protein_name": "...",
"uniprot_id": "..."
}
GET /verification/report
Full API documentation available at http://localhost:8000/docs when backend is running.
The reference dataset (backend/data/reference.csv) contains 20 predefined proteins with validated synthesis protocols:
- Hemoglobin, Insulin, GFP, Lysozyme, Myoglobin
- Albumin, Cytochrome C, Trypsin, Collagen, Fibrinogen
- Actin, Tubulin, Catalase, Peroxidase, Lactate dehydrogenase
- Ribonuclease, Chymotrypsin, Elastase, Carbonic anhydrase
Run the end-to-end test script:
cd backend
python tests/test_e2e.pyThis will test the full pipeline for sample proteins and log performance metrics.
The system validates:
- Response time < 5s per step
- Extraction accuracy >= 70%
- End-to-end success >= 80%
If you see "Ollama service not available" errors:
-
Check if Ollama is running:
curl http://localhost:11434/api/tags
-
Start Ollama manually:
ollama serve
-
Verify model is downloaded:
ollama list
If en_core_sci_sm is not available, the application will fall back to en_core_web_sm. To install scispaCy models:
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.3/en_core_sci_sm-0.5.3.tar.gz- Ensure backend is running on
http://localhost:8000 - Check CORS settings in
backend/main.py - Verify
NEXT_PUBLIC_API_URLin frontend.env.localif using custom URL
Invictus-plan/
βββ backend/
β βββ main.py # FastAPI application
β βββ requirements.txt # Python dependencies
β βββ .env.example # Environment template
β βββ services/ # Business logic
β β βββ uniprot.py # UniProt API integration
β β βββ publications.py # Publication retrieval (Perplexity/PubMed/Semantic Scholar)
β β βββ perplexity.py # Perplexity.ai API integration
β β βββ extraction.py # Text extraction and cleaning
β β βββ ai_models.py # Ollama integration
β β βββ verification.py # Protocol verification
β βββ data/
β β βββ reference.csv # Reference dataset
β βββ tests/
β βββ test_e2e.py # End-to-end tests
βββ frontend/
β βββ app/ # Next.js app directory
β βββ components/ # React components
β βββ lib/ # API client
β βββ hooks/ # React hooks
βββ README.md # This file
- FastAPI auto-reloads on code changes when using
--reload - API documentation available at
/docs - Environment variables loaded from
.env
- Next.js hot-reloads on code changes
- TypeScript for type safety
- Tailwind CSS for styling
-
Initialize Git repository (if not already done):
git init git add . git commit -m "Initial commit"
-
Create GitHub repository:
- Go to GitHub and create a new repository
- Don't initialize with README (you already have one)
-
Connect and push:
git remote add origin https://github.com/yourusername/invictus-plan.git git branch -M main git push -u origin main
Important: Never commit .env files! They are already in .gitignore.
- Backend: Copy
backend/env.templatetobackend/.envand fill in your values - Frontend: Copy
frontend/env.templatetofrontend/.env.localand fill in your values
See the template files for required environment variables.
For detailed deployment instructions, see DEPLOYMENT.md.
- Push your code to GitHub
- Go to vercel.com and import your repository
- Set root directory to
frontend - Add environment variable:
NEXT_PUBLIC_API_URL=https://your-backend-domain.com - Deploy!
Recommended platform:
- Railway (easiest, good free tier, Docker support) - See
backend/railway.jsonandRAILWAY_DEPLOYMENT.md
Alternative platforms:
- Fly.io (global edge, generous free tier)
- DigitalOcean (reliable, paid)
Docker deployment (universal):
cd backend
docker build -t invictus-backend .
docker run -p 8000:8000 --env-file .env invictus-backendSee RAILWAY_DEPLOYMENT.md for detailed Railway deployment instructions, or DEPLOYMENT.md for other platform options.
Before deploying to production:
- All
.envfiles are in.gitignore(already done) - Environment variables are set in hosting platform (not in code)
- CORS is configured to only allow your frontend domain
- API keys are rotated and secure
- HTTPS is enabled (automatic on most platforms)
- Security headers are configured (see
frontend/vercel.json) - Rate limiting is considered (add if needed)
- Monitoring/alerting is set up
See DEPLOYMENT.md for comprehensive security guidelines.
This is a prototype project for demonstration purposes.
- UniProt for protein database
- Perplexity.ai for intelligent publication search (primary)
- PubMed/PMC (NCBI E-utilities) for publication access (fallback)
- Semantic Scholar for publication fallback access
- Unpaywall API for open access PDF retrieval
- Ollama for local LLM inference
- scispaCy for biomedical NLP