A powerful Single Page Application (SPA) parser designed to "впитывать сок всего проекта" (absorb all project content) including HTML, logic, and data to feed AI models and create vector databases.
- 🌐 SPA Support: Navigate and parse dynamic single-page applications with authentication
- 🔐 Smart Authentication: Automatic login with fallback to manual intervention
- 📄 Resume Functionality: Skip already parsed pages and continue from where you left off
- 🎯 Comprehensive Content Extraction: Extract HTML, JavaScript, CSS, images, and links
- 🤖 AI Integration: Process content using Ollama AI models (Mistral, Nomic embeddings)
- 📊 Vector Database: Create embeddings and vector representations for semantic search
- 📝 Detailed Logging: Structured logging with specialized parser methods
- ⚙️ Configuration Management: Easy config management with CLI tools
# Clone the repository
git clone <repository-url>
cd spa-parser
# Install dependencies with Bun
bun install
# Make sure Ollama is running
ollama serve{
"target": {
"url": "https://demo.mirasoft.io",
"authentication": {
"required": true,
"username": "mirabot",
"password": "123123123",
"loginUrl": "https://demo.mirasoft.io/login"
}
},
"parsing": {
"depth": {
"maxPages": 100,
"forceReparse": false
},
"extractContent": {
"html": true,
"javascript": true,
"css": true,
"images": true,
"links": true
}
}
}Use the built-in config manager:
# Check current settings
bun manage-config.ts status
# Enable force reparse (reparse all pages)
bun manage-config.ts force-reparse on
# Disable force reparse (skip already parsed)
bun manage-config.ts force-reparse off
# Set max pages limit
bun manage-config.ts max-pages 50# Run the parser
bun run index.tsThe parser automatically:
- ✅ Loads previously parsed URLs from existing results
- ✅ Skips already processed pages (saves time!)
- ✅ Continues parsing from unvisited routes
- ✅ Generates progress reports
- Automatic Login: Tries to login automatically with configured credentials
- Manual Fallback: Opens browser window for manual login if auto-login fails
- Smart Detection: Detects successful authentication and continues parsing
parsed_data/
├── parse-result-2025-10-09T19-50-44-332Z.json # Complete parsed data
├── parse-result-2025-10-09T19-50-44-332Z.md # Markdown summary
└── logs/
└── parser-2025-10-09.log # Detailed logs
{
"timestamp": "2025-10-09T19:50:44.332Z",
"summary": {
"totalPages": 100,
"totalUrls": 150,
"successRate": 100,
"duration": "18 minutes"
},
"pages": [
{
"url": "https://demo.mirasoft.io/#/dashboard",
"title": "Dashboard - Mirasoft",
"content": {
"html": "...",
"javascript": "...",
"css": "...",
"links": [...],
"images": [...]
},
"metadata": {
"parseTime": "2025-10-09T19:51:15.123Z",
"processingTime": 1250
}
}
]
}The parser intelligently discovers SPA routes:
- 🔍 Common route patterns (
#/dashboard,#/admin, etc.) - 🖱️ Interactive element discovery (buttons, links, menus)
- 📜 Dynamic content loading with scrolling and hover actions
- 🔗 Hash-based and modern routing support
- Content Analysis: Uses Mistral model for content categorization
- Embedding Generation: Creates vector embeddings with Nomic
- Semantic Processing: Extracts structured data from unstructured content
- Smart Resume: Checks existing JSON results and continues from unvisited URLs
- Progress Tracking: Maintains visited URL state across runs
- Efficient Processing: Avoids re-parsing already processed content
Recent successful run:
- ✅ 100 pages parsed successfully
- ⏱️ 18 minutes total duration
- 📁 1.1GB of extracted data
- 🎯 0 errors encountered
- 🚀 5.5 pages/minute average speed
🚨 ВНИМАНИЕ! Нужен ручной логин!
- Check credentials in config.json
- Use the opened browser window for manual login
- Wait for the success confirmation message
# Reduce max pages for large sites
bun manage-config.ts max-pages 50# Enable force reparse to start fresh
bun manage-config.ts force-reparse on- Bun Runtime: v1.2.23+
- Ollama: Running on localhost:11434 with Mistral model
- Chromium Browser: Auto-installed by Playwright
- Node.js: v18+ (for compatibility)
ollama pull mistral:latest
ollama pull nomic-embed-text:latestThe parser includes comprehensive logging:
- Parse Progress: Real-time URL processing updates
- Authentication Flow: Detailed login attempt tracking
- Error Handling: Structured error reporting
- Performance Metrics: Timing and success rate tracking
- Daily Rotation: Organized log files by date
Perfect for:
- Incremental Parsing: Regular content updates
- Monitoring Changes: Track website modifications over time
- Data Pipeline: Feed parsed content to ML/AI systems
- Content Analysis: Systematic web application exploration
For issues with parsing specific applications or configuration questions, check the logs in logs/parser-[date].log for detailed debugging information.
Made with ❤️ for comprehensive web application analysis and AI data preparation