Skip to content

odilovicc/spa-parser

Repository files navigation

🕷️ SPA Parser - Comprehensive Web Application Content Extractor

A powerful Single Page Application (SPA) parser designed to "впитывать сок всего проекта" (absorb all project content) including HTML, logic, and data to feed AI models and create vector databases.

🚀 Features

  • 🌐 SPA Support: Navigate and parse dynamic single-page applications with authentication
  • 🔐 Smart Authentication: Automatic login with fallback to manual intervention
  • 📄 Resume Functionality: Skip already parsed pages and continue from where you left off
  • 🎯 Comprehensive Content Extraction: Extract HTML, JavaScript, CSS, images, and links
  • 🤖 AI Integration: Process content using Ollama AI models (Mistral, Nomic embeddings)
  • 📊 Vector Database: Create embeddings and vector representations for semantic search
  • 📝 Detailed Logging: Structured logging with specialized parser methods
  • ⚙️ Configuration Management: Easy config management with CLI tools

📦 Installation

# Clone the repository
git clone <repository-url>
cd spa-parser

# Install dependencies with Bun
bun install

# Make sure Ollama is running
ollama serve

⚙️ Configuration

Main Config (config.json)

{
  "target": {
    "url": "https://demo.mirasoft.io",
    "authentication": {
      "required": true,
      "username": "mirabot",
      "password": "123123123",
      "loginUrl": "https://demo.mirasoft.io/login"
    }
  },
  "parsing": {
    "depth": {
      "maxPages": 100,
      "forceReparse": false
    },
    "extractContent": {
      "html": true,
      "javascript": true,
      "css": true,
      "images": true,
      "links": true
    }
  }
}

Configuration Management

Use the built-in config manager:

# Check current settings
bun manage-config.ts status

# Enable force reparse (reparse all pages)
bun manage-config.ts force-reparse on

# Disable force reparse (skip already parsed)
bun manage-config.ts force-reparse off

# Set max pages limit
bun manage-config.ts max-pages 50

🎯 Usage

Basic Parsing

# Run the parser
bun run index.ts

Resume Functionality

The parser automatically:

  • ✅ Loads previously parsed URLs from existing results
  • ✅ Skips already processed pages (saves time!)
  • ✅ Continues parsing from unvisited routes
  • ✅ Generates progress reports

Authentication Flow

  1. Automatic Login: Tries to login automatically with configured credentials
  2. Manual Fallback: Opens browser window for manual login if auto-login fails
  3. Smart Detection: Detects successful authentication and continues parsing

📊 Output Structure

parsed_data/
├── parse-result-2025-10-09T19-50-44-332Z.json  # Complete parsed data
├── parse-result-2025-10-09T19-50-44-332Z.md    # Markdown summary
└── logs/
    └── parser-2025-10-09.log                   # Detailed logs

JSON Output Format

{
  "timestamp": "2025-10-09T19:50:44.332Z",
  "summary": {
    "totalPages": 100,
    "totalUrls": 150,
    "successRate": 100,
    "duration": "18 minutes"
  },
  "pages": [
    {
      "url": "https://demo.mirasoft.io/#/dashboard",
      "title": "Dashboard - Mirasoft",
      "content": {
        "html": "...",
        "javascript": "...",
        "css": "...",
        "links": [...],
        "images": [...]
      },
      "metadata": {
        "parseTime": "2025-10-09T19:51:15.123Z",
        "processingTime": 1250
      }
    }
  ]
}

🛠️ Advanced Features

SPA Route Discovery

The parser intelligently discovers SPA routes:

  • 🔍 Common route patterns (#/dashboard, #/admin, etc.)
  • 🖱️ Interactive element discovery (buttons, links, menus)
  • 📜 Dynamic content loading with scrolling and hover actions
  • 🔗 Hash-based and modern routing support

AI Integration

  • Content Analysis: Uses Mistral model for content categorization
  • Embedding Generation: Creates vector embeddings with Nomic
  • Semantic Processing: Extracts structured data from unstructured content

Resumable Parsing

  • Smart Resume: Checks existing JSON results and continues from unvisited URLs
  • Progress Tracking: Maintains visited URL state across runs
  • Efficient Processing: Avoids re-parsing already processed content

📈 Performance Stats

Recent successful run:

  • 100 pages parsed successfully
  • ⏱️ 18 minutes total duration
  • 📁 1.1GB of extracted data
  • 🎯 0 errors encountered
  • 🚀 5.5 pages/minute average speed

🔧 Troubleshooting

Authentication Issues

🚨 ВНИМАНИЕ! Нужен ручной логин!
  • Check credentials in config.json
  • Use the opened browser window for manual login
  • Wait for the success confirmation message

Memory Issues

# Reduce max pages for large sites
bun manage-config.ts max-pages 50

Resume Not Working

# Enable force reparse to start fresh
bun manage-config.ts force-reparse on

📋 Requirements

  • Bun Runtime: v1.2.23+
  • Ollama: Running on localhost:11434 with Mistral model
  • Chromium Browser: Auto-installed by Playwright
  • Node.js: v18+ (for compatibility)

Required Ollama Models

ollama pull mistral:latest
ollama pull nomic-embed-text:latest

🎨 Logging Features

The parser includes comprehensive logging:

  • Parse Progress: Real-time URL processing updates
  • Authentication Flow: Detailed login attempt tracking
  • Error Handling: Structured error reporting
  • Performance Metrics: Timing and success rate tracking
  • Daily Rotation: Organized log files by date

🔄 Continuous Integration

Perfect for:

  • Incremental Parsing: Regular content updates
  • Monitoring Changes: Track website modifications over time
  • Data Pipeline: Feed parsed content to ML/AI systems
  • Content Analysis: Systematic web application exploration

📞 Support

For issues with parsing specific applications or configuration questions, check the logs in logs/parser-[date].log for detailed debugging information.


Made with ❤️ for comprehensive web application analysis and AI data preparation

About

SPA Parser: A robust Bun-based tool for deeply extracting HTML, JS, CSS, and assets from authenticated Single Page Applications (SPAs). Features smart auth, resumable parsing, AI content analysis with Ollama (Mistral/Nomic), and vector embeddings for semantic search—ideal for AI data pipelines and web monitoring.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors