Link Validator

A tool and GitHub Action to check for broken links (4xx, 5xx status codes) on websites. It supports both sitemap-based checking and web crawling.

Features

Sitemap Support: Check links from XML sitemaps
Website Crawling: Recursively crawl websites to discover links
Concurrent Processing: Configurable concurrent request limits for performance
Flexible Configuration: Support for both command-line flags and environment variables
Pattern Exclusion: Exclude URLs using regex patterns
GitHub Action Integration: Built-in support for GitHub Actions with proper outputs
Dynamic URL Resolution: Intelligent base URL detection using HTTP Content-Type headers
Comprehensive Reporting: Detailed results with status codes, errors, and timing information
Help and Version Support: Built-in help and version information

Installation & Usage

GitHub Action

Use directly in your GitHub workflows:

- name: Check links
  uses: joshbeard/gh-action-link-checker@v1
  with:
    sitemap-url: 'https://example.com/sitemap.xml'

Docker Image

Available on GitHub Container Registry and Docker Hub:

# From GitHub Container Registry
docker run --rm ghcr.io/joshbeard/link-checker:latest \
  --sitemap-url https://example.com/sitemap.xml

# From Docker Hub
docker run --rm joshbeard/link-checker:latest \
  --sitemap-url https://example.com/sitemap.xml

Binary Releases

Download pre-built binaries from GitHub Releases:

curl -L https://github.com/joshbeard/gh-action-link-checker/releases/latest/download/link-checker-linux-amd64 -o link-checker
chmod +x link-checker
./link-checker --sitemap-url https://example.com/sitemap.xml

Getting Help

# Show help information
./link-checker --help

# Show version information
./link-checker --version

Examples

GitHub Action - Sitemap

name: Check Links
on:
  schedule:
    - cron: '0 0 * * 0'  # Weekly on Sunday
  workflow_dispatch:

jobs:
  link-check:
    runs-on: ubuntu-latest
    steps:
      - name: Check links from sitemap
        uses: joshbeard/gh-action-link-checker@v1
        with:
          sitemap-url: 'https://example.com/sitemap.xml'
          timeout: 30
          max-concurrent: 10
          exclude-patterns: '.*\.pdf$,.*example\.com.*'

GitHub Action - Web Crawling

name: Check Links
on:
  push:
    branches: [main]

jobs:
  link-check:
    runs-on: ubuntu-latest
    steps:
      - name: Check links by crawling
        uses: joshbeard/gh-action-link-checker@v1
        with:
          base-url: 'https://example.com'
          max-depth: 3
          timeout: 30
          max-concurrent: 5
          fail-on-error: true

GitLab CI

link-check:
  stage: test
  image: ghcr.io/joshbeard/link-checker:latest
  script:
    - link-checker --sitemap-url https://example.com/sitemap.xml --timeout 30 --max-concurrent 5
  rules:
    - if: $CI_PIPELINE_SOURCE == "schedule"
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH

Docker with Custom Configuration

docker run --rm ghcr.io/joshbeard/link-checker:latest \
  --base-url https://example.com \
  --max-depth 2 \
  --timeout 60 \
  --exclude-patterns ".*\.pdf$,.*\.zip$" \
  --verbose

Complete GitHub Action with Error Handling

name: Link Checker
on:
  schedule:
    - cron: '0 2 * * 1'  # Weekly on Monday at 2 AM
  workflow_dispatch:

jobs:
  check-links:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Check website links
        id: link-check
        uses: joshbeard/gh-action-link-checker@v1
        with:
          sitemap-url: 'https://example.com/sitemap.xml'
          timeout: 30
          user-agent: 'MyBot/1.0'
          exclude-patterns: '.*\.pdf$,.*\.zip$,.*example\.com.*'
          max-concurrent: 10
          fail-on-error: false

      - name: Comment on PR if broken links found
        if: steps.link-check.outputs.broken-links-count > 0
        uses: actions/github-script@v7
        with:
          script: |
            const brokenLinks = JSON.parse('${{ steps.link-check.outputs.broken-links }}');
            const count = '${{ steps.link-check.outputs.broken-links-count }}';

            let comment = `## 🔗 Link Check Results\n\n`;
            comment += `Found ${count} broken link(s):\n\n`;

            brokenLinks.forEach(link => {
              comment += `- ❌ [${link.url}](${link.url}) - ${link.error}\n`;
            });

            console.log(comment);

Configuration

Inputs (GitHub Action)

Input	Description	Required	Default
`sitemap-url`	URL to sitemap.xml to check links from	No	-
`base-url`	Base URL to crawl for links (used if sitemap-url not provided)	No	-
`max-depth`	Maximum crawl depth when using base-url	No	`3`
`timeout`	Request timeout in seconds	No	`30`
`user-agent`	User agent string for requests	No	`GitHub-Action-Link-Checker/1.0`
`exclude-patterns`	Comma-separated list of URL patterns to exclude (regex supported)	No	-
`fail-on-error`	Whether to fail the action if broken links are found	No	`true`
`max-concurrent`	Maximum number of concurrent requests	No	`10`
`verbose`	Show detailed output for each link checked	No	`false`

Command Line Flags

When using the binary or Docker image, use these flags:

-sitemap-url string       URL to sitemap.xml
-base-url string          Base URL to crawl
-max-depth int            Maximum crawl depth (default 3)
-timeout int              Request timeout in seconds (default 30)
-user-agent string        User agent string (default "GitHub-Action-Link-Checker/1.0")
-exclude-patterns string  Comma-separated exclude patterns
-max-concurrent int       Max concurrent requests (default 10)
-fail-on-error           Exit with error code if broken links found (default true)
-verbose                 Show detailed output
-help                    Show help information
-version                 Show version information

Environment Variables

The tool supports environment variables (primarily for GitHub Action integration):

INPUT_SITEMAP_URL         URL of the sitemap to check
INPUT_BASE_URL            Base URL to start crawling from
INPUT_MAX_DEPTH           Maximum crawl depth (default: 3)
INPUT_TIMEOUT             Request timeout in seconds (default: 30)
INPUT_USER_AGENT          User agent string (default: Link-Validator/1.0)
INPUT_EXCLUDE_PATTERNS    Comma-separated regex patterns to exclude URLs
INPUT_FAIL_ON_ERROR       Exit with error code if broken links found (default: true)
INPUT_MAX_CONCURRENT      Maximum concurrent requests (default: 10)
INPUT_VERBOSE             Enable verbose output (default: false)

Note: Command line flags take precedence over environment variables.

Outputs (GitHub Action)

Output	Description
`broken-links-count`	Number of broken links found
`broken-links`	JSON array of broken links with details
`total-links-checked`	Total number of links checked

Advanced Usage

Using Environment Variables

You can use environment variables instead of command line flags:

# Check links from sitemap using environment variables
INPUT_SITEMAP_URL=https://example.com/sitemap.xml ./link-checker

# Crawl website using environment variables
INPUT_BASE_URL=https://example.com INPUT_MAX_DEPTH=2 INPUT_VERBOSE=true ./link-checker

Exclude Patterns

You can exclude URLs using regex patterns:

with:
  exclude-patterns: '.*\.pdf$,.*\.zip$,.*example\.com.*,.*#.*'

This will exclude:

PDF files
ZIP files
Any URLs containing "example.com"
Any URLs with fragments (anchors)

Rate Limiting

Control concurrent requests to be respectful to target servers:

with:
  max-concurrent: 5  # Only 5 concurrent requests
  timeout: 60        # 60 second timeout per request

Verbose Output

Enable detailed output to see each link as it's being checked:

with:
  verbose: true

This will show output like:

✅ [1/111] https://example.com/page1 (Status: 200, Duration: 45ms)
❌ [2/111] https://example.com/broken (Status: 404, Duration: 23ms)
🔄 [3/111] https://example.com/redirect (Status: 301, Duration: 67ms)

Status emojis:

✅ Success (2xx)
🔄 Redirect (3xx)
❌ Client Error (4xx)
💥 Server Error (5xx)
❓ Unknown/Error

Development

Building

go mod tidy
go build -o link-checker ./cmd/link-checker

Or use the Makefile:

make build    # Build the binary
make test     # Run tests
make help     # See all available targets

Testing

Run the test suite:

go test ./...              # Run all tests
go test ./... -cover       # Run with coverage
go test ./... -v           # Verbose output

Test Coverage

The project maintains high test coverage. To generate a coverage report:

go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.html

Dynamic URL Resolution

The link checker uses intelligent URL resolution to properly handle relative links on web pages:

HTML Base Tag Detection: If a page contains a <base href="..."> tag, it uses that as the base URL for resolving relative links.
Dynamic Content-Type Analysis: When no base tag is present, the tool makes HTTP HEAD requests to determine if a URL represents a file or directory based on the Content-Type header:
- Directory-like content (text/html, application/json, application/xml): Treats the URL as a directory for relative link resolution
- File-like content (application/pdf, image/*, audio/*, video/*, etc.): Uses the parent directory for relative link resolution
Extension-based Fallback: If HTTP detection fails, falls back to file extension analysis to determine URL type.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
cmd/link-checker		cmd/link-checker
examples/workflows		examples/workflows
internal		internal
.gitignore		.gitignore
.golintci.yml		.golintci.yml
.goreleaser.yml		.goreleaser.yml
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
action.yml		action.yml
go.mod		go.mod
go.sum		go.sum
tools.go		tools.go

License

joshbeard/link-validator

Folders and files

Latest commit

History

Repository files navigation

Link Validator

Features

Installation & Usage

GitHub Action

Docker Image

Binary Releases

Getting Help

Examples

GitHub Action - Sitemap

GitHub Action - Web Crawling

GitLab CI

Docker with Custom Configuration

Complete GitHub Action with Error Handling

Configuration

Inputs (GitHub Action)

Command Line Flags

Environment Variables

Outputs (GitHub Action)

Advanced Usage

Using Environment Variables

Exclude Patterns

Rate Limiting

Verbose Output

Development

Building

Testing

Test Coverage

Dynamic URL Resolution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages