A tool and GitHub Action to check for broken links (4xx, 5xx status codes) on websites. It supports both sitemap-based checking and web crawling.
- Sitemap Support: Check links from XML sitemaps
- Website Crawling: Recursively crawl websites to discover links
- Concurrent Processing: Configurable concurrent request limits for performance
- Flexible Configuration: Support for both command-line flags and environment variables
- Pattern Exclusion: Exclude URLs using regex patterns
- GitHub Action Integration: Built-in support for GitHub Actions with proper outputs
- Dynamic URL Resolution: Intelligent base URL detection using HTTP Content-Type headers
- Comprehensive Reporting: Detailed results with status codes, errors, and timing information
- Help and Version Support: Built-in help and version information
Use directly in your GitHub workflows:
- name: Check links
uses: joshbeard/gh-action-link-checker@v1
with:
sitemap-url: 'https://example.com/sitemap.xml'
Available on GitHub Container Registry and Docker Hub:
# From GitHub Container Registry
docker run --rm ghcr.io/joshbeard/link-checker:latest \
--sitemap-url https://example.com/sitemap.xml
# From Docker Hub
docker run --rm joshbeard/link-checker:latest \
--sitemap-url https://example.com/sitemap.xml
Download pre-built binaries from GitHub Releases:
curl -L https://github.com/joshbeard/gh-action-link-checker/releases/latest/download/link-checker-linux-amd64 -o link-checker
chmod +x link-checker
./link-checker --sitemap-url https://example.com/sitemap.xml
# Show help information
./link-checker --help
# Show version information
./link-checker --version
name: Check Links
on:
schedule:
- cron: '0 0 * * 0' # Weekly on Sunday
workflow_dispatch:
jobs:
link-check:
runs-on: ubuntu-latest
steps:
- name: Check links from sitemap
uses: joshbeard/gh-action-link-checker@v1
with:
sitemap-url: 'https://example.com/sitemap.xml'
timeout: 30
max-concurrent: 10
exclude-patterns: '.*\.pdf$,.*example\.com.*'
name: Check Links
on:
push:
branches: [main]
jobs:
link-check:
runs-on: ubuntu-latest
steps:
- name: Check links by crawling
uses: joshbeard/gh-action-link-checker@v1
with:
base-url: 'https://example.com'
max-depth: 3
timeout: 30
max-concurrent: 5
fail-on-error: true
link-check:
stage: test
image: ghcr.io/joshbeard/link-checker:latest
script:
- link-checker --sitemap-url https://example.com/sitemap.xml --timeout 30 --max-concurrent 5
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
docker run --rm ghcr.io/joshbeard/link-checker:latest \
--base-url https://example.com \
--max-depth 2 \
--timeout 60 \
--exclude-patterns ".*\.pdf$,.*\.zip$" \
--verbose
name: Link Checker
on:
schedule:
- cron: '0 2 * * 1' # Weekly on Monday at 2 AM
workflow_dispatch:
jobs:
check-links:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Check website links
id: link-check
uses: joshbeard/gh-action-link-checker@v1
with:
sitemap-url: 'https://example.com/sitemap.xml'
timeout: 30
user-agent: 'MyBot/1.0'
exclude-patterns: '.*\.pdf$,.*\.zip$,.*example\.com.*'
max-concurrent: 10
fail-on-error: false
- name: Comment on PR if broken links found
if: steps.link-check.outputs.broken-links-count > 0
uses: actions/github-script@v7
with:
script: |
const brokenLinks = JSON.parse('${{ steps.link-check.outputs.broken-links }}');
const count = '${{ steps.link-check.outputs.broken-links-count }}';
let comment = `## π Link Check Results\n\n`;
comment += `Found ${count} broken link(s):\n\n`;
brokenLinks.forEach(link => {
comment += `- β [${link.url}](${link.url}) - ${link.error}\n`;
});
console.log(comment);
Input | Description | Required | Default |
---|---|---|---|
sitemap-url |
URL to sitemap.xml to check links from | No | - |
base-url |
Base URL to crawl for links (used if sitemap-url not provided) | No | - |
max-depth |
Maximum crawl depth when using base-url | No | 3 |
timeout |
Request timeout in seconds | No | 30 |
user-agent |
User agent string for requests | No | GitHub-Action-Link-Checker/1.0 |
exclude-patterns |
Comma-separated list of URL patterns to exclude (regex supported) | No | - |
fail-on-error |
Whether to fail the action if broken links are found | No | true |
max-concurrent |
Maximum number of concurrent requests | No | 10 |
verbose |
Show detailed output for each link checked | No | false |
When using the binary or Docker image, use these flags:
-sitemap-url string URL to sitemap.xml
-base-url string Base URL to crawl
-max-depth int Maximum crawl depth (default 3)
-timeout int Request timeout in seconds (default 30)
-user-agent string User agent string (default "GitHub-Action-Link-Checker/1.0")
-exclude-patterns string Comma-separated exclude patterns
-max-concurrent int Max concurrent requests (default 10)
-fail-on-error Exit with error code if broken links found (default true)
-verbose Show detailed output
-help Show help information
-version Show version information
The tool supports environment variables (primarily for GitHub Action integration):
INPUT_SITEMAP_URL URL of the sitemap to check
INPUT_BASE_URL Base URL to start crawling from
INPUT_MAX_DEPTH Maximum crawl depth (default: 3)
INPUT_TIMEOUT Request timeout in seconds (default: 30)
INPUT_USER_AGENT User agent string (default: Link-Validator/1.0)
INPUT_EXCLUDE_PATTERNS Comma-separated regex patterns to exclude URLs
INPUT_FAIL_ON_ERROR Exit with error code if broken links found (default: true)
INPUT_MAX_CONCURRENT Maximum concurrent requests (default: 10)
INPUT_VERBOSE Enable verbose output (default: false)
Note: Command line flags take precedence over environment variables.
Output | Description |
---|---|
broken-links-count |
Number of broken links found |
broken-links |
JSON array of broken links with details |
total-links-checked |
Total number of links checked |
You can use environment variables instead of command line flags:
# Check links from sitemap using environment variables
INPUT_SITEMAP_URL=https://example.com/sitemap.xml ./link-checker
# Crawl website using environment variables
INPUT_BASE_URL=https://example.com INPUT_MAX_DEPTH=2 INPUT_VERBOSE=true ./link-checker
You can exclude URLs using regex patterns:
with:
exclude-patterns: '.*\.pdf$,.*\.zip$,.*example\.com.*,.*#.*'
This will exclude:
- PDF files
- ZIP files
- Any URLs containing "example.com"
- Any URLs with fragments (anchors)
Control concurrent requests to be respectful to target servers:
with:
max-concurrent: 5 # Only 5 concurrent requests
timeout: 60 # 60 second timeout per request
Enable detailed output to see each link as it's being checked:
with:
verbose: true
This will show output like:
β
[1/111] https://example.com/page1 (Status: 200, Duration: 45ms)
β [2/111] https://example.com/broken (Status: 404, Duration: 23ms)
π [3/111] https://example.com/redirect (Status: 301, Duration: 67ms)
Status emojis:
- β Success (2xx)
- π Redirect (3xx)
- β Client Error (4xx)
- π₯ Server Error (5xx)
- β Unknown/Error
go mod tidy
go build -o link-checker ./cmd/link-checker
Or use the Makefile:
make build # Build the binary
make test # Run tests
make help # See all available targets
Run the test suite:
go test ./... # Run all tests
go test ./... -cover # Run with coverage
go test ./... -v # Verbose output
The project maintains high test coverage. To generate a coverage report:
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.html
The link checker uses intelligent URL resolution to properly handle relative links on web pages:
-
HTML Base Tag Detection: If a page contains a
<base href="...">
tag, it uses that as the base URL for resolving relative links. -
Dynamic Content-Type Analysis: When no base tag is present, the tool makes HTTP HEAD requests to determine if a URL represents a file or directory based on the Content-Type header:
- Directory-like content (
text/html
,application/json
,application/xml
): Treats the URL as a directory for relative link resolution - File-like content (
application/pdf
,image/*
,audio/*
,video/*
, etc.): Uses the parent directory for relative link resolution
- Directory-like content (
-
Extension-based Fallback: If HTTP detection fails, falls back to file extension analysis to determine URL type.
MIT License - see LICENSE file for details.