-
Notifications
You must be signed in to change notification settings - Fork 762
RFC2617 (BASIC and DIGEST Auth)
To use the RFC2617 credential, supply a domain, realm, username, and password.
The way that RFC2617 authentication works in Heritrix is that in response to a 401 response code (Unauthorized), Heritrix will use a key made up of the domain plus the realm to do a lookup into its Credential Store. If a match is found, then the credential is loaded into the CrawlURI and the CrawlURI is marked for immediate retry.
When the CrawlURI is retried, the found credentials are added to the request. If the request succeeds with a 200 response code, the credentials are promoted to the CrawlServer and all subsequent requests made against the CrawlServer will preemptively volunteer the credential. If the credential fails with a 401 response code, the URI is no longer retried.
The domain is the canonical root URI of RFC2617; it is the CrawlServer name or URI authority (domain plus port if other than port 80). Examples of domains are: 'www.archive.org' or 'www.archive.org:8080'.
A realm is defined in the RFC2617 faq. The realm string must exactly match the realm name presented in the authentication challenge served by the web server.
An RFC2617 credential configuration is illustrated below.
<bean id="credential"
class="org.archive.modules.credential.HttpAuthenticationCredential">
<property name="domain">
<value>
domain
</value>
</property>
<property name="realm">
<value>
myrealm
</value>
</property>
<property name="login">
<value>
mylogin
</value>
</property>
<property name="password">
<value>
mypassword
</value>
</property>
</bean>
Note
- Only one realm per credential domain is allowed. See Logging in (HTTP POST, Basic Auth, etc.) for more information.
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Adding URIs mid-crawl
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Build Box
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse