-
Notifications
You must be signed in to change notification settings - Fork 275
Refactoring: Access Check and Exclusion
(this is still a pile of random thoughts - I'd welcome comments/clean ups - Kenji) There are two frameworks for filtering out captures: one for ResourceIndex
/CaptureSearchResult
and another for CDXServer
/CDXLine
. As we plan to consolidate index implementation into CDXServer
, I focus on CDXServer
version of capture filtering here. In order to reuse filtering components originally written for ResourceIndex
framework, there is some dirty bridge work in CDXServer
/CDXLine
. That's one area needing clean-ups.
Two kinds of filtering are recognized to date:
- scope filtering (for hosting multiple collections on top of single Wayback index)
- access control (for prohibiting playback of certain captures, often controlled by external filtering rule database)
Scope filtering is currently tightly coupled with CompositeAccessPoint
, and specific to use case at Internet Archive. Access control can also change visibility of CDX fields as well as filtering out CDX lines altogether.
Key difference between these two is:
- scope filtering is silent; there's no need to communicate to the user that captures are being filtered out. whereas,
- access control needs to communicate what filtering took effect and how (ex. "excluded by robots.txt" etc.) This communication is not well implemented in my opinion (more later).
Another (minor) difference is:
- scope filtering is usually statically configured, whereas,
- access control often vary by client (username, IP address etc.)
Description of classes involved.
This interface is a factory of CDXAccessFilter
. CDXServer is configured with an implementation of this interface at startup. CDXAccessFilter
is a per-session filtering object.
AuthChecker
also grants permissions to AuthToken
. Primary implementation PrivTokenAuthChecker
is configured with a list of pre-defined tokens, and determines whether a user (subject; represented by AuthToken
) has certain permissions.
Mix-up of these two functionality is a legacy of stand-alone CDXServer implementation. More common approach is to have separate authentication/authorization component, and let other parts of application consult with subject object for user permissions. With this architecture, we can remove isAllUrlAccessAllowed
and isAllCdxFieldAccessAllowed
methods from AuthChecker
. It is harder to reuse authentication/authorization functionality in Wayback because of this mix-up.
getPublicCdxFields
method appears to be unused. Don't know why it must be part of AuthChecker interface. PrivTokenAuthChecker
has setPublicCdxFields(String)
, which updates publicCdxFormat
property with FieldSplitFormat
object. There's no setPublicCdxFormat(FieldSplitFormat)
method.
getPublicCdxFormat
is used by CDXServer.writeCdxResponse
method:
if (!authChecker.isAllCdxFieldAccessAllowed(authToken)) {
outputFields = this.authChecker.getPublicCdxFormat();
}
This property could be moved to CDXServer
.
WaybackAPAuthChecker
has been superseded by AccessPointAuthChecker
and there is no known user of WaybackAPAuthChecker
currently. Its base class WaybackAuthChecker
has no other sub-classes. These two classes can be removed.
Name of this class indicates a close tie to PrivTokenAuthChecker
. This is more like a Subject
class defined by JAAS.
authToken
field is the name of a subject (JAAS allows for multiple identities, represented by sub-object Principal
). cachedAllUrlAllow
, cachedAllCdxAllow
and ignoreRobots
are permissions.
AuthToken
is abused to pass AccessPoint
to AuthChecker
. Its sole sub-class APContextAuthToken
saves AccessPoint
object passed to its constructor for later use by AuthChecker
implementation (ex. AccessPointAuthChecker
) to build collection-specific CDXAccessFilter
. By introduction of AccessPoint.createExclusionFilter
method, this method has become a standard way of instantiating ExclusionFilter
. We should add CollectionContext
(or CDX Server equivalent of it) parameter to CDXServer.getCdx
to make this confusing trick unnecessary.
setAllCdxFieldsAllow()
method and setIgnoreRobots(boolean)
methods are worth a special note. These methods are used by EmbeddedCDXServerIndex
for configuring APContextAuthToken
for internal use of CDXServer
.
An interface for per-session filtering object. It defines two methods:
boolean includeUrl(String urlKey, String originalUrl)
boolean includeCapture(CDXLine line)
includeUrl
is called (by CDXServer.getCdx(CDXQuery, AuthToken)
) just once for URL, before any calls to includeCapture
, to check for per-URL filtering. This method exists so as to quickly detect per-URL exclusion, even before loading the first line of CDX.
Another purpose of this method is to communicate the act of filtering. If this method returns false
, CDXServer will silently return empty result; There's no way to tell if the URL has never been captured, or excluded per-URL basis. To communicate the act of filtering, AccessCheckFilter
(primary implementation of CDXAccessFilter
) throws RuntimeIOException
wrapping an instance of AccessControlException
carrying more information on the type of filtering applied. Wayback defines four sub-classes of AccessControlException
:
-
AdministrativeAccessControlException
- excluded by other (possibly manually set up) policy rules -
RobotControlAccessControlException
- excluded by robots.txt rules -
RobotNotAvailableException
- unused inCDXServer
-
RobotTimeOutAccessControlException
- unused inCDXServer
EmbeddedCDXServerIndex.doQuery
catches RuntimeIOException
and re-throws inner AccessControlException
. We should be able to make AuthChecker
and CDXAccessFilter
throw AccessControlException
directly.
Semantics of the first two exceptions are loose. AccessCheckFilter
throws AdministrativeAccessControlException
when whatever ExclusionFilter
given to its adminFilter
parameter returns values other than FILTER_INCLUDE
. Similarly it throws RobotAccessControlException
whenever its robotsFilter
returns non-FILTER_INCLUDE
value. As such, AccessCheckFilter
is not extensible to allow for other types of exclusions. For this reason, recent changes (at IA) are moving away from this approach; New AccessPointAuthChecker
creates AccessCheckFilter
with just one ExclusionFilter
object, returned by AccessPoint.createExclusionFilter()
, and new CompositeExclusionFilterFactory
allows for configuring multiple exclusion filters.
ExclusionFilter
originates from ResourceIndex
/CaptureSearchResult
framework. AccessCheckFilter
uses it so that existing exclusion filters (most notably OracleExclusionFilter
and StaticMapExclusionFilter
) can be reused with CDXServer
.
As its filterObject
method needs CaptureSearchResult
as an argument, AccessCheckFilter
creates temporary wrapper CaptureSearchResult
object for every CDXLine
(CDXLine
cannot implement CaptureSearchResult
as it is not an interface). This is very inefficient.
Older code did not have this issue because filtering was run in CDXToCaptureSearchResultsWriter
who already had CaptureSearchResult
objects. This old method is still supported, but strongly discouraged as it screws up CDX query result if exclusion and collapsing are combined. CDXWriter
should focus on converting CDXLine
to final output. That fact other CDXWriter
implementations, like PlainTextWriter
, do not implement exclusion, supports this argument.
I'd suggest re-implementing exclusion filters natively with CDX Server interfaces.
Adopt design pattern from JAAS:
- AuthToken as Subject
- AuthChecker as LoginModule (or LoginContext)
This implies:
- moving permission attributes and test methods from AuthChecker to AuthToken
- add a new method to AuthChecker interface, that corresponds to LoginModule#initialize and LoginModule#login
- remove code related to cookie-based authorization from
CDXServer
toPrivTokenAuthChecker
(ex. constantCDX_AUTH_TOKEN
,cookieAuthToken
fields and its getter/setter,extractAuthToken
method etc.) - remove
createAccessFilter
method fromAuthChecker
, embed the essential portion of its functionality inCDXServer
Adopting full JAAS would make Wayback configuration way too complex without clear benefits. Unless other requirements arise, simplified framework would suffice. Alignment with JAAS design makes it easier to understand and extend.
Update: recent change at IA is a move in this direction: df182ec
- remove
CDXToCaptureSearchResultWriter#setExclusionFilter
and#getExclusionFilter
- add more arguments to
ContextExclusionFilterFactory#getExclusionFilter
AuthToken
See iipc/openwayback#290 on requirements for exclusion based on client's IP address. While I (Kenji) suggested WaybackRequest
as an additional argument, it shall not be part of CDX server API. Use of extended AuthToken
carrying client's IP address would meet the requirements.
Copyright © 2005-2022 [tonazol](http://netpreserve.org/). CC-BY. https://github.com/iipc/openwayback.wiki.git