GitHub - manwar/URL-Transform: perform URL transformations in various document types

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
lib/URL		lib/URL
t		t
.gitignore		.gitignore
Build.PL		Build.PL
Changes		Changes
MANIFEST		MANIFEST
MANIFEST.SKIP		MANIFEST.SKIP
README		README

Repository files navigation

NAME
    URL::Transform - perform URL transformations in various document types

SYNOPSIS
        my $output;
        my $urlt = URL::Transform->new(
            'document_type'      => 'text/html;charset=utf-8',
            'content_encoding'   => 'gzip',
            'output_function'    => sub { $output .= "@_" },
            'transform_function' => sub { return (join '|', @_) },
        );
        $urlt->parse_file($Bin.'/data/URL-Transform-01.html');

        print "and this is the output: ", $output;

DESCRIPTION
    URL::Transform is a generic module to perform an url transformation in a
    documents. Accepts callback function using which the url link can be
    changed.

    There are different modules to handle different document types, elements
    or attributes:

    `text/html', `text/vnd.wap.wml', `application/xhtml+xml',
    `application/vnd.wap.xhtml+xml'
        URL::Transform::using::HTML::Parser, URL::Transform::using::XML::SAX
        (incomplete was used only to benchmark)

    `text/css'
        URL::Transform::using::CSS::RegExp

    `text/html/meta-content'
        URL::Transform::using::HTML::Meta

    `application/x-javascript'
        URL::Transform::using::Remove

    By passing `parser' option to the `URL::Transform->new()' constructor
    you can set what library will be used to parse and execute the output
    and transform functions. Note that the elements inside for example
    `text/html' that are of a different type will be transformed via
    default_for($document_type) modules.

    `transform_function' is called with following arguments:

        transform_function->(
            'tag_name'       => 'img',
            'attribute_name' => 'src',
            'url'            => 'http://search.cpan.org/s/img/cpan_banner.png',
        );

    and must return (un)modified url as the return value.

    `output_function' is called with (already modified) document chunk for
    outputting.

PROPERTIES
        content_encoding
        document_type
        parser
        transform_function
        output_function

    parser
        For HTML/XML can be HTML::Parser, XML::SAX

    document_type
            text/html - default

    transform_function
        Function that will be called to make the transformation. The
        function will receive one argument - url text.

    output_function
        Reference to function that will receive resulting output. The
        default one is to use print.

    content_encoding
        Can be set to `gzip' or `deflate'. By default it is `undef', so
        there is no content encoding.

METHODS
  new
    Object constructor.

    Requires `transform_function' a CODE ref argument.

    The rest of the arguments are optional. Here is the list with defaults:

        document_type       => 'text/html;charset=utf-8',
        output_function     => sub { print @_ },
        parser              => 'HTML::Parser',
        content_encoding    => undef,

  default_for($document_type)
    Returns default parser for a supplied $document_type.

    Can be used also as a set function with additional argument - parser
    name.

    If called as object method set the default parser for the object. If
    called as module function set the default parser for a whole module.

  parse_string($string)
    Submit document as a string for parsing.

    This some function must be implemented by helper parsing classes.

  parse_chunk($chunk)
    Submit chunk of a document for parsing.

    This some function should be implemented by helper parsing classes.

  can_parse_chunks
    Return true/false if the parser can parse in chunks.

  parse_file($file_name)
    Submit file for parsing.

    This some function should be implemented by helper parsing classes.

  link_tags
        # To simplify things, reformat the %HTML::Tagset::linkElements
        # hash so that it is always a hash of hashes.

    # Construct a hash of tag names that may have links.

  js_attributes
    # Construct a hash of all possible JavaScript attribute names

  decode_string($string)
    Will return decoded string suitable for parsing. Decoding is chosen
    according to the $self->content_encoding.

    Decoding is run automatically for every chunk/string/file.

  encode_string($string)
    Will return encoded string. Encoding is chosen according to the
    $self->content_encoding.

    NOTE if you want to have your content encoded back to the
    $self->content_encoding you will have to run this method in your code.
    Argument to the `output_function()' are always plain text.

  get_supported_content_encodings()
    Returns hash reference of supported content encodings.

benchmarks
        Benchmark: timing 10000 iterations of HTML::Parser    , XML::LibXML::SAX, XML::SAX::PurePerl...
        HTML::Parser      :  3 wallclock secs ( 2.41 usr +  0.04 sys =  2.45 CPU) @ 4081.63/s (n=10000)
        XML::LibXML::SAX  : 29 wallclock secs (27.22 usr +  0.11 sys = 27.33 CPU) @ 365.90/s (n=10000)
        XML::SAX::PurePerl: 192 wallclock secs (180.62 usr +  0.50 sys = 181.12 CPU) @ 55.21/s (n=10000)

TODO
    There are urls in `pics' meta tag: `<meta http-equiv="pics-label"
    content=" ...'. See http://www.w3.org/PICS/.

SEE ALSO
    HTML::Parser, URL::Transform::using::HTML::Parser

AUTHOR
    Jozef Kutej `<jkutej at cpan.org>'

LICENSE AND COPYRIGHT
    This program is free software; you can redistribute it and/or modify it
    under the terms of either: the GNU General Public License as published
    by the Free Software Foundation; or the Artistic License.

    See http://dev.perl.org/licenses/ for more information.