forked from jozef/URL-Transform
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
170 lines (122 loc) · 5.51 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
NAME
URL::Transform - perform URL transformations in various document types
SYNOPSIS
my $output;
my $urlt = URL::Transform->new(
'document_type' => 'text/html;charset=utf-8',
'content_encoding' => 'gzip',
'output_function' => sub { $output .= "@_" },
'transform_function' => sub { return (join '|', @_) },
);
$urlt->parse_file($Bin.'/data/URL-Transform-01.html');
print "and this is the output: ", $output;
DESCRIPTION
URL::Transform is a generic module to perform an url transformation in a
documents. Accepts callback function using which the url link can be
changed.
There are different modules to handle different document types, elements
or attributes:
`text/html', `text/vnd.wap.wml', `application/xhtml+xml',
`application/vnd.wap.xhtml+xml'
URL::Transform::using::HTML::Parser, URL::Transform::using::XML::SAX
(incomplete was used only to benchmark)
`text/css'
URL::Transform::using::CSS::RegExp
`text/html/meta-content'
URL::Transform::using::HTML::Meta
`application/x-javascript'
URL::Transform::using::Remove
By passing `parser' option to the `URL::Transform->new()' constructor
you can set what library will be used to parse and execute the output
and transform functions. Note that the elements inside for example
`text/html' that are of a different type will be transformed via
default_for($document_type) modules.
`transform_function' is called with following arguments:
transform_function->(
'tag_name' => 'img',
'attribute_name' => 'src',
'url' => 'http://search.cpan.org/s/img/cpan_banner.png',
);
and must return (un)modified url as the return value.
`output_function' is called with (already modified) document chunk for
outputting.
PROPERTIES
content_encoding
document_type
parser
transform_function
output_function
parser
For HTML/XML can be HTML::Parser, XML::SAX
document_type
text/html - default
transform_function
Function that will be called to make the transformation. The
function will receive one argument - url text.
output_function
Reference to function that will receive resulting output. The
default one is to use print.
content_encoding
Can be set to `gzip' or `deflate'. By default it is `undef', so
there is no content encoding.
METHODS
new
Object constructor.
Requires `transform_function' a CODE ref argument.
The rest of the arguments are optional. Here is the list with defaults:
document_type => 'text/html;charset=utf-8',
output_function => sub { print @_ },
parser => 'HTML::Parser',
content_encoding => undef,
default_for($document_type)
Returns default parser for a supplied $document_type.
Can be used also as a set function with additional argument - parser
name.
If called as object method set the default parser for the object. If
called as module function set the default parser for a whole module.
parse_string($string)
Submit document as a string for parsing.
This some function must be implemented by helper parsing classes.
parse_chunk($chunk)
Submit chunk of a document for parsing.
This some function should be implemented by helper parsing classes.
can_parse_chunks
Return true/false if the parser can parse in chunks.
parse_file($file_name)
Submit file for parsing.
This some function should be implemented by helper parsing classes.
link_tags
# To simplify things, reformat the %HTML::Tagset::linkElements
# hash so that it is always a hash of hashes.
# Construct a hash of tag names that may have links.
js_attributes
# Construct a hash of all possible JavaScript attribute names
decode_string($string)
Will return decoded string suitable for parsing. Decoding is chosen
according to the $self->content_encoding.
Decoding is run automatically for every chunk/string/file.
encode_string($string)
Will return encoded string. Encoding is chosen according to the
$self->content_encoding.
NOTE if you want to have your content encoded back to the
$self->content_encoding you will have to run this method in your code.
Argument to the `output_function()' are always plain text.
get_supported_content_encodings()
Returns hash reference of supported content encodings.
benchmarks
Benchmark: timing 10000 iterations of HTML::Parser , XML::LibXML::SAX, XML::SAX::PurePerl...
HTML::Parser : 3 wallclock secs ( 2.41 usr + 0.04 sys = 2.45 CPU) @ 4081.63/s (n=10000)
XML::LibXML::SAX : 29 wallclock secs (27.22 usr + 0.11 sys = 27.33 CPU) @ 365.90/s (n=10000)
XML::SAX::PurePerl: 192 wallclock secs (180.62 usr + 0.50 sys = 181.12 CPU) @ 55.21/s (n=10000)
TODO
There are urls in `pics' meta tag: `<meta http-equiv="pics-label"
content=" ...'. See http://www.w3.org/PICS/.
SEE ALSO
HTML::Parser, URL::Transform::using::HTML::Parser
AUTHOR
Jozef Kutej `<jkutej at cpan.org>'
LICENSE AND COPYRIGHT
This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.