Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
tree: e02a3ff89b
Fetching contributors…

Cannot retrieve contributors at this time

536 lines (486 sloc) 29.162 kb
<?xml version="1.0"?>
<!-- $Id: /xmltwig/trunk/faq.xml 33 2008-04-30T08:03:41.004487Z mrodrigu $ -->
<title>XML::Twig FAQ</title>
<author>Michel Rodriguez</author>
<p>FAQ created by Michel Rodriguez</p>
<p>Thanks to the numerous users of XML::Twig for their questions and suggestions, and to Walter Pienciak for letting me
mirror this FAQ on the IEEE website</p>
<overview><p>This FAQ contains information on XML::Twig, a perl module used to process XML documents.
Please direct all corrections and additions to <a href=""></a>. </p>
<p>This FAQ can be found on the Web at <a href=""></a>.</p>
<p><a name="mlist"></a>Information in this FAQ is based mainly on question to the Perl XML email list. To join, send an email to <a href=""></a> with the message:
<b>SUBSCRIBE Perl-XML</b>.</p>
<p>This FAQ was generated using a Perl script (using XML::Twig ;--) and an XML file. The script is at <a href=""></a>. The XML source is at
<a href=""></a>. To generate the XML::Twig FAQ, run <B>twig_faq faq.xml</B> which prints the HTML to STDOUT.
<q id="1">
<question>I know what a twig is but what is that XML thing anyway?</question>
<answer>OK, time for a quick list of XML links:
<ul><li><a href="">The W3C XML page</a></li>
<li><a href="">The XML Cover Pages</a></li>
<li><a href="">The Perl XML FAQ</a></li>
<li><a href="">Kip Hampton's Perl and XML column</a></li>
<q id="2">
<question>Where can I get the latest version of XML::Twig?</question>
<answer>The latest stable version:
<ul><li><a href="">CPAN</a></li>
<li><a href="">The Twig Homepage</a></li>
<li><a href="">The Twig Homepage (mirror hosted by the IEEE)</a></li>
The latest development version:
<ul><li><a href="">The Twig Homepage</a></li>
<li><a href="">The Twig Homepage (mirror hosted by the IEEE)</a></li>
<q id="3">
<question>Where is the documentation?</question>
<answer><p>Development version:
<a href="">html</a> /
<a href="">text</a></p>
<p>Stable version:
<a href="">html</a> /
<a href="">text</a>
<p>You can also type <tt>perldoc XML::Twig</tt> once you have installed the module
or look at the <a href="">XML::Twig Quick Reference</a>,
or goto <a href=""></a> for more information, including a
<a href="">tutorial</a>.</p>
<q id="9">
<question>How is XML::Twig supported?</question>
<answer><p>Twig is supported through email <a href=""></a>
and through the <a href="#mlist">Perl-XML mailing list</a>.</p>
<p>You are encouraged to report bugs using RT at <a href=""></a>.</p>
<p>Please send the following configuration information when you describe a bug:</p>
<li>version of perl (<tt>perl -v</tt>),</li>
<li>version of <tt>expat</tt> (see below),</li>
<li>version of XML::Parser (<tt>perl -MXML::Parser -le'print $XML::Parser::VERSION'</tt>),</li>
<li>version of XML::Twig (<tt>perl -MXML::Twig -le'print $XML::Twig::VERSION'</tt>).</li>
<p>Finding the version of <tt>expat</tt> that you are running can be a bit tricky, but it is an
important information. Here is how you can get it:</p>
<p>First, if you are using a version of XML::Parser lower than 2.30, then you don't need to mention
<tt>expat</tt>'s version: XML::Parser comes with
its own version of <tt>expat</tt> (it is old though, you might want to upgrade, first grab
<tt><a href="">expat</a></tt> and install it, then install
a recent version of XML::Parser).</p>
<p>If you are using XML::Parser 2.30 or above, run <tt>xmlwf -v</tt>. If you are lucky this will
give you the version of expat. If <tt>xmlwf</tt> exists but
does not like the <tt>-v</tt> option, then you are most likely running expat 1.95.2. If
<tt>xmlwf</tt> is not installed on your system (which can be the case if you did not install
<tt>expat</tt> yourself but use the one provided with your OS) then (on *nix) you can look for in your library path (using for example <tt>slocate</tt>). is expat 1.95.2, is expat 1.95.4 (in which case you should
upgrade, expat 1.95.4 is not compatible with XML::Twig, is expat 1.95.5 or
<p>This information will help me a lot in figuring out what causes the problem.</p>
<q id="4">
<question>What is XML::Twig used for anyway?</question>
<answer><p>I use XML::Twig for all sorts of XML processing: I use it to extract data from XML documents, to update documents from one DTD to another, to convert them to HTML and to extract/store/process data to and from a various databases.</p></answer>
<q id="5">
<question>Why should I use XML::Twig?</question>
<answer><p>The main purpose of XML::Twig is to allow you to process XML documents that might be too big to fit in memory (with XML::DOM for example). If you are in that case but don't really like stream oriented processing, then XML::Twig allows you to use a mixed stream/tree model, where you can process sub-documents as trees and then flush them to free the memory.</p><p>In addition it is designed to be easy to use, masking some of the most annoying quirks of XML and XML::Parser, such as whitespace management and encodings (see below)</p><p>The main drawback of XML::Twig is that it is not XML::DOM! It is does not have a standard interface (feel free to add one ;--) nor does it interface with XML::SAX, although as of verion 3.05 it does export SAX streams</p><p>Using the twig_roots option also lets you process (using the tree interface) only the parts of the documents you are interested in, something that can speed up tremendously your scripts</p>
<q id="23">
What are the alternatives to XML::Twig?
<p>The <a href="">Perl-XML FAQ</a> lists
quite a few other modules that can be used to process XML.</p>
<p>When deciding which module to choose for any slightly complex processing
of XML, I would advise you to also have a look at
<a href="">XML::LibXML</a>. Here is a
quick comparison of the 2 modules.</p>
<p>XML::LibXML, actually <a href="">libxml2</a>
on which it is based, sticks to the standards,
and implements a good number of them in a rather strict way: XML, XPath, DOM,
RelaxNG, I must be forgetting a couple (XInclude?). It is fast and rather
frugal memory-wise.</p>
<p>XML::Twig is older: when I started writing it XML::Parser/expat was the only
game in town. It implements XML and that's about it (plus a subset of XPath,
and you can use XML::Twig::XPath if you have XML::XPath installed for full
support). It is slower and requires more memory for a full tree than
XML::LibXML. On the plus side (yes, there is a plus side!) it lets you process
a big document in chunks, and thus let you tackle documents that couldn't be
loaded in memory by XML::LibXML, and it offers a lot (and I mean a LOT!) of
higher-level methods, for everything, from adding structure to "low-level" XML,
to shortcuts for XHTML conversions and more. It also DWIMs quite a bit, getting
comments and non-significant whitespaces out of the way but preserving them in
the output for example. As it does not stick to the DOM, is also usually leads
to shorter code than in XML::LibXML.</p>
<p>Beyond the pure features of the 2 modules, XML::LibXML seems to be prefered by
"XML-purists", while XML::Twig seems to be more used by Perl Hackers who have
to deal with XML. As you have noted, XML::Twig also comes with quite a lot of
docs, but I am sure if you ask for help about XML::LibXML here or on Perlmonks
you will get answers.</p>
<p>Note that it is actually quite hard for me to compare the 2 modules: on one hand
I know XML::Twig inside-out and I can get it to do pretty much anything I need
to (or I improve it ;--), while I have a very basic knowledge of XML::LibXML.
So feature-wise, I'd rather use XML::Twig ;--). On the other hand, I am
painfully aware of some of the deficiencies, potential bugs and plain ugly code
that lurk in XML::Twig, even though you are unlikely to be affected by them
(unless for example you need to change the DTD of a document programatically),
while I haven't looked much into XML::LibXML so it still looks shinny and clean
to me.</p>
<p>That said, ifyou need to process a document that is too big to fit memory
and XML::Twig is too slow for you, my reluctant advice would be to use "bare"
XML::Parser. It won't be as easy to use as XML::Twig: basically with XML::Twig
you trade some speed (depending on what you do from a factor 3 to... none)
for ease-of-use, but it will be easier IMHO than using SAX (albeit not
standard), and at this point a LOT faster (see the last test in
<a href="">simple benchmark</a>).</p>
<q id="6">
<question>My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?</question>
<answer><p>Yes, if you use the KeepEncoding option when you create a twig all PCDATA (character data) will be returned as-is, dont forget to use an encoding declaration in the XML declaration or in the twig creation though or the parser will die on you. You can also process your document as UTF-8 internally and use the <tt>output_encoding</tt> option (XML::Twig version 3.05 and above) to convert the output to your favourite encoding.</p></answer>
<q id="7">
<question>What's that whitespace management thing?</question>
<answer><p>XML parsers are required by the standard to pass ALL data outside the markup to the calling application. Most of the time this is not desirable. By default XML::Twig discards those pesky \n (in fact XML::Twig discards all element contents that contain only whitespaces. This can be changed at twig level</p></answer>
<q id="8">
<question>What's the expansion factor from an XML document to a twig?</question>
<answer><p>If you load the entire document in a twig the expansion factor is about 13 (the 900K file used for the benchmark takes about 11M). Of course if you flush the document as you're parsing then it will be <b>much</b> less!</p></answer>
<q id="10">
<question>I have that huge XML document, but I only want to extract information from a couple of elements, can XML-Twig help me there?</question>
<answer><p>Oddly enough yes! Create the twig using the TwigRoots option and the tree will be built only for those elements. <br/>Example:<code>
my $twig= XML::Twig->( twig_roots =&gt; { info =&gt; \&amp;process_info });
<q id="11">
<question>I process lots of XML documents in batch and there seems to
be a memory leak in XML::Twig, any fix for that?</question>
<answer><p>Yes, since version 3.00, XML::Twig has a <tt>dispose</tt> method that releases completely a twig.
With earlier versions you can release it yourself by doing:
undef $t->{twig};
undef $t->{twig_root}->{twig};
undef $t->{twig_parser};
<p>The easiest method though, if you are using perl 5.6.0 and above, is to install the
<a href="">WeakRef</a> module, which fixes the memory leak</p>
<q id="12"><question>How can I install XML::Twig on Windows?</question>
<answer><p>XML::Twig might be available as a ppm either from <a href="">Activestate</a>
or from another repository (see <a href="">Using PPM to install modules</a> for more information about ppm and for a list of repositories.</p>
<p>If it is not available, or if you want to use the development version, you can just uncompress the distribution file (<tt>XML::Twig-x.xx.tar.gz</tt>) and copy the <tt></tt> in the <tt>C:\Perl\site\lib\xml</tt> directory, alongside <tt></tt>. Of course if you use <a href="">Cygwin</a> you can install the module with the usual<tt>perl Makefile.PL; make; make test; make install</tt> incantation. You might need to download <a href="">nmake</a>.</p>
<p>Alternatively <a href="">KobeSearch</a> lists PPMs for the module</p> </answer>
<q id="17">
<p>I am having a problem installing MythTV on RedHat 9.0:</p>
<p>When I attempt to do an install XML::Twig in CPAN It goes through its
install, but then states: <tt>Weak references are not implemented</tt></p>
<answer>You need to upgrade the <tt>Scalar::Util</tt> module, from CPAN. Then re-run the install
from scratch (doing the <tt>perl Makefile.PL; make; make test; make install</tt> dance, or
cleaning up the CPAN/CPANPLUS cache, I suspect you have to exit the shell and launch it again
for this to work).</answer>
<q id="16">
<question><p>I seem to be having a spot of trouble getting XML::Twig 3.08 to compile
and install on a SuSE 8.1/RedHat 8.0 system.</p>
<p>Here is the result of <tt>make test</tt>:</p>
<code><![CDATA[make test
undefined entity at line 4, column 13, byte 77:
<!DOCTYPE doc SYSTEM "t/dummy.dtd">
<elt1>toto &ent1;</elt1>
<elt2>tata &ent2;</elt2>
<elt3>tutu &ent3;</elt3>
at /usr/lib/perl5/site_perl/5.8.0/i586-linux-thread-multi/XML/ line 185
Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-6
Failed 6/6 tests, 0.00% okay
Test returned status 255 (wstat 65280, 0xff00)
DIED. FAILED tests 1-3
Failed 3/3 tests, 0.00% okay
t/test_twig_roots.........ok t/test_xpath_cond.........ok
Failed Test Stat Wstat Total Fail Failed List of Failed
t/test_entities.t 255 65280 6 6 100.00% 1-6
t/test_spaces.t 255 65280 3 3 100.00% 1-3
Failed 2/18 test scripts, 88.89% okay. 9/400 subtests failed, 97.75% okay.
make: *** [test_dynamic] Error 29]]></code>
<answer><p>The problem is an incompatibility between XML::Twig and the
version of the libexpat library that comes with RH 8.0 / Suse 8.1. (1.95.4)
If you upgrade to XML::Twig 3.08 or later and to the latest version of libexpat you should not
get the problem anymore.</p>
<p>You can get the latest version of libexpat on sourceforge: <a href=""></a></p></answer>
<q id="21">
<question><p>Setting $SIG{__DIE__} breaks parse()</p>
<p>The problem can be narrowed down to:</p>
<code><![CDATA[#!/usr/bin/perl -w
use strict;
use XML::Twig;
local $SIG{__DIE__} = sub {
my $msg = shift;
print STDERR "dying! $msg\n"; exit 1;
new XML::Twig()->parse('<a />');]]></code>
<answer>This is a bug in XML::Parser. Upgrading to XML::Parser 2.34 or above solves the problem.
See the <a href="">bug report on RT</a>.
<q id="20">
<question>It looks like I can only print a twig (or an element) to STDIN, how do I
redirect the output to a file?</question>
<answer><p>You can pass a filehandle to <tt>print</tt>:</p>
<pre><tt> open( FH, ">output.xml") or die "cannot open output.xml: $!";
$twig->print( \*FH);</tt></pre>
<q id="13"><question>For logging purposes I would like XML::Twig to report line/column number in the
original file</question>
<answer><p>Use <tt>start_tag_handlers</tt> to grab the line and column number through the parser object and
store them in private attributes (attributes whose name starts with a # are not output by XML::Twig):</p>
<code>#!/usr/bin/perl -w
use strict;
use XML::Twig;
my $t=XML::Twig->new( start_tag_handlers =>
{ # called when the start tag for elt is parsed
# use '#ELT' or _all_ to call the handler for all elements
elt => sub { my( $t, $elt)= @_;
$elt->set_att( '#line' => $t->current_line);
twig_handlers =>
{ # called when elt is completely parsed
elt => sub { my( $t, $elt)= @_;
print "error in elt starting line ",
$elt->att( '#line'), "\n"
if( $elt->has_child( 'subelt[@error]'));
$t->parsefile( "test_track_line_number.xml");
<p>will parse <tt>test_track_line_number.xml</tt> that looks like:</p>
<subelt>text 1</subelt>
<subelt>text 2</subelt>
<subelt>text 3</subelt>
<subelt>text 1</subelt>
<subelt error="yes">text 2</subelt>
<subelt>text 3</subelt>
<p>and will output: <tt>error in elt starting line 7</tt></p></answer>
<q id="14"><question>How do I include bits of (possibly not well-formed) HTML in an XML document and
use them to generate HTML?</question>
<answer><p>You can wrap the HTML in a CDATA section, which will prevent the parser to
look into the data. Then use a twig_handler on CDATA to process those sections.
Use the <tt>set_asis</tt> method to get those sections to be output without
being "XML escaped" (XML::Twig 3.05 and above)</p>
#!/usr/bin/perl -w
use strict;
use XML::Twig;
my $t= XML::Twig->new( twig_handlers => { '#CDATA' => sub { $_->set_asis; } });
$t->parse( \*DATA);
<!-- embedded HTML, note the un-closed <br> tag -->
<p>will output (comment stripped for conciseness):</p>
<p>Note that the CDATA section will not protect you from encoding problems, so if the included text is likely to
be in a different encoding than the main document you will have to do some encoding conversion before including it.</p></answer>
<q id="15">
<question><p>In which order are handlers called?</p>
<p>I have this simple Perl script that parse an XML document. The XML document use the following DTD:</p>
<markup><![CDATA[<!ELEMENT doc (title, elt+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT elt (#PCDATA|subelt)+>
<!ELEMENT subelt (#PCDATA)]]></markup>
<p>I've noticed the following: although the element 'doc' is the root,
XML::Twig calls its handle last. All the elements 'title' and 'elt'
are processed in correct sequence. Why? The element 'doc' handler should be
called the first and not the last.</p>
<p>Is the element's handler called on the opening tag OR on the closing tag?</p>
<answer><p>Element handlers are called on the closing tag, as it is the only time
when the entire element has been parsed. The handler is
called as soon as the element has been completely parsed, which is when
its end tag has been parsed.</p>
<p>This indeed leads to handlers for the inner elements to be called before
the ones to the outer elements: here the handler on 'doc' will be
called after the handlers on 'title' and 'elt'.</p>
<p>This example will show you in which order the handlers are called:</p>
<code><![CDATA[#!/usr/bin/perl -w -l
use strict;
use XML::Twig;
my $t= XML::Twig->new( twig_handlers => { '_all_' => sub { print "handler for ", $_->att( 'id'); } },
error_context => 1,
$t->parse( \*DATA);
<doc id="doc">
<title id="title">title</title>
<elt id="elt_1">
<subelt id="subelt_1">subelt</subelt>
<subelt id="subelt_2">subelt</subelt>
<elt id="elt_2">element 2</elt>
<q id="17">
<question>Any neat trick to increase the performance of XML::Twig?</question>
<answer><p>Tom Anderson from tomacorp released an interesting article:
<a href="">Performance Comparison
Between SAX XML::Filter::Dispatcher and XML::Twig</a>. He notes:</p>
<blockquote><i>I learned an interesting performance optimization when writing
the anonymous subs for XML::Twig. These subs should not uselessly return
a long string. Processing this string can increase processing time by 50%
in this example. This is why the start_tag_handlers return the value 1</i>
<p>Using this trick lead to a 4x speedup on my first attempt at speeding up Tom's example!</p>
<p>Thanks Tom!</p>
<q id="19">
<question>I need to process XML documents. The problem is that they are several of them, so the
parser dies after the first one, with a message telling me that there is junk after the
end of the document. Is there any way I could trick the parser into believing they are
all part of a single document?</question>
<answer><p>You can open the input file as a pipe, first <tt>echo</tt>-ing an open tag, then getting
the input from wherever you get it, then <tt>echo</tt>-ing a close tag:</p>
<code><![CDATA[#!/usr/bin/perl -w
use strict;
use XML::Twig;
# here we have a very simple generator, but it could be any process that
# generates a stream of XML documents
my $xml_generator= q{echo '<doc>doc1</doc><doc>doc2</doc>'};
my $wrap= 'docs';
# this is where it all happens:
# the pipe at the end of the "file name" means that the name is a
# shell command, that will be executed then piped to the filehandle
open( IN, qq{echo '<$wrap>'; $xml_generator; echo '</$wrap>' |})
or die "error opening xml_generator: $!";
my $i=1;
my $t= XML::Twig->new( twig_handlers => {
doc => sub { print "document $i: ", $_->sprint, "\n";
$_[0]->purge; # to get he memory back
$t->parse( \*IN);
close IN or die "error during the execution of xml_generator: $!";
<q id="20">
<question>How to stop processing the document when a certain condition is met?</question>
<p>There are 2 ways to do this:</p>
<ul><li>use <a href=""><tt>$twig->finish</tt></a>,
which will still parse (quickly) the file but without doing any processing on it.</li>
<li>wrap the <tt>$twig->parse</tt> in an eval, and <tt>die</tt> when you find the element you are interested in:
<code><![CDATA[#!/usr/bin/perl -w
use strict;
use XML::Twig;
my $t= XML::Twig->new( twig_handlers =>{ e => sub { print $_->id, "\n"; die 0; }, });
eval { $t->parse( q{<doc>toto<e id="tata"/>tata<e id="titi"/></doc>});};
print "done\n";'
<p><b>update</b>: is now a third method: <a href=""><tt>$twig->finish_now</tt></a>
method is, as you might have guessed, a little more imperative than <tt>finish</tt>: while <tt>finish</tt> still finishes to parse the XML, and
dies if it isn't well-formed, <tt>finish_now</tt> just aborts the parsing and returns right away.</p>
<q id="21">
<question><p>When I re-use a twig to parse an other document within a handler, I get a mysterious
<tt>calling depth after parsing is finished...</tt> error. What does it mean?</p>
<p>My code:</p>
<code><![CDATA[ my $t=XML::Twig->new( twig_handlers => { include => \&include })
->parsefile( "main_file.xml");
sub include
{ my( $t, $include);
$t->parsefile( $include->att( 'src');
<answer><p>Indeed you cannot re-use the twig object to parse an other document. Contrary to most other modules (XML::Parser, XML::LibXML...), the twig is both the parser _and_ the parsed document. You can re-use the object if you parse several documents sequentially, but you cannot re-use it within a parse. So in your case you have to create a new XML::Twig object.</p>
<p>The reason for this is simple: incompetence. Mine. I wasn't very familiar with OO when I started writing the module, back in 1998, and I completely missed the object factory construct. Sorry.</p>
<p>Note that in version 3.22 and up the error message that is hopefully more explicit:
<tt>cannot reuse a twig that is already parsing</tt>.</p>
<q id="22">
<question>I want to output the XML with the same format (indentation and line returns) as the
input file. I have tried <tt>pretty_print</tt> but I cannot get what I want.</question>
<p>You can get the same formating as in the original file by using the <tt>keep_spaces =&gt; 1</tt> option when you create the twig. Note that this will create <tt>#PCDATA</tt> (text) elements that contain the whitespaces in your tree.</p>
<q id="23">
<question>What does the error message <tt>*** glibc detected *** double free or corruption (!prev):</tt> mean,
and how do I get rid of it?</question>
<answer><p>You are using the UTF8 perlIO layer on your input stream, usually because the environment
variable <tt>PERL_UNICODE</tt> or the <tt>-C</tt> option include <tt>D</tt>. This causes
problems when reading from a pipe, due to a flaw in IO::Handle, used in XML::Parser in this case.</p>
<p>The workaround is to remove the <tt>D</tt> option, by setting <tt>PERL_UNICODE</tt> or using <tt>-C</tt>
with a value that does not include <tt>-d</tt>.</p>
<p>More info at <a href=""></a>.</p>
<q id="24">
<question>I want to pass additional arguments to XML::Twig handlers, not just the twig and the element, and I'd rather not use
global variables. Can I do this?</question>
<answer><p>Sure, use a closure:</p>
my @additional_args= more_args();
my $t=XML::Twig->new( twig_handlers => { foo => sub { bar( @_, @additional_args) } });
sub bar
{ my( $t, $foo, @more_args)= @_;
<p>A good explanation of what closures are can be found in <a href="">Achieving
<copyright>Copyright (c)2000-2008 Michel Rodriguez. All rights reserved. Permission is hereby granted to freely distribute this document provided that all credits and copyright notices are retained.</copyright>
Jump to Line
Something went wrong with that request. Please try again.