Extract data from XML sources.
The following data will be used for all the basic and advanced examples.
<?xml version="1.0" encoding="UTF-8"?>
<Persons>
<Person>
<Name>Anna</Name>
<Surname>Adams</Surname>
<Email>anna.adams@example.com</Email>
<Addresses>
<Address Type="Home">
<Name>Rocky Row</Name>
<Postcode>6181</Postcode>
</Address>
<Address Type="Work">
<Name>Round Valley</Name>
<Postcode>6781</Postcode>
</Address>
</Addresses>
</Person>
<Person>
<Name>Bob</Name>
<Surname>Brown</Surname>
<Email>bob.brown@example.com</Email>
<Addresses>
<Address Type="Home">
<Name>Stony Boulevard</Name>
<Postcode>8276</Postcode>
</Address>
</Addresses>
</Person>
<Person>
<Name>Charles</Name>
<Surname>Cooper</Surname>
<Email>N/A</Email>
<Addresses>
<Address Type="Home">
<Name>Lazy Fawn Mount</Name>
<Postcode>9828</Postcode>
</Address>
<Address Type="Work">
<Name>High Zephyr Impasse</Name>
<Postcode>8918</Postcode>
</Address>
</Addresses>
</Person>
</Persons>
Examples of how to use the class are provided in this document, along with an explanation of available options.
In order to extract data and given the more complex nature of the XML format, at least one map/data handler pair is required.
Each element configuration must be an associative array
, with the absolute XPath of the element as key and an array
containing the map/data handler pair as value.
'/xpath' => array(
'map' => array(
// ...
),
'handler' => function ($element, array $properties, &$data) {
// ...
},
),
The map must be an associative array
with each property name as key and the respective relative XPath as value.
The data handler should be of the type Closure
and have the following signature:
/**
* @param string $element Absolute XPath of the current XML element
* @param array $properties Associative array with extracted properties
* @param mixed $data User data
*/
$handler = function ($element, array $properties, &$data) {
// Implementation
);
TIP: User data will be passed by reference
For some XML documents, a namespace needs to be registered in order to parse the data properly.
$namespaces = array(
'atom' => 'http://www.w3.org/2005/Atom',
);
<?php
require 'vendor/autoload.php';
use Impensavel\Essence\EssenceException;
use Impensavel\Essence\XML;
$config = array(
'/Persons/Person' => array(
'map' => array(
'name' => 'string(Name)',
'surname' => 'string(Surname)',
'email' => 'string(Email)',
),
'handler' => function ($element, array $properties, &$data) {
var_dump($properties);
},
),
);
$namespaces = array();
try
{
$essence = new XML($config, $namespaces);
$essence->extract(new SplFileInfo('input.xml'));
} catch (EssenceException $e) {
// Handle exceptions
}
The extract()
method allows consuming XML data from a few input types.
Currently supported are string
, resource
(normally the result of a fopen()
) and SplFileInfo
.
$input = <<< EOT
<?xml version="1.0" encoding="UTF-8"?>
<Persons>
<!-- data -->
</Persons>
EOT;
$essence->extract($input);
$input = fopen('input.xml', 'r');
$essence->extract($input);
$input = new SplFileInfo('input.xml');
$essence->extract($input);
The extract()
method has a few options that can be used to handle different situations.
The encoding
option is set to UTF-8
by default and it should remain so in normal circumstances.
In order to use the encoding defined in the document, set the value to null
or to another encoding when appropriate.
$essence->extract($input, array(
'encoding' => 'ISO-8859-1',
));
By default, the options
value is set to LIBXML_PARSEHUGE
.
For extra parsing configurations, like loading an external subset, use a bitmask.
$essence->extract($input, array(
'options' => LIBXML_PARSEHUGE|LIBXML_DTDLOAD,
));
Refer to the documentation for the complete list of supported LIBXML_*
constants.
By default, the handler only has access to the data being extracted, but sometimes access to other data might be necessary.
To solve this, user data can be passed as a third argument to the extract()
method.
$config = array(
// ...
);
$data = array(
// ...
);
$essence->extract($input, $config, $data);
TIP: The user data is passed by reference, which means that it can be modified by the handler, if needed.
In this section we will cover two advanced use cases.
Sometimes, it might be necessary to skip to a specific element if a pre-condition fails. A reason for this would be that there's no point in storing data from a child node if the parent data wasn't saved.
On the XML above, the third Person
element has some missing data and only valid/complete data from the set should be extracted.
The following configuration takes care of that:
$config = array(
'/Persons/Person' => array(
'map' => array(
'name' => 'string(Name)',
'surname' => 'string(Surname)',
'email' => 'string(Email)',
),
'handler' => function ($element, array $properties, &$data) {
// Skip to the next /Persons/Person element if the email is invalid
if (filter_var($properties['email'], FILTER_VALIDATE_EMAIL) === false) {
return '/Persons/Person';
}
// Do something with the data, otherwise
},
),
'/Persons/Person/Addresses/Address' => array(
'map' => array(
'type' => 'string(@Type)',
'address' => 'string(Name)',
'postcode' => 'string(Postcode)',
),
'handler' => function ($element, array $properties, &$data) {
// Do something with the data
},
),
);
In other words, the absolute XPath of the element we want to skip to, must be returned from the handler we're in.
In order to keep track of node/element relations, we can store data from one handler and retrieve it from another.
$config = array(
'/Persons/Person' => array(
'map' => array(
'name' => 'string(Name)',
'surname' => 'string(Surname)',
'email' => 'string(Email)',
),
'handler' => function ($element, array $properties, &$data) {
// Store data using a Laravel Person model
$person = Person::create($properties);
// Return the last inserted id
return $person->id;
},
),
'/Persons/Person/Addresses/Address' => array(
'map' => array(
// Use the last inserted Person id set from
// the other handler to make the relation
'person_id' => '#/Persons/Person',
'type' => 'string(@Type)',
'address' => 'string(Name)',
'postcode' => 'string(Postcode)',
),
'handler' => function ($element, array $properties, &$data) {
// Store data using a Laravel Address model
Address::create($properties);
},
),
);
When a handler returns, any value than cannot be mapped to an absolute XPath (otherwise it would skip), will be stored. Previous values will be overwritten each time the handler returns.
By prefixing a #
to the absolute XPath of a mapped element (e.g. #/Persons/Person
) on a map property value, the stored value registered to that element XPath will be used instead.
An EssenceException
will be thrown if the XPath is not registered.
In order to extract data from an XML, we use XPaths to map the document structure by element and by properties.
Only XPath 1.0 is supported.
Configuration keys should always have the absolute XPath to the element we want to extract data from.
To extract data from Person
elements, the configuration should be:
$config = array(
'/Persons/Person' => array(
// ...
),
);
Map values should always be XPath expressions relative to the current element, unless when we want to retrieve stored element data.
To get the Name
property of a /Persons/Person
element, the configuration should be:
$config = array(
'/Persons/Person' => array(
'map' => array(
'name' => 'string(Name)',
),
// Data handler
),
);
Values should be cast to a type when mapping element properties, unless there's a reason to work with a DOMNodeList
, instead.
The dump()
method was added in version 3.0.0
, to make things a bit easier when mapping.
This method returns an array with all the XPaths and occurrence count of an XML input.
<?php
require 'vendor/autoload.php';
use Impensavel\Essence\EssenceException;
use Impensavel\Essence\XML;
try
{
$essence = new XML;
$paths = $essence->dump(new SplFileInfo('input.xml'));
var_dump($paths);
} catch (EssenceException $e) {
// Handle exceptions
}
Using the code above to dump the example XML data, we get the following output:
array(9) {
["Persons"]=>
int(1)
["Persons/Person"]=>
int(3)
["Persons/Person/Name"]=>
int(3)
["Persons/Person/Surname"]=>
int(3)
["Persons/Person/Email"]=>
int(3)
["Persons/Person/Addresses"]=>
int(3)
["Persons/Person/Addresses/Address"]=>
int(5)
["Persons/Person/Addresses/Address/Name"]=>
int(5)
["Persons/Person/Addresses/Address/Postcode"]=>
int(5)
}
Sometimes it may be easier to have a DOMNodeList
and work with it, instead of having to set a new element map and data handler.
Since version 2.1.0
, a helper method has been added to convert DOMNodeList
objects into array
types.
This static
method converts a DOMNodeList
object into an indexed array
(by default), or to an associative one when the second argument is true
.
By default, node attributes are not included in the array
. To include them, set the value of the third argument to true
.
<?php
require 'vendor/autoload.php';
use Impensavel\Essence\EssenceException;
use Impensavel\Essence\XML;
$config = array(
'/Persons/Person' => array(
'map' => array(
'name' => 'string(Name)',
'surname' => 'string(Surname)',
'email' => 'string(Email)',
'addresses' => 'Addresses',
),
'handler' => function ($element, array $properties, &$data) {
// Return an associative array
$associative = false;
// Include node attributes
$attributes = true;
foreach ($properties as $name => $value) {
if ($value instanceof DOMNodeList) {
$properties[$name] = XML::DOMNodeListToArray($value, $associative, $attributes);
}
}
var_dump($properties);
},
),
);
try
{
$essence = new XML($config);
$essence->extract(new SplFileInfo('input.xml'));
} catch (EssenceException $e) {
// Handle exceptions
}
Indexed array
with node attributes (@
key) for the addresses
element:
array(4) {
["name"]=>
string(4) "Anna"
["surname"]=>
string(5) "Adams"
["email"]=>
string(22) "anna.adams@example.com"
["addresses"]=>
array(1) {
[0]=>
array(2) {
[0]=>
array(3) {
["@"]=>
array(1) {
["Type"]=>
string(4) "Home"
}
[0]=>
string(9) "Rocky Row"
[1]=>
string(4) "6181"
}
[1]=>
array(3) {
["@"]=>
array(1) {
["Type"]=>
string(4) "Work"
}
[0]=>
string(12) "Round Valley"
[1]=>
string(4) "6781"
}
}
}
}
Associative array
with node attributes (@
key) for the addresses
element:
array(4) {
["name"]=>
string(4) "Anna"
["surname"]=>
string(5) "Adams"
["email"]=>
string(22) "anna.adams@example.com"
["addresses"]=>
array(1) {
[0]=>
array(1) {
["Address"]=>
array(2) {
[0]=>
array(3) {
["@"]=>
array(1) {
["Type"]=>
string(4) "Home"
}
["Name"]=>
array(1) {
[0]=>
string(9) "Rocky Row"
}
["Postcode"]=>
array(1) {
[0]=>
string(4) "6181"
}
}
[1]=>
array(3) {
["@"]=>
array(1) {
["Type"]=>
string(4) "Work"
}
["Name"]=>
array(1) {
[0]=>
string(12) "Round Valley"
}
["Postcode"]=>
array(1) {
[0]=>
string(4) "6781"
}
}
}
}
}
}