Skip to content

ppajer/DOMExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PHP-DOM-Extractor

A PHP library for extracting data from a HTML DOM document into any user-defined data structure, based on custom extraction rules.

Usage

Install

Download the repo and install via Composer, or manually download and include the class in your project. Note: this package requires ivopetkov/html5-dom-document-php to process HTML5 documents. If you're installing manually, you will need to manage this dependency yourself.

Defining extraction rules

Rules are simple PHP arrays which denote where the extractor must look for their value. They consist of a key to store the output in, and a CSS selector to match the element required. By default the element's text value will be returned, unless you specify an attribute to return instead. All instruction keys for the extractor are prefixed with a @ and will be ignored in the output.

Basic query & attributes

The package uses CSS selector syntax for getting values from document nodes, including text and attribute nodes. The most basic rule could be written as:

array(
	'exampleKey' => array(
		'@selector' => 'title'
	)
)

// Will return:

array(
	'exampleKey' => 'Example Title'
)

If the data you're looking for is inside an element attribute, specify it in the selector after a @ sign.

array(
	'exampleKey' => array(
		'@selector' => 'h1@class'
	)
)

// Will return: 

array(
	'exampleKey' => 'h1 green-text site-heading'
)

Lists & nested data

If you need to parse multiple values for a single key, or look for nested data, you can use the @each instruction, and nest as many levels of instructions as your memory limit allows:

array(
	'exampleKey' => array(
		'@selector' => '.some-list-item',
		'@each' => array(
			'listItemTitle' => array(
					'@selector' => 'h3'
			),
			'listItemLink' => array(
					'@selector' => 'a@href'
			),
			'listItemImages' => array(
					'@selector' => '.carousel-item',
					'@each' => array(
						'src' => array(
							'@selector' => 'img@src'
						)
					)
			)
		)
	)
)

This will return an array where exampleKey is an array containing arrays of data about the individual items in the list: in this example, the text content of each h3 tag, the href attribute of each a element, and the src attribute of every img element.

array(
	'exampleKey' => array(
		array(
			'listItemTitle' => 'Some title',
			'listItemLink' => 'https://...',
			'listItemImages' => array(
				array('src' => 'https://...'),
				array('src' => 'https://...'),
				...
			)
		),
		array(
			'listItemTitle' => 'Some other title',
			'listItemLink' => 'https://...',
			'listItemImages' => array(
				array('src' => 'https://...'),
				array('src' => 'https://...'),
				...
			)
		),
		...
	)
)

Setting up the rules

Once your rules are ready, you can pass them either to the instance by calling setRules, or the constructor as first argument. For convenience, the extractor can also take its instructions as either a JSON string or from an external JSON file as a path.

$rules = /* array or JSON string or file path */;

// Constructor 
$extractor = new DOM_Extractor($rules);

// OR Instance
$extractor = new DOM_Extractor();
$extractor->setRules($rules);

Loading the document

Once everything is set, you are ready to load the document to parse and start extraction. As with passing the rules, here too you have the option of using the constructor's second argument or the dedicated load method.

$html = file_get_contents('https://...');

// Constructor 
$extractor = new DOM_Extractor($rules, $html);

// OR Instance
$extractor = new DOM_Extractor();
$extractor->load($html);

Complete example

$rules = 'some/path/to/rules.json';
$html = file_get_contents('https:/...');

// Constructor method
$extractor = new DOM_Extractor($rules, $html);
$data = $extractor->parse();

// Instance method
$extractor = new DOM_Extractor;
$extractor->setRules($rules);
$extractor->load($html);
$data = $extractor->parse();

// Also supports method chaining:
$extractor = new DOM_Extractor
$data = $extractor->setRules($rules)->load($html)->parse();
˙``

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages