Inspired by Facebook's link sharing flow, this abstractly accessed class attempts to parse a document (x/html), and retrieve it's meta-information. I emphasize attempts, as x/html documents are exceptionally tough to parse, and data is often lost due to the content structuring delivered.
Array ( [base] => http://www.bbc.com/ [favicon] => http://www.bbc.co.uk/favicon.ico [meta] => Array ( [description] => Breaking news, sport, ... [keywords] => Array (  => BBC  => bbc.co.uk ...  => BBCi ) ) [images] => Array (  => http://sa.bbc.co.uk/bbc/bbc/s?name=home.page&geo_edition=us&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.6.3  => http://static.bbc.co.uk/frameworks/barlesque/1.21.3/desktop/3/img/blocks/light.png  => http://static.bbc.co.uk/wwhomepage-3.5/ic/news/432-259/57632000/jpg/_57632639_013603124-1.jpg  => http://static.bbc.co.uk/wwhomepage-3.5/ic/news/432-259/57626000/jpg/_57626527_57626526.jpg ...  => http://me.effectivemeasure.net/em_image ) [openGraph] => Array ( [title] => BBC - Homepage [type] => website [image] => http://static.bbc.co.uk/wwhomepage-3.5/1.0.29/img/iphone.png [url] => http://www.bbc.co.uk/ ) [title] => BBC - Homepage [url] => http://www.bbc.com/ )
The following code uses the PHP-Curler class to curl the BBC site, store it's content, and pass it along to a MetaParser instance. The URL is passed along as well to ensure any paths (favicons, images) are rewritten relative to the path of the document that was parsed.
<?php // booting require_once APP . '/vendors/PHP-Curler/Curler.class.php'; require_once APP . '/vendors/PHP-MetaParser/MetaParser.class.php'; // curling $curler = new Curler(); $url = 'http://www.bbc.com/'; $body = $curler->get($url); $parser = new MetaParser($body, $url); print_r($parser->getDetails());