OpenAlboPretorio is a scraper library able to extract data from the Albo Pretorio Web archive present on Italian city institutional websites
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src/GP/OpenAlboPretorio
tests
.gitignore
.travis.yml
LICENSE
README.md
composer.json
composer.lock
phpunit.xml.dist

README.md

OpenAlboPretorio

The OpenAlboPretorio project consists in a scraper library able to extract data from the Albo Pretorio Web archive present on Italian city institutional websites. The Albo Pretorio is the public archive containing all administrative acts that concern the Municipality administrive life. Extracted data can be exported in JSON (default) and Feed (rss2, atom) formats.

Build Status

Installation

Clone the github repo in your machine

$ git clone https://github.com/gpirrotta/OpenAlboPretorio.git
$ cd OpenAlboPretorio

And run these two commands to install it:

$ wget http://getcomposer.org/composer.phar
$ php composer.phar install

Now you can add the autoloader, and you will have access to the library:

<?php

require 'vendor/autoload.php';

You're done.

Usage

The OpenAlboPretorio class is the entry point of the library.

<?php

    $albo = new OpenAlboPretorio();
    $results = $albo->city(AlboPretorioScraperFactory::TERME_VIGLIATORE);
                    ->open();

    print $results  // JSON format as default

You can also customize the scraper manually:

<?php

    $scraper = new BarcellonaPGScraper(new BarcellonaPGMasterPageScraper(), new BarcellonaPGDetailPageScraper());
    // you can also customize the Master and Detail scraper objects using i.e. different HttpAdapter objects

    $scraper->setItemType(BarcellonaPGScraper::TIPOLOGIA_DETERMINAZIONE_DEL_SINDACO);
    $formatter = new FeedFormatter(FeedFormatter::ATOM_FEED_TYPE);  // RSS2 default

    $albo = new OpenAlboPretorio();
    $results = $albo->scrapeUsing($scraper)
                    ->formatUsing($formatter)
                    ->maxNumberItems(10)
                    ->open();

    print $results;

The OpenAlboPretorio API

  • city($city): set the city to scrape. The $city parameter is the id city of the Web page to scrape.

Examples:

<?php

 $albo->city(AlboPretorioScraperFactory::TERME_VIGLIATORE);

Alternatively to the city method you can set your customized scraper using

  • scrapeUsing(AlboPretorioScraperInterface $scraper) method.

You can customized the scraped results with:

  • maxNumberItems($maxNumberItems): set the maximum number of items to scrape;

  • formatUsing(FormatterInterface $formatter): set the output format. The default is JSON format but you can choose also the Feed (rss2, atom) format;

  • open(): it starts the game returning the scraped data.

Scraper

Currently the following scrapers are implemented:

Formatter

Formatters available:

  • JSONFormatter: (default) formats the scraped results in JSON format;

  • FeedFormatter($type): formats the scraped results in Feed format; type parameter can be one of these:

    • FeedFormatter::RSS_FEED_TYPE (default)
    • FeedFormatter::ATOM_FEED_TYPE

###Extending the OpenAlboPretorio project If you want to extend the Albo Pretorio project for your city you have to implement the AlboPretorioScraperInterface interface.

Generally scraping an Albo Pretorio Web page means extract data from two pages:

  1. the Master page - the Web page containing the list of all Albo Pretorio items, i.e. all administrative acts of the Municipality, where you can find the summary of the last item published including the URL of each item;

  2. the Detail page - the single Web page item where you can find the detail of each administrative act.

To manage correctly the above described scraping logic the OpenAlboPretorio library provides the AbstractMasterDetailTemplateScraper abstract class implementing the AlboPretorioScraperInterface interface.

The abstract class uses the following scraper interfaces:

  • MasterPageScraperInterface - retrieves the list of item URLs to scrape;
  • DetailPageScraperInterface - retrieves the detail of each single administrative act, i.e. an item. - it must return an AlboPretorioItem class

Obviously if the extraction logic of your Albo Pretorio Web page is different from the Master-Detail you are free to implement the one that meets your needs.

Requirements

  • = PHP 5.4

Running the Tests

$ phpunit

Demo

TODO

  • Improve test coverage
  • Add scrapers
  • Add formatters

Credits

License

OpenAlboPretorio is released under the MIT License. See the bundled LICENSE file for details.