Skip to content

Latest commit

 

History

History
202 lines (124 loc) · 7.57 KB

README.md

File metadata and controls

202 lines (124 loc) · 7.57 KB

web.instata

Turn your plain old tabular data (POTD) into Web data with web.instata: it takes CSV as input and generates a HTML document with the data items marked up with Schema.org terms.

                       +--------------------+
+-------+              |                    |            +--------------+
|  CSV  |              |                    |            |              |
|-------|              |                    |            |   HTML5      |
|       | +----------->|     web.instata    |+---------> |              |
|       |              |                    |            |   Schema.org |
|       |              |                    |            |              |
+-------+              |                    |            +--------------+
                       +--------------------+

Note: web.instata only works for CSV files that use Schema.org types or properties as column names.

Usage

Simple publishing

In order to publish a HTML+microdata document from a CSV file:

python web.instata.py -p {path to CSV file} {base URI for publishing}

Example:

python web.instata.py -p test/potd_0.csv http://example.org/instata/potd_0

... and you should see the following on the command line:

[web.instata] processing [test/potd_0.csv] with base URI [http://example.org/instata/potd_0] 
[web.instata] loading DBpedia2Schema.org mapping ...
[web.instata] got DBpedia2Schema.org mapping!
[web.instata] trying to find a match for http://schema.org/Recipe
[web.instata] trying to find a match for http://schema.org/publishDate
[web.instata] trying to find a match for http://schema.org/name
[web.instata] trying to find a match for http://schema.org/author
[web.instata] match(es) found: {'http://schema.org/author': ('http://www.w3.org/2002/07/owl#equivalentProperty', 'http://dbpedia.org/ontology/author')}
[web.instata] result is now available at [output/potd_0.html]

As a result of the above command, an HTML+microdata document potd_0.html is created that should look like the following:

example output screenshot

The generated HTML document, potd_0.html, contains Schema.org terms marked up in microdata as follows:

<table id="instatable">
	<thead>
		<tr itemscope itemtype="http://purl.org/NET/schema-org-csv#HeaderRow">
			<th itemscope itemtype="http://schema.org/Thing" itemid="http://example.org/instata/potd_0#row:1,col:1">Recipe</th>
			<th itemscope itemtype="http://schema.org/Thing" itemid="http://example.org/instata/potd_0#row:1,col:2">name</th>
			...
		</tr>
	</thead>
	<tbody>
		<tr itemscope itemtype="http://schema.org/Recipe" itemid="http://example.org/instata/potd_0#row:2">
			<td><a href="http://example.org/instata/potd_0#row:2" itemprop="http://schema.org/url">bb</a></td>
			<td itemprop="http://schema.org/name">Mom's World Famous Banana Bread</td>
			<td itemprop="http://schema.org/author">John Smith</td>
			<td itemprop="http://schema.org/publishDate">May 8, 2009</td>
		</tr>
		...
	</tbody>
</table>	

Configuration-based publishing

A more flexible but also slightly more complex case is that of using a web.instata configuration file to specify input and output as well as schema matching options. The syntax of the web.instata configuration file is Turtle.

In order to publish a HTML+microdata document from a CSV file using a configuration file:

python web.instata.py -c {path to configuration file}

Example:

python web.instata.py -c web.instata.config

... where a configuration file looks as follows:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix c: <#> .

c:default-config	
	# publishing options
	c:csv_input			"test/potd_0.csv" ;
	c:output_base_uri	<http://example.org/instata/potd_0> ;
	c:schema_matching	"dbpedia-2011-07-31.rdf" ; 

	# directory and file options
	c:templates_dir		"templates/" ;
	c:mappings_dir		"mappings/" ;
	c:output_dir		"output/" ;
	c:base_template		"base.tpl" ;
	c:base_style_file	"web.instata-style.css" ;

	# metadata about the config file
	dc:title		"The default configuration for web.instata" ;
	dc:modified		"2011-08-01"^^xsd:date ;
	dc:creator		<http://sw-app.org/mic.xhtml#i> ;
.

Note that in the configuration file you can specify one or more schema matchings (via c:schema_matching) as well as customise the output (c:base_template as well as c:base_style_file). The last block (metadata) is for completeness purposes and currently not used by web.instata - you may remove it if you want.

Validation of input

In order to check if the input CSV file uses Schema.org terms:

python web.instata.py -v {path to CSV file} {base URI for publishing}

Example:

python web.instata.py -v test/potd_0.csv http://example.org/instata/potd_0

... and you should see the following on the command line:

[web.instata] validating schema ...
[web.instata] all column headings in the input file test/potd_0.csv seem to be valid Schema.org terms :)

Data dump

In order to get a RDF/Turtle data dump from a CSV file:

python web.instata.py -d {path to CSV file} {base URI for publishing}

Example:

python web.instata.py -d test/potd_0.csv http://example.org/instata/potd_0

... and you should see something like the following on the command line:

@prefix dc: <http://purl.org/dc/terms/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix scsv: <http://purl.org/NET/schema-org-csv#> .

<http://example.org/instata/potd_0#table> a <http://purl.org/NET/schema-org-csv#Table>;
    scsv:row <http://example.org/instata/potd_0#row:1>,
        <http://example.org/instata/potd_0#row:2>,
        <http://example.org/instata/potd_0#row:3>;
    dc:source <http://example.org/instata/potd_0>;
    dc:title "potd_0" .

<http://example.org/instata/potd_0#row:1> a <http://purl.org/NET/schema-org-csv#HeaderRow>;
    scsv:cell <http://example.org/instata/potd_0#row:1,col:1>,
        <http://example.org/instata/potd_0#row:1,col:2>,
        <http://example.org/instata/potd_0#row:1,col:3>,
        <http://example.org/instata/potd_0#row:1,col:4>;
    dc:title "header" .
	
<http://example.org/instata/potd_0#row:1,col:1> dc:title "Recipe" .

<http://example.org/instata/potd_0#row:2> a <http://purl.org/NET/schema-org-csv#Row>;
    scsv:cell <http://example.org/instata/potd_0#row:2,col:1>,
        <http://example.org/instata/potd_0#row:2,col:2>,
        <http://example.org/instata/potd_0#row:2,col:3>,
        <http://example.org/instata/potd_0#row:2,col:4>;
    dc:title "row 2" .

<http://example.org/instata/potd_0#row:2,col:1> a <http://schema.org/Recipe>;
    rdf:value "bb" .

Kudos

Thanks to asciiflow.com for providing a useful tool.

To do

  • DONE: use Bottle as templating system for output
  • DONE: use DBpedia2Schema.org mapping to enrich output (related link, etc.)
  • Use the JS dump from Schema.RDF.org to check if term exists
  • Provide new option -c to check input data
  • Provide new option -d to create data dump in RDF

License

This software is Public Domain.