Skip to content

Commit

Permalink
First version of Top100Extract
Browse files Browse the repository at this point in the history
  • Loading branch information
fsteggink committed Jan 15, 2018
1 parent 5b24230 commit a77f3a5
Show file tree
Hide file tree
Showing 17 changed files with 3,158 additions and 0 deletions.
54 changes: 54 additions & 0 deletions brt/top100nl/etl/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
TOP100NL inlezen met Stetl (www.stetl.org) ETL framework.
door: Just van den Broecke,
GFS en XSLT door Frank Steggink

Deze map bevat de ETL configuratie en commando om via Stetl
TOP100NL vanuit de bron GML bestanden naar verschillende outputs weg te schrijven.
Standaard is dit PostGIS, maar omdat output via ogr2ogr verloopt kan dit
elke output zijn die ogr2ogr ondersteunt, bijv SHP, GeoJSON of GeoPackage, in theorie ook bijv Oracle.

Om gebruik te maken van Stetl moet de externe GitHub submodule externals/stetl
aanwezig zijn.

Bij het klonen van de GitHub komt Stetl als volgt mee:
git clone --recursive https://github.com/nlextract/NLExtract.git
Stetl komt dan mee, hoeft niet apart geinstalleerd, alleen de Stetl-dependencies.

Dependencies Stetl installeren:
http://www.stetl.org/en/latest/install.html

Meer over Stetl: http://stetl.org

Commando
--------

./etl-top100nl.sh
Windows: etl-top100nl.cmd

Gebruikt default opties (database params etc) uit options/default.args.

Stetl configuratie, hoeft niet gewijzigd, alleen indien bijv andere output gewenst:
conf/etl-top100nl-v1.1.cfg

Opties/argumenten
-----------------

Een aantal opties kunnen op 2 manieren vervangen worden:

1- Impliciet: Overrule default opties (database params etc) met een eigen lokale file gebaseerd op
lokale hostnaam: options/<jouw host naam>.args

2- Expliciet op command line via ./etl-top100nl.sh <mijn opties file>.args
etl-top100nl.cmd <mijn opties file>.args

Indien methode 2 gebruikt wordt, prevaleert deze boven 1 en de default opties!

Database mapping
----------------
gfs/top100-v1.1.gfs is de GDAL/OGR "GFS Template" en bepaalt de mapping van GML elementen/attributen
naar PostGIS kolom(namen). Maak eventueel een eigen GFS file en specificeer deze in je
options/<jouw host naam>.args: bijv gfs_template=gfs/mijntop100.gfs

TODO
----
* GUI
118 changes: 118 additions & 0 deletions brt/top100nl/etl/conf/etl-top100nl-v1.1.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Example of process-chains for extracting TOP100NL source data from GML to PostGIS.
# A Chain is a series of Components: one Input, zero or more Filters and one Output.
# The output of a Component is connected to the input of the next Component (except for
# the final Output Component, which writes to the final destination, e.g. Postgres.
#
# Currently 3 chains are executed in the following order:
# - SQL pre: DB initialization, delete tables, create schema
# - Main ETL chain, consists of the following components
# 1. input_zip_file: reads files from input ZIP file(s)
# 2. extract_zip_file: extracts a GML file from a ZIP file
# 3. parse_gml_file: parses elements from a GML file
# 4. xml_assembler: assemble feature elements into smaller (etree) docs
# 5. transformer_xslt: transform each (etree) doc
# 6. packet_writer: writes the transformed GML document to a file
# 7. output_ogr2ogr: output using ogr2ogr, input is a transformed GML file, output can be any OGR output
# - SQL post: remove duplicates
#
# Any substitutable values are specified in curly brackets e.g. {password}.
# Actual values can be passed as args to Stetl main.py or as arguments from a wrapper program
# like top100extract.py to etl.py. Here are the 3 chains:

[etl]
chains = input_sql_pre|schema_name_filter|output_postgres,
input_zip_file|extract_zip_file|parse_gml_file|xml_assembler|transformer_xslt|packet_writer|output_ogr2ogr,
input_sql_post|schema_name_filter|output_postgres

# Pre SQL file inputs to be executed
[input_sql_pre]
class = inputs.fileinput.StringFileInput
file_path = sql/drop-tables-v1.1.sql,sql/create-schema.sql

# Post SQL file inputs to be executed
[input_sql_post]
class = inputs.fileinput.StringFileInput
file_path = sql/delete-duplicates-v1.1.sql,sql/update-multiattributes-v1.1.sql

# Generic filter to substitute Python-format string values like {schema} in string
[schema_name_filter]
class = filters.stringfilter.StringSubstitutionFilter
# format args {schema} is schema name
format_args = schema:{schema}

[output_postgres]
class = outputs.dboutput.PostgresDbOutput
database = {database}
host = {host}
port = {port}
user = {user}
password = {password}
schema = {schema}

# The source input ZIP-file(s) from dir, producing 'records' with ZIP file name and inner file names
[input_zip_file]
class=inputs.fileinput.ZipFileInput
file_path = {input_dir}
filename_pattern = *.[zZ][iI][pP]
name_filter=*.[gG][mM][lL]

# Filter to extract a ZIP file one by one to a temporary location
[extract_zip_file]
class=filters.zipfileextractor.ZipFileExtractor
file_path = {temp_dir}/fromzip-tmp.gml

# The source input file producing cityObjectMember elements
[parse_gml_file]
class = filters.xmlelementreader.XmlElementReader
element_tags = FeatureMember

# Assembles etree docs gml:featureMember elements, each with "max_elements" elements
[xml_assembler]
class = filters.xmlassembler.XmlAssembler
max_elements = {max_features}
container_doc = <?xml version="1.0" encoding="UTF-8"?>
<top100nl:FeatureCollectionTop100
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:top100nl="http://register.geostandaarden.nl/gmlapplicatieschema/top100nl/1.1.0"
xmlns:gml="http://www.opengis.net/gml/3.2"
xsi:schemaLocation="http://register.geostandaarden.nl/gmlapplicatieschema/top100nl/1.1.0 top100nl.xsd"
gml:id="Top100NL_FC">
</top100nl:FeatureCollectionTop100>
element_container_tag = FeatureCollectionTop100

# Transforms into simple/flat feature data (single geometry per feature type, single attrs)
[transformer_xslt]
class = filters.xsltfilter.XsltFilter
script = xsl/top100-split_v1.1.xsl

# Writes the payload of a packet as a string to a file
[packet_writer]
class = filters.packetwriter.PacketWriter
file_path = {temp_dir}/top100-tmp.gml

# The ogr2ogr command-line, may use any output here, as long as
# the input is a GML file. The "temp_file" is where etree-docs
# are saved. It has to be the same file as in the ogr2ogr command.
# TODO: find a way to use a GML-stream through stdin to ogr2ogr
[output_ogr2ogr]
class = outputs.execoutput.Ogr2OgrExecOutput
# destination format: OGR vector format name
dest_format = PostgreSQL
# destination datasource: name of datasource
dest_data_source = "PG:dbname={database} host={host} port={port} user={user} password={password} active_schema={schema}"
# layer creation options will only be added to ogr2ogr on first run
lco = -lco LAUNDER=YES -lco PRECISION=NO
# spatial_extent, translates to -spat xmin ymin xmax ymax
spatial_extent = {spatial_extent}
# gfs template
gfs_template = gfs/top100-v1.1.gfs
# miscellaneous ogr2ogr options
options = -append -gt 65536 {multi_opts} --config PG_USE_COPY NO
# cleanup input?
cleanup_input = True

# Validator for XML
[xml_schema_validator]
class = filters.xmlvalidator.XmlSchemaValidator
xsd = http://register.geostandaarden.nl/gmlapplicatieschema/top100nl/1.1.0/top100nl.xsd
enabled = False
37 changes: 37 additions & 0 deletions brt/top100nl/etl/etl-top100nl.cmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
:: ETL voor TOP100NL GML met gebruik Stetl.
::
:: Dit is een front-end/wrapper batch-script om uiteindelijk Stetl met een configuratie
:: (etl-top100nl-v1.1.cfg) en parameters (options\myoptions.args) aan te roepen. Dit script is
:: gebaseerd op het shell-script ../../../brk/etl-brk.sh.
::
:: Author: Frank Steggink
@echo off

setlocal

:: Gebruik Stetl meegeleverd met NLExtract (kan in theorie ook Stetl via pip install stetl zijn)
if "%STETL_HOME%"=="" (
set STETL_HOME=../../../externals/stetl
)

:: Nodig voor imports
if "%PYTHONPATH%"=="" (
set PYTHONPATH=%STETL_HOME%
) else (
set PYTHONPATH=%STETL_HOME%;%PYTHONPATH%
)

:: Default argumenten/opties
set options_file=options\default.args

:: Overrule eventueel het default optiebestand door het gebruik van een host-gebaseerd optiebestand
:: options\<hostnaam>.args.
if exist options\%COMPUTERNAME%.args set options_file=options\%COMPUTERNAME%.args

:: Evt via commandline overrulen: etl-top100nl.cmd <mijn optiebestand>
if not "%~1"=="" set options_file=%1

:: Uiteindelijke commando. Kan ook gewoon "stetl -c conf\etl-top100nl-v1.1.cfg -a ..." worden indien Stetl installed
python %STETL_HOME%\stetl\main.py -c conf\etl-top100nl-v1.1.cfg -a %options_file%

endlocal
37 changes: 37 additions & 0 deletions brt/top100nl/etl/etl-top100nl.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash
#
# ETL voor TOP100NL GML met gebruik Stetl.
#
# Dit is een front-end/wrapper shell-script om uiteindelijk Stetl met een configuratie
# (etl-top100nl-v1.1.cfg) en parameters (options/myoptions.args) aan te roepen.
#
# Author: Just van den Broecke
#

# Gebruik Stetl meegeleverd met NLExtract (kan in theorie ook Stetl via pip install stetl zijn)
if [ -z "$STETL_HOME" ]; then
STETL_HOME=../../../externals/stetl
fi

# Nodig voor imports
if [ -z "$PYTHONPATH" ]; then
export PYTHONPATH=$STETL_HOME
else
export PYTHONPATH=$STETL_HOME:$PYTHONPATH
fi

# Default arguments/options
options_file=options/default.args

# Optionally overules default options file by using a host-based file options/<your hostname>.args
# To add your localhost add <your hostname>.args in options directory
host_options_file=options/`hostname`.args

[ -f "$host_options_file" ] && options_file=$host_options_file

# Evt via commandline overrulen: etl-top100nl.sh <my options file>
[ -f "$1" ] && options_file=$1

# Uiteindelijke commando. Kan ook gewoon "stetl -c conf/etl-top100nl-v1.1.cfg -a ..." worden indien Stetl installed
# python $STETL_HOME/stetl/main.py -c conf/etl-top100nl-v1.1.cfg -a "$pg_options temp_dir=temp max_features=$max_features gml_files=$gml_files $multi $spatial_extent"
python $STETL_HOME/stetl/main.py -c conf/etl-top100nl-v1.1.cfg -a $options_file

0 comments on commit a77f3a5

Please sign in to comment.