Playground & Import UI demo scenario Provto & OpenElec

Marc Dutoo edited this page Sep 16, 2015 · 50 revisions

Datacore Playground

An Open Linked Data Cloud, what's that ? a picture is worth a thousand words !

=> Welcome to the Datacore Playground !

https://data.ozwillo-dev.eu/dc-ui/index.html

It's the visible part of the iceberg, and the one that can be used and tested by application providers - you ! historically, built on swagger (scroll down), but we've moved toward something more specific and easier to use, which starts by describing what it does (scroll up)

=> data resources, and data models that describe them, because if you don't know what's in it, you can't query it !

So what's a data Resource ?

for instance a city :

pli:city : it's got type(s), a name, (an italian id ISTAT), it's in a country (Italia) i.e. linked to it and by clicking on the link you can go to it => click on it, this is a country, it's got a name

let's see another exemple, companies :

co:company_0 : another kind of resource, it's got a city, an italian kind of activity (ATECO), and by clicking on ":" you can get all companies that have the same kind of activity ex. "Lavori_generali_di_costruzione_di_edifici" => click on it, here there are two constrution company

Models :

are themselves persisted as Resources (metamodel), so you can have a look at them in the exact same way => click dcmo:model_0

you're in a Resource (click pli:city_0) and you wonder which fields are available on a kind of data ? you only have to click on its type (click type) and here are all fields and their Data Type

Mixins : are also persisted the same way. Mixins are reusable parts of types that you can build models with and also are technical or business features.

you're looking for all Resources that can be put on a map ? query models for those that have the pl:place_0 mixin (which contain the pl:shape field which is WKS localization information)

More

i18n : list of translation objects, can be skipped (lookup on value and not language) => click

RDF : "standard" => click

auth : same as Portal

Datacore Import UI

Tuto : try import OpenElec data

=> click on "import" (by default, imports OpenElec)

At the bottom, there is a Datacore playground where imported resources can be browsed.

=> click on "go" for GET /dc/type/elec:Elector_0 : here are all Electors ; then find one whose "elec_Elector:street" is "...30" and click on ":" : here are all people living in the same street ; have a look at their name : they belong to the same family, actually in the same house !

Above, it displays : parsed CSV, built Resources or their parsing errors (missingIdFields...) or warnings if any, POSTed Resources or their Datacore errors. This is displayed for Data Resources, or if models are broken for Model Resources.

It all starts with (your) data

To add your data, you have to define models (or map to them)

=> you can do both using Import UI ! https://data.ozwillo-dev.eu/dc-ui/import/index.html

It all starts with a flat export of all your Data, containing your leaf, core business data (ex. Elector), but also classfying, colder data (ex. Electoral List). Think of it as a view on ALL of your data.

=> have a look at OpenElec sample data https://data.ozwillo-dev.eu/dc-ui/import/samples/openelec/electeur_v26010_sample.csv

  • Don't start without data, models need data, else your models will be empty shells that will never be able to import any of your real data. And after each change to your models CSV, run import of both your models AND data, otherwise if when later import of your data fails, you won't know which change is the cause.

  • Don't import each one of your database tables separately, it negates Datacore flexibility in defining models and is overall much harder (to link them, to evolve them towards a better structure). You need to have a SINGLE csv file for all your tables. That's because for each line, Datacore Import tool imports one Resource of each type. For instance, if you have Elector & Street table, do select * from elector inner join street on elector.streetHeLivesInId = street.id , then for each line Datacore Import tool will import one Product and one Elector and link Elector to Street (and if Street already exists it will be reused & merged).

  • Start with your most important tables, i.e. your tables about your core business. You may have tables about city & country like a lot of databases, but that's probably not your core business, which is rather electors, urban areas, request forms...

  • Start with your root concepts i.e. that doesn't depend on any other, they are usually easier to model, ex. Country rather than City if you model manages both

  • In a first time (to design the right model), you only need a meaningful extract of your data, not your whole data. 50 or even 5 lines may be enough if they're meaningful enough.

  • There are a lot of ways to get CSV out of SQL : using a web or desktop database management tool, using ETL such as Talend, using online tools such as http://codebeautify.org/sql-to-csv-converter . Open Office allows to convert SQL to CSV by connecting to some databases, see http://grasswiki.osgeo.org/wiki/Openoffice.org_with_SQL_Databases , and same for Excel, see http://jmerrell.com/2011/06/30/connect-excel-external-data-source/ .

  • CSV separator is ',' (comma) and not ';' (semicolon). Beware, Excel produces ';', so rather use Open/LibreOffice Calc or a Google Doc sheet (or see how to configure Excel). Beware, sometimes importing a CSV (that comes from an export of your DB) in Excel or Open/LibreOffice uses several such separators all at once, which will corrupt your data.

  • Beware of upper vs lower case, especially since these softwares can easily put first letters in upper case : model CSV title line is case-agnostic, but everything else is not !

Both phases consist in (almost) all of the following substeps.

Phase overview

All phases are about defining models for your data and importing both, but each has a different purpose.

  • first, in Phase 1 you define from scratch trivial (obvious, with the same name and fields) models around your data. Its purpose is to focus on what you know best: your business, and translate it to Datacore. The outcome is imported but unlinked, siloed sample data.
  • secondly, in Phase 2 you reconcile those trivial models with already existing Datacore models, in a series of refactoring (renaming and / or inheriting, refining) steps that each brings your models closer to those that already exist, for each model successively. The outcome is linked sample data.
  • finally, in Phase 3, you properly publish your models and data, in a dedicated versioned project that clients access from the corresponding (unversioned) facade project in read/write and from oasis.main in readonly. The outcome is linked, complete, optimized, available to consumers, evolution-proof data.

Phase 1 - define models from scratch around your data

In Phase 1, you define from scratch the trivial models for it, by grouping fields about the same stuff together (i.e. methodology step 1).

Download and open the model template using OpenOffice or as CSV and save it in a new file. For now, don't change this file's first three lines (title, documentation and defaults) and start working below.

=> have a look at OpenElec trivial models https://data.ozwillo-dev.eu/dc-ui/import/samples/openelec/oasis-donnees-metiers-openelec.csv

Copy and transpose the column title line of your data to the "Internal field name" column (below its first three lines: title, documentation and defaults), then for each one fill columns : "Field name", "Data type" (ex. String, see available ones on the Playground) and "Mixin" (where it resides / can be grouped in).

Group each data in "Mixin" (first column of cvs template) by ememple a database :

Museum name Open hour zip code city
Museum of london 8h-18h 12345 London

Here we can group zip code and city in City and Museum name and Open hour in Museum. In cvs FILE:

Mixin Field name OLD Field name Data type Precision Description Internal field name
Museum name (blank not use) String (blank not use) name of museum Museum name
Museum openhour (blank not use) String (blank not use) open hour of museum Open hour
City zipcode (blank not use) Integer (blank not use) zipcode zip code
City name (blank not use) String (blank not use) name of city city

Don't try to link data imediately or to make id , proceed step by step.

Then import both model and data

In this step you must get error this is ordinary we will fix them later.

=> select the your model file. => select the your data file. => click on "import" (by default, imports OpenElec)

At the bottom, there is a Datacore playground where imported resources can be browsed. Above, it displays : parsed CSV, built Resources or their parsing errors (missingIdFields...) or warnings if any, POSTed Resources or their Datacore errors. This is displayed for Data Resources, or if models are broken for Model Resources.

Now, normaly you have lot of error(by exemple id not define) , this is not a problem, we will fix all.

Define Resource IDs building

To be created, a Resource needs a well unique URI, that is http://data.oasis-eu.org/dc/type/TYPE/ID , where TYPE is its concrete model type. So you need to define in model import configuration how IDs are built for each Resource type.

Fill the indexInId column (see below) to build IDs (ex. FR/75/Paris) out of other fields (ex. values FR, 75, Paris).

  • Using Resource-typed fields (ex. geo:City_0 model's geo_City:country link of type geo_Country_0) for this purpose means using the linked Resource's own ID (ex. FR).
  • If some values are missing, you can put them in the defaultValue column to temporarily make it work.

NB. for now that's the only way to build IDs, besides hacking the import javascript.

With our exemple we can use name of Museum like id and zipcode and name of city like id.

In model cvs file:

Mixin Field name OLD Field name Data type Precision Description Internal field name ... indexInId
Museum name (blank not use) String (blank not use) name of museum Museum name ... 0
Museum openhour (blank not use) String (blank not use) open hour of museum Open hour ... (blank)
City zipcode (blank not use) Integer (blank not use) zipcode zip code ... 0
City name (blank not use) String (blank not use) name of city city ... 1

Now we can link city and museum.

Define Resource linking

There are two ways : autolinking and ID-based linking.

In each data line, Datacore autolinks Resources together(fills their Resource-typed fields with URIs) if there is a single candidate concrete type for the link type (Resource-typed field).

Otherwise (ex. geo_City:parent being of type o:Ancestor_0 which is shared by geo:Department_0, geo:Region_0, geo:Country_0 types), you must specify linked Resource ID fields using dotted path (ex. elec_Elector_0:birthCity.elec_City_0:INSEECode and elec_Elector_0:movedFromCity.elec_City_0:INSEECode) in additional Field name columns.

Now, in our sample we can autolink city and musuem. In model cvs file:

Mixin Field name OLD Field name Data type Precision Description Internal field name ... indexInId
Museum name (blank not use) String (blank not use) name of museum Museum name ... 0
Museum openhour (blank not use) String (blank not use) open hour of museum Open hour ... (blank)
City zipcode (blank not use) Integer (blank not use) zipcode zip code ... 0
City name (blank not use) String (blank not use) name of city city ... 1
Museum city (blank not use) City (blank not use) autolink (blank this is a link) ...

Phase 2 - reconcile your models with existing ones

Once Phase 1 is done, in Phase 2, you reconcile your models with existing ones, and make them follow required rules and best practices. This is done by changing and refining your models iteratively.

In short, at the end, for instance your locationCity field must not be the string "Paris", but the linked Resource "http://data.ozwillo.com/dc/type/geocifr:Commune_0/FR/FR-75/Paris". And if you have defined a "City" Model in Phase 1, it must be replaced by "geoci:City_0" (or by its country-specific version ex. "geocifr:Commune_0" ; or keep a business specific version mybusiness:City_0 but let it inherit from said "geoci:City_0" or "geocifr:Commune_0", see below).

To achieve that, the main method is to replace your Models and fields with corresponding existing ones, or (for instance if your Models have more fields than existing ones) let your Models inherit (by setting their Has Mixins column) from existing ones and only keep non existing fields. Additionally, just as your inheriting Model becomes an alias for the inherited existing one in this last case, you can keep your fields even if they already exist at the condition that you store them in existing fields (by setting their aliasedStorageNames column).

Replacing Models and fields is done by replacing their lines in your Model CSV by the corresponding lines taken from the Master Model CSV (ask Bruno for it), and replacing references to them in other lines (ex. replacing Resource types in Data Type column).

Matter of factly, this is done by first creating a new version of the Master Model CSV (download it from the Import UI and add "_yourproject" at the end of its name), copying/pasting your lines ad the end of it, and then replacing their references to existing Models and fields (ex. replacing Resource types in Data Type column). Or you can also do it, on the opposite, by replacing Models and fields lines in your Model CSV by the corresponding lines taken from said Master Model CSV and then replacing references to it in other lines, but you still will have to create said new version of the Master Model CSV at the end.

Here are some more techniques allowing to refine your Models:

  • rename and mutualize some fields in commonly inherited mixins
  • define or refine / change prefixes for each Field name (ex. geo_City:name or geoi:name) and Mixin (ex. geo:City or geoi:City)
  • do previous steps over and over, add features (see below), try to import more data...

Rules & best practices to be followed (Phase 2)

  • each mixin must have its own prefix, which must be concise, lower case, alphanumeric only (no dot, underscore...), and be an acronym in English that means something that anybody can understand. The mixin suffix must be the best possible business-specific name, in the most appropriate language, and has not to be unique. The mixin name ends by an underscore followed by the model major version. Ex. geoci:City_0
  • each field name is unique (save if it is inherited / overriden) and reuses its model's prefix. Ex. geoci:name.
  • each field must describe what is allowed, the rules that decide it (for now written in human language), and provide one or two examples in the Description column.
  • inherited fields (defined in an inherited mixin) can be overriden, but only (if semantically allowed and) in a more specific manner that would still be consistent with the overriden field. Ex. a date field can't become an int, and incompatible rules written in Description are forbidden.
  • each mixin must have the odisp:Displayable mixin (or inherit from it) and define (in its Description) the rule that builds its odisp:name field, which allows to display an informative title about it and is a kind of more human-readable URI.

Phase 3 - ask for dedicated project(s) and fill them

  • finally, in Phase 3, you properly publish your models and data, in a dedicated versioned project that clients access from the corresponding (unversioned) facade project in read/write and from oasis.main in readonly. The outcome is linked, complete, optimized, available to consumers, evolution-proof data.

Phase 3 - Define indexes

on fields that must be queriable, by setting a non-0 queryLimit column value, ex. 100. Then check that the index is used by queries on thie field by writing such a query, clicking on the '?' button in the Playground and checking that the debug JSON output is like { debug: { queryExplain: { cursor: BtreeCursor _YOUR_FIELD_NAME_1 ...

Phase 3 - Freeze, publish model & data

More details

CSV model definition reference:

See reference in model template: open it using OpenOffice or as CSV and find feature name (first line), documentation (second line) and default value if any (third line).

More about CSV model definition (incomplete):

  • ID (of Resources within their Model) : built out of fields (indexInId column) ; LATER generated, or custom in Javascript
  • linking : specify linked Resource type (Data type column), then if autolinking between Resources parsed within each row is not enough, specify linked Resource ID fields using dotted path (ex. elec_Elector_0:birthCity.elec_City_0:INSEECode) in additional Field name columns
  • Internal field name : if not provided will use Field name. Resource-typed fields don't need Internal field name because their value is filled by linking (auto, id-based...). As an exception, (Datacore or external Web) Resource URLs are accepted as value.
  • default value, ex. for id field when none to ease import
  • merge across lines : when the same Resource (same URI ID) is parsed in several different lines, its values are the first non-empty, non-default values, else default if any (see above), else empty (i.e. not sent to Datacore), save for "multiple" fields (see below).
  • multiple, for list values : when the same Resource (same URI ID) is parsed in several different lines, the value of a "multiple" field is the list of its values across all lines
  • Mixin : ex. geo:City_0 or geoi:City_0. If not prefixed, a default domain prefix is used instead. If not suffixed by major version number, '_0' is added instead.
  • Field name : ex. geo_City:name or geoi:name. If not prefixed, a model field prefix is generated using a default domain prefix and used instead.
  • Data type : string, date (ISO8601 format ex. 2014-11-20T16:47:56.761+01:00, uses moment.js), list, boolean (only the string "true" means true), int, float, long, double, or any model type (ex. geo:City_0 or geoi:City_0)
  • Is Mixin : i.e. can't be instanciated on its own
  • Has Mixins : comma-separated values, values being mixin or model type (defined in Mixin column in another line)
  • ancestors : having the o:Ancestor_0 mixin (to be defined with list (multiple) field o:ancestors of type o:Ancestor_0) triggers this field to be filled by the list of ancestors i.e. linked Resources used to build its id (Resource-typed fields having a valid indexInId) and their own ancestors
  • Documentation column : the mixin's own. Description & Precision columns are merged in the field's dcmf:documentation
  • i18n : recognized dataype for field that needs translations : start the first couple language/value without index. Language can be set in defaultValue column. Sample :
Field name Data type
nace:label i18n
nace:label map
nace:label.l String
nace:label.v String
nace:label.0 map
nace:label.0.l String
nace:label.0.v String
  • jsFunctionToEval column: ex. hashCodeId (produced hashes are always the same ; or also hashids.encode), generateUuid (to be used with queryBeforeCreate)
  • queryLimit: set to more than 0 ex. 100 to define an index on this field
  • aliasedStorageNames: comma-separated list of existing inherited fields to store this field in
  • queryBeforeCreate: set to true to allow generateUuid as jsFunctionToEval

(TO BE ENRICHED) More...

  • error types : missingIdFields, cantFindAncestorAmongParsedResources, ancestorHasNotAncestorMixin, ancestorHasNotDefinedAncestors, modelWasThoughtToBeMixinAndMustBeReparsed, missingSubFields, missingReferencedMixins, unknownReferencedMixin
  • warning types : noColumnForFieldInternalName, noConfForId

(TO BE ENRICHED) Good practices :

  • A MODEL WITH NO DATA IS A BROKEN MODEL. Start from the data, not the model : get some real data first, then design a trivial model for it (put column names in "Internal field name" column, autolinking), then link it besides autolinking (id building, id-based linking), then refine it (rename and mutualize stuff in mixins, complex fields ex. list & i18n), finally integrate it with existing models & data.
  • A BROKEN MODEL IS USELESS. Always check that your import model still works : when you start working, when you stop working, when you make a change. It's very easy to do. "Works" doesn't mean without Resource errors, but without model errors and with at least some Resources being created.
  • Description : put useful info : samples values, what, how, explanation of configuration (ex. how indexInId and default are used to build ids), how it is planned to evolve...
  • if some existing model & their data are missing in the Datacore, that shouldn't stop you from designing your own version of it, it should even be trivial since you already have their data among your own. Later it won't be any harder to unify and synchronize both. So if city & country model is not ready yet, just look at OpenElec import model : it defines its own city & country model.

(TO BE ENRICHED) Upcoming:

  • id defined using fields of linked resources, and not only linked resource full ids (besides local fields)
  • ancestors defined independently from id (indexInId)
  • scripted ids : for now ids must be built by client apps, and import only supports building ids out of fields (using indexInId). LATER ids may become scripted in JavaScript, in (user-forked) import tool, and / or in server-side script engine executing model.idJsScript, and provide useful functions (hash(), random()...)
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.