New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mapping XML dataset to FDP datasets #67
Comments
@HimmelStein great question. It would be graet if you could:
|
the files in the dataset are as follows. c!SEC1_E!en!0.xml each 'nmc-item alias='"", id=""' starts a small table (maximum 3 rows) with bud-remark, bud-legal. The content of c!SEC10_E!en!0.xml is as follows. http://wenxion.net/cyc/c!SEC10_E!en!0.xml
|
@HimmelStein that's great. Could you try prepping some sample CSV illustrating what tables you think would get generated out of this and then either paste the raw CSV sample, or, better post to a gist or somewhere online and drop in the DataPipes link http://datapipes.okfnlabs.org/ (see http://okfnlabs.org/blog/2013/12/05/view-csv-with-data-pipes.html) |
For what it's worth, there's a parser for this that I started hacking up a while back here: https://github.com/civicdataeu/eu-budget-scraper - unfortunately, the XML is very complex and I found it impossible to make a good CSV equivalent without answering some additional questions about the structure of the EU budget process. It's also made harder by the fact that this XML looks to be made as a pre-print processing stage, not an accounting document. It would be really cool to get this right, though! |
here is the web link |
@HimmelStein thank-you - btw here's the datapipes version (I've also used webshot to get a screen grab of that and inlined below): http://datapipes.okfnlabs.org/csv/head%C2%A0400/html?url=http://wenxion.net/cyc/c!SEC10_E!en!0.csv Questions: |
@HimmelStein any thoughts on the questions here? |
this is due to the structure of the XML file. Just think of an xml file as a book, which has chapters, sections. The financial data sets appear only in sections. If we map each chapter and section into csv lines, those chapter lines have no financial data, and appear as empty.
they are attributes in the xml file
yes, we can work on it, and get it right, better with supports from domain experts. |
|
Yes, that looks great. Could I suggest you start a repo at https://github.com/os-data and we start putting some of this data in there and creating a Fiscal Data Package. I'm cc'ing @danfowler who can get you set up. |
sure. my pleasure |
@rgrp I've created an empty repo here: https://github.com/os-data/eu-budget-2014 |
@danfowler @HimmelStein has access to the os-data org in the |
pushed there without datapackage.json file |
Hi @HimmelStein, thanks for pushing! This looks great so far. Looking through the data, I have the following comments:
Please let me know if that helps, and how I can help you move this forward. First three lines of
|
thanks very much for your comments! |
hi, here is a csv file transformed from http://www.apa.sk/en/index.php?offset=0&language=en&navID=50&euod=&eudo=&order=& https://github.com/os-data/Beneficiaries-From-EAGF-and-the-EAFRD-in-Slovakia-2014 This dataset should be simpler than the EU2014 dataset above. Could you provide its datapackage.json file? http://datapipes.okfnlabs.org/csv/html/?url=https%3A%2F%2Fraw.githubusercontent.com%2Fos-data%2FBeneficiaries-From-EAGF-and-the-EAFRD-in-Slovakia-2014%2Fmaster%2Fdata%2FSLOV2014EA_HTML.csv |
Hi @HimmelStein, I see lots of duplicate rows for the CSV file pushed to |
Hi @HimmelStein, |
great! |
@danfowler @HimmelStein can we close this? |
@rgrp @HimmelStein I think we can. Any further questions specifically about the dataset should go into the issue tracker for it: https://github.com/os-data/eu-budget-2014/issues |
an XML file may have several tables, e.g. http://open-data.europa.eu/data/dataset/budget-of-the-european-union-2014
should each table in XML be mapped to one FDP dataset (with .csv + datapackage.json),
or all these tables mapped to a big sparse FDP dataset?
The text was updated successfully, but these errors were encountered: