New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSV source for AMSE #99
Comments
I also looked at various CSV datasets at mobilithek.info. I categorized my findings into different groups: Suitable ones, no obvious errors in the data:
Suitable one, comparatively challenging to work with:
This dataset has uninteresting top / bottom lines and two labels or missing labels per column. It also contains uninteresting columns at the front and redundant columns at the end that sum up previous columns. Probably too challenging:
Those datasets are similar to the challenging one but additionally have multi-indexed columns. Ideally, for achieving a database-compatible format, those column indices would need to be converted into their own separate columns. Strange findings, not suitable:
|
My overall impression when looking through the CSV datasets at mobilithek.info was that almost all ones are formatted properly (i.e. first row as header, the rest being the actual data). Exceptions were the GENISIS tables from Statistisches Bundesamt, where top and bottom lines need to be trimmed and multi-indices may occur, regarding both rows and columns. |
Really cool feedback, I looked at some and I think these are great:
In theory I like the DESTATIS one for being a bit more complicated but it seems they are available from their website in a better format, as I found out when I flamed them on twitter: https://twitter.com/destatis/status/1551510399920623620. So using that one is theoretically nice but will force us to talk about why we did not use it as intended. There is a story there about it being too complicated to find out how it is intended but I am not sure we want to tell that. On the other hand, I like the easy one with the Haltestellen, it has a mobility slant and it is very basic. That will make modeling a pipeline easier but therefore probably also make the data quality harder to differentiate. What I like in both is that they contain non-ascii chars with the german umlauts, we'll need to find a good way to deal with that. |
Actually looking at it again, the Haltestellen Dataset also has two more (older) ones with different headers and it contains location data. I like both, not for the experiment but for later on having AMSE students work with the data. We can provide a jv project that produces a combined SQLite db and have a cool example that is nontrivial aka the students can build some nice projects with it. |
User Story
User Acceptance Criteria
Notes
Definitions of Done
The text was updated successfully, but these errors were encountered: