Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV source for AMSE #99

Closed
4 tasks
rhazn opened this issue Jan 17, 2023 · 4 comments
Closed
4 tasks

CSV source for AMSE #99

rhazn opened this issue Jan 17, 2023 · 4 comments
Labels
enhancement New feature or request
Milestone

Comments

@rhazn
Copy link
Contributor

rhazn commented Jan 17, 2023

User Story

  1. As a {AMSE teacher}
  2. I want {to have an interesting CSV source}
  3. So that {I can use it in AMSE}

User Acceptance Criteria

  • A non-trivial CSV file is chosen for AMSE, ideally with multiple data types and many columns + a large amount of records. The file should include some errors and data we can improve with jayvee.

Notes

  • From MA thesis work we have some insights into mobilithek data, the top offers with CSV files are attached as csv
  • Suggestions from the student:
- "Radmonitore": https://mobilithek.info/offers/-2042164558385113078
  - 11 Fahrrad-Monitore messen Anzahl an vorbeifahrenden Fahrrädern
in Rostock in 15-Minuten-Intervallen
  - 123 MB
  - Achtung: Insgesamter Zeitraum scheint nicht immer gleich bei
allen Monitoren zu sein (Teilweise 2013-2020, teils nur 2022, ...)
  - In einer kleinen CSV-Datei (1,7 KB) sind die Standorte näher beschrieben
  - Es sind noch andere, ggfs. spannende, Dateien enthalten
- "Pünktlichkeit": https://mobilithek.info/offers/-3073323302501876779
  - Pünktlichkeit von Zuglinien in Prozent beim
Schienenpersonennahverkehr aufgelöst in Jahr und Monat im Raum
Schleswig-Holstein
  - 178 KB
  - von Jan'2010 bis Jan'2022

Definitions of Done

  • A PR has been opened and accepted
  • All user acceptance criteria are met
  • All tests are passing
@rhazn rhazn added the enhancement New feature or request label Jan 17, 2023
@rhazn rhazn added this to the AMSE SS23 milestone Jan 17, 2023
@felix-oq
Copy link
Contributor

I also looked at various CSV datasets at mobilithek.info. I categorized my findings into different groups:

Suitable ones, no obvious errors in the data:

Suitable one, comparatively challenging to work with:

This dataset has uninteresting top / bottom lines and two labels or missing labels per column. It also contains uninteresting columns at the front and redundant columns at the end that sum up previous columns.

Probably too challenging:

Those datasets are similar to the challenging one but additionally have multi-indexed columns. Ideally, for achieving a database-compatible format, those column indices would need to be converted into their own separate columns.

Strange findings, not suitable:

@felix-oq
Copy link
Contributor

My overall impression when looking through the CSV datasets at mobilithek.info was that almost all ones are formatted properly (i.e. first row as header, the rest being the actual data). Exceptions were the GENISIS tables from Statistisches Bundesamt, where top and bottom lines need to be trimmed and multi-indices may occur, regarding both rows and columns.

@rhazn
Copy link
Contributor Author

rhazn commented Jan 19, 2023

Really cool feedback, I looked at some and I think these are great:

Suitable ones, no obvious errors in the data:

Suitable one, comparatively challenging to work with:

In theory I like the DESTATIS one for being a bit more complicated but it seems they are available from their website in a better format, as I found out when I flamed them on twitter: https://twitter.com/destatis/status/1551510399920623620.

So using that one is theoretically nice but will force us to talk about why we did not use it as intended. There is a story there about it being too complicated to find out how it is intended but I am not sure we want to tell that.

On the other hand, I like the easy one with the Haltestellen, it has a mobility slant and it is very basic. That will make modeling a pipeline easier but therefore probably also make the data quality harder to differentiate.

What I like in both is that they contain non-ascii chars with the german umlauts, we'll need to find a good way to deal with that.

@rhazn
Copy link
Contributor Author

rhazn commented Jan 19, 2023

Actually looking at it again, the Haltestellen Dataset also has two more (older) ones with different headers and it contains location data. I like both, not for the experiment but for later on having AMSE students work with the data. We can provide a jv project that produces a combined SQLite db and have a cool example that is nontrivial aka the students can build some nice projects with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants