CSV source for AMSE #99

rhazn · 2023-01-17T14:01:02Z

User Story

As a {AMSE teacher}
I want {to have an interesting CSV source}
So that {I can use it in AMSE}

User Acceptance Criteria

A non-trivial CSV file is chosen for AMSE, ideally with multiple data types and many columns + a large amount of records. The file should include some errors and data we can improve with jayvee.

Notes

From MA thesis work we have some insights into mobilithek data, the top offers with CSV files are attached as csv
Suggestions from the student:

- "Radmonitore": https://mobilithek.info/offers/-2042164558385113078
  - 11 Fahrrad-Monitore messen Anzahl an vorbeifahrenden Fahrrädern
in Rostock in 15-Minuten-Intervallen
  - 123 MB
  - Achtung: Insgesamter Zeitraum scheint nicht immer gleich bei
allen Monitoren zu sein (Teilweise 2013-2020, teils nur 2022, ...)
  - In einer kleinen CSV-Datei (1,7 KB) sind die Standorte näher beschrieben
  - Es sind noch andere, ggfs. spannende, Dateien enthalten
- "Pünktlichkeit": https://mobilithek.info/offers/-3073323302501876779
  - Pünktlichkeit von Zuglinien in Prozent beim
Schienenpersonennahverkehr aufgelöst in Jahr und Monat im Raum
Schleswig-Holstein
  - 178 KB
  - von Jan'2010 bis Jan'2022

Definitions of Done

A PR has been opened and accepted
All user acceptance criteria are met
All tests are passing

The text was updated successfully, but these errors were encountered:

felix-oq · 2023-01-19T09:01:38Z

I also looked at various CSV datasets at mobilithek.info. I categorized my findings into different groups:

Suitable ones, no obvious errors in the data:

Liste der Ladesäulen (aggregiert), Deutschland: E-Ladesäulen
- Locations and metadata of charging stations in Germany
Weltweit: Flughäfen
- A list of all airports worldwide
Haltestellen: IDs und Geodaten
- Stops of public transport in Nuremberg (hence with regional reference)
Haltestellendaten
- Train stops in Germany

Suitable one, comparatively challenging to work with:

Personenkraftwagen: Kreise, Stichtag, Kraftstoffarten,Emissionsgruppen
- Statistics on emissions of passenger vehicles

This dataset has uninteresting top / bottom lines and two labels or missing labels per column. It also contains uninteresting columns at the front and redundant columns at the end that sum up previous columns.

Probably too challenging:

Unfallbeteiligte, Hauptverursacher des Unfalls:Deutschland, Monate, Geschlecht, Altersgruppen, Art derVerkehrsbeteiligung, Unfallkategorie, Ortslage
- Statistics on people involved in accidents
Unternehmen, Beförderte Personen, Personenkilometer(Personenverkehr mit Bussen und Bahnen): Deutschland,Quartale, Verkehrsart
- Statistics on passenger transportation
Kraftfahrzeugbestand nach Kraftfahrzeugarten - Stichtag01.01. - regionale Ebenen
- Statistics on motor vehicles

Those datasets are similar to the challenging one but additionally have multi-indexed columns. Ideally, for achieving a database-compatible format, those column indices would need to be converted into their own separate columns.

Strange findings, not suitable:

Verkehrszählung und multi-object-tracking Metadaten im nicht-motorisierten Individualverkehr
- Uses JSON syntax for values in columns
Kraftfahrzeugbestand nach Kraftfahrzeugarten - Stichtag01.01. - regionale Tiefe: Kreise und krfr. Städte
- Similar to the too challenging one but the indentation of values in a particular column encodes semantic meaning
  - I.e. no indentation -> country, single indentation -> federal state, double indentation -> region, triple indentation -> district

felix-oq · 2023-01-19T09:09:16Z

My overall impression when looking through the CSV datasets at mobilithek.info was that almost all ones are formatted properly (i.e. first row as header, the rest being the actual data). Exceptions were the GENISIS tables from Statistisches Bundesamt, where top and bottom lines need to be trimmed and multi-indices may occur, regarding both rows and columns.

rhazn · 2023-01-19T09:26:59Z

Really cool feedback, I looked at some and I think these are great:

Suitable ones, no obvious errors in the data:

Haltestellendaten

Train stops in Germany

Suitable one, comparatively challenging to work with:

Personenkraftwagen: Kreise, Stichtag, Kraftstoffarten,Emissionsgruppen

Statistics on emissions of passenger vehicles

In theory I like the DESTATIS one for being a bit more complicated but it seems they are available from their website in a better format, as I found out when I flamed them on twitter: https://twitter.com/destatis/status/1551510399920623620.

So using that one is theoretically nice but will force us to talk about why we did not use it as intended. There is a story there about it being too complicated to find out how it is intended but I am not sure we want to tell that.

On the other hand, I like the easy one with the Haltestellen, it has a mobility slant and it is very basic. That will make modeling a pipeline easier but therefore probably also make the data quality harder to differentiate.

What I like in both is that they contain non-ascii chars with the german umlauts, we'll need to find a good way to deal with that.

rhazn · 2023-01-19T09:30:40Z

Actually looking at it again, the Haltestellen Dataset also has two more (older) ones with different headers and it contains location data. I like both, not for the experiment but for later on having AMSE students work with the data. We can provide a jv project that produces a combined SQLite db and have a cool example that is nontrivial aka the students can build some nice projects with it.

rhazn added the enhancement New feature or request label Jan 17, 2023

rhazn added this to the AMSE SS23 milestone Jan 17, 2023

rhazn mentioned this issue Jan 17, 2023

Additional CSV features depending on chosen Dataset #100

Closed

4 tasks

felix-oq closed this as completed Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV source for AMSE #99

CSV source for AMSE #99

rhazn commented Jan 17, 2023

felix-oq commented Jan 19, 2023

felix-oq commented Jan 19, 2023

rhazn commented Jan 19, 2023

rhazn commented Jan 19, 2023

CSV source for AMSE #99

CSV source for AMSE #99

Comments

rhazn commented Jan 17, 2023

User Story

User Acceptance Criteria

Notes

Definitions of Done

felix-oq commented Jan 19, 2023

felix-oq commented Jan 19, 2023

rhazn commented Jan 19, 2023

rhazn commented Jan 19, 2023