Feature Request: Single CSV-File with current state #369

alexgit2k · 2021-01-06T00:27:40Z

I'm calculating the current 7-day incidence rate per 100,000 using cases-rki-by-ags.csv and cases-rl-crowdsource-by-ags.csv. For that I have to download and parse two 0.5 MB files. It would be better to have an aggregated file with the current values (also population would be great). Then only a file with a few KB would be needed for the current state of any area.

For example:

Area;AGS;Population;TotalCasesRKI;TotalCasesCrowdsource;Cases7DaysRKI;Cases7DaysCrowdsource
Berlin;11000;3769000;99525;100771;5031;5389
...

Note: population-data can be found here https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/04-kreise.html

The text was updated successfully, but these errors were encountered:

jgehrcke · 2021-01-08T14:00:25Z

Having a "current" / "latest" aggregate data file is indeed an interesting proposal!

mathiasflick · 2021-01-09T12:44:53Z

If you are really willing to go down this path I could provide you at least the most important "building blocks" (i.e. Jupyter Notebook with the basic steps, csv regarding 'Einwohner' from the source mentioned in the initial request). Just let me know ...
Basic goal with the Jupyter Notebook was (based on your cases-rki-by-ags.csv) to compute historical (until up-to-date) seven-day-incidences in order to show graphical trends for all the AGS-entities and - in a final step - produce an interactive (BOKEH) map showing the current incidences.
The data source regarding 'Einwohner' is a bit tricky - it is not really friendly with regard to automatic processing (or I was just to lazy to try to ...), so I just manually purged it from summarizing and describing rows. As it is pretty stable (my guess is that it is updated only once a year) this is only a one-shot.
Greetings from Cologne
Mathias

jgehrcke · 2021-01-12T11:36:02Z

@mathiasflick thanks for the feedback and proposal! I added population data a couple of days ago with this patch: #383

As you can see I was trying more or less hard to keep this as transparent as possible, with respect to the actual sources and the data processing flow. Because -- you're right -- this is a little tricky :-). Happy with the outcome, though!

alexgit2k · 2021-01-14T08:28:49Z

@jgehrcke great! So now the current state should be the easy part ;-)

mathiasflick · 2021-01-14T10:22:40Z

Perfect! Thank you again for your valuable work! Will reduce the required data sources for my analysis by one ...
I double checked the figures regarding population and it is - of course - a perfect match!
Two remarks though:
(1) ags.json contains an outdated entity ('LK Göttingen (alt)', AGS=3152) which has no 'population' element.
(2) For interested users: don't sum up the 'population' elements to get the total value for Germany, as ags.json contains the whole of Berlin (AGS=11000) PLUS respective sub-entities (AGS 11001ff.)!
Greetings from Cologne
Mathias

jgehrcke · 2021-01-14T10:46:10Z

@mathiasflick thanks for having a deep look! Validation is always super valuable -- I was actually hoping for an additional pair of eye balls 👀. You've built the check sum, just like me -- great :) And you've found the two key aspects to keep in mind. Also see this part of code:

        # Skip those entries that do not have the populuation key set (expected
        # for AGS 3152 "LK Göttingen (alt)". Minus the total for Berlin, because
        # there's a double count.
        checksum = (
            sum(
                v["population"]
                for k, v in AGS_PROPERTY_MAP.items()
                if "population" in v
            )
            - AGS_PROPERTY_MAP["11000"]["population"]
        )

About AGS 3152...

ags.json contains an outdated entity ('LK Göttingen (alt)', AGS=3152) which has no 'population' element.

It popped up on this one dark day in April 2020 in the RKI database 🤷

See 1b36532 -- I had to add it for compatibility, and annotated it with the "(alt)". You wouldn't believe for such messy things to happen in the RKI database, but I suppose these are the things that happen in a real-world data pipeline :).

jgehrcke · 2021-01-14T10:48:46Z

@jgehrcke great! So now the current state should be the easy part ;-)

@alexgit2k yes yes! I do hope I'll get to that soon. Step by step. Thanks again for being here and leaving feedback etc. This is a side/hobby project and it's really lovely to see feedback like in this thread here, and contributions in general.

alexgit2k · 2021-01-17T23:05:10Z

Pull-request #437: Add 7-day incidence data files, and reorganize code a bit

Add tooling for generating more-data/latest-aggregate.csv (for #369), code cleanup, misc

jgehrcke · 2021-01-18T10:29:16Z

With #442 we're basically there. Would appreciate another pair of eyes, and feedback!

Will also review things again, and see if the automatic updates look good within the next few days.

jgehrcke · 2021-01-18T10:39:32Z

The first (manually triggered) automatic update looks good I think.

Here's what's new (super high level summary):

new data file https://github.com/jgehrcke/covid-19-germany-gae/blob/master/more-data/latest-aggregate.csv
documentation for individual columns here:

covid-19-germany-gae/tools/generate-latest-aggregate.py

Line 26 in ff9eec2

Columns in output CSV file:
example program for parsing that file (including its meta data -- the timestamps) here: https://github.com/jgehrcke/covid-19-germany-gae/blob/master/examples/parse-latest-aggregate.py

Let me know what you think!

mathiasflick · 2021-01-19T14:38:07Z

Had a short look on your code -> very nice and dandy! But I will stick to kind of black-box-testing. I'm coming from the old days and sometimes program in Python as if it were C or, even worse, Fortran :-) ...
Systematic check will follow soon, but a first finding:
Remark: I am only talking about (1) cases-rki-by-ags.csv and (2) more-data/7di-rki-by-ags.csv.
I compute the 7di (Seven-Days-Incidence) from (1) using the same figures for "Einwohner" as you do.
My observation is (first glance) that your 7di-values are one day late (neglecting the very last day - yesterday).
Example: Berlin (11000): "My" value 137.83 (latest report); "My" value 167.76 (yesterday report); "Your" value (in (2) of today) 166.51
I checked it for approx. 5 to 6 other AGS-entities - seems to be systematic.
Is this done intentionally?
Greetings form Cologne
Mathias

jgehrcke · 2021-01-20T09:37:24Z

@mathiasflick a quick thank you for your review! This is great. I'll try to find time soon (today?) to look at the details of what you've written down here, and to double-check things.

jgehrcke · 2021-01-25T19:51:33Z

I am not actually sure what specifically the problem is that you seem to have identified @mathiasflick :).

When the last data point in cases-rki-by-ags.csv is at 2021-01-24T17:00:00+0000, then the last data point in 7di-rki-by-ags.csv is at the same time. That's good and expected (that these timestamps match).

At https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html, for all Germany, the RKI reports a 7di of 111 for "today" (Stand: 25.1.2021, 00:00 Uhr (online aktualisiert um 08:30 Uhr)). That timestamp actually matches 2021-01-24T17:00:00+0000 (they mean the "same": state at end of day of January 24).

The 7di value in 7di-rki-by-ags.csv is 115. On the RKI website it's 111. I think that counts as 'matching' values.

I had done double-checking with other sources, too (also for individual counties).

This is not to say that I think you're wrong, but I would actually love to understand better what you noticed! :)

mathiasflick · 2021-01-25T21:20:10Z

Ok, I see -wasn't specific enough with my first try - just luck of time ...
Let me explain: we are computing the 7DI based on exactly the same data (ok, rounding in the different instances will most likely be different).
Comparing your results and my results regarding a few subsequent days for one entity will make it clear - I just took Berlin 11000:

Timestamp "Your 7DI" "My 7DI"
2021-01-20 140,14 131,5
2021-01-21 131,15 122,2
2021-01-22 121,87 114,2
2021-01-23 114,20 113,3
2021-01-24 113,14 109,8 for both this is the latest available data - just minutes old ...
Guess what your tomorrows results for Berlin will be: approx. 110 (109,8 = "my" value for today)! You can see the one-day-offset between "your" and "my" values. Without looking at your code there seems to be a problem with the .rolling() window ...
Just a guess, though ...
Greetings from Cologne
Mathias

mathiasflick · 2021-01-26T08:49:00Z

As I predicted yesterday night:
Berlin 7DI for 2021-01-25
"Your" Value: 110,96
"My" Value: 101,7
RKI-Dashboard: 101,7

My conclusion: Your computation does somehow not include the very last day (which is nevertheless in your database ...)!
Reminder: This is based on a comparison between (1) cases-rki-by-ags.csv and (2) more-data/7di-rki-by-ags.csv.
Greetings from Cologne
Mathias

jgehrcke · 2021-01-29T14:53:26Z

@mathiasflick hank you so much for this quick and detailed response! I should have investigated immediately; but I have been moving to another city this week and am super occupied with that, at every time of the day. I'll try to get to the bottom of this asap, but probably not before next week (which is super sad).

alexgit2k · 2021-02-01T19:13:59Z

First thanks @jgehrcke for your integration! It looks like @mathiasflick is right and you are not considering the last day. If I calculate 7 days back from the day before yesterday I get (nearly) your values:

Berlin: 86.3 (23.-30.1., cases-rki-by-ags.csv) vs. 86 (rki_cases_7di, latest-aggregate.csv)
Munich: 61.6 (23.-30.1., cases-rki-by-ags.csv) vs. 61 (rki_cases_7di, latest-aggregate.csv)

Can you please use at least one decimal place for the 7-day incidence, so we do have the same format as RKI uses.

RKI (https://corona.rki.de/)
- Berlin: 85.4
- Munich: 60.6
cases-rki-by-ags.csv (Updated 9:30 CET), 24.-31.1.
- Berlin: 119,943 - 116,809 = 3,134 / 3,669,491 * 100,000 = 85.4
- Munich: 51,666 - 50,767 = 899 / 1,484,226 * 100,000 = 60.6
latest-aggregate.csv (Updated 16:19 CET)
- Berlin: rki_cases_7di = 86
- Munich: rki_cases_7di = 61
cases-rki-by-ags.csv (Updated 9:30 CET), 23.-30.1.
- Berlin: 119,920 - 116,754 = 3,166 / 3,669,491 * 100,000 = 86.3
- Munich: 51,621 - 50,707 = 914 / 1,484,226 * 100,000 = 61.6
7di-rki-by-ags.csv (Updated 16:19 CET), 31.1. but actually 30.1.
- Berlin: 86.24
- Munich: 61.54

So it looks like for latest-aggregate.csv you are using the last line of 7di-rki-by-ags.csv with stripped decimal places. But my calculations are slightly different:

Berlin: 86.3 (23.-30.1., cases-rki-by-ags.csv) vs. 86.24 (31.1. but actually 30.1., 7di-rki-by-ags.csv)
Munich: 61.6 (23.-30.1., cases-rki-by-ags.csv) vs. 61.54 (31.1. but actually 30.1., 7di-rki-by-ags.csv)

jgehrcke · 2021-02-02T13:15:10Z

Still overwhelmed with personal and work things for now, but I will get back to this as soon as I can. Thanks for the input, this is so much appreciated. I'd of course also be super happy for someone else to dig into code -- the more we can collaborate on this the better things will be after all :).

mathiasflick · 2021-02-02T21:24:23Z

You finally convinced me to have a closer look into the code and - after painful code inspection and some practical experimentation - I was able to track down the problem to the last 7 lines of code of "lib/tsmath.py".
In order to protect you from an incompetent pull request (and frankly spoken this is not the final solution, as it does not cover the Risklayer path) I reproduce the two lines of code here that did the job:

######## EMERGENGY EXIT FOR RKI ####################################

output_series = change_per_day.rolling(7).sum()

return output_series
    
######## END OF EMERGENCY EXIT #####################################

These lines are placed just before the '1H' resampling takes place. Executing "tools/7di.py" after this "fix" delivers EXACTLY my values for the 7DI for the correct days (including the very last one) PLUS those values are EXACTLY (except for minimal rounding) the values published in the RKI dashboard.

Greetings from Cologne
Mathias

jgehrcke · 2021-02-04T18:11:27Z

This is amazing, thank you for the debugging effort @mathiasflick -- will save me quite a bit of time. Want to say sorry again for my silence here. Turns out that moving and a full time job leave little room to breathe. Will report back.

7di: remove hourly resample, see #369

jgehrcke · 2021-02-08T10:07:04Z

My goal was to take the time and thoroughly understand why the resampling step changed behavior and numbers in the way it did! But had to accept that these days I can't get the focus time.

Now that so much time passed I felt like I have abused your patience already -- I've now simply removed this additional processing step in #553. Would appreciate for you to keep an eye on the following updates. Hope this is an improvement, after all.

Can you please use at least one decimal place for the 7-day incidence

You refer to latest-aggregate.csv, @alexgit2k?

Well. Maybe : ) Hope you agree that the 1/10th of the incidence is far below any systematic and statistical error haha. I think I stripped things to integer for latest-aggregate.csv so that the diff in GitHub reads nicely!

alexgit2k · 2021-02-08T18:31:16Z

7-day-incidence-calculations are now correct, however rounding is not correct:
44.47 (7di-rki-by-ags.csv, 11000_7di) is rounded to 44 (latest-aggregate.csv, rki_cases_7di)

RKI (https://corona.rki.de/)
- Berlin: 71.4
- Munich: 44.5
cases-rki-by-ags.csv (Updated 9:21 CET), 31.1.-7.2.
- Berlin: 122,606 - 119,986 = 2,620 / 3,669,491 * 100,000 = 71.40 ✔️
- Munich: 52,320 - 51,660 = 660 / 1,484,226 * 100,000 = 44.47 ✔️
latest-aggregate.csv (Updated 21:17 CET)
- Berlin: rki_cases_7di = 71 ✔️
- Munich: rki_cases_7di = 44 ❌
7di-rki-by-ags.csv (Updated 21:17 CET)
- Berlin: 11000_7di = 71.40 ✔️
- Munich: 9162_7di = 44.47 ✔️

Can you please use at least one decimal place for the 7-day incidence

You refer to latest-aggregate.csv, @alexgit2k?

Well. Maybe : ) Hope you agree that the 1/10th of the incidence is far below any systematic and statistical error haha. I think I
stripped things to integer for latest-aggregate.csv so that the diff in GitHub reads nicely!

Yes, I'm talking about latest-aggregate.csv, 7di-rl-by-ags.csv and 7di-rki-by-ags.csv contain already 2 decimal places.

I know that it is only 1/10th, but as said RKI also uses one decimal place. So this would makes it easier compareable.
But I also see your Github-diff-argument. However, I think that maybe not many people will use that diff.

mathiasflick · 2021-02-08T18:42:05Z

Just checked the outcome of the updated computation of 7di-rki-by-ags.csv. Perfect match - including the rounding to two decimal places ;-)!
Well done!
Regarding the resampling and windowing: Pandas documentation particularly in this regard is far from being precise or comprehensive - in some respect even ambiguous or wrong. A good piece of trial and error yet to come ...
Thank you and greetings from Cologne
Mathias

aggregate: 7di with 2 digits (see #369)

jgehrcke · 2021-02-09T12:06:06Z

Thank you again @mathiasflick for your careful review, and for taking the time to leave this kind of feedback. Feels like collaboration. In the future, please don't hesitate sharing ideas etc.

Regarding the resampling and windowing: Pandas documentation particularly in this regard is far from being precise or comprehensive - in some respect even ambiguous or wrong.

Yes. I love pandas for what it does and for how robust it already is, and for how well-documented it is. But for sure, there are weaknesses in the docs and often times you really need to play with data examples a lot before digging the subtleties of behavior.

An interesting line of thinking is whether the goal is to calculate the exact same numbers that everybody else has or whether it makes sense to ... well. :)

jgehrcke · 2021-02-09T12:09:01Z

@alexgit2k also big thanks to you, for doing another careful review with a keen eye for detail!

In this context, I think it's fair to ignorantly strip (not properly round) even a 44.7 to a 44. But, conventionally, it's correct to round a 44.47 down to a 44. In any case, because you provide such lovely feedback I chose to not care too much about the diff view and wanted to satisfy your "So this would makes it easier compareable" argument. I have just landed #559. Now, the latest-aggregate.csv file contains the 7di values with two decimal places.

Interestingly, the diff view is still perfectly fine, showing the specific per-row changes in darker green:

jgehrcke · 2021-02-09T12:09:54Z

So, please @mathiasflick and @alexgit2k let's keep the feedback coming. Contributions are so much welcome!

I am closing this issue now; let's carry any outstanding items to new, more focused issues, if possible.

alexgit2k · 2021-02-27T13:47:58Z

I can confirm that the numbers are correct now. Thank you very much!

I have integrated your data-source in a small windows-application: https://github.com/alexgit2k/corona-info

jgehrcke · 2021-02-27T13:50:40Z

Thanks for the feedback @alexgit2k and it's great to see downstream work!

jgehrcke added the feedback feedback and questions. label Jan 8, 2021

jgehrcke added a commit that referenced this issue Jan 18, 2021

Merge pull request #442 from jgehrcke/jp/snapshot-current

83f691e

Add tooling for generating more-data/latest-aggregate.csv (for #369), code cleanup, misc

jgehrcke added a commit that referenced this issue Feb 8, 2021

7di: remove hourly resample, see #369

57fad12

jgehrcke added a commit that referenced this issue Feb 8, 2021

Merge pull request #553 from jgehrcke/jp/7diwo-resample

b0a2cfa

7di: remove hourly resample, see #369

jgehrcke added a commit that referenced this issue Feb 9, 2021

aggregate: 7di with 2 digits (see #369)

e0d9cd4

jgehrcke mentioned this issue Feb 9, 2021

aggregate: 7di with 2 digits (see #369) #559

Merged

jgehrcke added a commit that referenced this issue Feb 9, 2021

Merge pull request #559 from jgehrcke/jp/aggregate-digit

bbbcf17

aggregate: 7di with 2 digits (see #369)

jgehrcke closed this as completed Feb 9, 2021

jgehrcke mentioned this issue Feb 9, 2021

Discrepancy to RKI data of yesterday while today is accurate. #491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Single CSV-File with current state #369

Feature Request: Single CSV-File with current state #369

alexgit2k commented Jan 6, 2021 •

edited

Loading

jgehrcke commented Jan 8, 2021

mathiasflick commented Jan 9, 2021

jgehrcke commented Jan 12, 2021

alexgit2k commented Jan 14, 2021

mathiasflick commented Jan 14, 2021

jgehrcke commented Jan 14, 2021 •

edited

Loading

jgehrcke commented Jan 14, 2021 •

edited

Loading

alexgit2k commented Jan 17, 2021

jgehrcke commented Jan 18, 2021

jgehrcke commented Jan 18, 2021

mathiasflick commented Jan 19, 2021

jgehrcke commented Jan 20, 2021

jgehrcke commented Jan 25, 2021

mathiasflick commented Jan 25, 2021

mathiasflick commented Jan 26, 2021

jgehrcke commented Jan 29, 2021

alexgit2k commented Feb 1, 2021 •

edited

Loading

jgehrcke commented Feb 2, 2021

mathiasflick commented Feb 2, 2021

jgehrcke commented Feb 4, 2021

jgehrcke commented Feb 8, 2021

alexgit2k commented Feb 8, 2021 •

edited

Loading

mathiasflick commented Feb 8, 2021

jgehrcke commented Feb 9, 2021 •

edited

Loading

jgehrcke commented Feb 9, 2021

jgehrcke commented Feb 9, 2021

alexgit2k commented Feb 27, 2021

jgehrcke commented Feb 27, 2021

Feature Request: Single CSV-File with current state #369

Feature Request: Single CSV-File with current state #369

Comments

alexgit2k commented Jan 6, 2021 • edited Loading

jgehrcke commented Jan 8, 2021

mathiasflick commented Jan 9, 2021

jgehrcke commented Jan 12, 2021

alexgit2k commented Jan 14, 2021

mathiasflick commented Jan 14, 2021

jgehrcke commented Jan 14, 2021 • edited Loading

jgehrcke commented Jan 14, 2021 • edited Loading

alexgit2k commented Jan 17, 2021

jgehrcke commented Jan 18, 2021

jgehrcke commented Jan 18, 2021

mathiasflick commented Jan 19, 2021

jgehrcke commented Jan 20, 2021

jgehrcke commented Jan 25, 2021

mathiasflick commented Jan 25, 2021

mathiasflick commented Jan 26, 2021

jgehrcke commented Jan 29, 2021

alexgit2k commented Feb 1, 2021 • edited Loading

jgehrcke commented Feb 2, 2021

mathiasflick commented Feb 2, 2021

jgehrcke commented Feb 4, 2021

jgehrcke commented Feb 8, 2021

alexgit2k commented Feb 8, 2021 • edited Loading

mathiasflick commented Feb 8, 2021

jgehrcke commented Feb 9, 2021 • edited Loading

jgehrcke commented Feb 9, 2021

jgehrcke commented Feb 9, 2021

alexgit2k commented Feb 27, 2021

jgehrcke commented Feb 27, 2021

alexgit2k commented Jan 6, 2021 •

edited

Loading

jgehrcke commented Jan 14, 2021 •

edited

Loading

jgehrcke commented Jan 14, 2021 •

edited

Loading

alexgit2k commented Feb 1, 2021 •

edited

Loading

alexgit2k commented Feb 8, 2021 •

edited

Loading

jgehrcke commented Feb 9, 2021 •

edited

Loading