Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Single CSV-File with current state #369

Closed
alexgit2k opened this issue Jan 6, 2021 · 28 comments
Closed

Feature Request: Single CSV-File with current state #369

alexgit2k opened this issue Jan 6, 2021 · 28 comments
Labels
feedback feedback and questions.

Comments

@alexgit2k
Copy link

alexgit2k commented Jan 6, 2021

I'm calculating the current 7-day incidence rate per 100,000 using cases-rki-by-ags.csv and cases-rl-crowdsource-by-ags.csv. For that I have to download and parse two 0.5 MB files. It would be better to have an aggregated file with the current values (also population would be great). Then only a file with a few KB would be needed for the current state of any area.

For example:

Area;AGS;Population;TotalCasesRKI;TotalCasesCrowdsource;Cases7DaysRKI;Cases7DaysCrowdsource
Berlin;11000;3769000;99525;100771;5031;5389
...

Note: population-data can be found here https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/04-kreise.html

@jgehrcke
Copy link
Owner

jgehrcke commented Jan 8, 2021

Having a "current" / "latest" aggregate data file is indeed an interesting proposal!

@jgehrcke jgehrcke added the feedback feedback and questions. label Jan 8, 2021
@mathiasflick
Copy link

If you are really willing to go down this path I could provide you at least the most important "building blocks" (i.e. Jupyter Notebook with the basic steps, csv regarding 'Einwohner' from the source mentioned in the initial request). Just let me know ...
Basic goal with the Jupyter Notebook was (based on your cases-rki-by-ags.csv) to compute historical (until up-to-date) seven-day-incidences in order to show graphical trends for all the AGS-entities and - in a final step - produce an interactive (BOKEH) map showing the current incidences.
The data source regarding 'Einwohner' is a bit tricky - it is not really friendly with regard to automatic processing (or I was just to lazy to try to ...), so I just manually purged it from summarizing and describing rows. As it is pretty stable (my guess is that it is updated only once a year) this is only a one-shot.
Greetings from Cologne
Mathias

@jgehrcke
Copy link
Owner

@mathiasflick thanks for the feedback and proposal! I added population data a couple of days ago with this patch: #383

As you can see I was trying more or less hard to keep this as transparent as possible, with respect to the actual sources and the data processing flow. Because -- you're right -- this is a little tricky :-). Happy with the outcome, though!

@alexgit2k
Copy link
Author

@jgehrcke great! So now the current state should be the easy part ;-)

@mathiasflick
Copy link

Perfect! Thank you again for your valuable work! Will reduce the required data sources for my analysis by one ...
I double checked the figures regarding population and it is - of course - a perfect match!
Two remarks though:
(1) ags.json contains an outdated entity ('LK Göttingen (alt)', AGS=3152) which has no 'population' element.
(2) For interested users: don't sum up the 'population' elements to get the total value for Germany, as ags.json contains the whole of Berlin (AGS=11000) PLUS respective sub-entities (AGS 11001ff.)!
Greetings from Cologne
Mathias

@jgehrcke
Copy link
Owner

jgehrcke commented Jan 14, 2021

@mathiasflick thanks for having a deep look! Validation is always super valuable -- I was actually hoping for an additional pair of eye balls 👀. You've built the check sum, just like me -- great :) And you've found the two key aspects to keep in mind. Also see this part of code:

        # Skip those entries that do not have the populuation key set (expected
        # for AGS 3152 "LK Göttingen (alt)". Minus the total for Berlin, because
        # there's a double count.
        checksum = (
            sum(
                v["population"]
                for k, v in AGS_PROPERTY_MAP.items()
                if "population" in v
            )
            - AGS_PROPERTY_MAP["11000"]["population"]
        )

About AGS 3152...

ags.json contains an outdated entity ('LK Göttingen (alt)', AGS=3152) which has no 'population' element.

It popped up on this one dark day in April 2020 in the RKI database 🤷

See 1b36532 -- I had to add it for compatibility, and annotated it with the "(alt)". You wouldn't believe for such messy things to happen in the RKI database, but I suppose these are the things that happen in a real-world data pipeline :).

@jgehrcke
Copy link
Owner

jgehrcke commented Jan 14, 2021

@jgehrcke great! So now the current state should be the easy part ;-)

@alexgit2k yes yes! I do hope I'll get to that soon. Step by step. Thanks again for being here and leaving feedback etc. This is a side/hobby project and it's really lovely to see feedback like in this thread here, and contributions in general.

@alexgit2k
Copy link
Author

Pull-request #437: Add 7-day incidence data files, and reorganize code a bit

jgehrcke added a commit that referenced this issue Jan 18, 2021
Add tooling for generating more-data/latest-aggregate.csv (for #369), code cleanup, misc
@jgehrcke
Copy link
Owner

With #442 we're basically there. Would appreciate another pair of eyes, and feedback!

Will also review things again, and see if the automatic updates look good within the next few days.

@jgehrcke
Copy link
Owner

The first (manually triggered) automatic update looks good I think.

Here's what's new (super high level summary):

Let me know what you think!

@mathiasflick
Copy link

Had a short look on your code -> very nice and dandy! But I will stick to kind of black-box-testing. I'm coming from the old days and sometimes program in Python as if it were C or, even worse, Fortran :-) ...
Systematic check will follow soon, but a first finding:
Remark: I am only talking about (1) cases-rki-by-ags.csv and (2) more-data/7di-rki-by-ags.csv.
I compute the 7di (Seven-Days-Incidence) from (1) using the same figures for "Einwohner" as you do.
My observation is (first glance) that your 7di-values are one day late (neglecting the very last day - yesterday).
Example: Berlin (11000): "My" value 137.83 (latest report); "My" value 167.76 (yesterday report); "Your" value (in (2) of today) 166.51
I checked it for approx. 5 to 6 other AGS-entities - seems to be systematic.
Is this done intentionally?
Greetings form Cologne
Mathias

@jgehrcke
Copy link
Owner

@mathiasflick a quick thank you for your review! This is great. I'll try to find time soon (today?) to look at the details of what you've written down here, and to double-check things.

@jgehrcke
Copy link
Owner

I am not actually sure what specifically the problem is that you seem to have identified @mathiasflick :).

When the last data point in cases-rki-by-ags.csv is at 2021-01-24T17:00:00+0000, then the last data point in 7di-rki-by-ags.csv is at the same time. That's good and expected (that these timestamps match).

At https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html, for all Germany, the RKI reports a 7di of 111 for "today" (Stand: 25.1.2021, 00:00 Uhr (online aktualisiert um 08:30 Uhr)). That timestamp actually matches 2021-01-24T17:00:00+0000 (they mean the "same": state at end of day of January 24).

The 7di value in 7di-rki-by-ags.csv is 115. On the RKI website it's 111. I think that counts as 'matching' values.

I had done double-checking with other sources, too (also for individual counties).

This is not to say that I think you're wrong, but I would actually love to understand better what you noticed! :)

@mathiasflick
Copy link

Ok, I see -wasn't specific enough with my first try - just luck of time ...
Let me explain: we are computing the 7DI based on exactly the same data (ok, rounding in the different instances will most likely be different).
Comparing your results and my results regarding a few subsequent days for one entity will make it clear - I just took Berlin 11000:

Timestamp "Your 7DI" "My 7DI"
2021-01-20 140,14 131,5
2021-01-21 131,15 122,2
2021-01-22 121,87 114,2
2021-01-23 114,20 113,3
2021-01-24 113,14 109,8 for both this is the latest available data - just minutes old ...
Guess what your tomorrows results for Berlin will be: approx. 110 (109,8 = "my" value for today)! You can see the one-day-offset between "your" and "my" values. Without looking at your code there seems to be a problem with the .rolling() window ...
Just a guess, though ...
Greetings from Cologne
Mathias

@mathiasflick
Copy link

As I predicted yesterday night:
Berlin 7DI for 2021-01-25
"Your" Value: 110,96
"My" Value: 101,7
RKI-Dashboard: 101,7

My conclusion: Your computation does somehow not include the very last day (which is nevertheless in your database ...)!
Reminder: This is based on a comparison between (1) cases-rki-by-ags.csv and (2) more-data/7di-rki-by-ags.csv.
Greetings from Cologne
Mathias

@jgehrcke
Copy link
Owner

@mathiasflick hank you so much for this quick and detailed response! I should have investigated immediately; but I have been moving to another city this week and am super occupied with that, at every time of the day. I'll try to get to the bottom of this asap, but probably not before next week (which is super sad).

@alexgit2k
Copy link
Author

alexgit2k commented Feb 1, 2021

First thanks @jgehrcke for your integration! It looks like @mathiasflick is right and you are not considering the last day. If I calculate 7 days back from the day before yesterday I get (nearly) your values:

  • Berlin: 86.3 (23.-30.1., cases-rki-by-ags.csv) vs. 86 (rki_cases_7di, latest-aggregate.csv)
  • Munich: 61.6 (23.-30.1., cases-rki-by-ags.csv) vs. 61 (rki_cases_7di, latest-aggregate.csv)

Can you please use at least one decimal place for the 7-day incidence, so we do have the same format as RKI uses.

  • RKI (https://corona.rki.de/)
    • Berlin: 85.4
    • Munich: 60.6
  • cases-rki-by-ags.csv (Updated 9:30 CET), 24.-31.1.
    • Berlin: 119,943 - 116,809 = 3,134 / 3,669,491 * 100,000 = 85.4
    • Munich: 51,666 - 50,767 = 899 / 1,484,226 * 100,000 = 60.6
  • latest-aggregate.csv (Updated 16:19 CET)
    • Berlin: rki_cases_7di = 86
    • Munich: rki_cases_7di = 61
  • cases-rki-by-ags.csv (Updated 9:30 CET), 23.-30.1.
    • Berlin: 119,920 - 116,754 = 3,166 / 3,669,491 * 100,000 = 86.3
    • Munich: 51,621 - 50,707 = 914 / 1,484,226 * 100,000 = 61.6
  • 7di-rki-by-ags.csv (Updated 16:19 CET), 31.1. but actually 30.1.
    • Berlin: 86.24
    • Munich: 61.54

So it looks like for latest-aggregate.csv you are using the last line of 7di-rki-by-ags.csv with stripped decimal places. But my calculations are slightly different:

  • Berlin: 86.3 (23.-30.1., cases-rki-by-ags.csv) vs. 86.24 (31.1. but actually 30.1., 7di-rki-by-ags.csv)
  • Munich: 61.6 (23.-30.1., cases-rki-by-ags.csv) vs. 61.54 (31.1. but actually 30.1., 7di-rki-by-ags.csv)

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 2, 2021

Still overwhelmed with personal and work things for now, but I will get back to this as soon as I can. Thanks for the input, this is so much appreciated. I'd of course also be super happy for someone else to dig into code -- the more we can collaborate on this the better things will be after all :).

@mathiasflick
Copy link

You finally convinced me to have a closer look into the code and - after painful code inspection and some practical experimentation - I was able to track down the problem to the last 7 lines of code of "lib/tsmath.py".
In order to protect you from an incompetent pull request (and frankly spoken this is not the final solution, as it does not cover the Risklayer path) I reproduce the two lines of code here that did the job:

######## EMERGENGY EXIT FOR RKI ####################################

output_series = change_per_day.rolling(7).sum()

return output_series
    
######## END OF EMERGENCY EXIT #####################################

These lines are placed just before the '1H' resampling takes place. Executing "tools/7di.py" after this "fix" delivers EXACTLY my values for the 7DI for the correct days (including the very last one) PLUS those values are EXACTLY (except for minimal rounding) the values published in the RKI dashboard.

Greetings from Cologne
Mathias

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 4, 2021

This is amazing, thank you for the debugging effort @mathiasflick -- will save me quite a bit of time. Want to say sorry again for my silence here. Turns out that moving and a full time job leave little room to breathe. Will report back.

jgehrcke added a commit that referenced this issue Feb 8, 2021
jgehrcke added a commit that referenced this issue Feb 8, 2021
@jgehrcke
Copy link
Owner

jgehrcke commented Feb 8, 2021

My goal was to take the time and thoroughly understand why the resampling step changed behavior and numbers in the way it did! But had to accept that these days I can't get the focus time.

Now that so much time passed I felt like I have abused your patience already -- I've now simply removed this additional processing step in #553. Would appreciate for you to keep an eye on the following updates. Hope this is an improvement, after all.

Can you please use at least one decimal place for the 7-day incidence

You refer to latest-aggregate.csv, @alexgit2k?

Well. Maybe : ) Hope you agree that the 1/10th of the incidence is far below any systematic and statistical error haha. I think I stripped things to integer for latest-aggregate.csv so that the diff in GitHub reads nicely!

@alexgit2k
Copy link
Author

alexgit2k commented Feb 8, 2021

7-day-incidence-calculations are now correct, however rounding is not correct:
44.47 (7di-rki-by-ags.csv, 11000_7di) is rounded to 44 (latest-aggregate.csv, rki_cases_7di)

  • RKI (https://corona.rki.de/)
    • Berlin: 71.4
    • Munich: 44.5
  • cases-rki-by-ags.csv (Updated 9:21 CET), 31.1.-7.2.
    • Berlin: 122,606 - 119,986 = 2,620 / 3,669,491 * 100,000 = 71.40 ✔️
    • Munich: 52,320 - 51,660 = 660 / 1,484,226 * 100,000 = 44.47 ✔️
  • latest-aggregate.csv (Updated 21:17 CET)
    • Berlin: rki_cases_7di = 71 ✔️
    • Munich: rki_cases_7di = 44
  • 7di-rki-by-ags.csv (Updated 21:17 CET)
    • Berlin: 11000_7di = 71.40 ✔️
    • Munich: 9162_7di = 44.47 ✔️

Can you please use at least one decimal place for the 7-day incidence

You refer to latest-aggregate.csv, @alexgit2k?

Well. Maybe : ) Hope you agree that the 1/10th of the incidence is far below any systematic and statistical error haha. I think I
stripped things to integer for latest-aggregate.csv so that the diff in GitHub reads nicely!

Yes, I'm talking about latest-aggregate.csv, 7di-rl-by-ags.csv and 7di-rki-by-ags.csv contain already 2 decimal places.

I know that it is only 1/10th, but as said RKI also uses one decimal place. So this would makes it easier compareable.
But I also see your Github-diff-argument. However, I think that maybe not many people will use that diff.

@mathiasflick
Copy link

Just checked the outcome of the updated computation of 7di-rki-by-ags.csv. Perfect match - including the rounding to two decimal places ;-)!
Well done!
Regarding the resampling and windowing: Pandas documentation particularly in this regard is far from being precise or comprehensive - in some respect even ambiguous or wrong. A good piece of trial and error yet to come ...
Thank you and greetings from Cologne
Mathias

jgehrcke added a commit that referenced this issue Feb 9, 2021
jgehrcke added a commit that referenced this issue Feb 9, 2021
aggregate: 7di with 2 digits (see #369)
@jgehrcke
Copy link
Owner

jgehrcke commented Feb 9, 2021

Thank you again @mathiasflick for your careful review, and for taking the time to leave this kind of feedback. Feels like collaboration. In the future, please don't hesitate sharing ideas etc.

Regarding the resampling and windowing: Pandas documentation particularly in this regard is far from being precise or comprehensive - in some respect even ambiguous or wrong.

Yes. I love pandas for what it does and for how robust it already is, and for how well-documented it is. But for sure, there are weaknesses in the docs and often times you really need to play with data examples a lot before digging the subtleties of behavior.

An interesting line of thinking is whether the goal is to calculate the exact same numbers that everybody else has or whether it makes sense to ... well. :)

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 9, 2021

@alexgit2k also big thanks to you, for doing another careful review with a keen eye for detail!

In this context, I think it's fair to ignorantly strip (not properly round) even a 44.7 to a 44. But, conventionally, it's correct to round a 44.47 down to a 44. In any case, because you provide such lovely feedback I chose to not care too much about the diff view and wanted to satisfy your "So this would makes it easier compareable" argument. I have just landed #559. Now, the latest-aggregate.csv file contains the 7di values with two decimal places.

Interestingly, the diff view is still perfectly fine, showing the specific per-row changes in darker green:

Screenshot from 2021-02-09 13-05-56

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 9, 2021

So, please @mathiasflick and @alexgit2k let's keep the feedback coming. Contributions are so much welcome!

I am closing this issue now; let's carry any outstanding items to new, more focused issues, if possible.

@alexgit2k
Copy link
Author

I can confirm that the numbers are correct now. Thank you very much!

I have integrated your data-source in a small windows-application: https://github.com/alexgit2k/corona-info

@jgehrcke
Copy link
Owner

Thanks for the feedback @alexgit2k and it's great to see downstream work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback feedback and questions.
Projects
None yet
Development

No branches or pull requests

3 participants