-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Single CSV-File with current state #369
Comments
Having a "current" / "latest" aggregate data file is indeed an interesting proposal! |
If you are really willing to go down this path I could provide you at least the most important "building blocks" (i.e. Jupyter Notebook with the basic steps, csv regarding 'Einwohner' from the source mentioned in the initial request). Just let me know ... |
@mathiasflick thanks for the feedback and proposal! I added population data a couple of days ago with this patch: #383 As you can see I was trying more or less hard to keep this as transparent as possible, with respect to the actual sources and the data processing flow. Because -- you're right -- this is a little tricky :-). Happy with the outcome, though! |
@jgehrcke great! So now the current state should be the easy part ;-) |
Perfect! Thank you again for your valuable work! Will reduce the required data sources for my analysis by one ... |
@mathiasflick thanks for having a deep look! Validation is always super valuable -- I was actually hoping for an additional pair of eye balls 👀. You've built the check sum, just like me -- great :) And you've found the two key aspects to keep in mind. Also see this part of code: # Skip those entries that do not have the populuation key set (expected
# for AGS 3152 "LK Göttingen (alt)". Minus the total for Berlin, because
# there's a double count.
checksum = (
sum(
v["population"]
for k, v in AGS_PROPERTY_MAP.items()
if "population" in v
)
- AGS_PROPERTY_MAP["11000"]["population"]
) About AGS 3152...
It popped up on this one dark day in April 2020 in the RKI database 🤷 See 1b36532 -- I had to add it for compatibility, and annotated it with the "(alt)". You wouldn't believe for such messy things to happen in the RKI database, but I suppose these are the things that happen in a real-world data pipeline :). |
@alexgit2k yes yes! I do hope I'll get to that soon. Step by step. Thanks again for being here and leaving feedback etc. This is a side/hobby project and it's really lovely to see feedback like in this thread here, and contributions in general. |
Pull-request #437: Add 7-day incidence data files, and reorganize code a bit |
Add tooling for generating more-data/latest-aggregate.csv (for #369), code cleanup, misc
With #442 we're basically there. Would appreciate another pair of eyes, and feedback! Will also review things again, and see if the automatic updates look good within the next few days. |
The first (manually triggered) automatic update looks good I think. Here's what's new (super high level summary):
Let me know what you think! |
Had a short look on your code -> very nice and dandy! But I will stick to kind of black-box-testing. I'm coming from the old days and sometimes program in Python as if it were C or, even worse, Fortran :-) ... |
@mathiasflick a quick thank you for your review! This is great. I'll try to find time soon (today?) to look at the details of what you've written down here, and to double-check things. |
I am not actually sure what specifically the problem is that you seem to have identified @mathiasflick :). When the last data point in At https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Fallzahlen.html, for all Germany, the RKI reports a 7di of The 7di value in I had done double-checking with other sources, too (also for individual counties). This is not to say that I think you're wrong, but I would actually love to understand better what you noticed! :) |
Ok, I see -wasn't specific enough with my first try - just luck of time ... Timestamp "Your 7DI" "My 7DI" |
As I predicted yesterday night: My conclusion: Your computation does somehow not include the very last day (which is nevertheless in your database ...)! |
@mathiasflick hank you so much for this quick and detailed response! I should have investigated immediately; but I have been moving to another city this week and am super occupied with that, at every time of the day. I'll try to get to the bottom of this asap, but probably not before next week (which is super sad). |
First thanks @jgehrcke for your integration! It looks like @mathiasflick is right and you are not considering the last day. If I calculate 7 days back from the day before yesterday I get (nearly) your values:
Can you please use at least one decimal place for the 7-day incidence, so we do have the same format as RKI uses.
So it looks like for latest-aggregate.csv you are using the last line of 7di-rki-by-ags.csv with stripped decimal places. But my calculations are slightly different:
|
Still overwhelmed with personal and work things for now, but I will get back to this as soon as I can. Thanks for the input, this is so much appreciated. I'd of course also be super happy for someone else to dig into code -- the more we can collaborate on this the better things will be after all :). |
You finally convinced me to have a closer look into the code and - after painful code inspection and some practical experimentation - I was able to track down the problem to the last 7 lines of code of "lib/tsmath.py".
These lines are placed just before the '1H' resampling takes place. Executing "tools/7di.py" after this "fix" delivers EXACTLY my values for the 7DI for the correct days (including the very last one) PLUS those values are EXACTLY (except for minimal rounding) the values published in the RKI dashboard. Greetings from Cologne |
This is amazing, thank you for the debugging effort @mathiasflick -- will save me quite a bit of time. Want to say sorry again for my silence here. Turns out that moving and a full time job leave little room to breathe. Will report back. |
7di: remove hourly resample, see #369
My goal was to take the time and thoroughly understand why the resampling step changed behavior and numbers in the way it did! But had to accept that these days I can't get the focus time. Now that so much time passed I felt like I have abused your patience already -- I've now simply removed this additional processing step in #553. Would appreciate for you to keep an eye on the following updates. Hope this is an improvement, after all.
You refer to Well. Maybe : ) Hope you agree that the 1/10th of the incidence is far below any systematic and statistical error haha. I think I stripped things to integer for |
7-day-incidence-calculations are now correct, however rounding is not correct:
Yes, I'm talking about I know that it is only 1/10th, but as said RKI also uses one decimal place. So this would makes it easier compareable. |
Just checked the outcome of the updated computation of 7di-rki-by-ags.csv. Perfect match - including the rounding to two decimal places ;-)! |
aggregate: 7di with 2 digits (see #369)
Thank you again @mathiasflick for your careful review, and for taking the time to leave this kind of feedback. Feels like collaboration. In the future, please don't hesitate sharing ideas etc.
Yes. I love pandas for what it does and for how robust it already is, and for how well-documented it is. But for sure, there are weaknesses in the docs and often times you really need to play with data examples a lot before digging the subtleties of behavior. An interesting line of thinking is whether the goal is to calculate the exact same numbers that everybody else has or whether it makes sense to ... well. :) |
@alexgit2k also big thanks to you, for doing another careful review with a keen eye for detail! In this context, I think it's fair to ignorantly strip (not properly round) even a Interestingly, the diff view is still perfectly fine, showing the specific per-row changes in darker green: |
So, please @mathiasflick and @alexgit2k let's keep the feedback coming. Contributions are so much welcome! I am closing this issue now; let's carry any outstanding items to new, more focused issues, if possible. |
I can confirm that the numbers are correct now. Thank you very much! I have integrated your data-source in a small windows-application: https://github.com/alexgit2k/corona-info |
Thanks for the feedback @alexgit2k and it's great to see downstream work! |
I'm calculating the current 7-day incidence rate per 100,000 using cases-rki-by-ags.csv and cases-rl-crowdsource-by-ags.csv. For that I have to download and parse two 0.5 MB files. It would be better to have an aggregated file with the current values (also population would be great). Then only a file with a few KB would be needed for the current state of any area.
For example:
Note: population-data can be found here https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/Administrativ/04-kreise.html
The text was updated successfully, but these errors were encountered: