Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Deaths - Gehrcke vs. Risklayer #562

Open
bdiv opened this issue Feb 9, 2021 · 5 comments
Open

Discrepancy in Deaths - Gehrcke vs. Risklayer #562

bdiv opened this issue Feb 9, 2021 · 5 comments

Comments

@bdiv
Copy link

bdiv commented Feb 9, 2021

Hello Dr. Gehrcke,
sorry to bother you again.

In my data I noticed a discrepancy between your aggregation and the Risklayer Chronik.

Data for all of Germany

Date RL Deaths Gehrcke
05-02-21 61961 61651
06-02-21 62128 61955
07-02-21 62597 62126
08-02-21 62630 62601

Image References:
image
Source
image
Source

This time there is no constant temporal shift.

Looking at your code I'm not quite sure how you generate the data and where this difference could come from . Do you use the Main Page and simply append the data from the current day to your files? Would you share the source URL for clarity?

If you take the "current" values from "Haupt" then the discrepancy probably comes from the time when counts for a day are finalised compared to when you sample them. Wouldn't the "Chronik" be a more stable ressource in that case?

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 9, 2021

Thanks! I didn't even get to reading this carefully, but I already want to quickly respond to

Do you use the Main Page and simply append the data from the current day to your files? Would you share the source URL for clarity?

This is currently being used:

export RISKLAYER_HISTORY_DEATHS_CSV_URL="https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv"

@bdiv
Copy link
Author

bdiv commented Feb 10, 2021

Thank you for providing the Link! I looked into it a bit and these are my findings:

Time shifted values

image
image

  • leaving out the decrease in deaths from 6th to 7th of February since it's probably due to corrections in risklayer
  • the 9th is missing in deaths-rl-crowdsource-by-ags.csv
  • values are generally shifted backwards one day
  • interestingly this doesn't happen in cases-rl-crowdsource-by-ags.csv (see screenshot below) different source format perhaps since the extraction code for deaths and cases are the same if I recall correctly?
    image

Discrepancy between the source of deaths-rl-crowdsource-by-ags.csv and the Chronik of the Risklayer Sheet

  • i can align both datasources using my code 1:
    image
  • comparison to the Risklayer-Chronik:
    image

Conclusion

Appendix

1: Alignment Code (sorry not that pretty but it does it's job)

import pandas as pd
gehrcke = pd.read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRTkw2_oVkpZ9-WQk-BRf4Pgam9aRmH62uCUr9FiY0Uxv5ixtDhwSsecc_QMrfrD4ncHsCAua2f0TJh/pub?output=csv")
gehrcke.head(1)[["GEN","AGS","05.02.2021","06.02.2021","07.02.2021","08.02.2021","09.02.2021","10.02.2021","current","current_time"]]
g = gehrcke[["05.02.2021","06.02.2021","07.02.2021","08.02.2021","09.02.2021","10.02.2021"]]
rl = pd.read_csv("https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/gviz/tq?tqx=out:csv&sheet=Chronik&range=AW1:CB437")
rl = rl.drop(rl.index[0:18]) # drop the aggregated state rows, leave only counties
rl = list(rl[["05.02","06.02","07.02","08.02","09.02","10.02"]].sum().values) # aggregate values for germany
t = pd.DataFrame(dict(  dates = g.sum().index,
                        gehrcke = g.sum().values,
                        rl = rl ))
t["diff"] = t.gehrcke - t.rl
print(t)

@jgehrcke
Copy link
Owner

jgehrcke commented Feb 12, 2021

Still didn't get time to process this really, but it looks like you found that there's a discrepancy between the actual Google sheet you're looking at and and the Google sheet CSV export I am using. I remember that Tagesspiegel is using the same Google sheet CSV export. Maybe the sheet behind the CSV export is a different one, and they are both manually maintained / "kept in sync"?

@jgehrcke
Copy link
Owner

Could you maybe also provide the source url for cases-rl-crowdsource-by-ags.csv ?

Sorry for replying to late to that specific question. This is

export RISKLAYER_HISTORY_CASES_CSV_URL="https://docs.google.com/spreadsheets/d/e/2PACX-1vTiKkV3Iy-BsShsK3DSUeO9Gpen7VwsXM_haCOc8avj1PeoCIWqL4Os-Uza3jWMEUgmTrEizEV-Itq5/pub?output=csv"

@jgehrcke
Copy link
Owner

Did we make any further progress here understanding where the differences come from between the CSV exports used here, and the web view of the RL Google Sheet? Hm.

@bdiv would you say that these differences are cosmetic and negligible or significant? Honest opinion, please! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants