Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differences with NYState Calculation #11

Closed
dmadeka opened this issue Apr 2, 2020 · 11 comments
Closed

Differences with NYState Calculation #11

dmadeka opened this issue Apr 2, 2020 · 11 comments

Comments

@dmadeka
Copy link

dmadeka commented Apr 2, 2020

If I look at the daily figures from nyc.gov

It reports 48462 total cases vs NYState's official NYC Count is 51809

Whats the source of the discrepancy? Ive been noticing it for a few days

@ptulin
Copy link

ptulin commented Apr 2, 2020 via email

@dmadeka
Copy link
Author

dmadeka commented Apr 2, 2020

They dont seem to match for days though.

This is NYState's historical estimates vs NYC historical estimates.

@ptulin
Copy link

ptulin commented Apr 2, 2020

they will not match until after 7 pm and only for a moment, then they will need to be corrected next day again

@psylum
Copy link

psylum commented Apr 3, 2020

I don't think this is a timing issue. The data shown on the NYS site usually matches what is presented during a Governor briefing. For the past 3 days, the number shown for NYC during the noonish briefing has been ~2000 higher than the evening NYC number.

@dmadeka
Copy link
Author

dmadeka commented Apr 3, 2020

Im with @psylum - the afternoon numbers seem higher, and the implied growth rates are very different. Here's a bar chart from the NYC data (I took the last three points and added them to the 31st).

image

Where as on wiki, NYState has an implied growth rate of 13-10% over the last few days. There seem to be big differences

@DTPOTO
Copy link

DTPOTO commented Apr 3, 2020

Hello all please review the Issue string started when the NYC Health Dept started to use GitHub as data storage for their WEB page ("Counts vary differently from Yesterday"). At the same time of switching to GitHub the Health Dept changed the reporting methodology. Using "Diagnosis Date" instead of "Reporting Date". I am sure that the State Health Department is stuck with just getting the "Reporting Date" because they are collecting from too many different sources. The City is now attempting to show the NEW cases as of the date-of-diagnosis. The original Diagnosis occurs when the doctor suspects the patient has the virus and orders the TEST. The Lab provides data on the Reporting-Date, the Lab results may take 3 to 14 days (OUCH). I have looked at the LAG time between Diagnosis Date and Reporting Date see here
I am using the level of Back-Dating Revisions as of the Diagnosis-Date as surrogate for Lab-Results Lag time. The cumulative graph suggest that 3 days back is under-reported by half and that 4 days back is under-reported by a 1/3rd. At the current lab turn-around rate it takes a week before you have a handle on today's real number.
This can be unsettling if you are only focused on yesterday's new cases. All new cases being reported is OLD news (coming from either the state or the city). The reality is using Diagnosis date may be the better method for predicting the APEX. But, changing the reporting methodology without adequate an explanation sows the seeds of distrust and certainly undermines everyone's predictive models.

@madeka @ptulin @psylum @mmontesanonyc

@speedplane
Copy link

This is causing a whole lot of confusion. It’s the responsibility of NYC to make these data differences crystal clear, and to provide both sets of data.

@speedplane
Copy link

Also @DTPOTO ... why would there be more total positive tests in the reporting date methodology? Is that just due to the time of day the results are generated?

@DTPOTO
Copy link

DTPOTO commented Apr 3, 2020

Also @DTPOTO ... why would there be more total positive tests in the reporting date methodology? Is that just due to the time of day the results are generated?

I agree NYC health dept should supply both data sets. The time of day has a minor impact, more so when you are using the Report-Date methodology. The reason why the Reporting date methodology has higher numbers is because you are focused on the current date (today). The data files are being restated by BACK-DATING. It's a little like the government revising last months unemployment number. The TOTAL number of cases are identical, it just when are they being reported. @joansobo demonstrated that the total cases were the same, and able to calculate a new REPORTED Cases by looking at the Case-Hosp-Deaths.csv over two different days. The issue is Daily Restatement. Getting the lasted version of Case-Hosp-Deaths.csv may be you best bet in terms of predictive modeling. I don't like either but we may get that clarity or better information in a timely way.

@mmontesanonyc
Copy link
Contributor

Data from NYC and NYS will always be different for a number of reasons, including the time of day the dataset is cut, de-duplication procedures that differ between the agencies, and data cleaning and QA procedures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants
@speedplane @ptulin @psylum @dmadeka @mmontesanonyc @DTPOTO and others