Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2017-2021 SVI for ZCTAs #12

Closed
usamabilal opened this issue Jan 12, 2023 · 12 comments
Closed

2017-2021 SVI for ZCTAs #12

usamabilal opened this issue Jan 12, 2023 · 12 comments

Comments

@usamabilal
Copy link

  • years: 2017-2021
  • geo-unit: zcta
  • what areas your are intested in: Pennsylvania

Thanks!

@heli-xu
Copy link
Owner

heli-xu commented Feb 10, 2023

Hi Usama, thanks for your patience. I'm attaching a zip file with data, a report and some documentation. Please refer to Readme in the zip file for more detailed information. I'd appreciate any suggestions and feedback from you and your team. Thanks!

2017to2021_PA_zcta_SVI.zip

@usamabilal
Copy link
Author

Thanks so much Heli!! i'll Review and will let you know how things go

@usamabilal
Copy link
Author

After reviewing, this looks great. I really like the validation. I understand that part of the differences in the validation stem from potential differences in aggregation from CT to ZCTA.
Let me know if my understanding below is correct:

  • Heli's SVI (hencefort hSVI) downloads data (using tidycensus) directly at the ZCTA level (geography="zcta") and then follows the same procedure as CDC's SVI (henceforth cSVI)
  • To validate, you first aggregate cSVI from CT to ZCTA. To do this, there would be to options
    1. sum E_ variables and mean EP_ variables and then compute percentiles and follow the regular SVI calculation
    1. take the mean of the percentiles
  • In the section "Aggregating ct data to ZCTA level" i understood you are doing 1, but in the code for "Percentile ranking (“RPL_xx”) by theme" i see option 2. Which one is happening?
  • Then once it is aggregated, you just compare hSVI to cSVI

A few notes (regardless of the 1 vs 2 thing above):

  • to me, the "real" validation is what you do in "SVI calculation and validation" which is "my code returns the same thing as CDC's (roughly, very minor differences)
  • the ZCTA vs CT validation, while nice, may be complicated to actually conduct properly. I say this because if a CT has 3 people and another CT has 1000 people (And they are both the only component CTs of a specific ZCTA) then a mean of EP_s would give the same weight to teach one, while the second CT should have 300 times the weight. Moreover, since CTs are not perfectly nested in ZCTAs (that is, a CT may be in more than one ZCTA) these weights would be ZCTA specific.
  • In other words: I'd keep the validation to making sure that when you use this at the scales CDC has worked at (county and tract) the results are as expected. It'd be good to replicate this at the county levle and compare hSVI with cSVI at the county level

Thanks again!!

heli-xu referenced this issue in heli-xu/svi-calculation Feb 21, 2023
@heli-xu
Copy link
Owner

heli-xu commented Feb 21, 2023

Hi Usama, thanks for the feedback!

In the section "Aggregating ct data to ZCTA level" i understood you are doing 1, but in the code for "Percentile ranking (“RPL_xx”) by theme" i see option 2. Which one is happening?

You're right about how hSVI works, including the part that I used two aggregating methods. I did the sum E_variables and mean EP_variables without further computing percentiles and SVI, and I also took the mean of the percentiles separately. The purpose was to look at not only the aggregated cSVI, but also the individual variables in terms of their correlation with our calculation results. So by your standard, I was using option2 for cSVI aggregation from CT to ZCTA, and additionally I was using (part of) option1 for variable aggregation from CT to ZCTA. I'd be happy to do option1 for cSVI aggregation too if you'd like.

the ZCTA vs CT validation, while nice, may be complicated to actually conduct properly.

I completely agree with you about how tricky ZCTA vs CT validation can be, and the point about the ZCTA-specific weights makes a lot of sense. I got quite frustrated while trying to do the aggregation, but wanted to include them and hear your thoughts.

It'd be good to replicate this at the county levle and compare hSVI with cSVI at the county level

Here is a new report where I added the comparison between hSVI and cSVI at the county level (2018, 2020) and census tract level (2020) .

Thanks again for your time and advice, and please let me know if you have other questions/suggestions.

@usamabilal
Copy link
Author

Thank you! I know get it. so "method" 1 for comparing variables and "method" 2 for comparing the SVI itself. Part of the issue may be that an aggregation of percentiles may not be comparable with an aggregation of variables and then creating percentiles. This is known as the STA vs ATS dilemma: summarize (aggregate) then analyze (percentile calculation) = STA vs analyze (percentile calculation) then summarize (aggregate)=ATS. Your approach for validation of the SVI is ATS (you first calculate percentiles and then aggregate by taking the mean of percentiles)

County-level validation looks great. I think CT (usual acronym for tracts) and CTY (usual acronym for counties) validation is all you need to ensure you are doing the right things.

Now one last thing: I do observe a few very minor differences in both CT and CTY. What do you attribute them to?

@heli-xu
Copy link
Owner

heli-xu commented Feb 22, 2023

Good to know. Thank you very much! Indeed a dilemma...

For the minor differences, I think they may be due to the number of decimal places in EP_variables (percentage). CDC version keeps one decimal place, whereas ours have more because I didn't specify it in the function (at the time I preferred to preserve as much information as possible). Here is a report with more details with some examples. I'd appreciate your insight, and we could adjust the function to make it more consistent with CDC's data if needed.

Thanks again for your help!

@usamabilal
Copy link
Author

Great! It'd be great to try to "fully replicate" their approach by matching their number of decimals. Interesting that they don't include the caveat in the 2020 documentation...

@heli-xu
Copy link
Owner

heli-xu commented Feb 24, 2023

Sounds good! This is a report where I used the updated function (with matching decimal places) to reproduce CDC SVI. Thanks again for your input!

heli-xu referenced this issue in heli-xu/svi-calculation Feb 24, 2023
new get_svi() rounded EP_var and remove TOTPOP = 0 during ranking;
refer to previous two commits on this file (forgot to link issue there)
@heli-xu
Copy link
Owner

heli-xu commented Feb 24, 2023

If this looks good to you, I'll redo the zcta SVI (2017-2021, PA) using the new function and send them again.

@usamabilal
Copy link
Author

Perfect!! Validation is 100% on point, so lets re-do them. thanks!

@heli-xu
Copy link
Owner

heli-xu commented Feb 28, 2023

Sounds great! I'm attaching a zip folder with 5 updated tables of zcta-level SVI and a folder of CDC SVI tables and documentation for your reference (same as previously uploaded). I'd appreciate any further questions/suggestions. If they look good to you, please feel free to close the issue. Thanks again for your help with improving the result!

pa_zcta_svi_2017to2021_updated.zip

@usamabilal
Copy link
Author

Thanks! All looks good,closing

@heli-xu heli-xu transferred this issue from heli-xu/svi-calculation Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants