# Modeling Health Data with Other Kinds of Classification

**Complete by: Tuesday 11 Nov. at class time**  
Data: <https://jrladd.com/CIS241/data/PLACES_PA_OH_2024_12.csv>

What does it mean for a *place* to be healthy? What can we learn about the overall health of people in the United States by looking at health by city and locality? These are the questions that the CDC's [PLACES](https://www.cdc.gov/places/about/index.html) project attempts to answer. This project originally began in 2015 as the "500 Cities Project": an annual survey of the 500 largest cities in the US. By surveying places every year, the CDC hopes to better understand long-term health among Americans all across the country. 

Since the start of the COVID-19 pandemic, data analysis focused on health and healthcare has been at the center of public conversation. Good data analysis in this domain is essential for public health in the 21st century. In this week's workshop, you'll examine recent data from the PLACES project to better understand Americans' overall health.

We'll use more recent data that surveys people by county instead of city. For use in JupyterHub, I've filtered the [original dataset](https://data.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb/about_data) to include only Pennsylvania and Ohio counties. We're going to use this data to compare the relative health of these two neighboring states. 

***For this workshop, you will make a coherent and readable report. As you go through each section, make sure you are explaining, describing and interpreting all code, visualizations, and statistical output. The goal is that a reader with no previous knowledge of the data should be able to understand your findings.***

## Instructions

Your report should have the following sections. *You may re-use what you did last week for Sections 1, 2, & 3, but consider using this as an opportunity to improve on what you did in the last assignment*.

1. **Data Description and Ethical Considerations**: Read in the data and describe its rows and columns. Now that we've discussed ethics as a class, use the frameworks you know to give an overview of important ethical considerations in this dataset. You might find it helpful to review the [project website](https://www.cdc.gov/places/index.html) as well as the [data dictionary](https://data.cdc.gov/500-Cities-Places/PLACES-and-500-Cities-Data-Dictionary/m35w-spkz/about_data) that the CDC provides.
2. **Data Wrangling**: You'll notice right away that this dataset is ***not*** in tidy format. To get it ready for analysis, you'll need to [pivot](https://jrladd.com/CIS241/guides/pandas.html#switch-rows-and-columns-with-pivot-and-melt) the data. Specifically, you'll need to use the `.pivot_table()` function to both pivot the data and use an *aggregation function* to get the mean of the rows you're pivoting. (Keep "StateAbbr" and "LocationName" as the index. Use "MeasureId" as columns and "Data_Value" as values.) You will also need to take other wrangling steps once you've pivoted the data.
3. **Exploratory Data Analysis**: Run some summary statistics and create at least 3 well-described visualizations to better understand your dataset and the potential predictors for your model.
4. **First Model and Validation**: Run one of the three additional classification models we learned about (KNN, Random Forest, or Naive Bayes). Pay special attention to the principles of **model selection** that we discussed in class: only certain models will be appropriate for this data. Use the same target variable (the state) that you did last week. Assess your model with the standard metrics. How did your model perform? Make sure you fully interpret the necessary validation steps.
5. **Second Model and Validation**: Run a different one of the three additional classification models we learned about (KNN, Random Forest, or Naive Bayes). Pay special attention to the principles of **model selection** that we discussed in class: only certain models will be appropriate for this data. Use the same target variable (the state) that you did last week. Assess your model with the standard metrics. How did your model perform? Make sure you fully interpret the necessary validation steps.
6) **Conclusion**: Write a paragraph explaining your findings and pointing out any takeaways from your modeling. How did the three models (Logistic Regression from last week, and the two you chose this week) perform on the PLACES data? Is there one model that you thought performed better than the others, and why do you think that is? What models or steps would you recommend for someone who wants to accurately predict the state based on these health statistics?