<br>

## 👨‍👩‍👧‍👦 **SOCIODEMOGRAPHIC AND HEALTH RESOURCE DATA** 👨‍👩‍👧‍👦

**REGULARIZED LINEAL REGRESSION PROJECT**

## **INDEX**
- **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**
- **STEP 2: DATA EXPLORATION AND CLEANING**
- **STEP 3: UNIVARIATE VARIABLE ANALYSIS**
- **STEP 4: MULTIVARIATE VARIABLE ANALYSIS**
- **STEP 5: FEATURE ENGINEERING**
- **STEP 6: FEATURE SELECTION**
- **STEP 7: MACHINE LEARNING**
- **STEP 8: CONCLUSIONS**

<br>

### **STEP 1: PROBLEM DEFINITION AND DATA COLLECTION**

- 1.1. Problem Definition
- 1.2. Library Importing
- 1.3. Data Collection

<br>

**1.1 PROBLEM DEFINITION**

**Objective:**

To identify the relationship between sociodemographic factors and the availability of health resources at the county level in the United States.

**Research Questions:**

1. **Sociodemographic Factors and Health Resource Availability:**
* What sociodemographic factors are most strongly associated with the availability of health resources in U.S. counties?
* Do income, education, and poverty rates influence access to healthcare services?

2. **Demographic Disparities in Health Resource Distribution:**
* Are health resources distributed equitably across counties with diverse racial, ethnic, and age demographics?
* How does family structure impact the allocation of health resources?

3. **Geographic and Economic Influences on Health Resource Availability:**
* Do economic factors, such as income and unemployment rates, affect the availability of health resources in U.S. counties?
* Are rural counties less likely to have adequate health resources compared to urban areas?

4. **Predictive Modeling of Health Resource Distribution:**
* Can statistical modeling techniques, such as regularized linear regression, accurately predict the distribution of health resources across U.S. counties?
* Which sociodemographic and economic factors have the strongest predictive power in determining health resource availability?

**Data:**

The analysis will utilize county-level sociodemographic and health resource data from the United States. This data will include variables such as:

* **Sociodemographic Factors:**
   * Population demographics (age, sex, race, ethnicity)
   * Income levels
   * Education levels
   * Poverty rates
   * Family structure
* **Health Resource Availability:**
   * Number of healthcare providers
   * Hospital bed capacity
   * Access to primary care
   * Availability of specialized services

By addressing these research questions, we aim to gain a deeper understanding of the factors that influence health resource availability and identify potential disparities in access to healthcare across U.S. counties.








<br>

**1.2. LIBRARY IMPORTING**

In [4]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 
import json
import joblib
import os
import math
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from pickle import dump
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

<br>

**1.3. DATA COLLECTION**

In [5]:
pd.options.display.max_columns=None
df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv")
df.head()

Unnamed: 0,fips,TOT_POP,0-9,0-9 y/o % of total pop,19-Oct,10-19 y/o % of total pop,20-29,20-29 y/o % of total pop,30-39,30-39 y/o % of total pop,40-49,40-49 y/o % of total pop,50-59,50-59 y/o % of total pop,60-69,60-69 y/o % of total pop,70-79,70-79 y/o % of total pop,80+,80+ y/o % of total pop,White-alone pop,% White-alone,Black-alone pop,% Black-alone,Native American/American Indian-alone pop,% NA/AI-alone,Asian-alone pop,% Asian-alone,Hawaiian/Pacific Islander-alone pop,% Hawaiian/PI-alone,Two or more races pop,% Two or more races,POP_ESTIMATE_2018,N_POP_CHG_2018,GQ_ESTIMATES_2018,R_birth_2018,R_death_2018,R_NATURAL_INC_2018,R_INTERNATIONAL_MIG_2018,R_DOMESTIC_MIG_2018,R_NET_MIG_2018,Less than a high school diploma 2014-18,High school diploma only 2014-18,Some college or associate's degree 2014-18,Bachelor's degree or higher 2014-18,Percent of adults with less than a high school diploma 2014-18,Percent of adults with a high school diploma only 2014-18,Percent of adults completing some college or associate's degree 2014-18,Percent of adults with a bachelor's degree or higher 2014-18,POVALL_2018,PCTPOVALL_2018,PCTPOV017_2018,PCTPOV517_2018,MEDHHINC_2018,CI90LBINC_2018,CI90UBINC_2018,Civilian_labor_force_2018,Employed_2018,Unemployed_2018,Unemployment_rate_2018,Median_Household_Income_2018,Med_HH_Income_Percent_of_State_Total_2018,Active Physicians per 100000 Population 2018 (AAMC),Total Active Patient Care Physicians per 100000 Population 2018 (AAMC),Active Primary Care Physicians per 100000 Population 2018 (AAMC),Active Patient Care Primary Care Physicians per 100000 Population 2018 (AAMC),Active General Surgeons per 100000 Population 2018 (AAMC),Active Patient Care General Surgeons per 100000 Population 2018 (AAMC),Total nurse practitioners (2019),Total physician assistants (2019),Total Hospitals (2019),Internal Medicine Primary Care (2019),Family Medicine/General Practice Primary Care (2019),Total Specialist Physicians (2019),ICU Beds_x,Total Population,Population Aged 60+,Percent of Population Aged 60+,COUNTY_NAME,STATE_NAME,STATE_FIPS,CNTY_FIPS,county_pop2018_18 and older,anycondition_prevalence,anycondition_Lower 95% CI,anycondition_Upper 95% CI,anycondition_number,Obesity_prevalence,Obesity_Lower 95% CI,Obesity_Upper 95% CI,Obesity_number,Heart disease_prevalence,Heart disease_Lower 95% CI,Heart disease_Upper 95% CI,Heart disease_number,COPD_prevalence,COPD_Lower 95% CI,COPD_Upper 95% CI,COPD_number,diabetes_prevalence,diabetes_Lower 95% CI,diabetes_Upper 95% CI,diabetes_number,CKD_prevalence,CKD_Lower 95% CI,CKD_Upper 95% CI,CKD_number,Urban_rural_code
0,1001,55601,6787,12.206615,7637,13.735364,6878,12.370281,7089,12.749771,7582,13.636445,7738,13.917016,5826,10.478229,4050,7.284042,2014,3.622237,42660,76.725239,10915,19.630942,267,0.480207,681,1.224798,62,0.111509,1016,1.827305,55601,158,455,11.8,9.6,2.2,0.0,0.7,0.6,4204,12119,10552,10291,11.3,32.6,28.4,27.7,7587,13.8,19.3,19.5,59338,53628,65048,25957,25015,942,3.6,59338,119.0,217.1,196.7,77.2,71.2,7.6,6.9,28.859137,6.085786,1.148905,25.992561,21.249061,72.142154,6,55036,10523,19.1,Autauga,Alabama,1,1,42438,47.6,45.4,49.4,20181,35.8,34.2,37.3,15193,7.9,7.2,8.7,3345,8.6,7.3,9.9,3644,12.9,11.9,13.8,5462,3.1,2.9,3.3,1326,3
1,1003,218022,24757,11.355276,26913,12.344167,23579,10.814964,25213,11.564429,27338,12.539102,29986,13.753658,29932,13.72889,20936,9.602701,9368,4.296814,190301,87.285228,19492,8.940382,1684,0.772399,2508,1.150343,146,0.066966,3891,1.784682,218022,5403,2190,10.5,10.3,0.1,0.5,24.3,24.8,14310,40579,46025,46075,9.7,27.6,31.3,31.3,21069,9.8,13.9,13.1,57588,54437,60739,93849,90456,3393,3.6,57588,115.5,217.1,196.7,77.2,71.2,7.6,6.9,113.162114,23.863512,4.505074,101.92173,83.321572,282.882982,51,203360,53519,26.3,Baldwin,Alabama,1,3,170912,40.2,38.2,42.3,68790,29.7,28.4,31.0,50761,7.8,7.0,8.7,13414,8.6,7.2,10.1,14692,12.0,11.0,13.1,20520,3.2,3.0,3.5,5479,4
2,1005,24881,2732,10.980266,2960,11.896628,3268,13.13452,3201,12.865239,3074,12.354809,3278,13.174712,3076,12.362847,2244,9.01893,1048,4.212049,12209,49.069571,12042,48.398376,164,0.659137,113,0.454162,46,0.18488,307,1.233873,24881,-277,2820,10.4,12.9,-2.5,0.5,-9.1,-8.6,4901,6486,4566,2220,27.0,35.7,25.1,12.2,6788,30.9,43.9,36.7,34382,31157,37607,8373,7940,433,5.2,34382,68.9,217.1,196.7,77.2,71.2,7.6,6.9,12.914231,2.72334,0.514126,11.631462,9.508784,32.283033,5,26201,6150,23.5,Barbour,Alabama,1,5,19689,57.5,55.6,59.1,11325,40.7,39.5,41.9,8013,11.0,10.1,11.8,2159,12.1,10.7,13.3,2373,19.7,18.6,20.6,3870,4.5,4.2,4.8,887,6
3,1007,22400,2456,10.964286,2596,11.589286,3029,13.522321,3113,13.897321,3038,13.5625,3115,13.90625,2545,11.361607,1723,7.691964,785,3.504464,17211,76.834821,4770,21.294643,98,0.4375,53,0.236607,26,0.116071,242,1.080357,22400,-155,2151,11.1,11.4,-0.3,0.4,-7.0,-6.6,2650,7471,3846,1813,16.8,47.3,24.4,11.5,4400,21.8,27.8,26.3,46064,41283,50845,8661,8317,344,4.0,46064,92.3,217.1,196.7,77.2,71.2,7.6,6.9,11.626493,2.451783,0.46286,10.471635,8.560619,29.063942,0,22580,4773,21.1,Bibb,Alabama,1,7,17813,51.6,49.6,53.4,9190,38.7,37.4,40.2,6894,8.6,7.9,9.3,1533,10.0,8.8,11.3,1789,14.1,13.2,14.9,2511,3.3,3.1,3.6,595,2
4,1009,57840,7095,12.266598,7570,13.087828,6742,11.656293,6884,11.901798,7474,12.921853,7844,13.561549,6965,12.04184,4931,8.525242,2335,4.036999,55456,95.878285,950,1.642462,378,0.653527,185,0.319848,70,0.121024,801,1.384855,57840,13,489,11.8,11.4,0.3,0.1,-0.2,0.0,7861,13489,13267,5010,19.8,34.0,33.5,12.6,7527,13.2,18.0,15.5,50412,46157,54667,25006,24128,878,3.5,50412,101.1,217.1,196.7,77.2,71.2,7.6,6.9,30.021267,6.330854,1.195171,27.039257,22.10474,75.047251,6,57667,13600,23.6,Blount,Alabama,1,9,44448,46.3,44.3,48.4,20584,34.0,32.4,35.5,15112,9.2,8.4,10.1,4101,10.5,9.1,12.0,4661,13.5,12.6,14.5,6017,3.4,3.2,3.7,1507,2


In [6]:
# Export DataFrame to a local CSV file

df.to_csv("../data/raw/medical_insurance_data.csv", index=False)

<br>

## **STEP 2: DATA EXPLORATION AND CLEANING**

- 2.1. Exploration: Understanding the Features
- 2.2. Identifying null values in each feature
- 2.3. Eliminating Duplicates
- 2.4. Eliminating Irrelevant Information

**1. GEOGRAPHIC IDENTIFICATION**

* `fips`: FIPS code for the county
* `STATE_FIPS`: FIPS code for the state
* `Urban_rural_code`:** Urban/rural area classification

**2. DEMOGRAPHICS AND AGE**

* `TOT_POP`: Total population
* **Age Groups:**
    * `0-9`
    * `10-19`
    * `20-29`
    * `30-39`
    * `40-49`
    * `50-59`
    * `60-69`
    * `70-79`
    * 80+
    * Percentage of the population in each age group (`0-9 y/o % of total pop`, etc.)
* **Elderly Population:**
    * `Population Aged 60+`
    * `Percent of Population Aged 60+`
* `county_pop2018_18 and older`: Population aged 18+ in 2018

**3. RACE AND ETHNICITY**

* **Population by Racial Group:**
    * `White-alone pop`
    * `Black-alone pop`
    * `Native American/American Indian-alone pop`
    * `Asian-alone pop`
    * `Hawaiian/Pacific Islander-alone pop`
    * `Two or more races pop`
* **Percentage by Racial Group:**
    * `% White-alone`
    * `% Black-alone`
    * `% NA/AI-alone`
    * `% Asian-alone`
    * `% Hawaiian/PI-alone`
    * `% Two or more races`

**4. POPULATION CHANGE AND MIGRATION**

* `N_POP_CHG_2018`: Numeric change in resident population (2017-2018)
* `GQ_ESTIMATES_2018`: Group quarters population estimate (2018)
* **Birth, Death, and Migration Rates:**
    * `R_birth_2018`
    * `R_death_2018`
    * `R_NATURAL_INC_2018`
    * `R_INTERNATIONAL_MIG_2018`
    * `R_DOMESTIC_MIG_2018`
    * `R_NET_MIG_2018`

**5. EDUCATION**

* **Education Levels:**
    * `Less than a high school diploma 2014-18`
    * `High school diploma only 2014-18`
    * `Some college or associate's degree 2014-18`
    * `Bachelor's degree or higher 2014-18`
* **Corresponding Percentages:**
    * `Percent of adults with less than a high school diploma 2014-18`
    * `Percent of adults with a high school diploma only 2014-18`

**6. POVERTY AND INCOME**

* `POVALL_2018`: Estimated number of people of all ages in poverty (2018)
* **Poverty Percentages:**
    * `PCTPOVALL_2018`: Percentage of people in poverty (2018)
    * `PCTPOV017_2018`: Percentage of people under age 17 in poverty (2018)
    * `PCTPOV517_2018`: Percentage of children aged 5-17 in poverty
* **Household Income:**
    * `MEDHHINC_2018`: Median household income estimate (2018)
    * `CI90LBINC_2018`: 90% confidence interval for median household income (lower bound)
    * `CI90UBINC_2018`: 90% confidence interval for median household income (upper bound)
    * `Med_HH_Income_Percent_of_State_Total_2018`: County median household income as a percent of state median (2018)

**7. EMPLOYMENT**

* `Civilian_labor_force_2018`: Civilian labor force annual average
* `Employed_2018`:** Number of employed individuals (annual average)
* `Unemployed_2018`:** Number of unemployed individuals (annual average)
* `Unemployment_rate_2018`: Unemployment rate

**8. HEALTH RESOURCES AND INFRAESTRUCTURE**

* **Health Professionals per 100,000 Population (2018):**
    * `Active Physicians per 100000 Population 2018 (AAMC)`: Active physicians
    * `Total Active Patient Care Physicians per 100000 Population 2018 (AAMC)`: Active primary care physicians
    * `Active Patient Care Primary Care Physicians per 100000 Population 2018 (AAMC)`: Primary care patient care physicians
    * `Active General Surgeons per 100000 Population 2018 (AAMC)`: General surgeons
* **Nursing and medical assistance professionals (2019):**
    * `Total nurse practitioners (2019)`
    * `Total physician assistants (2019)`
* **Hospital infrastructure:**
    * `Total Hospitals (2019)`: Total number of hospitals
    * `ICU Beds_x`: Number of ICU beds per county

**9. HEALTH CONDITIONS PREVALENCE**

* **General health conditions:**
    * `anycondition_prevalence`
    * `anycondition_Lower 95% CI`
    * `anycondition_Upper 95% CI`
    * `anycondition_number`
* **Obesity prevalence:**
    * `Obesity_prevalence`
    * `Obesity_Lower 95% CI`
    * `Obesity_Upper 95% CI`
    * `Obesity_number`
* **Other conditions:**
    * `Heart disease_prevalence`
    * `COPD_prevalence`
    * `diabetes_prevalence`
    * `CKD_prevalence` (each with prevalence rates, confidence intervals, and population counts)