# US county-level data analysis solution

### US county-level sociodemographic and health resource data (2018-2019)

Sociodemographic and health resource data have been collected by county in the United States and we want to find out if there is any **relationship between health resources and sociodemographic data.**

To do this, you need to set a target variable (health-related) to conduct the analysis.




#### Step 3: Build a linear regression model in python

Start solving the problem by implementing a linear regression model and analyze the results. Then, using the same data and default attributes, build a Lasso model and compare the results with the baseline linear regression.

Analyze how $R^2$ evolves when the hyperparameter of the Lasso model changes (you can, for example, start testing from a value of 0.0 and work your way up to a value of 20). Draw these values in a line diagram.

#### Step 4: Optimize the previous linear regression model using python

After training the Lasso model, if the results are not satisfactory, optimize it using one of the techniques seen above.

Data Dictionary:  
In this dataset you will find a large number of variables, which you will find defined in the [data_dictionary](/workspaces/Project_ML_Linear_Regression_regularized_VL/data_dictionary.xlsx).

O puedes verlo mas abajo:

In [2]:
import pandas as pd
from IPython.display import display, HTML

# Leer el archivo Excel
data_dict = pd.read_excel("data_dictionary.xlsx", engine="openpyxl")

# Mostrar toda la tabla en HTML sin cortar nada
html = data_dict.to_html(max_rows=None, max_cols=None)

# Si querés mostrar toda la tabla pero solo para esa celda 
# (sin afectar la configuración global de pandas), 
# podés usar una solución con IPython.display y to_html().

# Estilo CSS para alinear a la izquierda
css = """
<style>
    table { text-align: left !important; }
    th, td { text-align: left !important; padding: 4px; }
</style>
"""

# Mostrar tabla con estilo aplicado
display(HTML(css + html))


Unnamed: 0,Feature,Unnamed: 1,Unnamed: 2
0,fips,FIPS Code for the County,Column1
1,TOT_POP,Total Population,This data as well as all Age and Race data is pulled from the 2019 Population Estimates of the US Census
2,0-9,Population aged 0-9,All of the other age columns are the same but with varying age
3,0-9 y/o % of total pop,% of the population aged 0-9,
4,10-19',,
5,10-19 y/o % of total pop,,
6,20-29,,
7,20-29 y/o % of total pop,,
8,30-39,,
9,30-39 y/o % of total pop,,


---

#### 🟠Step 1: Loading the CSV into a python dataset

>The dataset can be found in this project folder under the name demographic_health_data.csv. You can load it into the code directly from the link:
>
>https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv
>
>Or download it and add it by hand in your repository. 

---

In [6]:
import pandas as pd

# Cargamos el dataset directamente desde la URL
url = "https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv"
df_raw = pd.read_csv(url)

# Miramos un sample del df
df_raw.sample(10,random_state=2025)

Unnamed: 0,fips,TOT_POP,0-9,0-9 y/o % of total pop,19-Oct,10-19 y/o % of total pop,20-29,20-29 y/o % of total pop,30-39,30-39 y/o % of total pop,...,COPD_number,diabetes_prevalence,diabetes_Lower 95% CI,diabetes_Upper 95% CI,diabetes_number,CKD_prevalence,CKD_Lower 95% CI,CKD_Upper 95% CI,CKD_number,Urban_rural_code
120,5023,24965,2423,9.705588,2757,11.043461,2332,9.341078,2541,10.17825,...,2302,14.6,13.3,15.8,2946,3.9,3.6,4.2,790,6
686,17187,17032,2047,12.018553,2427,14.249648,2193,12.875763,1901,11.161343,...,1175,11.5,10.5,12.5,1525,3.2,3.0,3.5,430,6
3098,55105,163129,19981,12.248589,21849,13.393695,19982,12.249202,20353,12.476629,...,8680,10.4,9.5,11.2,13020,3.0,2.8,3.3,3784,4
1512,29063,12630,1193,9.445764,1189,9.414093,1835,14.528899,1942,15.376089,...,984,12.0,11.1,13.0,1258,3.1,2.9,3.3,322,4
2836,51036,6941,536,7.72223,636,9.162945,734,10.574845,670,9.652788,...,600,17.7,16.7,18.7,1043,4.3,4.1,4.6,254,2
4,1009,57840,7095,12.266598,7570,13.087828,6742,11.656293,6884,11.901798,...,4661,13.5,12.6,14.5,6017,3.4,3.2,3.7,1507,2
2559,48077,10456,1041,9.956006,1271,12.1557,998,9.544759,1070,10.233359,...,771,13.5,12.3,14.6,1125,3.5,3.3,3.8,294,4
2727,48413,2895,339,11.709845,490,16.925734,293,10.120898,384,13.264249,...,154,14.2,13.3,15.1,306,3.4,3.2,3.7,74,6
2311,44005,82542,7243,8.774927,9194,11.138572,10520,12.745027,9527,11.542003,...,4598,9.3,8.4,10.4,6410,2.8,2.6,3.1,1922,2
1306,26155,68192,7384,10.82825,8728,12.799155,8258,12.109925,7597,11.140603,...,5037,11.0,10.1,11.9,5936,3.1,2.9,3.4,1682,5


---

#### 🟠Step 2: Perform a full EDA (Exploratory Data Analysis)

>This second step is vital to ensure that we keep the variables that are strictly necessary and eliminate those that are not relevant or do not provide information. Use the example Notebook we worked on and adapt it to this use case.
>
>Be sure to conveniently divide the data set into train and test as we have seen in previous lessons.

---

> **⭐EXPLORING THE DATA**


In [7]:
df_raw.shape()

TypeError: 'tuple' object is not callable

> **⭐CLEANING**


In [None]:
total_data = total_data.drop_duplicates().reset_index(drop = True)
total_data.head()

Unnamed: 0,fips,TOT_POP,0-9,0-9 y/o % of total pop,19-Oct,10-19 y/o % of total pop,20-29,20-29 y/o % of total pop,30-39,30-39 y/o % of total pop,...,COPD_number,diabetes_prevalence,diabetes_Lower 95% CI,diabetes_Upper 95% CI,diabetes_number,CKD_prevalence,CKD_Lower 95% CI,CKD_Upper 95% CI,CKD_number,Urban_rural_code
0,1001,55601,6787,12.206615,7637,13.735364,6878,12.370281,7089,12.749771,...,3644,12.9,11.9,13.8,5462,3.1,2.9,3.3,1326,3
1,1003,218022,24757,11.355276,26913,12.344167,23579,10.814964,25213,11.564429,...,14692,12.0,11.0,13.1,20520,3.2,3.0,3.5,5479,4
2,1005,24881,2732,10.980266,2960,11.896628,3268,13.13452,3201,12.865239,...,2373,19.7,18.6,20.6,3870,4.5,4.2,4.8,887,6
3,1007,22400,2456,10.964286,2596,11.589286,3029,13.522321,3113,13.897321,...,1789,14.1,13.2,14.9,2511,3.3,3.1,3.6,595,2
4,1009,57840,7095,12.266598,7570,13.087828,6742,11.656293,6884,11.901798,...,4661,13.5,12.6,14.5,6017,3.4,3.2,3.7,1507,2


In [None]:
from sklearn.preprocessing import StandardScaler

data_types = total_data.dtypes
numeric_columns = [c for c in list(data_types[data_types != "object"].index) if c != "Heart disease_number"]

scaler = StandardScaler()
norm_features = scaler.fit_transform(total_data[numeric_columns])

# Create a new DataFrame with the scaled numerical variables
total_data_scal = pd.DataFrame(norm_features, index = total_data.index, columns = numeric_columns)
total_data_scal["Heart disease_number"] = total_data["Heart disease_number"]
total_data_scal.head()

Unnamed: 0,fips,TOT_POP,0-9,0-9 y/o % of total pop,19-Oct,10-19 y/o % of total pop,20-29,20-29 y/o % of total pop,30-39,30-39 y/o % of total pop,...,diabetes_prevalence,diabetes_Lower 95% CI,diabetes_Upper 95% CI,diabetes_number,CKD_prevalence,CKD_Lower 95% CI,CKD_Upper 95% CI,CKD_number,Urban_rural_code,Heart disease_number
0,-1.940874,-0.145679,-0.142421,0.158006,-0.135556,0.573496,-0.153144,0.02761,-0.139384,0.588469,...,-0.063696,-0.07172,-0.089834,-0.129902,-0.609615,-0.582796,-0.669652,-0.147523,-1.082865,3345
1,-1.940742,0.341296,0.287476,-0.242861,0.320383,-0.193107,0.183774,-0.469965,0.23062,-0.1103,...,-0.394103,-0.4149,-0.337677,0.376251,-0.433549,-0.393279,-0.343373,0.389791,-0.420704,13414
2,-1.94061,-0.237785,-0.239429,-0.419441,-0.246181,-0.439718,-0.225971,0.272104,-0.218759,0.656538,...,2.432709,2.483064,2.317776,-0.183415,1.855312,1.880929,1.777443,-0.204321,0.903618,2159
3,-1.940478,-0.245223,-0.246032,-0.426966,-0.254791,-0.609076,-0.230792,0.396168,-0.220555,1.264959,...,0.376846,0.423984,0.299632,-0.229096,-0.257483,-0.203761,-0.180233,-0.2421,-1.745026,1533
4,-1.940346,-0.138966,-0.135053,0.186249,-0.13714,0.216679,-0.155888,-0.200808,-0.14357,0.088582,...,0.156575,0.195197,0.158008,-0.111247,-0.081417,-0.014244,-0.017093,-0.124105,-1.745026,4101


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression

X = total_data_scal.drop(columns=["Heart disease_number"])
y = total_data_scal["Heart disease_number"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
train_indices = list(X_train.index)
test_indices = list(X_test.index)

k = int(len(X_train.columns) * 0.3)
selection_model = SelectKBest(score_func = f_regression, k = k)
selection_model.fit(X_train, y_train)
ix = selection_model.get_support()

X_train_sel = pd.DataFrame(selection_model.transform(X_train), columns = X_train.columns.values[ix])
X_test_sel = pd.DataFrame(selection_model.transform(X_test), columns = X_test.columns.values[ix])

X_train_sel.head()

Unnamed: 0,TOT_POP,0-9,19-Oct,20-29,30-39,40-49,50-59,60-69,70-79,80+,...,Family Medicine/General Practice Primary Care (2019),Total Specialist Physicians (2019),Total Population,Population Aged 60+,county_pop2018_18 and older,anycondition_number,Obesity_number,COPD_number,diabetes_number,CKD_number
0,-0.232556,-0.227731,-0.234284,-0.232951,-0.226353,-0.231316,-0.229599,-0.233425,-0.23468,-0.23442,...,-0.212643,-0.20859,-0.231195,-0.229737,-0.233171,-0.23437,-0.232975,-0.223516,-0.218609,-0.219329
1,-0.158676,-0.178665,-0.180166,-0.188266,-0.17507,-0.161168,-0.134688,-0.105618,-0.11927,-0.091822,...,-0.11668,-0.11085,-0.150293,-0.098866,-0.152859,-0.142645,-0.155304,-0.11008,-0.131449,-0.130962
2,-0.199114,-0.211128,-0.195138,-0.166782,-0.195036,-0.194045,-0.199725,-0.219256,-0.222207,-0.205154,...,-0.192263,-0.217668,-0.197005,-0.216056,-0.195125,-0.193205,-0.201976,-0.193106,-0.189197,-0.206391
3,-0.036595,-0.037734,-0.017077,-0.057986,-0.052252,-0.033158,-0.020228,-0.032603,-0.023876,-0.046224,...,0.062458,-0.107888,-0.03694,-0.030034,-0.039882,-0.003321,0.006163,-0.007077,-0.047515,-0.045054
4,0.090839,0.09468,0.101662,0.056721,0.042392,0.068095,0.101699,0.144664,0.140685,0.166099,...,0.274818,0.194913,0.097767,0.161314,0.088485,0.165555,0.18274,0.265603,0.12304,0.132454


In [None]:
X_test_sel.head()

Unnamed: 0,TOT_POP,0-9,19-Oct,20-29,30-39,40-49,50-59,60-69,70-79,80+,...,Family Medicine/General Practice Primary Care (2019),Total Specialist Physicians (2019),Total Population,Population Aged 60+,county_pop2018_18 and older,anycondition_number,Obesity_number,COPD_number,diabetes_number,CKD_number
0,-0.285286,-0.285362,-0.294836,-0.269566,-0.258568,-0.268541,-0.289649,-0.312989,-0.316763,-0.286734,...,-0.303292,-0.285225,-0.284324,-0.308211,-0.283698,-0.302439,-0.302292,-0.324038,-0.27629,-0.281172
1,0.496553,0.433072,0.39217,0.544659,0.453677,0.39148,0.499744,0.668639,0.716353,0.476084,...,0.853184,0.424904,0.477184,0.620724,0.517408,0.52736,0.516364,0.443806,0.418504,0.454092
2,-0.260191,-0.255123,-0.265837,-0.246628,-0.234723,-0.240703,-0.264552,-0.289867,-0.289846,-0.290962,...,-0.277451,-0.261868,-0.257294,-0.287868,-0.259943,-0.249299,-0.259877,-0.225107,-0.22597,-0.242229
3,0.039389,0.058341,0.059701,-0.018647,0.003236,0.030594,0.074401,0.091003,0.060721,0.005012,...,0.197267,0.130719,0.036299,0.055281,0.031494,0.035274,0.026108,0.136643,0.003409,0.022352
4,0.364272,0.281232,0.323623,0.525353,0.29599,0.288317,0.298029,0.461297,0.49776,0.351393,...,0.659217,0.305024,0.336581,0.423969,0.390596,0.271127,0.273318,0.329669,0.25662,0.334804


In [None]:
X_train_sel["Heart disease_number"] = list(y_train)
X_test_sel["Heart disease_number"] = list(y_test)

X_train_sel.to_csv("../data/processed/clean_train.csv", index = False)
X_test_sel.to_csv("../data/processed/clean_test.csv", index = False)

OSError: Cannot save file into a non-existent directory: '../data/processed'

In [None]:
total_data = pd.concat([X_train_sel, X_test_sel])
total_data.head()

Unnamed: 0,TOT_POP,0-9,19-Oct,20-29,30-39,40-49,50-59,60-69,70-79,80+,...,Total Specialist Physicians (2019),Total Population,Population Aged 60+,county_pop2018_18 and older,anycondition_number,Obesity_number,COPD_number,diabetes_number,CKD_number,Heart disease_number
0,-0.232556,-0.227731,-0.234284,-0.232951,-0.226353,-0.231316,-0.229599,-0.233425,-0.23468,-0.23442,...,-0.20859,-0.231195,-0.229737,-0.233171,-0.23437,-0.232975,-0.223516,-0.218609,-0.219329,2072
1,-0.158676,-0.178665,-0.180166,-0.188266,-0.17507,-0.161168,-0.134688,-0.105618,-0.11927,-0.091822,...,-0.11085,-0.150293,-0.098866,-0.152859,-0.142645,-0.155304,-0.11008,-0.131449,-0.130962,3796
2,-0.199114,-0.211128,-0.195138,-0.166782,-0.195036,-0.194045,-0.199725,-0.219256,-0.222207,-0.205154,...,-0.217668,-0.197005,-0.216056,-0.195125,-0.193205,-0.201976,-0.193106,-0.189197,-0.206391,2222
3,-0.036595,-0.037734,-0.017077,-0.057986,-0.052252,-0.033158,-0.020228,-0.032603,-0.023876,-0.046224,...,-0.107888,-0.03694,-0.030034,-0.039882,-0.003321,0.006163,-0.007077,-0.047515,-0.045054,5484
4,0.090839,0.09468,0.101662,0.056721,0.042392,0.068095,0.101699,0.144664,0.140685,0.166099,...,0.194913,0.097767,0.161314,0.088485,0.165555,0.18274,0.265603,0.12304,0.132454,8686


In [None]:
X_test_sel.head()

Unnamed: 0,TOT_POP,0-9,19-Oct,20-29,30-39,40-49,50-59,60-69,70-79,80+,...,Total Specialist Physicians (2019),Total Population,Population Aged 60+,county_pop2018_18 and older,anycondition_number,Obesity_number,COPD_number,diabetes_number,CKD_number,Heart disease_number
0,-0.285286,-0.285362,-0.294836,-0.269566,-0.258568,-0.268541,-0.289649,-0.312989,-0.316763,-0.286734,...,-0.285225,-0.284324,-0.308211,-0.283698,-0.302439,-0.302292,-0.324038,-0.27629,-0.281172,698
1,0.496553,0.433072,0.39217,0.544659,0.453677,0.39148,0.499744,0.668639,0.716353,0.476084,...,0.424904,0.477184,0.620724,0.517408,0.52736,0.516364,0.443806,0.418504,0.454092,13982
2,-0.260191,-0.255123,-0.265837,-0.246628,-0.234723,-0.240703,-0.264552,-0.289867,-0.289846,-0.290962,...,-0.261868,-0.257294,-0.287868,-0.259943,-0.249299,-0.259877,-0.225107,-0.22597,-0.242229,1768
3,0.039389,0.058341,0.059701,-0.018647,0.003236,0.030594,0.074401,0.091003,0.060721,0.005012,...,0.130719,0.036299,0.055281,0.031494,0.035274,0.026108,0.136643,0.003409,0.022352,6739
4,0.364272,0.281232,0.323623,0.525353,0.29599,0.288317,0.298029,0.461297,0.49776,0.351393,...,0.305024,0.336581,0.423969,0.390596,0.271127,0.273318,0.329669,0.25662,0.334804,11305


## Logistic Regression model

In [None]:
train_data = pd.read_csv("../data/processed/clean_train.csv")
test_data = pd.read_csv("../data/processed/clean_test.csv")

train_data.head()

Unnamed: 0,TOT_POP,0-9,19-Oct,20-29,30-39,40-49,50-59,60-69,70-79,80+,...,Total Specialist Physicians (2019),Total Population,Population Aged 60+,county_pop2018_18 and older,anycondition_number,Obesity_number,COPD_number,diabetes_number,CKD_number,Heart disease_number
0,-0.232556,-0.227731,-0.234284,-0.232951,-0.226353,-0.231316,-0.229599,-0.233425,-0.23468,-0.23442,...,-0.20859,-0.231195,-0.229737,-0.233171,-0.23437,-0.232975,-0.223516,-0.218609,-0.219329,2072
1,-0.158676,-0.178665,-0.180166,-0.188266,-0.17507,-0.161168,-0.134688,-0.105618,-0.11927,-0.091822,...,-0.11085,-0.150293,-0.098866,-0.152859,-0.142645,-0.155304,-0.11008,-0.131449,-0.130962,3796
2,-0.199114,-0.211128,-0.195138,-0.166782,-0.195036,-0.194045,-0.199725,-0.219256,-0.222207,-0.205154,...,-0.217668,-0.197005,-0.216056,-0.195125,-0.193205,-0.201976,-0.193106,-0.189197,-0.206391,2222
3,-0.036595,-0.037734,-0.017077,-0.057986,-0.052252,-0.033158,-0.020228,-0.032603,-0.023876,-0.046224,...,-0.107888,-0.03694,-0.030034,-0.039882,-0.003321,0.006163,-0.007077,-0.047515,-0.045054,5484
4,0.090839,0.09468,0.101662,0.056721,0.042392,0.068095,0.101699,0.144664,0.140685,0.166099,...,0.194913,0.097767,0.161314,0.088485,0.165555,0.18274,0.265603,0.12304,0.132454,8686


In [None]:
X_train = train_data.drop(["Heart disease_number"], axis = 1)
y_train = train_data["Heart disease_number"]
X_test = test_data.drop(["Heart disease_number"], axis = 1)
y_test = test_data["Heart disease_number"]

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
print(f"Intercep (a): {model.intercept_}")
print(f"Coefficients: {model.coef_}")

Intercep (a): [-0.32477388 -0.32512664 -0.32345778 ... -2.27977069 -2.24523414
 -2.16810319]
Coefficients: [[-0.0716679  -0.06694374 -0.07335034 ... -0.11947433 -0.08328931
  -0.08395662]
 [-0.07180893 -0.067598   -0.07399394 ... -0.11947201 -0.08315567
  -0.08361624]
 [-0.07150937 -0.06695063 -0.07336974 ... -0.11920636 -0.08302182
  -0.08364371]
 ...
 [ 0.21757     0.26788751  0.31085772 ...  0.27388659 -0.08404334
   0.31300761]
 [ 0.19507433  0.14462827  0.10185451 ...  0.23773847  0.11894147
   0.30174769]
 [ 0.22991076  0.1045326   0.12900637 ...  0.03806663  0.41323692
   0.28777103]]


In [None]:
y_pred = model.predict(X_test)
y_pred

array([ 1072,  8689,  1072,  8689,  7128,  1072,  1072,  1072,  1072,
        1072,  1072,  1072,  3376,  1072,  7128,  1072, 75432,  1072,
        1072,  1072,  1072,  1072,  1072,  1072, 40686,  1072,  1072,
        1072,  1072,  1072,  1072,  1072,  1072,  1072,  7128,  1072,
        1072,  1072,  1072,  1072,  1072,  7128,  1072,  1072,  1072,
        1072,  1072,  7011,  1072,  1072, 32863,  1072,  1072,  1072,
        8506,  1072, 31550,  1072,  1072,  1072,  7128,  1072,  1072,
        8506,  1072,  3376,  1072,  1072,  1072, 38899,  1072,  1072,
        1072,  1072,  1072,  1448, 32863,  1072,  1072,  1072,  1072,
        3376, 12367,  1072,  1072,  1072, 32828,  1072,  1072,  7128,
        1072,  1072,  1072,  1072,  1072,  1072,  1072,  1072,  8506,
        1072, 76128,  8689,  1072,  1072,  1072,  1072,  1072,  1072,
        7128, 23631,  1072,  7011, 25091,  3376,  1072, 16376,  1072,
        1072,  7128,  1072,  1072,  1072,  1072,  1072,  1072,  1072,
        3376,  1072,

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

print(f"MSE: {mean_squared_error(y_test, y_pred)}")
print(f"R2 Score: {r2_score(y_test, y_pred)}")

MSE: 22887128.205414012
R2 Score: 0.7385089576924445


## Model optimization

In [None]:
from sklearn.linear_model import Lasso

alpha = 1.0
lasso_model = Lasso(alpha = alpha)

# Training the model
lasso_model.fit(X_train, y_train)

# We evaluate the performance of the model on the test data
score = lasso_model.score(X_test, y_test)
print("Coefficients:", lasso_model.coef_)
print("R2 score:", score)

Coefficients: [ 5103.56606854  1192.14607797 -1921.90787729  -804.66413704
  -565.56094295  4161.43524651   552.93901319 -1080.72356488
  3459.52199626  1245.55139019   999.98373671 -5424.05510818
   198.01474247  -841.91968637  -371.17714777  2792.66368052
   324.15462356    19.27250203 -1918.35380595    88.70731834
   492.63537754  -461.24851762  -854.99744188 -2893.16049233
  3359.53043536   204.58782867  1925.46994753  2907.60993035
  5383.37174712   819.64992462 -2329.43640877]
R2 score: 0.9978911816625889


  model = cd_fast.enet_coordinate_descent(


In [None]:
from pickle import dump

dump(lasso_model, open("../models/lasso_alpha-1.0.sav", "wb"))