# Proyecto Modelos Lineales Regularizados - Luis Alpizar

## Datos socio demográficos y de recursos de salud a nivel de condado de EE. UU. (2018-2019)

Se han recopilado datos socio demográficos y de recursos de salud por condado en los Estados Unidos y queremos descubrir si existe alguna relación entre los recursos sanitarios y los datos socio demográficos.

Para ello, es necesario que establezcas una variable objetivo (relacionada con la salud) para llevar a cabo el análisis

## Paso 0: Importar Librerias

In [1]:
# Importacion de Librerias
# Registro
import logging

# Manejo de Datos y Análisis
import pandas as pd
import numpy as np

# Visualización de Datos
import seaborn as sns
import matplotlib.pyplot as plt

#Preprocesamiento
from sklearn.model_selection import train_test_split # División de Datos Test / Train
from sklearn.preprocessing import StandardScaler # Escalado de Datos


#Pipelines
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Machine Learning
from sklearn.linear_model import LogisticRegression #Modelado
#Metricas
from sklearn.metrics import (
    mean_squared_error,
    r2_score,
)
from sklearn.feature_selection import( #Seleccionar las características más relevantes
     SelectKBest,
     f_regression
     )

from sklearn.linear_model import Lasso #Regularización

# Configuración del registro
logger = logging.getLogger()
logger.setLevel(logging.ERROR)



## Paso 1: Carga del dataset

In [2]:
raw = ('https://raw.githubusercontent.com/4GeeksAcademy/regularized-linear-regression-project-tutorial/main/demographic_health_data.csv')
df_raw = pd.read_csv(raw, sep=',')
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3140 entries, 0 to 3139
Columns: 108 entries, fips to Urban_rural_code
dtypes: float64(61), int64(45), object(2)
memory usage: 2.6+ MB


In [3]:
df_raw.isna().sum()
df_raw.isnull().sum()

fips                      0
TOT_POP                   0
0-9                       0
0-9 y/o % of total pop    0
19-Oct                    0
                         ..
CKD_prevalence            0
CKD_Lower 95% CI          0
CKD_Upper 95% CI          0
CKD_number                0
Urban_rural_code          0
Length: 108, dtype: int64


## 2. Data preprocessing

In [4]:
df_baking = df_raw.copy()
df_baking['data_types = df_baking.dtypes'] = df_baking['COUNTY_NAME'].astype('category')
df_baking['STATE_NAME'] = df_baking['STATE_NAME'].astype('category')
df_baking.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3140 entries, 0 to 3139
Columns: 109 entries, fips to data_types = df_baking.dtypes
dtypes: category(2), float64(61), int64(45), object(1)
memory usage: 2.7+ MB


In [14]:
display(df_baking.describe(include='number').T)
display(df_baking.describe(include='category').T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fips,3140.0,30401.640764,15150.559265,1001.0,18180.500000,29178.000000,45081.50000,5.604500e+04
TOT_POP,3140.0,104189.412420,333583.395432,88.0,10963.250000,25800.500000,67913.00000,1.010552e+07
0-9,3140.0,12740.302866,41807.301846,0.0,1280.500000,3057.000000,8097.00000,1.208253e+06
0-9 y/o % of total pop,3140.0,11.871051,2.124081,0.0,10.594639,11.802727,12.95184,2.546068e+01
19-Oct,3140.0,13367.976752,42284.392134,0.0,1374.500000,3274.000000,8822.25000,1.239139e+06
...,...,...,...,...,...,...,...,...
CKD_prevalence,3140.0,3.446242,0.568059,1.8,3.100000,3.400000,3.80000,6.200000e+00
CKD_Lower 95% CI,3140.0,3.207516,0.527740,1.7,2.900000,3.200000,3.50000,5.800000e+00
CKD_Upper 95% CI,3140.0,3.710478,0.613069,1.9,3.300000,3.700000,4.10000,6.600000e+00
CKD_number,3140.0,2466.234076,7730.422067,3.0,314.750000,718.000000,1776.25000,2.377660e+05


Unnamed: 0,count,unique,top,freq
COUNTY_NAME,3140,1841,Washington,31
STATE_NAME,3140,51,Texas,254


### Escalado de datos numericos

In [5]:
df_numeric = df_baking.select_dtypes(include=['number']).drop(columns=["Heart disease_number"])

scaler = StandardScaler()
df_standarized = scaler.fit_transform(df_numeric)

df_standarized = pd.DataFrame(df_standarized, columns=df_numeric.columns)
df_standarized["Heart disease_number"] = df_baking["Heart disease_number"].values
#df_standarized["COUNTY_NAME"] = df_baking["COUNTY_NAME"].values
#df_standarized["STATE_NAME"] = df_baking["STATE_NAME"].values



In [6]:
df = df_standarized.copy()


## 3. Exploratory Data Analysis

In [7]:
#División de los datos en entrenamiento y prueba
df_train, df_test = train_test_split(df, random_state=2025,test_size=0.2)
df_train = df_train.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)


In [52]:
display(df_train.describe(include='number').T)
display(df_train.describe(include='category').T)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
fips,2512.0,0.005484,1.001834,-1.940742,-0.751850,-0.081174,1.031008,1.692838
TOT_POP,2512.0,0.002685,1.004850,-0.312120,-0.279302,-0.233430,-0.105110,29.986269
0-9,2512.0,0.003823,1.008623,-0.304787,-0.274202,-0.230446,-0.105693,28.600342
0-9 y/o % of total pop,2512.0,0.005208,1.007745,-5.589683,-0.593781,-0.029668,0.517635,6.398903
19-Oct,2512.0,0.003450,1.006547,-0.316195,-0.283595,-0.238116,-0.100537,28.993352
...,...,...,...,...,...,...,...,...
CKD_Lower 95% CI,2512.0,-0.011000,1.004501,-2.667486,-0.772313,-0.014244,0.554308,4.913206
CKD_Upper 95% CI,2512.0,-0.009755,1.006261,-2.790468,-0.669652,-0.017093,0.635465,4.713958
CKD_number,2512.0,0.000544,0.994752,-0.318692,-0.277808,-0.224892,-0.087426,30.443001
Urban_rural_code,2512.0,-0.010544,1.003597,-2.407187,-1.082865,0.241457,0.903618,0.903618


Unnamed: 0,count,unique,top,freq
COUNTY_NAME,2512,1558,Washington,25
STATE_NAME,2512,51,Texas,201


Analisis univariado

In [None]:
df_train.hist(bins=30, figsize=(10, 8))
plt.show()

- Análisis univariado de variables categóricas

In [None]:
fig, ax = plt.subplots(2, 1, figsize=(10, 8))  # Ajustar el tamaño de la figura


sns.countplot(data=df_train, y='COUNTY_NAME', ax=ax[0])
sns.countplot(data=df_train, y='STATE_NAME', ax=ax[1])

plt.tight_layout()
plt.show()

- Analisis bivariado

In [None]:
sns.pairplot(data=df_train, corner=True)
plt.tight_layout()
plt.show()

- Correalaciones

In [None]:
sns.heatmap(data=df_train.select_dtypes('number').corr(),vmin=-1,vmax=1,cmap='RdBu',annot=True)
plt.show()

### 3. Machine Learning

In [8]:
X_train = df_train.drop(['Heart disease_number'],axis=1)
y_train = df_train['Heart disease_number']

X_test = df_train.drop(['Heart disease_number'],axis=1)
y_test = df_train['Heart disease_number']

In [9]:
k30 = int(len(X_train.columns) * 0.3)  #30% del total de variables en X_train.
selection_model = SelectKBest(score_func=f_regression, k=k30) #Selecciona las mejores variables basado en f_regression
selection_model.fit(X_train, y_train)
ix = selection_model.get_support() #Arreglo con las variables escogidas

In [10]:
X_train_best = pd.DataFrame(selection_model.transform(X_train),
                           columns=X_train.columns.values[ix])
X_test_best = pd.DataFrame(selection_model.transform(X_test),
                          columns=X_test.columns.values[ix])

In [11]:
X_train_best["Heart disease_number"] = list(y_train)
X_test_best["Heart disease_number"] = list(y_test)

In [12]:
df_clean = pd.concat([X_train_best,X_test_best])

In [13]:
X_train = X_train_best.drop(["Heart disease_number"], axis = 1)
y_train = X_train_best["Heart disease_number"]

X_test = X_test_best.drop(["Heart disease_number"], axis = 1)
y_test = X_test_best["Heart disease_number"]

In [14]:
logReg = LogisticRegression()
logReg.fit(X_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [15]:
y_hat = logReg.predict(X_test)
y_hat

array([  7128,    317,   3376, ..., 111793,   1253,   3376], shape=(2512,))

In [16]:
print(f"MSE: {mean_squared_error(y_test, y_hat)}")
print(f"R2 Score: {r2_score(y_test, y_hat)}")

MSE: 3285363.37977707
R2 Score: 0.9861292775537723


### 4. Regularizacion con Modelo Lasso

In [17]:
penalty = 1.0 #Penalizacion
lasso_logReg = Lasso(alpha = penalty)

lasso_logReg.fit(X_train, y_train)

  model = cd_fast.enet_coordinate_descent(


In [18]:
print(f"MSE: {mean_squared_error(y_test,y_hat)}")
print(f"R2 Score: {r2_score(y_train, y_hat)}")

MSE: 3285363.37977707
R2 Score: 0.9861292775537723
