<div style="position: absolute; top: 0; left: 0; font-family: 'Garamond'; font-size: 14px;">
    <a href="https://github.com/patriciaapenat" style="text-decoration: none; color: inherit;">Patricia Peña Torres</a>
</div>

<div align="center" style="font-family: 'Garamond'; font-size: 48px;">
    <strong>Proyecto final, BRFSS-clustering</strong>
</div>

<div align="center" style="font-family: 'Garamond'; font-size: 36px;">
    <strong>Imputación de valores nulos</strong>
</div>

__________________

<div style="font-family: 'Garamond'; font-size: 14px;">
    <normal>Dado que trabajamos con una base de datos extensa, es crucial examinar detenidamente la documentación. En este caso, me basé en las fuentes oficiales de la BBDD para explorar los datos. Comencé revisando el cuestionario (<a href="https://www.cdc.gov/brfss/questionnaires/pdf-ques/2022-BRFSS-Questionnaire-508.pdf" target="_blank">disponible aquí</a>) pero hay mayor concordancia con el codebook (<a href="https://www.cdc.gov/brfss/annual_data/2022/zip/codebook22_llcp-v2-508.zip" target="_blank">disponible aquí</a>), donde se encuentran los códigos asociados a las preguntas y respuestas. Esta revisión es esencial para comprender las preguntas formuladas y facilita la eliminación de secciones no pertinentes.
    En el presente notebook, llevé a cabo una revisión del documento mencionado, junto con el archivo de texto generado que contiene información sobre los valores nulos. La finalidad fue reducir con una limpieza rápida el dataset eliminando encuestas que no se habían completado o columnas que no utilizaremos.</normal>
</div>

<div style="font-family: 'Garamond'; font-size: 24px;">
    <strong>Importación de paquetes</strong>
</div>

In [1]:
%config IPCompleter.use_jedi = False

In [2]:
import pandas as pd
import findspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, DataFrame, functions as F, Window
from pyspark.sql.functions import col, mean, when, lit, monotonically_increasing_id
from pyspark.ml.feature import Imputer

<div style="font-family: 'Garamond'; font-size: 24px;">
    <strong>Configuración de Spark</strong>
</div>

In [3]:
# Si hay un SparkContext existente, debemos cerrarlo antes de crear uno nuevo
if 'sc' in locals() and sc:
    sc.stop()  # Detener el SparkContext anterior si existe

# Configuración de Spark
conf = (
    SparkConf()
    .setAppName("Proyecto_PatriciaA_Peña")  # Nombre de la aplicación en Spark
    .setMaster("local[1]")  # Modo local con un hilo para ejecución
    .set("spark.driver.host", "127.0.0.1")  # Dirección del host del driver
    .set("spark.executor.heartbeatInterval", "3600s")  # Intervalo de latido del executor
    .set("spark.network.timeout", "7200s")  # Tiempo de espera de la red
    .set("spark.executor.memory", "10g")  # Memoria asignada para cada executor
    .set("spark.driver.memory", "10g")  # Memoria asignada para el driver
)

# Crear un nuevo SparkContext con la configuración especificada
sc = SparkContext(conf=conf)

# Configuración de SparkSession (interfaz de alto nivel para trabajar con datos estructurados en Spark)
spark = (
    SparkSession.builder
    .appName("ProyectoF_PatriciaA_Peña")  # Nombre de la aplicación en Spark
    .config("spark.sql.repl.eagerEval.enabled", True)  # Habilitar la evaluación perezosa en Spark SQL REPL
    .config("spark.sql.repl.eagerEval.maxNumRows", 1000)  # Número máximo de filas a mostrar en la evaluación perezosa
    .getOrCreate()  # Obtener la sesión Spark existente o crear una nueva si no existe
) 

<div style="font-family: 'Garamond'; font-size: 16px;">
    <strong>Lectura del archivo</strong>
</div>

In [4]:
# Lee el archivo CSV 
df = spark.read.csv(r"C:\\Users\\patri\\OneDrive - UAB\\Documentos\\GitHub\\BRFSS-clustering\\datos\\BRFSS_Cleaner_2022.csv", header=True, inferSchema=True)

In [5]:
df = df.sample(withReplacement=False, fraction=0.45) # Por limitaciones informáticas hubo que hacer un sample

In [6]:
# Convertir todas las columnas a tipo numérico
for column_name in df.columns:
    df = df.withColumn(column_name, col(column_name).cast("double"))

In [7]:
def guardar_info_nulos_en_txt(df: DataFrame, archivo_nombre: str):
    # Ruta del archivo de texto
    file_path = f"C:\\Users\\patri\\OneDrive - UAB\\Documentos\\GitHub\\BRFSS-clustering\\tratamiento\\{archivo_nombre}.txt"

    # Verifica si el archivo ya existe
    if os.path.exists(file_path):
        print(f"¡Advertencia! El archivo '{file_path}' ya existe. No se ha sobrescrito. Por favor, elija otro nombre.")
        return
    
    # Abre el archivo en modo de escritura (crea uno nuevo)
    with open(file_path, 'w') as file:
        # Escribe la cantidad de valores nulos por columna en el archivo
        for col in df.columns:
            null_count = df.filter(df[col].isNull() | (df[col] == "")).count()
            file.write(f"{col}: {null_count} valores nulos\n")

    print(f"La información sobre valores nulos del DataFrame '{archivo_nombre}' ha sido guardada en: {file_path}")

<div style="font-family: 'Garamond'; font-size: 14px;">
    <normal>
Trabajamos con un dataset muy extenso así que por ello lo reduciremos en este notebook.</normal>
</div>

In [8]:
# Agregar una columna de identificación utilizando el número de índice de las filas
df = df.withColumn("ID", monotonically_increasing_id())

In [9]:
# Lista de valores a considerar como NA
na_values = [9, 99] # Esta es la clave para "Refused"

# Marcar valores 9 o 99 como NA en todas las columnas (excepto _AGE80)
for col_name in df.columns:
    if col_name != "_AGE80":
        for na_value in na_values:
            df = df.withColumn(col_name, when(col(col_name) == na_value, None).otherwise(col(col_name)))

In [10]:
# Add ID column
df = df.withColumn("ID", F.monotonically_increasing_id())

In [11]:
# Lista de columnas para imputar con Imputer
columns_to_impute = [
    'NUMADULT', 'NUMMEN', 'NUMWOMEN', 'GENHLTH', 'PHYSHLTH', 'MENTHLTH', 'POORHLTH',
    'PRIMINSR', 'PERSDOC3', 'MEDCOST1', 'CHECKUP1', 'EXERANY2', 'SLEPTIM1', 'LASTDEN4',
    'RMVTETH4', 'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'ASTHMA3', 'CHCSCNC1', 'CHCOCNC1',
    'CHCCOPD3', 'ADDEPEV3', 'CHCKDNY2', 'HAVARTH4', 'DIABETE4', 'DIABAGE4', 'MARITAL',
    'EDUCA', 'RENTHOM1', 'VETERAN3', 'EMPLOY1', 'INCOME3', 'DEAF', 'BLIND', 'DECIDE', 
    'DIFFWALK', 'DIFFDRES', 'DIFFALON'
]


In [12]:
# Crear el imputer
imputer = Imputer(
    inputCols=columns_to_impute,
    outputCols=[f"{col}_imputed" for col in columns_to_impute]  # Nombres de las columnas imputadas
)

In [13]:
# Aplicar el imputer y transformar el DataFrame
imputer_model = imputer.fit(df)
df_imputed = imputer_model.transform(df)

In [14]:
# Imputación con media para la columna CHILDREN
mean_value = df.selectExpr('avg(CHILDREN) as mean_CHILDREN').collect()[0]['mean_CHILDREN']
df_imputed = df_imputed.fillna(mean_value, subset=['CHILDREN']).withColumn('CHILDREN_imputed', col('CHILDREN').cast('double'))

# Realizar el primer join
df_imputed = df_imputed.join(df_imputed.select('ID', *[f"{col}_imputed" for col in columns_to_impute]), 'ID', 'left')

In [15]:
# Salud ginecológica
conditions_mode = {
    "HADMAM": col("_SEX") == 1,
    "HOWLONG": col("_SEX") == 1,
    "CERVSCRN": col("_SEX") == 1,
    "CRVCLCNC": col("_SEX") == 1,
    "CRVCLPAP": col("_SEX") == 1,
    "CRVCLHPV": col("_SEX") == 1,
    "HADHYST2": col("_SEX") == 1,
    "PREGNANT": col("_SEX") == 1
}

# Calcular modas para todas las columnas a la vez
mode_values = {}
for col_name in conditions_mode.keys():
    mode_values[col_name] = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]

# Aplicar condiciones de imputación y unión con el DataFrame original
for col_name, condition in conditions_mode.items():
    mode_value = mode_values[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(
    df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'),
    ['ID'],
    'left'
)

In [None]:
# Cáncer colorrectal
conditions_mode = {
    'HADSIGM4': col("_AGE80") < 45,
    'COLNSIGM': (col("_AGE80") < 45) | (col("HADSIGM4").isin([1])) | (col("COLNSIGM").isin([1, 7, 9, None])),
    'LASTSIG4': (col("_AGE80") < 45) | (col("HADSIGM4").isin([2, 7, 9, None])) | (col("COLNSIGM").isin([9, None])) | (col("SIGMTES1").isNotNull()),
    'COLNCNCR': col("_AGE80") < 45,
    'VIRCOLO1': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])),
    'VCLNTES2': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("VIRCOLO1").isin([2, 7, 9])),
    'SMALSTOL': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])),
    'STOLTEST': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("SMALSTOL").isin([2, 7, 9, None])),
    'STOOLDN2': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])),
    'BLDSTFIT': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("STOOLDN2").isin([2, 7, 9, None])),
    'SDNATES1': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("STOOLDN2").isin([2, 7, 9, None])),
}


# Calcular modas para todas las columnas a la vez
mode_values = {}
for col_name in conditions_mode.keys():
    mode_values[col_name] = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]

# Aplicar condiciones de imputación y unión con el DataFrame original
for col_name, condition in conditions_mode.items():
    mode_value = mode_values[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(
    df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'),
    ['ID'],
    'left'
)

In [None]:
# Hábito tabáquico y riesgo de cancer de pulmón
conditions_mode = {
    'ECIGNOW2': col('ECIGNOW2').isNull(),
    'LCSFIRST': col('SMOKE100').isin([2, 7, 9]) | col('SMOKDAY2').isin([7, 9]),
    'LCSLAST': col('SMOKE100').isin([2, 7, 9]) | col('SMOKDAY2').isin([7, 9]) | col('LCSFIRST').isin([888]),
    'LCSNUMCG': col('SMOKE100').isin([2, 7, 9]) | col('SMOKDAY2').isin([7, 9]) | col('LCSFIRST').isin([888]),
    'LCSCTSC1': col('LCSCTSC1').isNull(),
    'LCSSCNCR': col('LCSCTSC1').isin([2, 7, 9]),
    'LCSCTWHN': col('LCSCTSC1').isin([2, 7, 9]) | col('LCSSCNCR').isin([2, 7, 9]),
    'HEATTBCO': col('HEATTBCO').isNull(),
    'STOPSMK2': (col('SMOKE100').isin([2, 7, 9])) | (col('SMOKDAY2').isin([1, 2, 7, 9])),
    'LASTSMK2': (col('SMOKE100').isin([2, 7, 9])) | (col('SMOKDAY2').isin([1, 2, 7, 9])) | (col('LASTSMK2').isNull()),
    'MENTCIGS': col('SMOKDAY2').isin([1, 2, 7, 9]),
    'MENTECIG': (col('SMOKDAY2').isin([1, 2, 7, 9])) | (col('ECIGNOW2').isin([1, 4, 7, 9])),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy().agg({col_name: "avg"}).collect()[0][0]
    if isinstance(condition, bool):  
        df_imputed = df_imputed.withColumn(col_name + '_imputed', lit(mode_value if condition else 0)) 
    else:
        df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Vacunación
conditions_mode = {
    'FLUSHOT7': col('FLUSHOT7').isNull(),
    'TETANUS1': col('TETANUS1').isNull(),
    'PNEUVAC4': col('PNEUVAC4').isNull(),
    'HIVTST7': col('HIVTST7').isNull(),
    'HIVRISK5': col('HIVRISK5').isNull(),
    'COVIDPOS': col('COVIDPOS').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Cáncer, sobrevivencia y manejo del dolor 
conditions_mode = {
    'CNCRDIFF': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull(),
    'CSRVTRT3': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull(),
    'CSRVPAIN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull(),
    'CSRVCTL2': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVPAIN').isin([2, 7, 9]) | col('CSRVPAIN').isNull(),
    'CSRVDOC1': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVSUM': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVRTRN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVINST': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVINSR': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVDEIN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVCLIN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CNCRAGE': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CNCRDIFF').isin([7, 9]) | col('CNCRDIFF').isNull(),
    'CNCRTYP2': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CNCRDIFF').isin([7, 9]) | col('CNCRDIFF').isNull()
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Cáncer prostático
conditions_mode = {
    'PSATEST1': (col('_SEX') == 1) | (col('_AGE80') < 40),
    'PSASUGST': (col('_SEX') == 1) | (col('_AGE80') < 40),
    'PCSTALK1': (col('_SEX') == 1) | (col('_AGE80') < 40),
    'PCPSARS2': (col('_SEX') == 1) | (col('_AGE80') < 40) | col('PSATEST1').isin([2, 7, 9]) | col('PSATEST1').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Dependencia funcional
conditions_mode = {
    'CIMEMLOS': col('_AGE80') < 45,
    'CDHOUSE': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDASSIST': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDSOCIAL': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDDISCUS': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDHELP': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]) | col('CDASSIST').isin([4, 5, 7, 9]),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Dificultades en la infancia y determinantes sociales
conditions_mode = {
    'ACEDEPRS': col('ACEDEPRS').isNull(),
    'ACEDRINK': col('ACEDRINK').isNull(),
    'ACEDRUGS': col('ACEDRUGS').isNull(),
    'ACEPRISN': col('ACEPRISN').isNull(),
    'ACEDIVRC': col('ACEDIVRC').isNull(),
    'ACEPUNCH': col('ACEPUNCH').isNull(),
    'ACEHURT1': col('ACEHURT1').isNull(),
    'ACESWEAR': col('ACESWEAR').isNull(),
    'ACETOUCH': col('ACETOUCH').isNull(),
    'ACETTHEM': col('ACETTHEM').isNull(),
    'ACEHVSEX': col('ACEHVSEX').isNull(),
    'ACEADSAF': col('ACEADSAF').isNull(),
    'ACEADNED': col('ACEADNED').isNull(),
    'LSATISFY': col('LSATISFY').isNull(),
    'EMTSUPRT': col('EMTSUPRT').isNull(),
    'SDHISOLT': col('SDHISOLT').isNull(),
    'SDHTRNSP': col('SDHTRNSP').isNull(),
    'SDHEMPLY': col('SDHEMPLY').isNull(),
    'FOODSTMP': col('FOODSTMP').isNull(),
    'SDHBILLS': col('SDHBILLS').isNull(),
    'SDHUTILS': col('SDHUTILS').isNull(),
    'SDHFOOD1': col('SDHFOOD1').isNull(),
    'SDHSTRE1': col('SDHSTRE1').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Consumo de cannabis
conditions_mode = {
    'MARIJAN1': col('MARIJAN1').isNull(),
    'MARJEAT': col('MARJEAT').isNull(),
    'MARJVAPE': col('MARJVAPE').isNull(),
    'MARJDAB': col('MARJDAB').isNull(),
    'MARJOTHR': col('MARJOTHR').isNull(),
    'MARJSMOK': (col('MARIJAN1').isin([77, 88, 99])) | (col('MARIJAN1').isNull())
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Salud respiratoria
conditions_mode = {
    "ASTHNOW": col("ASTHMA3").isin([2, 7, 9]),
    'CHCCOPD3': col('CHCCOPD3').isNull(),
    'COPDCOGH': col('COPDCOGH').isNull(),
    'COPDFLEM': col('COPDFLEM').isNull(),
    'COPDBRTH': col('COPDBRTH').isNull(),
    'COPDBTST': col('COPDBTST').isNull(),
    'ASTHMA3': col('ASTHMA3').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Imputación con media para la columna _BMI5
mean_value_bmi5 = df.selectExpr('avg(_BMI5) as mean_bmi5').collect()[0]['mean_bmi5']
df_imputed = df_imputed.fillna(mean_value_bmi5, subset=['_BMI5']).withColumn('_BMI5_imputed', col('_BMI5').cast('double'))

In [None]:
# Consumo de alcohol
conditions_mode = {
    'ASBIALCH': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIDRNK': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIBING': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIADVC': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIRDUC': (col('CHECKUP1').isin([3, 4, 7, 8, 9, None]) | 
                 col('ASBIALCH').isin([2, 7, 9, None]) | 
                 col('ASBIDRNK').isin([2, 7, 9, None]) | 
                 col('ASBIBING').isin([2, 7, 9, None]))
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Armas de fuego
conditions_mode = {
    'FIREARM5': col('FIREARM5').isNull(),
    'GUNLOAD': col('FIREARM5').isin([2, 7, 9, None]),
    'LOADULK2': (col('FIREARM5').isin([2, 7, 9, None]) | col('GUNLOAD').isin([2, 7, 9, None]))
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Raza y salud
conditions_mode = {
    'RRCLASS3': col('RRCLASS3').isNull(),
    'RRHCARE4': col('RRHCARE4').isNull(),
    'RRPHYSM2': col('RRPHYSM2').isNull()
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))


df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]).alias('imputed_df'), ['ID'], 'left')

In [None]:
# Limpiar
df_final = df_imputed.drop(*[col for col in df_imputed.columns if col.endswith("_imputed")])

In [None]:
df_final

In [None]:
guardar_info_nulos_en_txt(df_final, "nulos_4")

In [None]:
war is over

___________

In [None]:
# Add ID column
df = df.withColumn("ID", F.monotonically_increasing_id())

# Lista de columnas para imputar con Imputer
columns_to_impute = [
    'NUMADULT', 'NUMMEN', 'NUMWOMEN', 'GENHLTH', 'PHYSHLTH', 'MENTHLTH', 'POORHLTH',
    'PRIMINSR', 'PERSDOC3', 'MEDCOST1', 'CHECKUP1', 'EXERANY2', 'SLEPTIM1', 'LASTDEN4',
    'RMVTETH4', 'CVDINFR4', 'CVDCRHD4', 'CVDSTRK3', 'ASTHMA3', 'CHCSCNC1', 'CHCOCNC1',
    'CHCCOPD3', 'ADDEPEV3', 'CHCKDNY2', 'HAVARTH4', 'DIABETE4', 'DIABAGE4', 'MARITAL',
    'EDUCA', 'RENTHOM1', 'VETERAN3', 'EMPLOY1', 'INCOME3', 'DEAF', 'BLIND', 'DECIDE', 
    'DIFFWALK', 'DIFFDRES', 'DIFFALON'
]

# Crear el imputer
imputer = Imputer(
    inputCols=columns_to_impute,
    outputCols=[f"{col}_imputed" for col in columns_to_impute]  # Nombres de las columnas imputadas
)

# Aplicar el imputer y transformar el DataFrame
imputer_model = imputer.fit(df)
df_imputed = imputer_model.transform(df)

# Imputación con media para la columna CHILDREN
mean_value = df.selectExpr('avg(CHILDREN) as mean_CHILDREN').collect()[0]['mean_CHILDREN']
df_imputed = df_imputed.fillna(mean_value, subset=['CHILDREN']).withColumn('CHILDREN_imputed', col('CHILDREN').cast('double'))

# Realizar el primer join
df_imputed = df_imputed.join(df_imputed.select('ID', *[f"{col}_imputed" for col in columns_to_impute]), 'ID', 'left')

# Salud ginecológica
conditions_mode = {
    "HADMAM": col("_SEX") == 1,
    "HOWLONG": col("_SEX") == 1,
    "CERVSCRN": col("_SEX") == 1,
    "CRVCLCNC": col("_SEX") == 1,
    "CRVCLPAP": col("_SEX") == 1,
    "CRVCLHPV": col("_SEX") == 1,
    "HADHYST2": col("_SEX") == 1,
    "PREGNANT": col("_SEX") == 1
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy().agg({col_name: "avg"}).collect()[0][0]
    df_imputed = df_imputed.withColumn(f"{col_name}_imputed", when(condition, 0).otherwise(mode_value))

# Realizar el segundo join
df_imputed = df_imputed.join(df_imputed.select('ID', *[f"{col}_imputed" for col in conditions_mode.keys()]), 'ID', 'left')

# Cáncer colorrectal
conditions_mode = {
    'HADSIGM4': col("_AGE80") < 45,
    'COLNSIGM': (col("_AGE80") < 45) | (col("HADSIGM4").isin([1])) | (col("COLNSIGM").isin([1, 7, 9, None])),
    'LASTSIG4': (col("_AGE80") < 45) | (col("HADSIGM4").isin([2, 7, 9, None])) | (col("COLNSIGM").isin([9, None])) | (col("SIGMTES1").isNotNull()),
    'COLNCNCR': col("_AGE80") < 45,
    'VIRCOLO1': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])),
    'VCLNTES2': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("VIRCOLO1").isin([2, 7, 9])),
    'SMALSTOL': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])),
    'STOLTEST': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("SMALSTOL").isin([2, 7, 9, None])),
    'STOOLDN2': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])),
    'BLDSTFIT': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("STOOLDN2").isin([2, 7, 9, None])),
    'SDNATES1': (col("_AGE80") < 45) | (col("COLNCNCR").isin([2, 7, 9, None])) | (col("STOOLDN2").isin([2, 7, 9, None])),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy().agg({col_name: "avg"}).collect()[0][0]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, 0).otherwise(mode_value))

# Realizar el tercer join
df_imputed = df_imputed.join(df_imputed.select('ID', *[f"{col}_imputed" for col in conditions_mode.keys()]), 'ID', 'left')

# Hábito tabáquico y riesgo de cancer de pulmón
conditions_mode = {
    'ECIGNOW2': col('ECIGNOW2').isNull(),
    'LCSFIRST': col('SMOKE100').isin([2, 7, 9]) | col('SMOKDAY2').isin([7, 9]),
    'LCSLAST': col('SMOKE100').isin([2, 7, 9]) | col('SMOKDAY2').isin([7, 9]) | col('LCSFIRST').isin([888]),
    'LCSNUMCG': col('SMOKE100').isin([2, 7, 9]) | col('SMOKDAY2').isin([7, 9]) | col('LCSFIRST').isin([888]),
    'LCSCTSC1': col('LCSCTSC1').isNull(),
    'LCSSCNCR': col('LCSCTSC1').isin([2, 7, 9]),
    'LCSCTWHN': col('LCSCTSC1').isin([2, 7, 9]) | col('LCSSCNCR').isin([2, 7, 9]),
    'HEATTBCO': col('HEATTBCO').isNull(),
    'STOPSMK2': (col('SMOKE100').isin([2, 7, 9])) | (col('SMOKDAY2').isin([1, 2, 7, 9])),
    'LASTSMK2': (col('SMOKE100').isin([2, 7, 9])) | (col('SMOKDAY2').isin([1, 2, 7, 9])) | (col('LASTSMK2').isNull()),
    'MENTCIGS': col('SMOKDAY2').isin([1, 2, 7, 9]),
    'MENTECIG': (col('SMOKDAY2').isin([1, 2, 7, 9])) | (col('ECIGNOW2').isin([1, 4, 7, 9])),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy().agg({col_name: "avg"}).collect()[0][0]
    if isinstance(condition, bool):  
        df_imputed = df_imputed.withColumn(col_name + '_imputed', lit(mode_value if condition else 0)) 
    else:
        df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Cáncer, sobrevivencia y manejo del dolor 
conditions_mode = {
    'CNCRDIFF': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull(),
    'CSRVTRT3': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull(),
    'CSRVPAIN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull(),
    'CSRVCTL2': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVPAIN').isin([2, 7, 9]) | col('CSRVPAIN').isNull(),
    'CSRVDOC1': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVSUM': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVRTRN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVINST': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVINSR': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVDEIN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CSRVCLIN': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CSRVTRT3').isin([1, 3, 4, 5, 7, 9]) | col('CSRVTRT3').isNull(),
    'CNCRAGE': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CNCRDIFF').isin([7, 9]) | col('CNCRDIFF').isNull(),
    'CNCRTYP2': col('CHCSCNC1').isin([2, 7, 9]) | col('CHCOCNC1').isin([2, 7, 9]) | col('CHCSCNC1').isNull() | col('CHCOCNC1').isNull() | col('CNCRDIFF').isin([7, 9]) | col('CNCRDIFF').isNull()
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Vacunación
conditions_mode = {
    'FLUSHOT7': col('FLUSHOT7').isNull(),
    'TETANUS1': col('TETANUS1').isNull(),
    'PNEUVAC4': col('PNEUVAC4').isNull(),
    'HIVTST7': col('HIVTST7').isNull(),
    'HIVRISK5': col('HIVRISK5').isNull(),
    'COVIDPOS': col('COVIDPOS').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Cáncer prostático
conditions_mode = {
    'PSATEST1': (col('_SEX') == 1) | (col('_AGE80') < 40),
    'PSASUGST': (col('_SEX') == 1) | (col('_AGE80') < 40),
    'PCSTALK1': (col('_SEX') == 1) | (col('_AGE80') < 40),
    'PCPSARS2': (col('_SEX') == 1) | (col('_AGE80') < 40) | col('PSATEST1').isin([2, 7, 9]) | col('PSATEST1').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Dependencia funcional
conditions_mode = {
    'CIMEMLOS': col('_AGE80') < 45,
    'CDHOUSE': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDASSIST': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDSOCIAL': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDDISCUS': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]),
    'CDHELP': (col('_AGE80') < 45) | col('CIMEMLOS').isin([2, 9]) | col('CDASSIST').isin([4, 5, 7, 9]),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Red de apoyo/soporte y cuidados
conditions_mode = {
    'CAREGIV1': col('CAREGIV1').isNull(),
    'CRGVREL4': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVLNG1': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVEXPT': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVPER1': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVHOU1': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVHRS1': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVPRB3': col('CAREGIV1').isin([2, 8, 7, 9]),
    'CRGVALZD': (col('CAREGIV1').isin([2, 8, 7, 9])) | (col('CRGVPRB3') == 5),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Dificultades en la infancia y determinantes sociales
conditions_mode = {
    'ACEDEPRS': col('ACEDEPRS').isNull(),
    'ACEDRINK': col('ACEDRINK').isNull(),
    'ACEDRUGS': col('ACEDRUGS').isNull(),
    'ACEPRISN': col('ACEPRISN').isNull(),
    'ACEDIVRC': col('ACEDIVRC').isNull(),
    'ACEPUNCH': col('ACEPUNCH').isNull(),
    'ACEHURT1': col('ACEHURT1').isNull(),
    'ACESWEAR': col('ACESWEAR').isNull(),
    'ACETOUCH': col('ACETOUCH').isNull(),
    'ACETTHEM': col('ACETTHEM').isNull(),
    'ACEHVSEX': col('ACEHVSEX').isNull(),
    'ACEADSAF': col('ACEADSAF').isNull(),
    'ACEADNED': col('ACEADNED').isNull(),
    'LSATISFY': col('LSATISFY').isNull(),
    'EMTSUPRT': col('EMTSUPRT').isNull(),
    'SDHISOLT': col('SDHISOLT').isNull(),
    'SDHTRNSP': col('SDHTRNSP').isNull(),
    'SDHEMPLY': col('SDHEMPLY').isNull(),
    'FOODSTMP': col('FOODSTMP').isNull(),
    'SDHBILLS': col('SDHBILLS').isNull(),
    'SDHUTILS': col('SDHUTILS').isNull(),
    'SDHFOOD1': col('SDHFOOD1').isNull(),
    'SDHSTRE1': col('SDHSTRE1').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Consumo de cannabis
conditions_mode = {
    'MARIJAN1': col('MARIJAN1').isNull(),
    'MARJEAT': col('MARJEAT').isNull(),
    'MARJVAPE': col('MARJVAPE').isNull(),
    'MARJDAB': col('MARJDAB').isNull(),
    'MARJOTHR': col('MARJOTHR').isNull(),
    'MARJSMOK': (col('MARIJAN1').isin([77, 88, 99])) | (col('MARIJAN1').isNull())
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Salud respiratoria
conditions_mode = {
    "ASTHNOW": col("ASTHMA3").isin([2, 7, 9]),
    'CHCCOPD3': col('CHCCOPD3').isNull(),
    'COPDCOGH': col('COPDCOGH').isNull(),
    'COPDFLEM': col('COPDFLEM').isNull(),
    'COPDBRTH': col('COPDBRTH').isNull(),
    'COPDBTST': col('COPDBTST').isNull(),
    'ASTHMA3': col('ASTHMA3').isNull(),
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Imputación con media para la columna _BMI5
mean_value_bmi5 = df.selectExpr('avg(_BMI5) as mean_bmi5').collect()[0]['mean_bmi5']
df_imputed = df_imputed.fillna(mean_value_bmi5, subset=['_BMI5']).withColumn('_BMI5_imputed', col('_BMI5').cast('double'))

# Consumo de alcohol
conditions_mode = {
    'ASBIALCH': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIDRNK': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIBING': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIADVC': col('CHECKUP1').isin([3, 4, 7, 8, 9, None]),
    'ASBIRDUC': (col('CHECKUP1').isin([3, 4, 7, 8, 9, None]) | 
                 col('ASBIALCH').isin([2, 7, 9, None]) | 
                 col('ASBIDRNK').isin([2, 7, 9, None]) | 
                 col('ASBIBING').isin([2, 7, 9, None]))
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Armas de fuego
conditions_mode = {
    'FIREARM5': col('FIREARM5').isNull(),
    'GUNLOAD': col('FIREARM5').isin([2, 7, 9, None]),
    'LOADULK2': (col('FIREARM5').isin([2, 7, 9, None]) | col('GUNLOAD').isin([2, 7, 9, None]))
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

# Raza y salud
conditions_mode = {
    'RRCLASS3': col('RRCLASS3').isNull(),
    'RRHCARE4': col('RRHCARE4').isNull(),
    'RRPHYSM2': col('RRPHYSM2').isNull()
}

for col_name, condition in conditions_mode.items():
    mode_value = df.groupBy(col_name).count().orderBy('count', ascending=False).first()[col_name]
    df_imputed = df_imputed.withColumn(col_name + '_imputed', when(condition, mode_value).otherwise(0))

df_imputed = df_imputed.join(df_imputed.select('ID', *[F.col(f"{col}_imputed").cast('double') for col in conditions_mode.keys()]), 'ID', 'left')

In [None]:
# Limpiar
df_final = df_imputed.drop(*[col for col in df_imputed.columns if col.endswith("_imputed")])

<div style="font-family: 'Garamond'; font-size: 14px;">
    <normal>
        

Este código es un flujo de trabajo en PySpark diseñado para abordar los valores faltantes en un DataFrame que contiene datos de encuestas relacionados con la salud, demografía y comportamientos sociales. Veamos paso a paso cómo se realiza este proceso:

1. **Generación de un identificador único para cada fila**: El primer paso es agregar una columna llamada "ID" al DataFrame original (`df`). Esta columna asigna un identificador único a cada fila del DataFrame, lo que facilita el seguimiento y la manipulación de datos en etapas posteriores del proceso.

2. **Identificación de columnas con valores faltantes**: A continuación, se define una lista llamada `columns_to_impute` que enumera todas las columnas que contienen valores faltantes y que necesitan ser imputadas. Estas columnas abarcan una variedad de aspectos relacionados con la salud, como la presencia de enfermedades crónicas, el acceso a la atención médica y los comportamientos de salud.

3. **Creación de un imputer**: Utilizando la clase `Imputer` de PySpark ML, se crea un objeto imputer. Este imputer se configura para imputar los valores faltantes en las columnas especificadas en `columns_to_impute` utilizando una estrategia de imputación específica.

4. **Imputación de valores faltantes**: El imputer se aplica al DataFrame original (`df`) para imputar los valores faltantes en las columnas especificadas. Como resultado, se obtiene un nuevo DataFrame llamado `df_imputed` que contiene los valores imputados en lugar de los valores faltantes.

5. **Imputación específica para ciertas columnas**: Se realiza una imputación específica para ciertas columnas que requieren un tratamiento especial. Por ejemplo, la columna "CHILDREN" se imputa utilizando el valor medio de esa columna en lugar de la estrategia general utilizada anteriormente.

6. **Unión de DataFrames**: Se realizan operaciones de join en el DataFrame `df_imputed` consigo mismo para agregar las columnas imputadas al DataFrame principal. Esto garantiza que todas las columnas imputadas estén disponibles para su análisis futuro.

7. **Imputación basada en condiciones de salud y comportamientos**: Se implementan estrategias de imputación específicas para datos relacionados con la salud, como la presencia de enfermedades crónicas, la vacunación y los hábitos de consumo de tabaco y alcohol. Estas estrategias tienen en cuenta el contexto de salud y los comportamientos sociales asociados.

8. **Imputación basada en condiciones sociales y demográficas**: Se realizan operaciones de imputación similares para características sociales y demográficas, como la educación, los ingresos y el apoyo social. Estas características pueden influir significativamente en la salud y el bienestar general de un individuo.

9. **Imputación específica para consumo de cannabis y armas de fuego**: Dado el impacto potencial en la salud y la seguridad pública, se imputan valores faltantes relacionados con el consumo de cannabis y la posesión de armas de fuego utilizando estrategias específicas.

10. **Análisis y modelado de datos mejorados**: Una vez completado el proceso de imputación, el DataFrame resultante (`df_imputed`) está listo para su análisis y modelado adicionales. Al garantizar que los valores faltantes se manejen de manera adecuada y significativa, se mejora la calidad y la integridad de los datos, lo que lleva a resultados más recisos y confiables.
ior análisis y modelado.

In [None]:
df_final = df_final.select(*[col for col in df_final.columns if "imputed" in col])

In [None]:
guardar_info_nulos_en_txt(df_final, "nulos_4")

In [None]:
df_imputed