# **Diplomado de Machine Learning con Python**
## Tarea 03 ‚Äì Sesi√≥n 05

## **Escalado de variables**


üìä El escalado de variables es una etapa fundamental en el preprocesamiento de datos, especialmente cuando se trabaja con algoritmos sensibles a la magnitud de las caracter√≠sticas. Su objetivo es garantizar que todas las variables contribuyan de manera equitativa al an√°lisis, evitando sesgos provocados por diferencias de escala o unidades.


Aplicar diferentes t√©cnicas de escalado a distintas columnas de un conjunto de datos, seg√∫n el tipo de variable o su sensibilidad a outliers, es clave para preparar datos antes de aplicar modelos como PCA, regresi√≥n, clustering, etc.

Para esta tarea se utilizar√° el **Austin Weather Dataset**.


![Austin Weather Illustration](https://www.austintexas.gov/sites/default/files/skyline2.png)




## Importaci√≥n de librer√≠as y lectura de archivo csv


In [47]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer

df = pd.read_csv("austin_weather.csv")
print(f'Dimensiones del dataset: Filas: {df.shape[0]}, Columnas: {df.shape[1]} \n')
df.head(5)

Dimensiones del dataset: Filas: 1319, Columnas: 21 



Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events
0,2013-12-21,74,60,45,67,49,43,93,75,57,...,29.68,29.59,10,7,2,20,4,31,0.46,"Rain , Thunderstorm"
1,2013-12-22,56,48,39,43,36,28,93,68,43,...,30.13,29.87,10,10,5,16,6,25,0,
2,2013-12-23,58,45,32,31,27,23,76,52,27,...,30.49,30.41,10,10,10,8,3,12,0,
3,2013-12-24,61,46,31,36,28,21,89,56,22,...,30.45,30.3,10,10,7,12,4,20,0,
4,2013-12-25,58,50,41,44,40,36,86,71,56,...,30.33,30.27,10,10,7,10,2,16,T,


## Revisi√≥n inicial de los datos

In [48]:
d_type = df.dtypes # Tipo de dato
n_non_null = df.count() # Numero de valores no nulos
n_unique = df.nunique() # Numero de valores unicos
n_null = df.isnull().sum() # Numero de valors nulos
ratio_null = df.isnull().sum()/df.shape[0] # Porcentaje de valores nulos

pd.DataFrame(
    {"d_type": d_type,
     "n_non_null": n_non_null,
     "n_unique": n_unique,
     "n_null": n_null,
     "ratio_null": ratio_null}
)

Unnamed: 0,d_type,n_non_null,n_unique,n_null,ratio_null
Date,object,1319,1319,0,0.0
TempHighF,int64,1319,74,0,0.0
TempAvgF,int64,1319,64,0,0.0
TempLowF,int64,1319,61,0,0.0
DewPointHighF,object,1319,64,0,0.0
DewPointAvgF,object,1319,66,0,0.0
DewPointLowF,object,1319,73,0,0.0
HumidityHighPercent,object,1319,58,0,0.0
HumidityAvgPercent,object,1319,69,0,0.0
HumidityLowPercent,object,1319,82,0,0.0


## Convertir columnas num√©ricas antes del procesamiento

In [49]:
# Reemplazar "-" por NaN en todas las columnas
df.replace("-", np.nan, inplace=True)

# Reemplazar "T" en la columna "PrecipitationSumInches" por 0.001
df["PrecipitationSumInches"] = df["PrecipitationSumInches"].replace("T", "0.001")
df["PrecipitationSumInches"] = pd.to_numeric(df["PrecipitationSumInches"], errors="coerce")

# Lista de columnas a excluir
excluir = ['Date', 'Events']

# Convertir columnas seleccionadas a num√©ricas
for col in df.columns:
    if col not in excluir:
        df[col] = pd.to_numeric(df[col])

# Verificar la conversi√≥n
df.dtypes


Unnamed: 0,0
Date,object
TempHighF,int64
TempAvgF,int64
TempLowF,int64
DewPointHighF,float64
DewPointAvgF,float64
DewPointLowF,float64
HumidityHighPercent,float64
HumidityAvgPercent,float64
HumidityLowPercent,float64


## Identificar columnas con outliers

1. Se recorren todas las columnas num√©ricas del DataFrame.
2. Para cada una, se calculan los l√≠mites de outliers y se verifica si hay valores fuera de esos rangos.
3. Se devuelve una lista con los nombres de las columnas que s√≠ tienen outliers.


In [50]:
# Selecciona las columnas num√©ricas del DataFrame 'datos'
num_cols = df.select_dtypes(include=[np.number]).columns

# Inicializa una lista para guardar los nombres de columnas que contienen outliers
cols_con_outliers = []

# Itera sobre cada columna num√©rica
for c in num_cols:
    # Convierte los valores a num√©ricos (por si hay strings), y elimina valores nulos
    s = pd.to_numeric(df[c], errors="coerce").dropna()

    # Si la serie est√° vac√≠a despu√©s de limpiar, pasa a la siguiente columna
    if s.empty:
        continue

    # Calcula el primer y tercer cuartil (Q1 y Q3)
    q1, q3 = s.quantile(0.25), s.quantile(0.75)

    # Calcula el rango intercuart√≠lico (IQR)
    iqr = q3 - q1

    # Si el IQR es cero, no hay dispersi√≥n, as√≠ que no se puede detectar outliers
    if iqr == 0:
        continue

    # Define los l√≠mites inferior y superior para detectar outliers
    lim_inf, lim_sup = q1 - 1.5 * iqr, q3 + 1.5 * iqr

    # Cuenta cu√°ntos valores est√°n fuera de esos l√≠mites
    n_out = int(((s < lim_inf) | (s > lim_sup)).sum())

    # Si hay al menos un outlier, agrega la columna a la lista
    if n_out > 0:
        cols_con_outliers.append(c)

print(f'Las columnas con outliers son:\n {cols_con_outliers}')

Las columnas con outliers son:
 ['TempHighF', 'TempAvgF', 'DewPointHighF', 'DewPointAvgF', 'HumidityHighPercent', 'HumidityAvgPercent', 'HumidityLowPercent', 'SeaLevelPressureHighInches', 'SeaLevelPressureAvgInches', 'SeaLevelPressureLowInches', 'VisibilityAvgMiles', 'WindHighMPH', 'WindAvgMPH', 'WindGustMPH', 'PrecipitationSumInches']


## Definir el tipo de escalado para cada columna

In [51]:
# Definir columnas por tipo de escalado

# Variables en rangos similares ‚Üí escalado est√°ndar
cols_std = ['TempLowF', 'DewPointLowF', 'VisibilityHighMiles','VisibilityLowMiles']

# Variable porcentual ‚Üí escalado entre 0 y 1
cols_minmax = ['HumidityHighPercent','HumidityAvgPercent', 'HumidityLowPercent']

#  Variables con outliers ‚Üí escalado robusto
cols_robust =  ['TempHighF', 'TempAvgF', 'DewPointHighF', 'DewPointAvgF','SeaLevelPressureHighInches',
                'SeaLevelPressureAvgInches', 'SeaLevelPressureLowInches', 'VisibilityAvgMiles',
                'WindHighMPH', 'WindAvgMPH', 'WindGustMPH', 'PrecipitationSumInches']


print(f'Columnas para estandarizacion:\n {cols_std}\n')
print(f'Columnas para min-max:\n {cols_minmax}\n')
print(f'Columnas para robustscaler:\n {cols_robust}')

Columnas para estandarizacion:
 ['TempLowF', 'DewPointLowF', 'VisibilityHighMiles', 'VisibilityLowMiles']

Columnas para min-max:
 ['HumidityHighPercent', 'HumidityAvgPercent', 'HumidityLowPercent']

Columnas para robustscaler:
 ['TempHighF', 'TempAvgF', 'DewPointHighF', 'DewPointAvgF', 'SeaLevelPressureHighInches', 'SeaLevelPressureAvgInches', 'SeaLevelPressureLowInches', 'VisibilityAvgMiles', 'WindHighMPH', 'WindAvgMPH', 'WindGustMPH', 'PrecipitationSumInches']


## Aplicar transformaciones a las columnas con `ColumnTransformer`



1. Se definen grupos de columnas seg√∫n el tipo de escalado m√°s adecuado: est√°ndar, MinMax y robusto.
2. Se crea un `ColumnTransformer` que aplica cada t√©cnica de escalado a su grupo correspondiente, eliminando las columnas no especificadas.
3. Se ajusta (`fit`) el transformador sobre el conjunto completo de datos para aprender los par√°metros de escalado.
4. Se transforma (`transform`) el dataset aplicando los escaladores aprendidos, generando una matriz con las variables ya escaladas.


In [52]:
# Construir el transformador por columnas

'''
Usamos ColumnTransformer para aplicar **distintas transformaciones a distintos subconjuntos
de columnas en una sola pasada. Cada tupla dentro de transformers=[...]
tiene la forma (nombre_bloque, transformador, columnas_objetivo).

- ("std", StandardScaler(), cols_std)
  Aplica estandarizaci√≥n a las columnas listadas en cols_std
- ("minmax", MinMaxScaler(), cols_minmax)
  Aplica normalizaci√≥n Min‚ÄìMax a las columnas de cols_minmax
- ("robust", RobustScaler(quantile_range=(25, 75)), cols_robust)
  Aplica escalado robusto a las columnas de cols_robust

Par√°metros adicionales del `ColumnTransformer`:

- remainder="drop"
  Indica que las columnas no listadas en los bloques anteriores se descartan de la salida.
  Alternativa: "passthrough" para dejarlas pasar sin transformar)

- verbose_feature_names_out=False
  Mantiene nombres de salida ‚Äúlimpios‚Äù (p. ej., `edad`) en lugar de prefijarlos con el nombre
  del bloque (p. ej., `std__edad`).
'''

preprocesador = ColumnTransformer(
    transformers=[
        ("std", StandardScaler(), cols_std), # Aplica escalado est√°ndar a cols_std
        ("minmax", MinMaxScaler(), cols_minmax), # Aplica MinMax a cols_minmax
        ("robust", RobustScaler(quantile_range=(25, 75)), cols_robust), # Aplica escalado robusto a cols_robust
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

# Ajustar (fit) sobre TODO el dataset
preprocesador.fit(df)

# Transformar con los par√°metros aprendidos
X_escalado = preprocesador.transform(df)

## Reconstruir DataFrame con las variables transformadas


In [53]:
# Reconstruir DataFrame con los mismos nombres de columnas

# Obtener los nombres de las columnas transformadas por el preprocesador
columnas_escaladas = preprocesador.get_feature_names_out()

# Convertir la matriz NumPy escalada (X_escalado) en un DataFrame de pandas
datos_escalado = pd.DataFrame(X_escalado, columns=columnas_escaladas, index=df.index)

# Mostrar el DataFrame escalado con las variables transformadas
datos_escalado.head(5)


Unnamed: 0,TempLowF,DewPointLowF,VisibilityHighMiles,VisibilityLowMiles,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,TempHighF,TempAvgF,DewPointHighF,DewPointAvgF,SeaLevelPressureHighInches,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityAvgMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches
0,-1.050594,-0.4903,0.051499,-1.314512,0.888889,0.685714,0.566265,-0.45,-0.619048,0.05,-0.521739,-1.0,-1.684211,-1.6,-3.0,1.4,-0.333333,1.25,460.0
1,-1.473568,-1.417295,0.051499,-0.499747,0.888889,0.585714,0.39759,-1.35,-1.190476,-1.15,-1.086957,1.5,0.684211,-0.2,0.0,0.6,0.333333,0.5,0.0
2,-1.967038,-1.726293,0.051499,0.858194,0.619048,0.357143,0.204819,-1.25,-1.333333,-1.75,-1.478261,2.181818,2.578947,2.5,0.0,-1.0,-0.666667,-1.125,0.0
3,-2.037533,-1.849893,0.051499,0.043429,0.825397,0.414286,0.144578,-1.1,-1.285714,-1.5,-1.434783,2.181818,2.368421,1.95,0.0,-0.2,-0.333333,-0.125,0.0
4,-1.332577,-0.922897,0.051499,0.043429,0.777778,0.628571,0.554217,-1.25,-1.095238,-1.1,-0.913043,1.5,1.736842,1.8,0.0,-0.6,-1.0,-0.625,1.0


In [54]:
# --- Reordenar para que siga el orden original del CSV ---

# Crear una lista con las columnas originales que tambi√©n est√°n presentes en el DataFrame escalado
# Esto preserva el orden original del archivo CSV, evitando que las columnas escaladas queden desordenadas
orden_original = [c for c in df.columns if c in datos_escalado.columns]

# Reordenar las columnas del DataFrame escalado seg√∫n el orden original
datos_escalado = datos_escalado[orden_original]

# Muestra el DataFrame escalado con las columnas en el mismo orden que el CSV original
datos_escalado.head(5)


Unnamed: 0,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,SeaLevelPressureHighInches,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches
0,-0.45,-0.619048,-1.050594,0.05,-0.521739,-0.4903,0.888889,0.685714,0.566265,-1.0,-1.684211,-1.6,0.051499,-3.0,-1.314512,1.4,-0.333333,1.25,460.0
1,-1.35,-1.190476,-1.473568,-1.15,-1.086957,-1.417295,0.888889,0.585714,0.39759,1.5,0.684211,-0.2,0.051499,0.0,-0.499747,0.6,0.333333,0.5,0.0
2,-1.25,-1.333333,-1.967038,-1.75,-1.478261,-1.726293,0.619048,0.357143,0.204819,2.181818,2.578947,2.5,0.051499,0.0,0.858194,-1.0,-0.666667,-1.125,0.0
3,-1.1,-1.285714,-2.037533,-1.5,-1.434783,-1.849893,0.825397,0.414286,0.144578,2.181818,2.368421,1.95,0.051499,0.0,0.043429,-0.2,-0.333333,-0.125,0.0
4,-1.25,-1.095238,-1.332577,-1.1,-0.913043,-0.922897,0.777778,0.628571,0.554217,1.5,1.736842,1.8,0.051499,0.0,0.043429,-0.6,-1.0,-0.625,1.0


In [None]:
# Exportar el DataFrame escalado a CSV

datos_escalado.to_csv("data_scaled.csv")