# Preparación de datos

Este notebook documenta la descripción del dataset AI4I 2020, su carga inicial y los pasos de limpieza y preprocesado previos al modelado.

## Descripción del dataset

Resumen de columnas, estructura y objetivos: mantenimiento predictivo de una fresadora, con sensores térmicos, de velocidad y de desgaste, y etiquetas de fallo multi-clase y binaria.

## Carga de datos

Carga inicial del archivo `ai4i2020.csv` y primera inspección (dimensiones, tipos y muestra).

In [1]:
import pandas as pd
from pathlib import Path

data_path = Path('../data/ai4i2020.csv')
df = pd.read_csv(data_path)
df.head(5)

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


## Limpieza de datos

Tratamiento de valores nulos, revisión de tipos, y detección de inconsistencias en variables numéricas o categóricas.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

## Preprocesamiento

Escalado de las variables numéricas y codificación de variables categóricas.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_cols = [
    'Air temperature [K]',
    'Process temperature [K]',
    'Rotational speed [rpm]',
    'Torque [Nm]',
    'Tool wear [min]'
]

categorical_cols = ['Type']


ohe = OneHotEncoder(handle_unknown='ignore') 

transformer = ColumnTransformer([
    ('num', StandardScaler(), numeric_cols),
    ('cat', ohe, categorical_cols)
], remainder='passthrough')

arr = transformer.fit_transform(df)

import numpy as np

if hasattr(arr, "toarray"):  
    arr = arr.toarray()


## Guardar dataset preprocesado

Exportar el DataFrame preparado a `data/ai4i_clean.csv` para usar en el flujo de modelado.

In [None]:
cat_ohe = transformer.named_transformers_['cat']
categories = cat_ohe.categories_[0]   # ['L', 'M', 'H']
cat_features = [f"Type_{cat}" for cat in categories]

remainder_cols = [c for c in df.columns if c not in numeric_cols + categorical_cols]

feature_names = numeric_cols + cat_features + remainder_cols

import numpy as np
if hasattr(arr, "toarray"):
    arr = arr.toarray()

df_prepared = pd.DataFrame(arr, columns=feature_names)

from pathlib import Path
clean_path = Path('../data/ai4i_clean.csv')
df_prepared.to_csv(clean_path, index=False)
print("Guardado:", clean_path)

Guardado: ..\data\ai4i_clean.csv


In [7]:
df_prepared.head().to_string()

'  Air temperature [K] Process temperature [K] Rotational speed [rpm] Torque [Nm] Tool wear [min] Type_H Type_L Type_M UDI Product ID Machine failure TWF HDF PWF OSF RNF\n0           -0.952389                -0.94736               0.068185      0.2822       -1.695984    0.0    0.0    1.0   1     M14860               0   0   0   0   0   0\n1           -0.902393               -0.879959              -0.729472    0.633308       -1.648852    0.0    1.0    0.0   2     L47181               0   0   0   0   0   0\n2           -0.952389               -1.014761               -0.22745     0.94429        -1.61743    0.0    1.0    0.0   3     L47182               0   0   0   0   0   0\n3           -0.902393                -0.94736              -0.590021   -0.048845       -1.586009    0.0    1.0    0.0   4     L47183               0   0   0   0   0   0\n4           -0.902393               -0.879959              -0.729472    0.001313       -1.554588    0.0    1.0    0.0   5     L47184               0 