# Preparación de datos

Este notebook documenta la descripción del dataset AI4I 2020, su carga inicial y los pasos de limpieza y preprocesado previos al modelado.

## Descripción del dataset

Resumen de columnas, estructura y objetivos: mantenimiento predictivo de una fresadora, con sensores térmicos, de velocidad y de desgaste, y etiquetas de fallo multi-clase y binaria.

## Carga de datos

Carga inicial del archivo `ai4i2020.csv` y primera inspección (dimensiones, tipos y muestra).

In [8]:
import pandas as pd
from pathlib import Path

data_path = Path('../data/ai4i2020.csv')
df = pd.read_csv(data_path)
df.head(5)

Unnamed: 0,UDI,Product ID,Type,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,PWF,OSF,RNF
0,1,M14860,M,298.1,308.6,1551,42.8,0,0,0,0,0,0,0
1,2,L47181,L,298.2,308.7,1408,46.3,3,0,0,0,0,0,0
2,3,L47182,L,298.1,308.5,1498,49.4,5,0,0,0,0,0,0
3,4,L47183,L,298.2,308.6,1433,39.5,7,0,0,0,0,0,0
4,5,L47184,L,298.2,308.7,1408,40.0,9,0,0,0,0,0,0


## Limpieza de datos

Tratamiento de valores nulos, revisión de tipos, y detección de inconsistencias en variables numéricas o categóricas.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   UDI                      10000 non-null  int64  
 1   Product ID               10000 non-null  object 
 2   Type                     10000 non-null  object 
 3   Air temperature [K]      10000 non-null  float64
 4   Process temperature [K]  10000 non-null  float64
 5   Rotational speed [rpm]   10000 non-null  int64  
 6   Torque [Nm]              10000 non-null  float64
 7   Tool wear [min]          10000 non-null  int64  
 8   Machine failure          10000 non-null  int64  
 9   TWF                      10000 non-null  int64  
 10  HDF                      10000 non-null  int64  
 11  PWF                      10000 non-null  int64  
 12  OSF                      10000 non-null  int64  
 13  RNF                      10000 non-null  int64  
dtypes: float64(3), int64(9)

## Preprocesamiento

Escalado de las variables numéricas y codificación de variables categóricas.

In [10]:
# Esta celda limita el preprocesado a detección de columnas clave antes del modelado
numeric_cols = [
    'Air temperature [K]',
    'Process temperature [K]',
    'Rotational speed [rpm]',
    'Torque [Nm]',
    'Tool wear [min]'
]

categorical_cols = ['Type']

print('Nº muestras / columnas:', df.shape)
print('Tipos relevantes:', df[numeric_cols + categorical_cols].dtypes)


Nº muestras / columnas: (10000, 14)
Tipos relevantes: Air temperature [K]        float64
Process temperature [K]    float64
Rotational speed [rpm]       int64
Torque [Nm]                float64
Tool wear [min]              int64
Type                        object
dtype: object


## Guardar dataset preprocesado

Exportar el DataFrame preparado a `data/ai4i_clean.csv` para usar en el flujo de modelado.

In [11]:
from pathlib import Path

clean_path = Path('../data/ai4i_clean.csv')  # guardado mediante el script `src/data_prep.py`
print('Si hicieras un dump desde aquí, bastaría con:')
print(f"df.to_csv('{clean_path}', index=False)")


Si hicieras un dump desde aquí, bastaría con:
df.to_csv('..\data\ai4i_clean.csv', index=False)


In [12]:
df.head().to_string()

'   UDI Product ID Type  Air temperature [K]  Process temperature [K]  Rotational speed [rpm]  Torque [Nm]  Tool wear [min]  Machine failure  TWF  HDF  PWF  OSF  RNF\n0    1     M14860    M                298.1                    308.6                    1551         42.8                0                0    0    0    0    0    0\n1    2     L47181    L                298.2                    308.7                    1408         46.3                3                0    0    0    0    0    0\n2    3     L47182    L                298.1                    308.5                    1498         49.4                5                0    0    0    0    0    0\n3    4     L47183    L                298.2                    308.6                    1433         39.5                7                0    0    0    0    0    0\n4    5     L47184    L                298.2                    308.7                    1408         40.0                9                0    0    0    0    0    0'

In [None]:
import pandas as pd
from pathlib import Path
from subprocess import run, CalledProcessError

# Use the cleaned CSV produced by the script for EDA. If missing, run the data_prep script
clean_path = Path('../data/ai4i_clean.csv')
if not clean_path.exists():
    print('ai4i_clean.csv not found — running data_prep script (this launches a subprocess)...')
    try:
        run(['python', '-m', 'src.data_prep'], check=True)
    except CalledProcessError as e:
        raise RuntimeError('Error running src.data_prep — run it from the repo root in a terminal') from e

df_clean = pd.read_csv(clean_path)
# For modeling we use the preprocessing pipeline in src.features (sklearn pipeline).
# For EDA here, work with df_clean directly.
df_clean.head()

Unnamed: 0,UDI,Product ID,Air temperature [K],Process temperature [K],Rotational speed [rpm],Torque [Nm],Tool wear [min],Machine failure,TWF,HDF,...,OSF,RNF,Temp_diff,Torque_per_rpm,Type_H,Type_L,Type_M,Tool_state_Medio,Tool_state_Nuevo,Tool_state_Viejo
0,1.0,M14860,-0.952389,-0.94736,0.068185,0.2822,-1.695984,0.0,0.0,0.0,...,0.0,0.0,0.498849,0.079191,0.0,0.0,1.0,0.0,1.0,0.0
1,2.0,L47181,-0.902393,-0.879959,-0.729472,0.633308,-1.648852,0.0,0.0,0.0,...,0.0,0.0,0.498849,0.677574,0.0,1.0,0.0,0.0,1.0,0.0
2,3.0,L47182,-0.952389,-1.014761,-0.22745,0.94429,-1.61743,0.0,0.0,0.0,...,0.0,0.0,0.398954,0.688185,0.0,1.0,0.0,0.0,1.0,0.0
3,4.0,L47183,-0.902393,-0.94736,-0.590021,-0.048845,-1.586009,0.0,0.0,0.0,...,0.0,0.0,0.398954,0.075735,0.0,1.0,0.0,0.0,1.0,0.0
4,5.0,L47184,-0.902393,-0.879959,-0.729472,0.001313,-1.554588,0.0,0.0,0.0,...,0.0,0.0,0.498849,0.171294,0.0,1.0,0.0,0.0,1.0,0.0


In [None]:
from sklearn.model_selection import train_test_split

# Use the cleaned dataframe produced by src.data_prep for splitting/EDA
df_with_labels = df_clean.copy()
target_cols = ['Machine failure', 'TWF', 'HDF', 'PWF', 'OSF', 'RNF']
failure_cols = ['TWF', 'HDF', 'PWF', 'OSF', 'RNF']
df_with_labels['Failure_type'] = 'No failure'
for col in failure_cols:
    df_with_labels.loc[df_with_labels[col] == 1, 'Failure_type'] = col

X = df_with_labels.drop(columns=['UDI', 'Product ID'] + target_cols + ['Failure_type'])
y_bin = df_with_labels['Machine failure'].astype(int)
y_multi = df_with_labels['Failure_type']

X_train_bin, X_test_bin, y_train_bin, y_test_bin = train_test_split(
    X, y_bin, test_size=0.2, stratify=y_bin, random_state=42)
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X, y_multi, test_size=0.2, stratify=y_multi, random_state=42)

def describe_split(y_train, y_test, label):
    print(f'{label}: train {len(y_train)} | test {len(y_test)}')
    print('  train distribution:')
    print(y_train.value_counts().to_frame('count'))
    print('  test  distribution:')
    print(y_test.value_counts().to_frame('count'))
    print()

describe_split(y_train_bin, y_test_bin, 'Binary (Machine failure)')
describe_split(y_train_multi, y_test_multi, 'Multiclass (Failure_type)')

Binary (Machine failure): train 8000 | test 2000
  train distribution:
                 count
Machine failure       
0                 7729
1                  271
  test  distribution:
                 count
Machine failure       
0                 1932
1                   68

Multiclass (Failure_type): train 8000 | test 2000
  train distribution:
              count
Failure_type       
No failure     7722
HDF              85
OSF              78
PWF              66
TWF              34
RNF              15
  test  distribution:
              count
Failure_type       
No failure     1930
HDF              21
OSF              20
PWF              17
TWF               8
RNF               4



## Entrenamiento (NO en el notebook)

El entrenamiento y la creación del pipeline se realizan desde el script `src.train` para evitar duplicar lógica y prevenir data leakage.
Puedes ejecutar el entrenamiento desde una terminal con:
```
python -m src.train --target binary --model rf
```
Si quieres lanzar el entrenamiento desde el notebook puedes usar una celda de sistema (tarda más):
```
!python -m src.train --target binary --model rf
```