# Clase 4: One-Hot Encoding

¿Como codificar variables categóricas?

In [1]:
import pandas as pd

In [13]:
df = pd.read_csv("datos/college_placement.csv")
df.head()

Unnamed: 0,College_ID,IQ,Prev_Sem_Result,CGPA,Academic_Performance,Internship_Experience,Extra_Curricular_Score,Communication_Skills,Projects_Completed,Placement
0,CLG0030,107,6.61,6.28,8,No,8,8,4,No
1,CLG0061,97,5.52,5.37,8,No,7,8,0,No
2,CLG0036,109,5.36,5.83,9,No,3,1,1,No
3,CLG0055,122,5.47,5.75,6,Yes,1,6,1,No
4,CLG0004,96,7.91,7.69,7,No,8,10,2,No


Primero: veo la columna College_ID. Me sirve para algo? No, es un identificador único. La borro.

In [14]:
df = df.drop(columns=["College_ID"])
df.head()

Unnamed: 0,IQ,Prev_Sem_Result,CGPA,Academic_Performance,Internship_Experience,Extra_Curricular_Score,Communication_Skills,Projects_Completed,Placement
0,107,6.61,6.28,8,No,8,8,4,No
1,97,5.52,5.37,8,No,7,8,0,No
2,109,5.36,5.83,9,No,3,1,1,No
3,122,5.47,5.75,6,Yes,1,6,1,No
4,96,7.91,7.69,7,No,8,10,2,No


Ahora veo que la columna Internship_Experience es categórica. Tiene dos valores: Yes y No. Uso One-Hot Encoding.

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_cols = ['Internship_Experience']

# ColumnTransformer: encodea las columnas categóricas y deja pasar las demás
ct = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_cols)
    ],
    remainder='passthrough'  # mantiene las otras columnas sin modificar
)

# Transforma los datos
X_transformed = ct.fit_transform(df)

# Agarro el nombre de las nuevas columnas
feature_names = ct.get_feature_names_out()

# Lo volvemos a convertir en DataFrame
encoded_df = pd.DataFrame(X_transformed, columns=feature_names)
encoded_df.head()


Unnamed: 0,cat__Internship_Experience_No,cat__Internship_Experience_Yes,remainder__IQ,remainder__Prev_Sem_Result,remainder__CGPA,remainder__Academic_Performance,remainder__Extra_Curricular_Score,remainder__Communication_Skills,remainder__Projects_Completed,remainder__Placement
0,1.0,0.0,107,6.61,6.28,8,8,8,4,No
1,1.0,0.0,97,5.52,5.37,8,7,8,0,No
2,1.0,0.0,109,5.36,5.83,9,3,1,1,No
3,0.0,1.0,122,5.47,5.75,6,1,6,1,No
4,1.0,0.0,96,7.91,7.69,7,8,10,2,No


Listo! Podemos ver que todas las columnas son numéricas.