<a href="https://colab.research.google.com/github/miroagustin/CienciaDeDatos1Q2025/blob/main/ToolsTitanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Práctica Pipelines - Dataset titanic

En el siguiente dataset intentaremos clasificar supervivientes del titanic en función de sus variables.

## Parte 1

Proceso el dataset; aplicando pipelines quite las columnas que considere, realice las imputaciones y/o escalamientos según corresponda y su conocimiento e intuicion le indiquen. Intente clasificar usando KNN.


## Parte 2

Reutilizando lo anterior tanto como sea posible evalue utilizando otro clasificador. Pruebe distintas alternativas de transformaciones. Evalue que opción (transformaciones+clasificador) ofrece los mejores resultados.

### Dataset "Titanic"


In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Cargar el dataset
url = "https://raw.githubusercontent.com/PabloSoligo2014/3670-UNLaM-CD/refs/heads/main/datasets/titanic.csv"
df = pd.read_csv(url)

In [2]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
df.isnull().sum()/len(df)*100

Unnamed: 0,0
PassengerId,0.0
Survived,0.0
Pclass,0.0
Name,0.0
Sex,0.0
Age,19.86532
SibSp,0.0
Parch,0.0
Ticket,0.0
Fare,0.0


In [4]:
# Eliminar columnas que no aportan
df = df.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"])

In [5]:
# Separar variables predictoras y objetivo
X = df.drop("Survived", axis=1)
y = df["Survived"]

In [6]:
# Definir columnas numéricas y categóricas
num_cols = ["Age", "Fare", "SibSp", "Parch"]
cat_cols = ["Sex", "Embarked", "Pclass"]

In [7]:
# Pipelines para cada tipo de dato
num_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])
cat_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

In [8]:
# Preprocesador combinado
preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", cat_pipeline, cat_cols)
])

In [9]:
# Pipeline completo
model_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", KNeighborsClassifier(n_neighbors=5))
])

In [10]:
# Dividir en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Entrenar el modelo
model_pipeline.fit(X_train, y_train)

In [11]:
# Evaluar el modelo
y_pred = model_pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

# Mostrar resultados
print(f"Accuracy del modelo KNN: {accuracy:.4f}")
print("\nReporte de clasificación:")
print(report)

Accuracy del modelo KNN: 0.8156

Reporte de clasificación:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       105
           1       0.81      0.73      0.77        74

    accuracy                           0.82       179
   macro avg       0.81      0.80      0.81       179
weighted avg       0.82      0.82      0.81       179



In [14]:
def make_preprocessor(num_imputer_strategy="median", scaler=StandardScaler()):
    num_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy=num_imputer_strategy)),
        ("scaler", scaler)
    ])
    cat_pipeline = Pipeline([
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown="ignore"))
    ])
    return ColumnTransformer([
        ("num", num_pipeline, num_cols),
        ("cat", cat_pipeline, cat_cols)
    ])

# Opciones a probar
options = [
    {
        "name": "KNN + StandardScaler + median",
        "preprocessor": make_preprocessor("median", StandardScaler()),
        "classifier": KNeighborsClassifier(n_neighbors=5)
    },
    {
        "name": "RandomForest + StandardScaler + median",
        "preprocessor": make_preprocessor("median", StandardScaler()),
        "classifier": RandomForestClassifier(random_state=42)
    },
    {
        "name": "LogisticRegression + MinMaxScaler + mean",
        "preprocessor": make_preprocessor("mean", MinMaxScaler()),
        "classifier": LogisticRegression(max_iter=1000)
    },
    {
        "name": "GradientBoosting + StandardScaler + median",
        "preprocessor": make_preprocessor("median", StandardScaler()),
        "classifier": GradientBoostingClassifier(random_state=42)
    }
]

In [15]:
# Ejecutar y comparar
results = []

for opt in options:
    pipe = Pipeline([
        ("preprocessor", opt["preprocessor"]),
        ("classifier", opt["classifier"])
    ])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    results.append({
        "name": opt["name"],
        "accuracy": report["accuracy"],
        "f1_score": report["weighted avg"]["f1-score"]
    })

# Mostrar resultados
print("Comparación de clasificadores y transformaciones:")
for r in results:
    print(f"{r['name']}: Accuracy={r['accuracy']:.4f}, F1={r['f1_score']:.4f}")

Comparación de clasificadores y transformaciones:
KNN + StandardScaler + median: Accuracy=0.8156, F1=0.8140
RandomForest + StandardScaler + median: Accuracy=0.8212, F1=0.8208
LogisticRegression + MinMaxScaler + mean: Accuracy=0.7989, F1=0.7974
GradientBoosting + StandardScaler + median: Accuracy=0.8212, F1=0.8188
