**Assignment**

1. Remember to add your name to the title of the notebook
2. The goal is to explore models that underfit and overfit, and to deal with overfitting by using the techniques seen in class.


In [None]:
# Import needed libraries
import numpy as np
import sys, os, pdb
import pandas as pd
from matplotlib import pyplot as plt


Data:

Consists of the gene expression profile of several cells (coming from a patient). 

There is a train and a test datasets already provided to you.

They are organized as a matrix of cells x genes.

Given a cell, the goal is to predict the correct cell-type based on the genes' expressions for that sample.

In [None]:
# Cambiar el directorio de trabajo a DL-HW-1 (se asume que `os` ya fue importado en otra celda)
target_dir = "DL-HW-1"

if os.path.isdir(target_dir):
    os.chdir(target_dir)
    print(f"Directorio cambiado a: {os.getcwd()}")
    print("Contenido del directorio:", os.listdir('.'))
else:
    raise FileNotFoundError(f"Directorio '{target_dir}' no existe. Ruta actual: {os.getcwd()}")

In [None]:
# Load data

# Path to source batch
train_path = "train.pkl"
# Path to target batch
test_path = "test.pkl"
# Column containing cell-types
lname = "labels" 

train_batch = pd.read_pickle(train_path)
test_batch = pd.read_pickle(test_path)

In [None]:
train_batch

In [None]:
# Extract the common genes so that we can use the same network for both batches

common_genes = list(set(train_batch.columns).intersection(set(test_batch.columns)))
common_genes.sort()
train_batch = train_batch[list(common_genes)]
test_batch = test_batch[list(common_genes)]

train_mat = train_batch.drop(lname, axis=1)
train_labels = train_batch[lname]

test_mat = test_batch.drop(lname, axis=1)
test_labels = test_batch[lname]

# values are already normalized (ignore this)
mat = train_mat.values
mat_round = np.rint(mat)
error = np.mean(np.abs(mat - mat_round))


In [None]:
train_labels.unique()

## 2. Preguntas Teoricas (Q1-Q6)

### Q1: Que tipo de problema estamos resolviendo?

**Respuesta:** Problema de clasificacion multiclase. Dado el perfil de expresion genica de una celula, predecir su tipo celular.

### Q2: Cual es el tamano de la entrada (numero de features)?

**Respuesta:** El numero de genes comunes entre train y test (se calcula en las celdas anteriores).

### Q3: Cuantas neuronas debemos tener en la ultima capa?

**Respuesta:** Una neurona por cada clase (tipo celular).

### Q4: Cual es la funcion de activacion mas apropiada para la ultima capa?

**Respuesta:** Softmax, para obtener probabilidades que sumen 1.0.


In [None]:
# Procesar etiquetas: convertir a enteros y one-hot
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical

label_encoder = LabelEncoder()
train_labels_int = label_encoder.fit_transform(train_labels)
test_labels_int = label_encoder.transform(test_labels)

num_classes = len(label_encoder.classes_)
n_features = train_mat.shape[1]

train_labels_onehot = to_categorical(train_labels_int, num_classes)
test_labels_onehot = to_categorical(test_labels_int, num_classes)

print(f"Numero de clases: {num_classes}")
print(f"Numero de features: {n_features}")
print(f"Mapeo: {dict(zip(label_encoder.classes_, range(num_classes)))}")


### Q5: Como se modificaron las etiquetas?

**Respuesta:** Se convirtieron de textos a enteros (LabelEncoder) y luego a one-hot encoding para usar con categorical_crossentropy.

### Q6: Que funcion de perdida se usara?

**Respuesta:** Categorical cross-entropy, la perdida estandar para clasificacion multiclase con one-hot encoding.


## 3. Entrenamiento de Modelos

Entrenaremos 3 modelos sin regularizacion:
1. Underfit: Muy pocas capas/neuronas
2. OK: Arquitectura razonable
3. Overfit: Muchas capas/neuronas

Luego aplicaremos regularizacion (L2, Dropout) al modelo que overfit.


In [None]:
# Preparar datos
from sklearn.model_selection import train_test_split

X_train_full = train_mat.values.astype('float32')
X_test = test_mat.values.astype('float32')
y_train_full = train_labels_onehot
y_test = test_labels_onehot

# Split train/validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, random_state=42,
    stratify=np.argmax(y_train_full, axis=1)
)

print(f"X_train: {X_train.shape}, X_val: {X_val.shape}, X_test: {X_test.shape}")


In [None]:
# Importar TensorFlow/Keras
import tensorflow as tf
from tensorflow.keras import models, layers, regularizers

tf.random.set_seed(42)
print(f"TensorFlow version: {tf.__version__}")


### Modelo 1: Underfit


In [None]:
# Modelo con capacidad insuficiente
def create_underfit_model():
    model = models.Sequential([
        layers.Dense(16, activation='relu', input_shape=(n_features,)),
        layers.Dense(num_classes, activation='softmax')
    ], name='Underfit')
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

model_underfit = create_underfit_model()
model_underfit.summary()


In [None]:
# Entrenar modelo underfit
history_underfit = model_underfit.fit(
    X_train, y_train, epochs=50, batch_size=32,
    validation_data=(X_val, y_val), verbose=0
)

print(f"Train acc: {history_underfit.history['accuracy'][-1]:.4f}")
print(f"Val acc: {history_underfit.history['val_accuracy'][-1]:.4f}")


### Modelo 2: Bien Ajustado


In [None]:
# Modelo con capacidad adecuada
def create_ok_model():
    model = models.Sequential([
        layers.Dense(128, activation='relu', input_shape=(n_features,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(32, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ], name='OK')
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

model_ok = create_ok_model()
model_ok.summary()


In [None]:
# Entrenar modelo OK
history_ok = model_ok.fit(
    X_train, y_train, epochs=50, batch_size=32,
    validation_data=(X_val, y_val), verbose=0
)

print(f"Train acc: {history_ok.history['accuracy'][-1]:.4f}")
print(f"Val acc: {history_ok.history['val_accuracy'][-1]:.4f}")


### Modelo 3: Overfit


In [None]:
# Modelo con capacidad excesiva
def create_overfit_model():
    model = models.Sequential([
        layers.Dense(512, activation='relu', input_shape=(n_features,)),
        layers.Dense(512, activation='relu'),
        layers.Dense(256, activation='relu'),
        layers.Dense(256, activation='relu'),
        layers.Dense(128, activation='relu'),
        layers.Dense(num_classes, activation='softmax')
    ], name='Overfit')
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

model_overfit = create_overfit_model()
model_overfit.summary()


In [None]:
# Entrenar modelo overfit
history_overfit = model_overfit.fit(
    X_train, y_train, epochs=50, batch_size=32,
    validation_data=(X_val, y_val), verbose=0
)

print(f"Train acc: {history_overfit.history['accuracy'][-1]:.4f}")
print(f"Val acc: {history_overfit.history['val_accuracy'][-1]:.4f}")
print(f"Gap: {history_overfit.history['accuracy'][-1] - history_overfit.history['val_accuracy'][-1]:.4f}")


### Visualizacion de Resultados


In [None]:
# Graficar curvas de entrenamiento
def plot_history(history, title):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
    ax1.plot(history.history['loss'], label='Train')
    ax1.plot(history.history['val_loss'], label='Val')
    ax1.set_title(f'{title} - Loss')
    ax1.set_xlabel('Epoch')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    ax2.plot(history.history['accuracy'], label='Train')
    ax2.plot(history.history['val_accuracy'], label='Val')
    ax2.set_title(f'{title} - Accuracy')
    ax2.set_xlabel('Epoch')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

plot_history(history_underfit, 'Underfit Model')
plot_history(history_ok, 'OK Model')
plot_history(history_overfit, 'Overfit Model')


## 4. Regularizacion

Aplicaremos L2 y Dropout al modelo que overfit.


### Regularizacion L2


In [None]:
# Modelo con L2 regularization
def create_l2_model(l2_lambda=0.001):
    model = models.Sequential([
        layers.Dense(512, activation='relu', kernel_regularizer=regularizers.l2(l2_lambda), input_shape=(n_features,)),
        layers.Dense(512, activation='relu', kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(256, activation='relu', kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(l2_lambda)),
        layers.Dense(num_classes, activation='softmax')
    ], name=f'L2_{l2_lambda}')
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

model_l2 = create_l2_model(0.001)
history_l2 = model_l2.fit(
    X_train, y_train, epochs=50, batch_size=32,
    validation_data=(X_val, y_val), verbose=0
)

print(f"L2 Model - Train acc: {history_l2.history['accuracy'][-1]:.4f}")
print(f"L2 Model - Val acc: {history_l2.history['val_accuracy'][-1]:.4f}")
print(f"L2 Model - Gap: {history_l2.history['accuracy'][-1] - history_l2.history['val_accuracy'][-1]:.4f}")


### Regularizacion Dropout


In [None]:
# Modelo con Dropout
def create_dropout_model(dropout_rate=0.5):
    model = models.Sequential([
        layers.Dense(512, activation='relu', input_shape=(n_features,)),
        layers.Dropout(dropout_rate),
        layers.Dense(512, activation='relu'),
        layers.Dropout(dropout_rate),
        layers.Dense(256, activation='relu'),
        layers.Dropout(dropout_rate),
        layers.Dense(256, activation='relu'),
        layers.Dropout(dropout_rate),
        layers.Dense(128, activation='relu'),
        layers.Dropout(dropout_rate),
        layers.Dense(num_classes, activation='softmax')
    ], name=f'Dropout_{dropout_rate}')
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

model_dropout = create_dropout_model(0.5)
history_dropout = model_dropout.fit(
    X_train, y_train, epochs=50, batch_size=32,
    validation_data=(X_val, y_val), verbose=0
)

print(f"Dropout Model - Train acc: {history_dropout.history['accuracy'][-1]:.4f}")
print(f"Dropout Model - Val acc: {history_dropout.history['val_accuracy'][-1]:.4f}")
print(f"Dropout Model - Gap: {history_dropout.history['accuracy'][-1] - history_dropout.history['val_accuracy'][-1]:.4f}")


In [None]:
# Comparacion de regularizacion
plot_history(history_l2, 'L2 Regularization')
plot_history(history_dropout, 'Dropout Regularization')

print("\nComparacion:")
print(f"Sin reg - Gap: {history_overfit.history['accuracy'][-1] - history_overfit.history['val_accuracy'][-1]:.4f}")
print(f"L2      - Gap: {history_l2.history['accuracy'][-1] - history_l2.history['val_accuracy'][-1]:.4f}")
print(f"Dropout - Gap: {history_dropout.history['accuracy'][-1] - history_dropout.history['val_accuracy'][-1]:.4f}")


## 5. Evaluacion en Test Set

Evaluaremos el mejor modelo en el conjunto de test.


In [None]:
# Evaluar en test set (usamos el modelo OK como ejemplo)
test_loss, test_acc = model_ok.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")


In [None]:
# Matriz de confusion
from sklearn.metrics import confusion_matrix, classification_report

y_pred = model_ok.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)

cm = confusion_matrix(y_test_classes, y_pred_classes)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.tight_layout()
plt.show()

print("\nClassification Report:")
print(classification_report(y_test_classes, y_pred_classes, 
                          target_names=label_encoder.classes_))


## 6. Conclusiones

### Resumen de Resultados:

1. **Modelo Underfit**: Tiene muy poca capacidad (solo 16 neuronas en una capa). No puede aprender los patrones complejos. Bajo rendimiento en train y validation.

2. **Modelo OK**: Arquitectura balanceada (128-64-32 neuronas en 3 capas). Buen rendimiento en train y validation con gap razonable.

3. **Modelo Overfit**: Muchas capas y neuronas (512-512-256-256-128). Alto rendimiento en train pero bajo en validation. Gran gap indica overfitting.

### Regularizacion:

- **L2**: Penaliza pesos grandes. Reduce overfitting forzando pesos pequenos.
- **Dropout**: Desactiva neuronas aleatoriamente. Evita co-adaptacion y mejora generalizacion.

Ambas tecnicas reducen el gap train-validation, mejorando la capacidad de generalizacion del modelo.

### Mejoras Futuras:

- Mas datos de entrenamiento
- Feature selection/engineering
- Probar otras arquitecturas (ResNet, attention)
- Ensemble de modelos
- Data augmentation especifica para genomica
