# Preprocesamiento EMNIST

- Reshape de 784 píxeles a matrices 28x28
- Normalización (invertir polaridad y escalar a [0, 1])
- Binarización por umbral de Otsu
- Extracción del ground truth en texto plano (usando el mapping de EMNIST)
- Uso del split original: 697,932 train / 116,323 test
- Extracción: 52 features (valentin)
- PCA 52 a 35-40 dimensiones (compensación de desbalance - class weight) 


## Imports

In [6]:
import numpy as np
import pandas as pd
from pathlib import Path
from skimage.filters import threshold_otsu

In [2]:
train_path = "../data/emnist-byclass-train.csv"
test_path = "../data/emnist-byclass-test.csv"

## Cargamos dataset EMNIST
Cada archivo CSV tiene **785 columnas**:
- Columna 0: etiqueta (label numérico)
- Columnas 1..784: valores de píxel en escala de grises [0, 255]

In [4]:
df_train = pd.read_csv(train_path, header=None)
df_test  = pd.read_csv(test_path, header=None)

print('Train shape:', df_train.shape)
print('Test shape :', df_test.shape)
df_train.head()

Train shape: (697932, 785)
Test shape : (116323, 785)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,775,776,777,778,779,780,781,782,783,784
0,35,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,36,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,22,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Función de preprocesamiento

Esta función realiza:
1. Separación de labels (ground truth) y píxeles
2. Reshape a imágenes 28x28
3. Inversión de polaridad y normalización a [0, 1]
4. Binarización Otsu por imagen

In [7]:
def preprocess_emnist(df):
    """Preprocesa un DataFrame EMNIST (ByClass).

    Parámetros
    ----------
    df : pd.DataFrame
        DataFrame con shape (N, 785). Columna 0 = label.

    Devuelve
    --------
    X_bin_flat : np.ndarray, shape (N, 784)
        Imágenes binarizadas (0/1) aplanadas.
    X_bin_img : np.ndarray, shape (N, 28, 28)
        Imágenes binarizadas (0/1) en formato 2D.
    y : np.ndarray, shape (N,)
        Labels enteros.
    """
    # 1. Separar labels y píxeles
    y = df.iloc[:, 0].astype(np.int64).values
    X = df.iloc[:, 1:].astype(np.uint8).values  # [0, 255]

    # 2. Reshape a (N, 28, 28)
    X_img = X.reshape(-1, 28, 28)

    # 3. Invertir polaridad y normalizar
    #    (asumimos fondo blanco, tinta oscura)
    X_inverted = 255 - X_img
    X_norm = X_inverted.astype(np.float32) / 255.0  # [0, 1]

    # 4. Binarización Otsu por imagen
    N = X_norm.shape[0]
    X_bin_img = np.empty_like(X_norm, dtype=np.uint8)

    for i in range(N):
        # umbral de Otsu en la imagen i
        thresh = threshold_otsu(X_norm[i])
        X_bin_img[i] = (X_norm[i] >= thresh).astype(np.uint8)

    # Aplanar a (N, 784) para modelos que esperan vectores
    X_bin_flat = X_bin_img.reshape(N, -1)

    return X_bin_flat, X_bin_img, y

## Ejecutar preprocesamiento en train y test

In [9]:
X_train_flat, X_train_img, y_train = preprocess_emnist(df_train)
X_test_flat,  X_test_img,  y_test  = preprocess_emnist(df_test)

print('Train binarizado:', X_train_flat.shape, y_train.shape)
print('Test  binarizado:', X_test_flat.shape,  y_test.shape)

Train binarizado: (697932, 784) (697932,)
Test  binarizado: (116323, 784) (116323,)


In [11]:
if MAPPING_TXT.exists():
    mapping_df = pd.read_csv(
        MAPPING_TXT,
        sep=' ',
        header=None,
        names=['label', 'ascii']
    )
    mapping_df['char'] = mapping_df['ascii'].apply(lambda x: chr(int(x)))
    label_to_char = dict(zip(mapping_df['label'], mapping_df['char']))

    # Convertir y_train, y_test a texto
    y_train_text = np.array([label_to_char[int(lbl)] for lbl in y_train])
    y_test_text  = np.array([label_to_char[int(lbl)] for lbl in y_test])

    print('Ejemplo labels numéricos → texto:')
    print(list(zip(y_train[:10], y_train_text[:10])))
else:
    print('WARNING: No se encontró el archivo de mapping:', MAPPING_TXT)
    print('Solo tendrás los labels numéricos (y_train, y_test).')
    y_train_text = None
    y_test_text = None


NameError: name 'MAPPING_TXT' is not defined

In [None]:
OUT_DIR = "/output"
OUT_DIR.mkdir(parents=True, exist_ok=True)

np.savez_compressed(
    OUT_DIR / 'emnist_byclass_train_preprocessed.npz',
    X_flat=X_train_flat,
    X_img=X_train_img,
    y=y_train,
    y_text=y_train_text,
)

np.savez_compressed(
    OUT_DIR / 'emnist_byclass_test_preprocessed.npz',
    X_flat=X_test_flat,
    X_img=X_test_img,
    y=y_test,
    y_text=y_test_text,
)

print('Archivos guardados en:', OUT_DIR)


In [None]:
# mostrar imagens binarizadas
import matplotlib.pyplot as plt

n_samples = 5
fig, axes = plt.subplots(1, n_samples, figsize=(10, 3))
for i, ax in enumerate(axes):
    ax.imshow(X_train_img[i], cmap='gray')
    title = str(y_train[i])
    if 'y_train_text' in globals() and y_train_text is not None:
        title += f" ({y_train_text[i]})"
    ax.set_title(title)
    ax.axis('off')
plt.tight_layout()
plt.show()