**Notes**
- .gz: compressed CSVs with no header, so I will need to provide column names from kddcup.names

In [1]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables) 

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

# Data Exploration

In [2]:
X.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [3]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


In [4]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   income  48842 non-null  object
dtypes: object(1)
memory usage: 381.7+ KB


In [5]:
y.describe()

Unnamed: 0,income
count,48842
unique,4
top,<=50K
freq,24720


In [6]:
y['income'].unique()

array(['<=50K', '>50K', '<=50K.', '>50K.'], dtype=object)

'>50K' and '>50K.' is 1, '<=50K' and '<=50K.' is 0. 

In [7]:
for idx, value in y['income'].items():
    if value in ['<=50K', '<=50K.']:
        y.at[idx, 'income'] = 0
    else:
        y.at[idx, 'income'] = 1

In [8]:
y['income'].unique()

array([0, 1], dtype=object)

In [9]:
y['income'] = y['income'].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y['income'] = y['income'].astype(int)


In [10]:
y['income'].unique()

array([0, 1])

In [11]:
y.describe()

Unnamed: 0,income
count,48842.0
mean,0.239282
std,0.426649
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,1.0


In [12]:
y.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   income  48842 non-null  int64
dtypes: int64(1)
memory usage: 381.7 KB


In [13]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       47879 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      47876 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  48568 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


In [14]:
num_rows_with_null = X.isnull().any(axis=1).sum()
print(num_rows_with_null)

1221


In [15]:
(X == '?').sum().sum() #sum per column, total across all columns 

np.int64(4262)

In [16]:
import pandas as pd

X = X.replace('?', pd.NA)

num_rows_with_null = X.isna().any(axis=1).sum()
print(num_rows_with_null)

3620


In [17]:
X['workclass'].unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', <NA>, 'Self-emp-inc', 'Without-pay', 'Never-worked',
       nan], dtype=object)

In [18]:
#Dropping rows

mask = ~X.isna().any(axis=1)

#X.isna(): creates a dataframe of true/false values (true where a cell in X is missing, false otherwise)
#.any(axis=1): checks across each row (axis = 1 means across columns) 
# ~ is logical not in pandas 


X = X.loc[mask].copy()
y = y.loc[mask].copy()

In [19]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Index: 45222 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             45222 non-null  int64 
 1   workclass       45222 non-null  object
 2   fnlwgt          45222 non-null  int64 
 3   education       45222 non-null  object
 4   education-num   45222 non-null  int64 
 5   marital-status  45222 non-null  object
 6   occupation      45222 non-null  object
 7   relationship    45222 non-null  object
 8   race            45222 non-null  object
 9   sex             45222 non-null  object
 10  capital-gain    45222 non-null  int64 
 11  capital-loss    45222 non-null  int64 
 12  hours-per-week  45222 non-null  int64 
 13  native-country  45222 non-null  object
dtypes: int64(6), object(8)
memory usage: 5.2+ MB


In [20]:
for column in X.columns:
    if X[column].dtype == 'object':
        print(f"{column}: {len(X[column].unique())}")



workclass: 7
education: 16
marital-status: 7
occupation: 14
relationship: 6
race: 5
sex: 2
native-country: 41


## Options for Encoding Categorical Variables
1. One-Hot Encoding
   - Pros: no information loss
   - Cons: expldoes feature size, leads to sparse data
2. Label Encoding
   - How it works: Assign each category an integer label
   - Pros: keeps dimensionality low
   - Cons: NN may interpret numbers as ordinal when in reality they have no mathematical relationship
3. Target Encoding / Mean Encoding
   - How it works: Replaces each category with a numerical statistic, which could be the most common category
   - Pros: compact
   - Cons: loss of information
4. Embedding Layers
   - How it works: each category is mapped to a dense vector of learned weights

# Preprocessing
- We already handled missing data, and categorical data will be handled when we setup the model since there will be an embedding layer. 

In [32]:
from sklearn.model_selection import train_test_split
import numpy as np

cat_cols = [c for c in X.columns if X[c].dtype == 'object']
num_cols = X.select_dtypes(include=[np.number]).columns

X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full,
    test_size=0.2,
    random_state=42,
    stratify=y_train_full
)


#stratification to y to make sure distribution of classes is the same in train and validaiton. 



                    

### Converting categorical columns to integer IDS

Embedding layers expect integer indices, so the categories must be mapped to numbers with 0 reserved for UNK (unknown/unseen categories)

In [34]:
#first we build the mappings from the training dataset only
cat_maps = {}

for c in cat_cols:
    categories = X_train[c].unique()
    mapping = {"unk": 0}
    for index, value in enumerate(categories, start=1):
        mapping[value] = index
    cat_maps[c] = mapping

print(cat_maps)
    

{'workclass': {'unk': 0, 'Private': 1, 'State-gov': 2, 'Self-emp-not-inc': 3, 'Federal-gov': 4, 'Local-gov': 5, 'Self-emp-inc': 6, 'Without-pay': 7}, 'education': {'unk': 0, 'Bachelors': 1, 'HS-grad': 2, 'Masters': 3, 'Some-college': 4, '7th-8th': 5, 'Prof-school': 6, '11th': 7, '10th': 8, 'Assoc-voc': 9, '5th-6th': 10, '12th': 11, 'Assoc-acdm': 12, '9th': 13, 'Doctorate': 14, '1st-4th': 15, 'Preschool': 16}, 'marital-status': {'unk': 0, 'Married-civ-spouse': 1, 'Never-married': 2, 'Divorced': 3, 'Separated': 4, 'Widowed': 5, 'Married-spouse-absent': 6, 'Married-AF-spouse': 7}, 'occupation': {'unk': 0, 'Exec-managerial': 1, 'Adm-clerical': 2, 'Prof-specialty': 3, 'Sales': 4, 'Farming-fishing': 5, 'Machine-op-inspct': 6, 'Transport-moving': 7, 'Craft-repair': 8, 'Tech-support': 9, 'Other-service': 10, 'Protective-serv': 11, 'Priv-house-serv': 12, 'Handlers-cleaners': 13, 'Armed-Forces': 14}, 'relationship': {'unk': 0, 'Husband': 1, 'Unmarried': 2, 'Own-child': 3, 'Other-relative': 4, 'W

In [35]:
def map_categories(column, mapping):
    return column.map(mapping).fillna(0).astype("int64")

# I removed missing columns from X already, but here
# map(mapping) will return NaN if it encounters a category not seen in cat_maps
# that is, not seen in training but seen in the validation set
# so, I put fillna(0) to put those unseen cateogires into the unk bucket


#the map_categories function only works on one column at a time,
#so I put a lambda function so it applies map_categories to all columns in cat_cols

train_categories = X_train[cat_cols].apply(lambda column: map_categories(column, cat_maps[column.name]))
val_categories = X_val[cat_cols].apply(lambda column: map_categories(column, cat_maps[column.name]))

print("Train:\n", train_categories)
print("\nVal:\n", val_categories)

Train:
        workclass  education  marital-status  occupation  relationship  race  \
5479           1          1               1           1             1     1   
11287          2          2               2           2             2     2   
31907          1          3               1           3             1     1   
9695           3          4               1           4             1     1   
43382          1          4               2           2             3     1   
...          ...        ...             ...         ...           ...   ...   
13028          1          2               3           7             6     1   
38378          1          2               1           2             1     1   
30315          1          2               1           4             5     1   
11081          1          4               1           7             1     1   
23786          1          4               2          13             3     2   

       sex  native-country  
5479     1    

### Scaling the numeric columns 

In [36]:
import numpy as np
from sklearn.preprocessing import StandardScaler

#standard scaler transforms integers into real numbers
#PyTorch layers expect inputs of type torch.float32
#It is float32 instead of float64 to use less memory, therefore the training process is faster

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train[num_cols]).astype("float32")
X_val_scaled = scaler.transform(X_val[num_cols]).astype("float32")

### Building categorical ID matrices
This step is about converting the categorical columns into the shape that that the embedding layers expect, which is (batch_size, num_cat_cols)

In [39]:
X_train_categories = train_categories.to_numpy(dtype="int64")
X_val_categories = val_categories.to_numpy(dtype="int64")

print(X_train_categories.shape) #(num_rows, num_cat_cols)
print(X_train_categories[:3]) #each row = one training example
                              #each column = one encoded ID for a categorical feature

(28941, 8)
[[1 1 1 1 1 1 1 1]
 [2 2 2 2 2 2 2 1]
 [1 3 1 3 1 1 1 2]]


### Make y float32
- We will be using nn.BCEWithLogitsLoss
- The model will output a float32 logit
- The loss compares that float32 logit against the target tensor (y), and if y is int, PyTorch will throw a dtype mismatch error 

In [42]:
print(type(y_train))

<class 'pandas.core.frame.DataFrame'>


In [44]:
y_train_array = y_train.astype("float32").to_numpy()
y_val_array = y_val.astype("float32").to_numpy()

In [46]:
print(type(y_train_array))

<class 'numpy.ndarray'>


### The Embedding Layer
- Mathematically: a lookup table (a matrix of learnable weights)
- Shape: (num_categories, embedding_dim); the rows are categories (0, 1, 2, etc.) and the columns are the latent features for each category. So, each categorical column will get its own embedding layer.
- It expects a tensor of integer IDs (torch.int64 aka LongTensor) of shape (batch_size, num_cat_cols). Each row is a sample, each column is the category (integer IDs)
- For each categorical column it outputs an embedding vector as float32. For multiple categorical columns, their embeddings will be contatenated.
- The embedding vectors themselves are learned during training


<br>
In short: column --> embedding vector 
<br>
The next step will be figuring out how many categories each column has, then picking embedding dimensions for each. 

### Vocab sizes 
The cleanest, stable vocab size is just the mapping length, or the toal number of unique integer IDs that can appear in a given column

In [49]:
vocab_sizes = [len(cat_maps[c]) for c in cat_cols]

#next, we have a function that computes the embedding dimension
#in a way that balances model expressiveness with efficiency

def pick_emb_dim(vocab_size):
    return min(50, max(4, int(round(vocab_size**0.25 * 8))))

#Next we compute the embedding dimensions for each column
emb_dims = [(vocab_size, pick_emb_dim(vocab_size)) for vocab_size in vocab_sizes]

print("vocab_sizes", vocab_sizes)
print("emb_dims", emb_dims)

vocab_sizes [8, 17, 8, 15, 7, 6, 3, 42]
emb_dims [(8, 13), (17, 16), (8, 13), (15, 16), (7, 13), (6, 13), (3, 11), (42, 20)]


# Modeling
This part of the project is where the PyTorch philosphy shines.
- Nothing is hidden, and I define how data is stored, accessed, batched, and shuffled.

For this, PyTorch gives two pieces:
1. Dataset: how to get a single sample, and we can neatly package each batch as (x_cats, x_nums, y)
2. DataLoader: how to turn that Dataset into batches with shuffling, batching, multiprocessing, moving to GPU, etc. 

## Dataset and DataLoader 

In [84]:
import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class TabularDS(Dataset):
    def __init__(self, X_cats, X_nums, y):
        self.X_cats = X_cats
        self.X_nums = X_nums
        self.y = y.reshape(-1, 1) #forces column vector (N,1)

    def __len__(self):
        return self.y.shape[0]

    def __getitem__(self, i):
        return (
            torch.tensor(self.X_cats[i], dtype=torch.long),
            torch.tensor(self.X_nums[i], dtype=torch.float32),
            torch.tensor(self.y[i], dtype=torch.float32),
        )
train_dataset = TabularDS(X_train_categories, X_train_scaled, y_train_array)
val_dataset = TabularDS(X_val_categories, X_val_scaled, y_val_array)

#smaller batch size for training batches helps the optimizer see more gradient noise,
#improving generalization
#in validation, we aren't updating any weights so higher batch size means faster evaluation
#in addition, we shuffle the training set so the model doesn't overfit to the order in the data
#we set shuffling to false for validation loader as it provides no benefit 

train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1024, shuffle=False)

## Model

In [98]:
import torch.nn as nn

class CatEmbMLP(nn.Module):
    #hidden=(128, 64) means first linear layer is 128 units and second linear layer is 64
    def __init__(self, emb_dims, n_num, hidden=(256, 128, 64), p=0.01):
        super().__init__()
        #for each categorical column, build one embedding table
        #v = vocab size (number of categories)
        #d = embedding dimension (chosen earlier)
        self.embs = nn.ModuleList([nn.Embedding(v, d) for (v, d) in emb_dims])
        self.emb_drop = nn.Dropout(p)

        #emb_dims is of shape (vocab_sizei, emb_dimi)
        #the next line refers to the total embedding output size
        #n_num is the amount of numerical columns
        in_dim = sum(d for _, d in emb_dims) + n_num
        layers = []
        for h in hidden:
            layers += [nn.Linear(in_dim, h), nn.ReLU(), nn.Dropout(p)]
            in_dim = h #reset the input dimensions 
        layers += [nn.Linear(in_dim, 1)] #binary logit
        self.mlp = nn.Sequential(*layers)

    #x_cat: categorical input tensor
        #shape: (batch_size, n_cat_cols)
        #dtype: torch.int64 (LongTensor)
    #x_num: numeric input tensor
        #shape: (batch_size, n_num_cols)
        #dtype: torch.float32
    def forward(self, x_cat, x_num):
        #x_cat[:, i] = all IDs for column i in the batch --> shape (batch_size,)
        #emb(x_cat[:, i]) = look up each IDs embedding vector --> shape (batch_size, emb_dim_i)
        #emb_list collects these embeddings in a list 
        emb_list = [emb(x_cat[:, i]) for i, emb in enumerate(self.embs)]
        #join the embedding vectors side by side --> (batch_size, sum_of_all_emb_dims)
        x = torch.cat(emb_list, dim=1)
        #applying dropout for regularization, preventing overfitting to rare cetegories
        x = self.emb_drop(x)
        #join with numeric features, shape (batch_size, sum_emb_dims + n_num) as defined in __init__
        x = torch.cat([x, x_num], dim=1)
        #forward pass through the mlp
        return self.mlp(x)

#seeing if we can take advantage of Apple Silicon GPU 
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
n_num = X_train_scaled.shape[1]
model = CatEmbMLP(emb_dims, n_num, hidden=(128, 64), p=0.1).to(device)

## Training Loop

### Improvements. 

**Run 1**
Epoch 001 | train 0.3054 | val 0.3064 | acc 0.861 | auc 0.916 | lr 2.0e-04
Epoch 002 | train 0.3034 | val 0.3067 | acc 0.861 | auc 0.916 | lr 2.0e-04
Epoch 003 | train 0.3048 | val 0.3070 | acc 0.860 | auc 0.915 | lr 2.0e-04
Epoch 004 | train 0.3039 | val 0.3071 | acc 0.860 | auc 0.915 | lr 2.0e-04
Epoch 005 | train 0.3039 | val 0.3073 | acc 0.860 | auc 0.915 | lr 1.0e-04
Epoch 006 | train 0.3034 | val 0.3073 | acc 0.860 | auc 0.915 | lr 1.0e-04
Epoch 007 | train 0.3038 | val 0.3073 | acc 0.860 | auc 0.915 | lr 1.0e-04
Epoch 008 | train 0.3027 | val 0.3074 | acc 0.859 | auc 0.915 | lr 1.0e-04
Epoch 009 | train 0.3022 | val 0.3074 | acc 0.860 | auc 0.915 | lr 5.0e-05
Epoch 010 | train 0.3024 | val 0.3074 | acc 0.860 | auc 0.915 | lr 5.0e-05
Epoch 011 | train 0.3028 | val 0.3075 | acc 0.859 | auc 0.915 | lr 5.0e-05
Early stopping at epoch 11 (best val loss: 0.3064)
Loaded best model (val loss = 0.3064 )

- it seems that train accuracy ~= val accuracy so it seems like the model isn't learning that much.
- We could try to make the model deeper/wider, try a different LR-scheduler (warmup + cosine, or OneCycleLR, add learned normalization, increase embedding dims by 25-50%, increase patience after I try capacity / LR changes.
- To compare, we could run XGBoost. 

In [102]:
import torch
from torch import nn
from sklearn.metrics import roc_auc_score, accuracy_score
import numpy as np

#I had problems using mps so switched to cpu. 
device = torch.device("cpu")

model = model.to(device)

# Loss, Optimizer, Scheduler 
criterion = nn.BCEWithLogitsLoss() #binary cross entropy
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-4, weight_decay=1e-4)

#if a metric (validation loss below) doesn't improve after 3 epochs, reduce learning rate
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.1, patience=3, min_lr=1e-5
)

# One training epoch
def train_one_epoch(model, loader, optimizer, criterion, device):
    model.train() #in PyTorch there is train and eval modes 
    total_loss, total_n = 0.0, 0
    for x_cat, x_num, yb in loader:
        #PyTorch tensors must live on the same device as the model to interact
        #non_blocking isn't really relevant unless using GPU
        x_cat = x_cat.to(device, non_blocking=True)
        x_num = x_num.to(device, non_blocking=True)
        yb    = yb.to(device, non_blocking=True) #shape (batch_size, 1)

        logits = model(x_cat, x_num) #calls the forward function of the model           
        loss = criterion(logits, yb) #computing loss

        #clears old gradients from previous step (PyTorch accumulates gradients)
        optimizer.zero_grad(set_to_none=True)
        #PyTorch computes gradient of loss w respect to each parameter by applying
        #the chain rule through the computation graph
        #every parameter at the end has a .grad tensor attached
        loss.backward()
        #rescales the gradients if their norm exceeds 1.0 preventing exploding gradients
        nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        #optimizer updates the parameters using the stored .grad rules
        #for adamw, this means it does adaptive learning rate calculation + weight decay
        # + correction before applying the update
        optimizer.step()

        batch_size = yb.size(0) #gives number of rows 
        total_loss += loss.item() * batch_size #sum of losses per batch
        total_n += batch_size
    return total_loss / total_n #epoch wide mean loss 

# Validation pass 
@torch.no_grad()
def evaluate(model, loader, criterion, device):
    model.eval() #put the model in eval mode 
    total_loss, total_n = 0.0, 0
    all_probs, all_targets = [], []

    for x_cat, x_num, yb in loader:
        x_cat = x_cat.to(device, non_blocking=True)
        x_num = x_num.to(device, non_blocking=True)
        yb    = yb.to(device, non_blocking=True)

        logits = model(x_cat, x_num)
        loss = criterion(logits, yb)

        bs = yb.size(0)
        total_loss += loss.item() * bs
        total_n += bs

        #converts to probabilities, moves off gpu if using it, and converts to numpy array
        #for scikit-learn metrics 
        probs = torch.sigmoid(logits).squeeze(1).cpu().numpy()
        targets = yb.squeeze(1).cpu().numpy() #makes it (batch_size,)
        all_probs.append(probs)
        all_targets.append(targets)

    import numpy as np
    all_probs   = np.concatenate(all_probs, axis=0).reshape(-1)     # 1-D
    all_targets = np.concatenate(all_targets, axis=0).reshape(-1)   # 1-D

    # (optional) mask out any non-finite values, just in case
    m = np.isfinite(all_probs) & np.isfinite(all_targets)
    all_probs, all_targets = all_probs[m], all_targets[m]

    # preds as integers (0/1)
    preds = (all_probs >= 0.5).astype(np.int32)

    from sklearn.metrics import accuracy_score, roc_auc_score
    acc = accuracy_score(all_targets, preds)

    # auc only if both classes present
    if np.unique(all_targets).size > 1:
        auc = roc_auc_score(all_targets, all_probs)
    else:
        auc = float("nan")

    return (total_loss / total_n), acc, auc

# Training loop with early stopping 
EPOCHS    = 200            
PATIENCE  = 10             
MIN_DELTA = 1e-4        

best_val = float("inf")
patience_ctr = 0
best_state = None

for epoch in range(1, EPOCHS + 1):
    tr_loss = train_one_epoch(model, train_loader, optimizer, criterion, device)
    va_loss, va_acc, va_auc = evaluate(model, val_loader, criterion, device)

    #step scheduler on val loss
    scheduler.step(va_loss)

    print(f"Epoch {epoch:03d} | "
          f"train {tr_loss:.4f} | val {va_loss:.4f} | acc {va_acc:.3f} | auc {va_auc:.3f} | "
          f"lr {optimizer.param_groups[0]['lr']:.1e}")

    # Early stopping
    if va_loss < best_val - MIN_DELTA:
        best_val = va_loss
        best_state = {k: v.detach().cpu().clone() for k, v in model.state_dict().items()}
        patience_ctr = 0
    else:
        patience_ctr += 1
        if patience_ctr >= PATIENCE:
            print(f"Early stopping at epoch {epoch} (best val loss: {best_val:.4f})")
            break

# restoring best weights 
if best_state is not None:
    model.load_state_dict(best_state)
    model.to(device)
    print("Loaded best model (val loss =", f"{best_val:.4f}", ")")

Epoch 001 | train 0.3077 | val 0.3071 | acc 0.859 | auc 0.916 | lr 2.0e-04
Epoch 002 | train 0.3056 | val 0.3074 | acc 0.858 | auc 0.915 | lr 2.0e-04
Epoch 003 | train 0.3065 | val 0.3079 | acc 0.859 | auc 0.916 | lr 2.0e-04
Epoch 004 | train 0.3051 | val 0.3073 | acc 0.859 | auc 0.915 | lr 2.0e-04
Epoch 005 | train 0.3056 | val 0.3070 | acc 0.858 | auc 0.916 | lr 2.0e-04
Epoch 006 | train 0.3057 | val 0.3070 | acc 0.858 | auc 0.916 | lr 2.0e-04
Epoch 007 | train 0.3036 | val 0.3088 | acc 0.858 | auc 0.915 | lr 2.0e-04
Epoch 008 | train 0.3053 | val 0.3076 | acc 0.856 | auc 0.915 | lr 2.0e-04
Epoch 009 | train 0.3055 | val 0.3073 | acc 0.858 | auc 0.915 | lr 2.0e-05
Epoch 010 | train 0.3038 | val 0.3074 | acc 0.858 | auc 0.915 | lr 2.0e-05
Epoch 011 | train 0.3031 | val 0.3074 | acc 0.858 | auc 0.915 | lr 2.0e-05
Epoch 012 | train 0.3044 | val 0.3073 | acc 0.857 | auc 0.915 | lr 2.0e-05
Epoch 013 | train 0.3042 | val 0.3074 | acc 0.858 | auc 0.915 | lr 1.0e-05
Epoch 014 | train 0.3028 

## XGBoost Comparison

In [96]:
import xgboost as xgb
from xgboost.callback import EarlyStopping
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score, accuracy_score

X_train_xgb = X_train.copy()
X_val_xgb = X_val.copy()

for categorical in cat_cols:
    X_train_xgb[categorical] = X_train_xgb[categorical].astype("category")
    #we will align validation categories to training categories to prevent unseen 
    X_val_xgb[categorical] = pd.Categorical(X_val_xgb[categorical], categories=X_train_xgb[categorical].cat.categories)

#Building DMatrices with native categorical support
dtrain = xgb.DMatrix(X_train_xgb, label=y_train, enable_categorical=True)
dval   = xgb.DMatrix(X_val_xgb,   label=y_val,   enable_categorical=True)

#Params (solid baseline)
params = {
    "objective": "binary:logistic",
    "eval_metric": "logloss",    
    "max_depth": 6,
    "eta": 0.03,              
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "lambda": 1.0,            
    "alpha": 0.0,              
    "tree_method": "hist",
    "verbosity": 0,
    "seed": 42,
}

#Train with early stopping on validation
evals = [(dval, "val")]
bst = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=2000,
    evals=evals,
    early_stopping_rounds=50,   # stops when val metric doesn't improve
    verbose_eval=False,
)

#Evaluate
probs = bst.predict(dval)   # probabilities for positive class
preds = (probs >= 0.5).astype(int)

auc = roc_auc_score(y_val, probs)
acc = accuracy_score(y_val, preds)

print(f"AUC: {auc:.3f}  ACC: {acc:.3f}  Best trees: {bst.best_iteration}")

AUC: 0.929  ACC: 0.871  Best trees: 456


# Reflection

This project went really well in terms of building a full end-to-end pipeline. I was able to set up preprocessing, a custom Dataset and DataLoader, the model, the loss and optimizer, a training loop with early stopping and a learning rate scheduler, and then track metrics properly. Along the way, I developed a much clearer mental model of key PyTorch concepts like logits vs. probabilities, why BCEWithLogitsLoss is preferred, the role of .train() and .eval(), device transfers, batch sizing, and how to aggregate batch metrics. I also handled categorical embeddings correctly, with per-column vocabularies and UNK handling, and built some debugging intuition by catching NaNs during initialization and tracing them back to Apple GPU issues. On top of that, I established a strong baseline by running XGBoost and comparing it to the neural network.

What didn’t go so smoothly was that the validation loss plateaued at around 0.307, regardless of scheduler tweaks or learning rate changes. This turned out not to be a problem with optimization but more a sign of architectural capacity limits in the MLP. Using Apple’s MPS backend also caused NaN issues, which forced me back to CPU training for stability. I also leaned a little too quickly on Optuna, when in hindsight the flat validation curve indicated that I should have tried architectural changes or regularization adjustments first.

The main lessons I’m taking away are that preprocessing must be done carefully to avoid leakage, embeddings help compress sparse categorical data into useful representations, and schedulers should always monitor validation performance, not just training. I also learned it’s often better to slightly over-parameterize a network and regularize it than to under-parameterize and cap performance. XGBoost slightly outperformed the neural network here, which is expected in tabular settings because trees are so good at capturing threshold-like interactions that MLPs miss without more advanced architectures.

If I were to run this again, I would try a larger MLP with batch normalization and lighter dropout, experiment with more dynamic learning rate schedules like OneCycle, and consider using pos_weight in the loss if the classes are imbalanced. I’d also look at engineered interaction features to see if the neural net can learn from what boosts XGBoost. For a bigger leap, I’d try transformer-style tabular models like FT-Transformer and compare them against XGBoost on a larger dataset. Overall, the biggest win is that I now have a clean, reusable training loop, a dataset pipeline, a hyperparameter tuning scaffold, and a strong baseline comparison — all of which I can carry into my next project.
