### Lab 3.1: Batching and Regularization

In this lab you will learn how to set up a dataset to be processed in batches, rather than processing the entire dataset in each training iteration, and explore neural network regularization.

In [1]:
import numpy as np
import torch

In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
adult = fetch_ucirepo(id=2) 
  
# data (as pandas dataframes) 
X = adult.data.features 
y = adult.data.targets 
  
# metadata 
print(adult.metadata) 
  
# variable information 
print(adult.variables)

{'uci_id': 2, 'name': 'Adult', 'repository_url': 'https://archive.ics.uci.edu/dataset/2/adult', 'data_url': 'https://archive.ics.uci.edu/static/public/2/data.csv', 'abstract': 'Predict whether annual income of an individual exceeds $50K/yr based on census data. Also known as "Census Income" dataset. ', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 48842, 'num_features': 14, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Income', 'Education Level', 'Other', 'Race', 'Sex'], 'target_col': ['income'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 1996, 'last_updated': 'Tue Sep 24 2024', 'dataset_doi': '10.24432/C5XW20', 'creators': ['Barry Becker', 'Ronny Kohavi'], 'intro_paper': None, 'additional_info': {'summary': "Extraction was done by Barry Becker from the 1994 Census database.  A set of reasonably clean records was extracted using the fol

In [3]:
X.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

In [4]:
y = y['income'].map({'<=50K':0,'<=50K.':0,'>50K':1,'>50K.':1})

Here I remove the missing values from the features and labels.

In [5]:
bad = X.isna().any(axis=1)
X = X[~bad]
y = y[~bad]

Selecting only the numeric variables:

In [6]:
X = X[['age','fnlwgt','education-num','capital-gain','capital-loss','hours-per-week']]

In [7]:
y = y.values
X = X.values.astype('float64')

To make the learning algorithm work more smoothly, we we will subtract the mean of each feature.

Here `np.mean` calculates a mean, and `axis=0` tells NumPy to calculate the mean over the rows (calculate the mean of each column).

In [8]:
X -= np.mean(X,axis=0)
X /= np.std(X,axis=0)

Now we will convert our `X` and `y` arrays to torch Tensors.

In [9]:
X = torch.tensor(X).float()
y = torch.tensor(y).long()

### Exercises

1. Divide the data into train and test splits.
2. Create a neural network for this dataset.
3. Use `TensorDataset` and `DataLoader` to batch the dataset during training.  
4. Use `weight_decay` parameter to `optim.SGD` to introduce L2 regularization during training. Evaluate the effect of regularization on test set accuracy.

In [10]:
# divide data into training and testing set
n = X.shape[0]
n_train = int(n*0.8)
n_test = n - n_train

X_train = X[:n_train]
y_train = y[:n_train]
X_test = X[n_train:]
y_test = y[n_train:]

# create neural network for this dataset
from torch import nn

nn_model = nn.Sequential(
    nn.Linear(6, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 2)
)

# use TensorDataset and DataLoader to batch dataset during training
from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_data = TensorDataset(X_test, y_test)
test_loader = DataLoader(test_data, batch_size=64, shuffle=False)

# use weight_decay parameter to optim.SGD to introduce L2 regularization during training
from torch import optim

optimizer = optim.SGD(nn_model.parameters(), lr=0.01, weight_decay=0.01)
loss_fn = nn.CrossEntropyLoss()

def train(model, train_loader, optimizer, loss_fn, n_epochs=100):
    for epoch in range(n_epochs):
        model.train()
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = loss_fn(y_pred, y_batch)
            loss.backward()
            optimizer.step()
        print(f'Epoch {epoch}, loss {loss.item()}')

train(nn_model, train_loader, optimizer, loss_fn)

Epoch 0, loss 0.2745862603187561
Epoch 1, loss 0.417532742023468
Epoch 2, loss 0.5407511591911316
Epoch 3, loss 0.29488298296928406
Epoch 4, loss 0.36891672015190125
Epoch 5, loss 0.42763209342956543
Epoch 6, loss 0.6220870614051819
Epoch 7, loss 0.22156564891338348
Epoch 8, loss 0.23471474647521973
Epoch 9, loss 0.473153293132782
Epoch 10, loss 0.25721150636672974
Epoch 11, loss 0.2935134172439575
Epoch 12, loss 0.1656373292207718
Epoch 13, loss 0.48936471343040466
Epoch 14, loss 0.27064546942710876
Epoch 15, loss 0.310878723859787
Epoch 16, loss 0.5345979332923889
Epoch 17, loss 0.26673561334609985
Epoch 18, loss 0.4345850944519043
Epoch 19, loss 0.31383857131004333
Epoch 20, loss 0.18812061846256256
Epoch 21, loss 0.49090737104415894
Epoch 22, loss 0.19499003887176514
Epoch 23, loss 0.30512845516204834
Epoch 24, loss 0.36596959829330444
Epoch 25, loss 0.3337234556674957
Epoch 26, loss 0.4680981934070587
Epoch 27, loss 0.2382657378911972
Epoch 28, loss 0.5252874493598938
Epoch 29, lo

In [11]:
# evaluate the effect of regularization on test set accuracy
def test(model, test_loader):
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            y_pred = model(X_batch)
            _, predicted = torch.max(y_pred, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
    return correct / total

print(test(nn_model, test_loader))

0.8141732283464567


Achieved an accuracy of 81.38% on the test set with regularization.