# Data Science for Business - Multilayer Perceptron (MLP) on Ames Housing with Pytorch

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import r2_score, root_mean_squared_error

import torch
import torch.nn as nn
import torch.optim as optim
from torchsummary import summary


In [None]:
torch.manual_seed(42)

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## Load data

Load training data from CSV file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/ds4b-2024/refs/heads/main/Session_08/ameshousing.csv')

In [None]:
data.head()

## Prepare data

First, we will remove some columns that are not useful for our task.

In [None]:
data = data.drop(['YrSold', 'MoSold', 'SaleCondition', 'SaleType'], axis=1)

Next, we will split the data into features (*X*) and labels (*y*) and into training (*X_train, y_train*) and test (*X_test, y_test*) sets.

In [None]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Finally, we will do some feature engineering. It is important to use only information from the training set for feature engineering, and the mechanistically repeat these steps on the test set.

Typically, feature engineering depends strongly on the datatype of the variables. Hence, we will first determine which variables are categorical and which are numerical. Subsequentally, we will transform these variables seperately.

In [None]:
categorical_features = X_train.select_dtypes(include='object').columns
numerical_features = X_train.select_dtypes(exclude='object').columns

The categorical variables must be transformed into numerical representations, e.g., by one-hot encdoing them.

In [None]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
enc.fit(X_train[categorical_features])

X_train_cat = enc.transform(X_train[categorical_features])
X_test_cat = enc.transform(X_test[categorical_features])

X_train_cat = pd.DataFrame(X_train_cat, columns=enc.get_feature_names_out(categorical_features))
X_test_cat = pd.DataFrame(X_test_cat, columns=enc.get_feature_names_out(categorical_features))

In [None]:
X_train_cat.head()

The numerical variables will be standardized, that is, we will subtract the mean and divide by the standard deviation. This is especially important for LASSO, as all coefficients need to be comparable in terms of units and magnitudes.

In [None]:
scaler = StandardScaler()
scaler.fit(X_train[numerical_features]) 

X_train_num = scaler.transform(X_train[numerical_features])
X_test_num = scaler.transform(X_test[numerical_features])

X_train_num = pd.DataFrame(X_train_num, columns=numerical_features)
X_test_num = pd.DataFrame(X_test_num, columns=numerical_features)

In [None]:
X_train_num.head()

Let's fuse the enginnered categorical and numerical variables again.

In [None]:
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

In [None]:
X_train

## Neural Network

### Data loaders

The first thing we have to do is to transform the data into Pytorch tensors. This is done by the `torch.tensor` function.

In [None]:
# Convert to PyTorch tensors
X_train_tensor = torch.tensor(X_train.to_numpy(), dtype=torch.float32)
X_test_tensor = torch.tensor(X_test.to_numpy(), dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.to_numpy(), dtype=torch.float32).unsqueeze(1)
y_test_tensor = torch.tensor(y_test.to_numpy(), dtype=torch.float32).unsqueeze(1)

Next, we create a data loader for the training data, using the `torch.utils.data.DataLoader` function. The data loader allows to read the data in a streaming fashion during training. This is actually not needed for our tiny dataset, but normally one uses PyTorch with much larger datasets that do not fit into main memory.

In [None]:
batch_size = 32

train_loader = torch.utils.data.DataLoader(
    dataset=torch.utils.data.TensorDataset(X_train_tensor, y_train_tensor),
    batch_size=batch_size,
    shuffle=True
)

### Network architecture

Now we specify the architecture of the neural network. We will use a simple feedforward neural network with one hidden layers. The input layer has the same number of neurons as we have features, the output layer has one neuron, as we want to do regression. The hidden layer has 256 neurons. We use the ReLU activation function for the hidden layer.

To specify the architecture, we create a class that inherits from `torch.nn.Module` and has two standard methods. In the `__init__` method, we specify the types of layers we wanto to use as building blocks. The `forward` method specifies how these buidling blocks are connected.

In [None]:
class MLPModel(nn.Module):
    def __init__(self, input_dim):
        super(MLPModel, self).__init__()
        self.hidden = nn.Linear(input_dim, 256)
        self.output = nn.Linear(256, 1)
        self.relu = nn.ReLU()
        
    def forward(self, x):
        x = self.relu(self.hidden(x))
        x = self.output(x)
        return x

Let's create an instance of the neural network and look at its architecture.

In [None]:
input_dim = X_train.shape[1]
model = MLPModel(input_dim)


In [None]:
summary(model, input_size=(input_dim,))

### Training loop

We are now ready to train the neural network. We first specify the loss function, which is the mean squared error loss. We also specify the optimizer, which is the Adam optimizer. We then train the neural network for 100 epochs.

In [None]:
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
epochs = 100

for epoch in range(epochs):
    model.train()
    epoch_loss = 0
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        predictions = model(batch_X)
        loss = criterion(predictions, batch_y)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    
    if (epoch + 1) % 10 == 0:
        print(f'Epoch {epoch + 1}/{epochs}, Loss: {epoch_loss / len(train_loader):.4f}')


### Predictions and evaluation

Finally, we evaluate the neural network on the test set. Therefore, we set the model to evaluation mode, so that potential dropout layers are not active. We then compute the loss on the test set and print it.

In [None]:
# Evaluate the model
model.eval()
with torch.no_grad():
    y_pred = model(X_test_tensor)
    test_loss = criterion(y_pred, y_test_tensor)
    print(f'Test Loss: {test_loss.item():.4f}')


Note that the loss is the MSE! Let's compute R2 and RMSE as well.

In [None]:
r2_test = r2_score(y_test_tensor, y_pred)
rmse_test = root_mean_squared_error(y_test_tensor, y_pred)
print('R2 on test set:', round(r2_test, 2))
print('RMSE on test set:', round(rmse_test, 2))

## Your turn!

The above RMSE is not very good. Try to improve it! You can try the following:

- Increase the number of epochs
- Increase the number of neurons in the hidden layer
- Add more hidden layers
- Add dropout layers
- ...