# Solving Regression with A Multilayer Perceptron (MLP)

Using a simple neural network for regression tasks.

### Step 1: Establish the question

Fortunately, we have worked with this dataset before back in MC2. Scientifically, we'll ask the new question: __Can a simple neural network perform better than a linear regression model on the same evaluation metrics we used before (MSE and R_sq)?__

For simplicity, we will use a perceptron for the regression task.

### Step 2: Environment Setup

Make sure you update your local copy of the repository by running `git pull`

In [None]:
# install dependencies
%pip install pandas numpy matplotlib scikit-learn torch

Notice we have installed a new package, torch, which is used for building, training, and evaluating neural networks!

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import torch
import torch.nn as nn
import torch.optim as optim

### Step 3: Read and Clean the Data

In [None]:
# Read in the dataset with pd.read_csv (save it in a variable called df)
# Data is in ../data/imports-85.csv


In [None]:
# Check for missing values using df.isnull().sum().sum()


In [None]:
# If there are missing values, drop them using df.dropna(inplace=True)


In [None]:
# Print out the first few rows of the dataset using df.head()


In [None]:
# Print the column names using df.columns


### Step 4: Data Exploration

We've already explored this dataset, so we can skip this step!

### Step 5: Building a Multilayer Perceptron

As we reviewed in the slides, a multilayer multilayer perceptron is a simple ANN with 

In [None]:
# Create a model class that inherits from nn.Module
# you don't need to change anything here
model = nn.Sequential(
    nn.Linear(1, 128),   # Input layer to hidden layer (1 feature: engine-size)
    nn.ReLU(),           # Activation function
    nn.Linear(128, 64),  # Hidden layer to hidden layer
    nn.Linear(64, 1)     # Hidden layer to output layer
)

In [None]:
# Now we can train our model
# Define loss function and optimizer (you don't need to change these)
criterion = nn.MSELoss()  # Mean Squared Error for regression
optimizer = optim.Adam(model.parameters(), lr=0.001)


In [None]:
# Try predicting price from engine size

# Split the dataset into features and target variable
# Set feature as X and predictor as y


In [None]:
# Train the model
# Split the data into training and testing sets (X_train, X_test, y_train, y_test)


In [None]:
# Convert the data to PyTorch tensors
# Do this with something like: X_train = torch.tensor(X_train, dtype=torch.float32)


In [None]:
# Train the model in batches (most of the code is already there, we'll skip the implementation details)

batch_size = 32
num_epochs = 100
losses = []
for epoch in range(num_epochs):
    for i in range(0, len(YOUR_X_TRAIN), batch_size):
        batch_losses = []
        # Get the batch data
        X_batch = YOUR_X_TRAIN[i:i+batch_size]
        y_batch = YOUR_Y_TRAIN[i:i+batch_size]
        
        # Forward pass
        model.train()
        optimizer.zero_grad()
        
        # Forward pass
        outputs = model(X_batch)

        # Compute loss
        loss = criterion(outputs, y_batch.view(-1, 1))

        # Backward pass and optimization
        loss.backward()
        optimizer.step()
        batch_losses.append(loss.item())
        
        if (epoch+1) % 10 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
    losses.append(np.mean(batch_losses))


In [None]:
# Plot losses list to see your training loss


In [None]:
# Test the model
model.eval() # sets model in eval mode
losses_test = []
with torch.no_grad():
    test_outputs = model(YOUR_X_TEST)
    test_loss = criterion(test_outputs, YOUR_Y_TEST.view(-1, 1))
    losses_test.append(test_loss.item())
    print(f'Test Loss: {test_loss.item():.4f}')


In [None]:
# Plot the linear regression line
line = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
line_tensor = torch.FloatTensor(line)
plt.scatter(YOUR_X_TRAIN, YOUR_Y_TRAIN, color='blue', label='Training Data')
plt.scatter(YOUR_X_TEST, YOUR_Y_TEST, color='orange', label='Testing Data')
plt.plot(line, model(line_tensor).detach().numpy(), color='red', label='Regression Line')
plt.xlabel('Engine Size')
plt.ylabel('Price')
plt.title('Linear Regression: Engine Size vs Price')
plt.legend()
plt.show()

In [None]:
# Get R-squared value
r_squared = r2_score(YOUR_Y_TEST.numpy(), test_outputs.numpy())
print(f'R-squared: {r_squared:.4f}')

#### __Check in:__ How did your MLP do? Why might it have performed better or worse than your previous linear regression model?

#### __Optional:__ Try to train an MLP on multiple variables. Does it perform better or worse than your previous model from MC2?