<div class="alert alert-block alert-success">
    <b>ARTIFICIAL INTELLIGENCE (E016350A)</b> <br>
ALEKSANDRA PIZURICA <br>
GHENT UNIVERSITY <br>
AY 2024/2025 <br>
Assistant: Nicolas Vercheval
</div>

# Basic regression: Predict fuel efficiency

*Regression* and *Classification* algorithms are supervised learning algorithms. Both algorithms are used for prediction and work with the labelled datasets. The main difference is that regression algorithms predict continuous values such as price, salary, age, etc. Regression is thus a process of finding the correlations between dependent and independent variables. Classification algorithms, on the other hand, predict discrete values such as Male or Female, True or False, Spam or Not Spam, etc. 
This notebook uses the classic [Auto MPG Dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) and builds a model to predict the fuel efficiency of the late-1970s and early 1980s automobiles.

In [3]:
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import seaborn as sns

In [4]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
print(tf.__version__)

2.16.2



## The Auto MPG dataset
The dataset is available from the [UCI Machine Learning](https://archive.ics.uci.edu/ml/) repository.


### Getting the data

First download the dataset by using the `keras.utils.get_file` function.

In [None]:
import requests
from io import StringIO
# We download the data using the request module
request = requests.get("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
# We convert the download into a file containing a string with StringIO
if request.status_code == 200:  # downloaded without errors
    file_str_io = StringIO(request.text)
else:
    print("Download file manually and replace file_srt_io with its path")

We will use the `pandas` library to work with the tabular data.

In [None]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
raw_dataset = pd.read_csv(file_str_io, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset = raw_dataset.copy()
dataset.tail()

### Clean the data

The dataset contains a few unknown values.

In [None]:
dataset.isna().sum()

There are different ways to solve the problem of missing values (their approximation by average or replacement by other values, etc). In our case, we will throw out the instances that have the missing values.

In [None]:
dataset = dataset.dropna()

### Encoding categorical variables

Column `Origin` is a categorical variable and not numerical, as it contains the name of the location the car is coming from:
1. (USA)
2. (Europe)
3. (Japan)

In [None]:
# List unique values in a column (in 2 different ways)
print(dataset.Origin.unique())
print(dataset["Origin"].unique())

**TIP:** This is an extremely useful transformation in practice. We want to replace the column elements with other values depending on the mapping we pass.

In [None]:
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

We perform dummy encoding of a categorical variable. This categorical data encoding method transforms the categorical variable into a set of binary variables (also known as dummy variables). In the case of one-hot encoding, for $N$ categories in a variable, it uses $N$ binary variables. We convert the `Origin` variable to a one-hot vector with `pd.get_dummies`.

In [None]:
dataset = pd.get_dummies(dataset, prefix='', prefix_sep='', dtype=int)
dataset.tail()

### Split the data into training and test sets

Now split the dataset into a training set and a test set.

Use the test set in the final evaluation of your models.

In [None]:
# We choose 80% of the data as the training data
train_dataset = dataset.sample(frac=0.8,random_state=0)

# The rest 20% is used as the testing data
test_dataset = dataset.drop(train_dataset.index)

### Inspect the data

Have a quick look at the joint distribution of a few pairs of columns from the training set.

Looking at the top row it should be clear that the fuel efficiency (MPG) is a function of all the other parameters. Looking at the other rows it should be clear that they are functions of each other.

In [None]:
sns.pairplot(train_dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde")
plt.show()

Also look at the overall statistics, note how each feature covers a different range. We omit the MPG as this is the target variable.

In [None]:
train_stats = train_dataset.astype(float).describe()
train_stats.pop("MPG")
train_stats = train_stats.transpose()
train_stats

### Extract the target variable

Separate the target value, the "label", from the features. This label is the value that you will train the model to predict.

So, we extract the target variable `MPG` from the data ([miles per gallon](https://www.carwow.co.uk/guides/running/what-is-mpg-0255)).

In [None]:
train_labels = train_dataset.pop("MPG")
test_labels = test_dataset.pop("MPG")

### Data standardization/normalization

It is good practice to normalize features that use different scales and ranges.

One reason this is important is because the features are multiplied by the model weights. So the scale of the outputs and the scale of the gradients are affected by the scale of the inputs.

Although a model might converge without feature normalization, normalization makes training much more stable.

In [None]:
train_stats

In [None]:
def norm(x):
  return (x.astype(float) - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

**IMPORTANT:** Note that we use the average and standard deviation of the training set both when we standardize data for training **and** testing. This is important because we do not want to use the information from the test data set in any way when training the model because it introduces a bias that leads to customization. Check the notebooks from the previous Lab session.

## Regression model

### Defining the model

It is time to define our model. We will use the `Sequential` model, representing one neural network with forward propagation. At the output of this network, there will be a neuron that will evaluate the attribute `MPG`.

We select the mean squared error as the loss function.

Apart from the `Adam` optimizer, there are many others. For illustration, the `RMSprop` optimizer will be used here.

In [None]:
def build_model():
  model = keras.Sequential([
    layers.Input(shape=(len(train_dataset.keys()), )),
    layers.Dense(64, activation='relu'),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
  ])

  optimizer = tf.keras.optimizers.RMSprop(0.001)

  model.compile(loss='mse',
                optimizer=optimizer,
                metrics=['mae', 'mse'])
  return model

In [None]:
model = build_model()

### Model summary

Using the `summary` function we can look at an overview of the defined model.

In [None]:
model.summary()

We can test the model. We will take a subset of $10$ examples from the training set and pass them through the network.

In [None]:
example_batch = normed_train_data[:10]
example_result = model.predict(example_batch)
example_result

In [None]:
example_result.shape

We get $10$ values as predictions.

### Training the model

We divide the training set into two new sets. With the first (80% of the original training set) optimizes its parameters through backpropagation, with the second validation we can evaluate its performance and tune the hyperparameters.

We will train the 100 epoch model and keep the accuracy of the training and validation data during the training. The `fit` function returns an object that contains the necessary data.

In [None]:
EPOCHS = 200

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=1)

In [None]:
plt.plot(history.epoch, history.history['mse'])
plt.plot(history.epoch, history.history['val_mse'])
plt.legend(['Training MSE', 'Validation MSE'])
plt.show()

The obtained data can be visualized using the `pandas` library.
`DataFrame` is a `pandas` type that represents tabular data.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch

# We take the last 5 rows
hist.tail()

We can notice that 200 epochs are too much and that releasing training to last that long does not contribute to the model's accuracy. We will repeat the optimization process again, but this time, we will use a technique called *early stopping*.

The idea is to define a set of constraints that, once fulfilled, model training will be stopped. For example, if the value of `val_mse` does not improve in `k` consecutive epochs, it makes sense to stop training.

How do we determine the `k` parameter? It is a hyperparameter as the network architecture, optimizer, etc.

More about `EarlyStopping` can be found [here](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping).

In [None]:
model = build_model()

# Parameter `patience` is the number of epochs considered for the early stopping.
# Parameter `monitor` represents the measure being compared through the epoch.
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=10)

early_history = model.fit(normed_train_data, train_labels, 
                    epochs=EPOCHS, validation_split=0.2, verbose=1, 
                    callbacks=[early_stop])

In [None]:
plt.plot(early_history.epoch, early_history.history['mae'])
plt.plot(early_history.epoch, early_history.history['val_mae'])
plt.ylim([0, 10])
plt.legend(['Training MAE', 'Validation MAE'])
plt.xlabel('Epoch')
plt.ylabel('MAE [MPG]')
plt.show()

The graph shows that at the validation set, the average error is about $\pm 2$ MPG. Whether this is good or not depends on the measure and the case of use, which depends on the problem.

### Model evaluation

Next, we will look at how well the model generalizes on the test data.

So far, we have trained the model in a subset (80%) of the training set to have a validation set. There is no point in throwing away data (20% of training data that ended up as validation data), so we will re-train the model with the entire training data for the test evaluation.

How many epochs should we train it for? One approach is to set the number of epochs of the early stop. Let us put 90.

In [None]:
model = build_model()

early_stop_epochs = 90

final_history = model.fit(normed_train_data, train_labels, 
                    epochs=early_stop_epochs, verbose=1)

Finally, we can look at how our model behaves at the test set.

In [None]:
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=2)
rmse_test = np.sqrt(mse)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))
print("Testing set Root of the Mean Squared Errorr: {:5.2f} MPG".format(rmse_test))

The test error should be less than the validation error because we are now using more data.

### Predicting values in the future

Is it over? Yes and no, it depends on different things. If we are satisfied with this model and want to move it into production and use it, there is no point in throwing away the data in the test set.

It makes sense to retrain the model, now over all the data. How do we then evaluate that model? We will not evaluate it. If we conducted this process well, the error in the test set would approximate the quality of this final model we trained over the entire dataset.

This notebook is partially based on the official [Basic regression: Predict fuel efficiency
](https://www.tensorflow.org/tutorials/keras/regression) Tensorflow tutorial.