# Regression Notebook: Predicting Sugar Content from Grape Using Living Optics Export Reader
This example aims to experiment with various machine learning models based on .lo data and perform regression.

## Steps
- Read the exported group from data analysis tool
- Train regressor based on divided features & labels
- Compare model performance by visualising the results

## 1. Import libraries and Setup
We import the libraries needed for data handling, modeling, evaluation, and visualization.

In [None]:
# Basic tools
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List

# Machine learning models and evaluation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error,
    mean_absolute_percentage_error, r2_score
)
from sklearn.cross_decomposition import PLSRegression
from sklearn.svm import SVR

# Dataset loader
from lo_dataset_reader import DatasetReader

## 2. Load Dataset
Load the grape dataset which contains annotations, spectral data, and metadata such as sugar content and position.

In [None]:
# Load dataset using the custom DatasetReader class
path = "/path/to/Grapes-Dataset.zip"
reader = DatasetReader(dataset_path=path)

## 3. Convert Categorical Position to Numerical Features
The grape position is given as a string. This function converts it into a list of three numerical values: tray number, row index, and column index.

In [None]:
# Coverting position to feature (e.g. tray16-a1 -> int)
def position_to_numeric(pos: str) -> List[int]:
    """
    Convert a position string (e.g. 'tray16-a1') into numeric features.
    """
    match = re.match(r"tray(\d+)-([a-z])(\d+)", pos)
    if match:
        tray, row, col = match.groups()
        return [int(tray), ord(row) - ord('a'), int(col)]
    return [0, 0, 0]

## 4. Metadata Extraction: Sugar Content & Position
Loop through the dataset to extract grape positions and corresponding sugar content from the metadata.

In [None]:
positions, sugar_contents = [], []

# Extract position and sugar content for each annotated grape
for (info, scene, spectra, _), converted_spectra, annotations, *_ in reader:
    for ann in annotations:
        if ann['metadata'] and ann['category_name'] == 'grapes':
            meta = {item['field']: item['value'] for item in ann['metadata']}
            positions.append(meta['position'])
            sugar_contents.append(float(meta['sugar-content']))

## 5. Train Regression Models and Evaluate
Train six different regression models and evaluate their performance using metrics like R2 score, MAE, MAPE, MSE, and RMSE.

In [None]:
# Prepare features and target
X = np.array([position_to_numeric(pos) for pos in positions])
y = np.array(sugar_contents)

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define multiple regression models
models = [
    ("Linear Regression", LinearRegression()),
    ("Lasso", Lasso()),
    ("Ridge", Ridge()),
    ("Random Forest", RandomForestRegressor()),
    ("PLS Regression", PLSRegression()),
    ("SVR", SVR())
]

# Store evaluation metrics
metrics_df = { 'Model': [], 'R2': [], 'MAE': [], 'MAPE': [], 'MSE': [], 'RMSE': [] }
residuals_list = []

# Fit and evaluate each model
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    metrics_df['Model'].append(name)
    metrics_df['R2'].append(r2_score(y_test, y_pred))
    metrics_df['MAE'].append(mean_absolute_error(y_test, y_pred))
    metrics_df['MAPE'].append(mean_absolute_percentage_error(y_test, y_pred))
    mse = mean_squared_error(y_test, y_pred)
    metrics_df['MSE'].append(mse)
    metrics_df['RMSE'].append(np.sqrt(mse))
    residuals_list.append(y_test - y_pred)

metrics_df = pd.DataFrame(metrics_df)

## 6. Visualise Model Performance
Visualise how well each model performs with comparison plots including residuals, predicted vs actual, and metric bar charts.

In [None]:
plt.figure(figsize=(20, 16))

# Actual vs Predicted
plt.subplot(2, 2, 1)
for name, model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    plt.scatter(y_test, y_pred, alpha=0.6, label=name)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--')
plt.title("Predicted vs Actual")
plt.xlabel("Actual Sugar Content")
plt.ylabel("Predicted")
plt.legend()

# Residual distributions
plt.subplot(2, 2, 2)
sns.boxplot(data=residuals_list)
plt.xticks(range(len(models)), metrics_df['Model'], rotation=45)
plt.title("Residuals Distribution")

# Normalised metrics (R2, MAPE)
plt.subplot(2, 2, 3)
norm = metrics_df.melt(id_vars='Model', value_vars=['R2', 'MAPE'], var_name='Metric', value_name='Value')
sns.barplot(data=norm, x='Model', y='Value', hue='Metric')
plt.title("Normalised Metrics")

# Raw metrics (MAE, MSE, RMSE)
plt.subplot(2, 2, 4)
non_norm = metrics_df.melt(id_vars='Model', value_vars=['MAE', 'MSE', 'RMSE'], var_name='Metric', value_name='Value')
sns.barplot(data=non_norm, x='Model', y='Value', hue='Metric')
plt.title("Raw Metrics")

plt.tight_layout()
plt.show()


## Model Performance Analysis
This section evaluates the performance of several regression models on predicting sugar content, using both graphical and numerical metrics.

### Plot Discussion
1. Predicted vs Actual (Top Left) : This scatter plot compares the predicted values to the actual sugar content. The diagonal dashed line represents perfect predictions (i.e., predicted = actual). Points closer to this line indicate better model performance.
    - Observation: Random Forest predictions are more closely clustered around the diagonal, suggesting it has learned the relationship between features and sugar content better than other models.

2. Residuals Distribution (Top Right) : Boxplots of residuals (actual - predicted values) help visualise the error spread for each model. A good model will have residuals centered around zero with minimal spread and few outliers.
    - Observation: Random Forest shows a tighter residual distribution with fewer extreme outliers compared to the other models.

3. Normalised Metrics (Bottom Left) : This bar chart compares models based on normalized R2 score (coefficient of determination) and MAPE (Mean Absolute Percentage Error) scores.
    - R2 score indicates the proportion of variance explained by the model (higher is better).
    - MAPE measures the average percentage error (lower is better).
    - Observation: Random Forest has the highest R2 score value and a relatively low MAPE, confirming strong predictive performance.

4. Raw Metrics (Bottom Right) : This plot shows the raw evaluation metrics for each model:
    - MAE (Mean Absolute Error): Average of absolute errors.
    - MSE (Mean Squared Error): Average of squared errors (penalizes large errors).
    - RMSE (Root Mean Squared Error): Square root of MSE, interpretable in the same units as the target variable.
    - Observation: Again, Random Forest has the lowest MAE, MSE, and RMSE values, further indicating superior accuracy.

### Metric Definitions
- R2 score (R-squared): Measures how well the variance in the dependent variable is explained by the model. Ranges from 0 to 1; closer to 1 means better fit.
- MAE (Mean Absolute Error): The average absolute difference between predicted and actual values. It gives an idea of the average prediction error.
- MSE (Mean Squared Error): Similar to MAE but squares the errors, making larger errors more significant.
- RMSE (Root Mean Squared Error): The square root of MSE. It retains the unit of the output variable and penalizes large errors.
- MAPE (Mean Absolute Percentage Error): The average of absolute percentage errors. It provides a sense of relative error size.

## Summary
- Extracted both positional and spectral features from *.lo* data
- Compared six regression models for sugar prediction from physical positions
- Visualised predictions, residuals, and performance metrics in detail