## Regression with a Wild Blueberry Yield Dataset

**Dataset Description:**

- **id:** Unique identifier for each record in the dataset.
- **clonesize:** Size of the bee colony or hive.
- **honeybee:** Number of honeybees in the colony.
- **bumbles:** Count of bumblebees.
- **andrena:** Population of Andrena bees.
- **osmia:** Presence or count of Osmia bees.
- **MaxOfUpperTRange:** Maximum upper temperature range observed.
- **MinOfUpperTRange:** Minimum upper temperature range recorded.
- **AverageOfUpperTRange:** Average upper temperature range.
- **MaxOfLowerTRange:** Maximum lower temperature range observed.
- **MinOfLowerTRange:** Minimum lower temperature range recorded.
- **AverageOfLowerTRange:** Average lower temperature range.
- **RainingDays:** Number of days with rain.
- **AverageRainingDays:** Average number of rainy days.
- **fruitset:** Fruit set or the process of fruit development.
- **fruitmass:** Mass or weight of the fruit.
- **seeds:** Number of seeds in the fruit.
- **yield:** Overall yield or production.

## Evaluation
- Submissions will be evaluated using **Mean Absolute Error (MAE)**

In [12]:
import pandas as pd
import numpy as np

import optuna
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

from catboost import CatBoostRegressor

import optuna

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("../data/Regression with a Wild Blueberry Yield Dataset/train.csv")

In [3]:
df.head()

Unnamed: 0,id,clonesize,honeybee,bumbles,andrena,osmia,MaxOfUpperTRange,MinOfUpperTRange,AverageOfUpperTRange,MaxOfLowerTRange,MinOfLowerTRange,AverageOfLowerTRange,RainingDays,AverageRainingDays,fruitset,fruitmass,seeds,yield
0,0,25.0,0.5,0.25,0.75,0.5,69.7,42.1,58.2,50.2,24.3,41.2,24.0,0.39,0.425011,0.417545,32.460887,4476.81146
1,1,25.0,0.5,0.25,0.5,0.5,69.7,42.1,58.2,50.2,24.3,41.2,24.0,0.39,0.444908,0.422051,33.858317,5548.12201
2,2,12.5,0.25,0.25,0.63,0.63,86.0,52.0,71.9,62.0,30.0,50.8,24.0,0.39,0.552927,0.470853,38.341781,6869.7776
3,3,12.5,0.25,0.25,0.63,0.5,77.4,46.8,64.7,55.8,27.0,45.8,24.0,0.39,0.565976,0.478137,39.467561,6880.7759
4,4,25.0,0.5,0.25,0.63,0.63,77.4,46.8,64.7,55.8,27.0,45.8,24.0,0.39,0.579677,0.494165,40.484512,7479.93417


No missing data or data that is a object. Lame.

In [4]:
df.drop("id", axis=1, inplace=True)

In [None]:
CatBoostRegressor()

In [50]:
def objective(trial):
    X = df.drop("yield", axis=1)
    y = df["yield"]
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)
    params = {
        "iterations": trial.suggest_int("iterations", 700, 1000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 1),
        "depth": trial.suggest_int("depth", 4, 12),
        "l2_leaf_reg": trial.suggest_int("l2_leaf_reg", 10, 15)
    }

    regr = CatBoostRegressor(**params, early_stopping_rounds=200, verbose=False)
    regr.fit(X_train, y_train, verbose=False)
    y_preds = regr.predict(X_valid)

    return mean_absolute_error(y_valid, y_preds)

In [51]:
study = optuna.create_study(direction="minimize")

[I 2024-06-19 12:10:06,415] A new study created in memory with name: no-name-fc17eacc-449d-456f-94b6-deb245dba8d6


In [52]:
study.optimize(objective, n_trials=300, n_jobs=-1, show_progress_bar=True);

  0%|          | 0/300 [00:00<?, ?it/s]

[I 2024-06-19 12:10:12,213] Trial 3 finished with value: 380.1648293991629 and parameters: {'iterations': 705, 'learning_rate': 0.5395352416324891, 'depth': 4, 'l2_leaf_reg': 13}. Best is trial 3 with value: 380.1648293991629.
[I 2024-06-19 12:10:16,228] Trial 4 finished with value: 369.91900528565986 and parameters: {'iterations': 880, 'learning_rate': 0.36315336642019225, 'depth': 5, 'l2_leaf_reg': 10}. Best is trial 4 with value: 369.91900528565986.
[I 2024-06-19 12:10:17,915] Trial 2 finished with value: 368.22898155546505 and parameters: {'iterations': 743, 'learning_rate': 0.21948047714640734, 'depth': 8, 'l2_leaf_reg': 11}. Best is trial 2 with value: 368.22898155546505.
[I 2024-06-19 12:10:20,415] Trial 5 finished with value: 368.2735056521715 and parameters: {'iterations': 800, 'learning_rate': 0.18458492953770272, 'depth': 6, 'l2_leaf_reg': 15}. Best is trial 2 with value: 368.22898155546505.
[I 2024-06-19 12:10:20,555] Trial 0 finished with value: 422.96672975434814 and para

In [54]:
study.best_params

{'iterations': 737,
 'learning_rate': 0.056653525988900576,
 'depth': 6,
 'l2_leaf_reg': 12}

In [27]:
X = df.drop("yield", axis=1)
y = df["yield"]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

In [46]:
regr = CatBoostRegressor(iterations=786, learning_rate=0.04240659370348328, depth=7, l2_leaf_reg=12)

In [47]:
regr.fit(X_train, y_train, plot=True, verbose=False, eval_set=(X_valid, y_valid));

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [48]:
y_preds = regr.predict(X_valid)

In [49]:
mean_absolute_error(y_valid, y_preds)

357.6225485116327