## Regression with a Wild Blueberry Yield Dataset

**Dataset Description:**

- **id:** Unique identifier for each record in the dataset.
- **clonesize:** Size of the bee colony or hive.
- **honeybee:** Number of honeybees in the colony.
- **bumbles:** Count of bumblebees.
- **andrena:** Population of Andrena bees.
- **osmia:** Presence or count of Osmia bees.
- **MaxOfUpperTRange:** Maximum upper temperature range observed.
- **MinOfUpperTRange:** Minimum upper temperature range recorded.
- **AverageOfUpperTRange:** Average upper temperature range.
- **MaxOfLowerTRange:** Maximum lower temperature range observed.
- **MinOfLowerTRange:** Minimum lower temperature range recorded.
- **AverageOfLowerTRange:** Average lower temperature range.
- **RainingDays:** Number of days with rain.
- **AverageRainingDays:** Average number of rainy days.
- **fruitset:** Fruit set or the process of fruit development.
- **fruitmass:** Mass or weight of the fruit.
- **seeds:** Number of seeds in the fruit.
- **yield:** Overall yield or production.

## Evaluation
- Submissions will be evaluated using **Mean Absolute Error (MAE)**

In [4]:
import pandas as pd
import numpy as np

import optuna

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

from catboost import CatBoostRegressor

import optuna

import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
df = pd.read_csv("../data/Regression with a Wild Blueberry Yield Dataset/train.csv")

In [6]:
df.head()

Unnamed: 0,id,clonesize,honeybee,bumbles,andrena,osmia,MaxOfUpperTRange,MinOfUpperTRange,AverageOfUpperTRange,MaxOfLowerTRange,MinOfLowerTRange,AverageOfLowerTRange,RainingDays,AverageRainingDays,fruitset,fruitmass,seeds,yield
0,0,25.0,0.5,0.25,0.75,0.5,69.7,42.1,58.2,50.2,24.3,41.2,24.0,0.39,0.425011,0.417545,32.460887,4476.81146
1,1,25.0,0.5,0.25,0.5,0.5,69.7,42.1,58.2,50.2,24.3,41.2,24.0,0.39,0.444908,0.422051,33.858317,5548.12201
2,2,12.5,0.25,0.25,0.63,0.63,86.0,52.0,71.9,62.0,30.0,50.8,24.0,0.39,0.552927,0.470853,38.341781,6869.7776
3,3,12.5,0.25,0.25,0.63,0.5,77.4,46.8,64.7,55.8,27.0,45.8,24.0,0.39,0.565976,0.478137,39.467561,6880.7759
4,4,25.0,0.5,0.25,0.63,0.63,77.4,46.8,64.7,55.8,27.0,45.8,24.0,0.39,0.579677,0.494165,40.484512,7479.93417


No missing data or data that is a object. Lame.

In [7]:
df.drop("id", axis=1, inplace=True)

In [None]:
CatBoostRegressor()

In [30]:
def objective(trial):
    X = df.drop("yield", axis=1)
    y = df["yield"]
    
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)
    params = {
        "iterations": trial.suggest_int("iterations", 700, 1000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 1),
        "depth": trial.suggest_int("depth", 4, 12),
        "l2_leaf_reg": trial.suggest_int("l2_leaf_reg", 10, 15),
        "rsm": trial.suggest_float("rsm", 0, 1)
    }

    regr = CatBoostRegressor(**params, early_stopping_rounds=200, verbose=False)
    regr.fit(X_train, y_train, verbose=False)
    y_preds = regr.predict(X_valid)

    return mean_absolute_error(y_valid, y_preds)

In [31]:
study = optuna.create_study(direction="minimize")

[I 2024-06-19 14:59:54,157] A new study created in memory with name: no-name-b35bca50-a19b-422e-aa5a-34fa0f654c80


In [32]:
study.optimize(objective, n_trials=300, n_jobs=-1, show_progress_bar=True);

  0%|          | 0/300 [00:00<?, ?it/s]

[I 2024-06-19 14:59:58,391] Trial 3 finished with value: 368.8831678037653 and parameters: {'iterations': 783, 'learning_rate': 0.3136721298414359, 'depth': 5, 'l2_leaf_reg': 15, 'rsm': 0.3762945191295267}. Best is trial 3 with value: 368.8831678037653.
[I 2024-06-19 15:00:00,494] Trial 0 finished with value: 416.7098969043111 and parameters: {'iterations': 889, 'learning_rate': 0.6248204807653425, 'depth': 9, 'l2_leaf_reg': 13, 'rsm': 0.15739883343120786}. Best is trial 3 with value: 368.8831678037653.
[I 2024-06-19 15:00:05,634] Trial 4 finished with value: 412.98062811788367 and parameters: {'iterations': 727, 'learning_rate': 0.683402117073987, 'depth': 7, 'l2_leaf_reg': 12, 'rsm': 0.8431603805091175}. Best is trial 3 with value: 368.8831678037653.
[I 2024-06-19 15:00:08,809] Trial 5 finished with value: 418.12772151091076 and parameters: {'iterations': 849, 'learning_rate': 0.8245992142260558, 'depth': 7, 'l2_leaf_reg': 11, 'rsm': 0.9164458767135771}. Best is trial 3 with value: 3

In [37]:
study.best_params

{'iterations': 884,
 'learning_rate': 0.04203896963266236,
 'depth': 6,
 'l2_leaf_reg': 14,
 'rsm': 0.707820607290803}

In [34]:
X = df.drop("yield", axis=1)
y = df["yield"]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3)

In [38]:
regr = CatBoostRegressor(iterations=884, learning_rate=0.04203896963266236, depth=6, l2_leaf_reg=14, rsm=0.707820607290803)

In [39]:
regr.fit(X_train, y_train, plot=True, verbose=False, eval_set=(X_valid, y_valid));

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [40]:
y_preds = regr.predict(X_valid)

In [41]:
mean_absolute_error(y_valid, y_preds)

362.2823994000898