## Crop Yield Classification AutoGluon


## Data Definition

###  Column Definitions

- **`Year`**: The year the data was recorded for crop yield and associated environmental conditions.
- **`Area`**: The country or geographical region where the crop was cultivated.
- **`Item`**: The type of crop (e.g., Maize, Potatoes, Rice, etc.).
- **`Yield`**: The crop yield value, typically measured in hectograms per hectare (hg/ha).
- **`Avg Rainfall`**: The average annual rainfall (in millimeters) for the region in that year.
- **`Pesticides`**: The amount of pesticide used (in kilograms per hectare).
- **`Temperature`**: The average annual temperature (in degrees Celsius) for the region in that year.

In [1]:
# imports
from autogluon.tabular import TabularDataset, TabularPredictor
from sklearn.model_selection import train_test_split

In [3]:
# create dataset using TabularDataset - this allows for a connection to the TabularPredictor
# all Pandas functionality is still available

data = TabularDataset('Yield_df.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28242 entries, 0 to 28241
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Unnamed: 0.1  28242 non-null  int64  
 1   Unnamed: 0    28242 non-null  int64  
 2   Area          28242 non-null  object 
 3   Item          28242 non-null  object 
 4   Year          28242 non-null  int64  
 5   Yield         28242 non-null  int64  
 6   Avg Rainfall  28242 non-null  int64  
 7   Pesticides    28242 non-null  float64
 8   Temperature   28242 non-null  float64
 9   k_labels      28242 non-null  int64  
 10  Cluster       28242 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 2.4+ MB


In [4]:
data.head

<bound method NDFrame.head of        Unnamed: 0.1  Unnamed: 0      Area            Item  Year  Yield  \
0                 0           0   Albania           Maize  1990  36613   
1                 1           1   Albania        Potatoes  1990  66667   
2                 2           2   Albania            Rice  1990  23333   
3                 3           3   Albania         Sorghum  1990  12500   
4                 4           4   Albania        Soybeans  1990   7000   
...             ...         ...       ...             ...   ...    ...   
28237         28237       28237  Zimbabwe            Rice  2013  22581   
28238         28238       28238  Zimbabwe         Sorghum  2013   3066   
28239         28239       28239  Zimbabwe        Soybeans  2013  13142   
28240         28240       28240  Zimbabwe  Sweet potatoes  2013  22222   
28241         28241       28241  Zimbabwe           Wheat  2013  22888   

       Avg Rainfall  Pesticides  Temperature  k_labels  Cluster  
0              

In [5]:
data['Yield'].value_counts()

Yield
10000     100
20000      98
100000     81
25000      37
23796      33
         ... 
146964      1
15591       1
50689       1
64106       1
22888       1
Name: count, Length: 11514, dtype: int64

In [6]:
# % 
data['Yield'].value_counts()/data.shape[0]

Yield
10000     0.003541
20000     0.003470
100000    0.002868
25000     0.001310
23796     0.001168
            ...   
146964    0.000035
15591     0.000035
50689     0.000035
64106     0.000035
22888     0.000035
Name: count, Length: 11514, dtype: float64

## EDA

In [8]:
# Train Test Split 
# Stratified split (ensuring equal representation of 'booking_status' in train and test sets)
# The train_df will be used for cross validation and the we will treat the test_df as the unseen dataset
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

In [9]:
train_df['Yield'].value_counts()/train_df.shape[0]

Yield
20000     0.003718
10000     0.003541
100000    0.002700
25000     0.001416
22814     0.001151
            ...   
32271     0.000044
69978     0.000044
69778     0.000044
21951     0.000044
88745     0.000044
Name: count, Length: 9962, dtype: float64

In [10]:
# test_df= unseen data
test_df['Yield'].value_counts()/train_df.shape[0]

Yield
100000    0.000885
10000     0.000885
20000     0.000620
30000     0.000443
28645     0.000398
            ...   
76679     0.000044
64119     0.000044
112741    0.000044
31995     0.000044
29286     0.000044
Name: count, Length: 3776, dtype: float64

In [11]:
# Create the predictor and fit the data
predictor = TabularPredictor(label='Yield', path='yield_predictors')

In [12]:
# observe the output
predictor.fit(train_df)

Verbosity: 2 (Standard Logging)
AutoGluon Version:  1.2
Python Version:     3.9.21
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.19045
CPU Count:          4
Memory Avail:       3.29 GB / 11.91 GB (27.7%)
Disk Space Avail:   16.34 GB / 237.23 GB (6.9%)
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='experimental' : New in v1.2: Pre-trained foundation model + parallel fits. The absolute best accuracy without consideration for inference speed. Does not support GPU.
	presets='best'         : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='high'         : Strong accuracy with fast inference speed.
	presets='good'         : Good accuracy with very fast inference speed.
	presets='medium'  

[1000]	valid_set's rmse: 10991.6
[2000]	valid_set's rmse: 10046.9
[3000]	valid_set's rmse: 9641.35
[4000]	valid_set's rmse: 9466.37
[5000]	valid_set's rmse: 9365.57
[6000]	valid_set's rmse: 9279.84
[7000]	valid_set's rmse: 9223.97
[8000]	valid_set's rmse: 9172.13
[9000]	valid_set's rmse: 9142.41
[10000]	valid_set's rmse: 9136.58


	-9136.4512	 = Validation score   (-root_mean_squared_error)
	22.53s	 = Training   runtime
	2.19s	 = Validation runtime
Fitting model: LightGBM ...


[1000]	valid_set's rmse: 9948.24
[2000]	valid_set's rmse: 9302.89
[3000]	valid_set's rmse: 9077.6
[4000]	valid_set's rmse: 8960.08
[5000]	valid_set's rmse: 8878.51
[6000]	valid_set's rmse: 8813.19
[7000]	valid_set's rmse: 8787.8
[8000]	valid_set's rmse: 8762.75
[9000]	valid_set's rmse: 8746.07
[10000]	valid_set's rmse: 8732.62


	-8728.6864	 = Validation score   (-root_mean_squared_error)
	19.95s	 = Training   runtime
	1.5s	 = Validation runtime
Fitting model: RandomForestMSE ...
	-9406.9977	 = Validation score   (-root_mean_squared_error)
	14.98s	 = Training   runtime
	0.17s	 = Validation runtime
Fitting model: CatBoost ...
	-10487.6315	 = Validation score   (-root_mean_squared_error)
	471.01s	 = Training   runtime
	0.07s	 = Validation runtime
Fitting model: ExtraTreesMSE ...
	-8903.2656	 = Validation score   (-root_mean_squared_error)
	5.38s	 = Training   runtime
	0.15s	 = Validation runtime
Fitting model: NeuralNetFastAI ...
	-11856.1449	 = Validation score   (-root_mean_squared_error)
	33.52s	 = Training   runtime
	0.06s	 = Validation runtime
Fitting model: XGBoost ...
	-9594.1366	 = Validation score   (-root_mean_squared_error)
	81.12s	 = Training   runtime
	1.23s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	-12034.2755	 = Validation score   (-root_mean_squared_error)
	113.45s	 = Training   ru

[1000]	valid_set's rmse: 9168.91
[2000]	valid_set's rmse: 8999.18
[3000]	valid_set's rmse: 8956.04
[4000]	valid_set's rmse: 8926.66
[5000]	valid_set's rmse: 8908.19
[6000]	valid_set's rmse: 8899.07
[7000]	valid_set's rmse: 8895.06
[8000]	valid_set's rmse: 8891.06
[9000]	valid_set's rmse: 8888.26
[10000]	valid_set's rmse: 8886.51


	-8886.4457	 = Validation score   (-root_mean_squared_error)
	43.55s	 = Training   runtime
	1.63s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	Ensemble Weights: {'LightGBM': 0.36, 'LightGBMLarge': 0.24, 'RandomForestMSE': 0.16, 'ExtraTreesMSE': 0.08, 'XGBoost': 0.08, 'NeuralNetFastAI': 0.04, 'NeuralNetTorch': 0.04}
	-8439.1011	 = Validation score   (-root_mean_squared_error)
	0.03s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 829.01s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 473.6 rows/s (2260 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("C:\Users\qs142\Downloads\yield_predictors")


<autogluon.tabular.predictor.predictor.TabularPredictor at 0x2c619d21a30>

In [13]:
# summary
predictor.fit_summary()

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


*** Summary of fit() ***
Estimated performance of each model:
                  model     score_val              eval_metric  pred_time_val    fit_time  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0   WeightedEnsemble_L2  -8439.101051  root_mean_squared_error       4.771522  311.962044                0.000998           0.027894            2       True         12
1              LightGBM  -8728.686367  root_mean_squared_error       1.498987   19.950128                1.498987          19.950128            1       True          4
2         LightGBMLarge  -8886.445712  root_mean_squared_error       1.627616   43.547876                1.627616          43.547876            1       True         11
3         ExtraTreesMSE  -8903.265559  root_mean_squared_error       0.152914    5.375193                0.152914           5.375193            1       True          7
4            LightGBMXT  -9136.451211  root_mean_squared_error       2.190140   22.525136         



{'model_types': {'KNeighborsUnif': 'KNNModel',
  'KNeighborsDist': 'KNNModel',
  'LightGBMXT': 'LGBModel',
  'LightGBM': 'LGBModel',
  'RandomForestMSE': 'RFModel',
  'CatBoost': 'CatBoostModel',
  'ExtraTreesMSE': 'XTModel',
  'NeuralNetFastAI': 'NNFastAiTabularModel',
  'XGBoost': 'XGBoostModel',
  'NeuralNetTorch': 'TabularNeuralNetTorchModel',
  'LightGBMLarge': 'LGBModel',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'KNeighborsUnif': -65494.91830877946,
  'KNeighborsDist': -67553.16040684361,
  'LightGBMXT': -9136.451210758602,
  'LightGBM': -8728.686366931355,
  'RandomForestMSE': -9406.997665041043,
  'CatBoost': -10487.631530749648,
  'ExtraTreesMSE': -8903.265558845178,
  'NeuralNetFastAI': -11856.144881113696,
  'XGBoost': -9594.136610433166,
  'NeuralNetTorch': -12034.275458966007,
  'LightGBMLarge': -8886.445711588955,
  'WeightedEnsemble_L2': -8439.101051408727},
 'model_best': 'WeightedEnsemble_L2',
 'model_paths': {'KNeighborsUnif': ['KNeigh

In [14]:
test_df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'Area', 'Item', 'Year', 'Yield',
       'Avg Rainfall', 'Pesticides', 'Temperature', 'k_labels', 'Cluster'],
      dtype='object')

In [15]:
# validate the model against unseen data
y_test = test_df["Yield"]
test_data = test_df.drop(columns=["Yield"])

In [16]:
y_pred = predictor.predict(test_data)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


In [17]:
metrics = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

In [18]:
metrics

{'root_mean_squared_error': -8191.731226940433,
 'mean_squared_error': -67104460.49443101,
 'mean_absolute_error': -3300.4337029628446,
 'r2': 0.9907488822937012,
 'pearsonr': 0.9953740409439149,
 'median_absolute_error': -978.65625}

In [19]:
# Feature Importance
importance = predictor.feature_importance(test_df)
importance

These features in provided data are not utilized by the predictor and will be ignored: ['Unnamed: 0', 'Cluster']
Computing feature importance via permutation shuffling for 8 features using 5000 rows with 5 shuffle sets...
If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")
	604.37s	= Expected runtime (120.87s per shuffle set)
If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")
If you only need to load model weights and

Unnamed: 0,importance,stddev,p_value,n,p99_high,p99_low
Item,99841.313096,1039.673307,1.410798e-09,5,101982.014804,97700.611388
Area,39077.576355,785.715798,1.960192e-08,5,40695.376043,37459.776667
Year,12214.739314,349.555764,8.039633e-08,5,12934.479464,11494.999165
k_labels,11872.446126,471.018882,2.966626e-07,5,12842.280514,10902.611738
Avg Rainfall,7218.544209,111.421468,6.80958e-09,5,7447.962552,6989.125867
Pesticides,7004.537483,261.173204,2.315119e-07,5,7542.296721,6466.778246
Unnamed: 0.1,6394.553706,297.702441,5.621029e-07,5,7007.527146,5781.580265
Temperature,5889.89317,225.770614,2.585657e-07,5,6354.757985,5425.028355


In [20]:
# Use Case!
# Adjust the lead times on the reservation, or another features and test!
sample_yield = {
   "Area": "Nigeria",                 
    "Item": "Maize",                  
    "Year": 2012,                    
    "Avg Rainfall": 1200,           
    "Pesticides": 85.0,              
    "Temperature": 26.5,            
    "Unnamed: 0.1": 0,              
    "k_labels": 1                   
}

In [21]:
yield_data = TabularDataset([sample_yield])
predictor.predict(yield_data)

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


0    42279.128906
Name: Yield, dtype: float32

In [22]:
## How to laod a Model that already exist
from autogluon.tabular import TabularDataset, TabularPredictor

In [23]:
path_to_model = 'yield_predictors'

In [24]:
data = TabularDataset('Yield_df.csv')

Loaded data from: Yield_df.csv | Columns = 11 / 11 | Rows = 28242 -> 28242


In [26]:
predictor = TabularPredictor.load(path_to_model)

In [27]:
y_pred = predictor.predict(data)


If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


In [28]:
y_true= data['Yield']

In [29]:
metrics = predictor.evaluate_predictions(y_true=y_true, y_pred=y_pred, auxiliary_metrics=True)
metrics

{'root_mean_squared_error': -4625.732457109106,
 'mean_squared_error': -21397400.764752645,
 'mean_absolute_error': -1603.4107837427118,
 'r2': 0.9970352649688721,
 'pearsonr': 0.9985243509410642,
 'median_absolute_error': -534.69677734375}

In [30]:
predictor.info()

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


{'path': 'C:\\Users\\qs142\\Downloads\\yield_predictors',
 'label': 'Yield',
 'random_state': 0,
 'version': '1.2',
 'features': ['Unnamed: 0.1',
  'Area',
  'Item',
  'Year',
  'Avg Rainfall',
  'Pesticides',
  'Temperature',
  'k_labels'],
 'feature_metadata_in': <autogluon.common.features.feature_metadata.FeatureMetadata at 0x2c6335103a0>,
 'time_fit_preprocessing': 1.2446608543395996,
 'time_fit_training': 827.7689235210419,
 'time_fit_total': 829.0135843753815,
 'time_limit': None,
 'time_train_start': 1744773588.7150443,
 'num_rows_train': 20333,
 'num_cols_train': 8,
 'num_rows_val': 2260,
 'num_rows_test': None,
 'num_classes': None,
 'problem_type': 'regression',
 'eval_metric': 'root_mean_squared_error',
 'best_model': 'WeightedEnsemble_L2',
 'best_model_score_val': -8439.101051408727,
 'best_model_stack_level': 2,
 'num_models_trained': 12,
 'num_bag_folds': 0,
 'max_stack_level': 2,
 'max_core_stack_level': 1,
 'model_info': {'KNeighborsUnif': {'name': 'KNeighborsUnif',
   

In [31]:
mdl_name = 'LightGBM'
predictor.predict(data, model= mdl_name)

0        34711.085938
1        83349.132812
2        25566.878906
3         8858.184570
4         7235.341797
             ...     
28237    21016.519531
28238     1697.450562
28239    13293.725586
28240    22235.244141
28241    23819.083984
Name: Yield, Length: 28242, dtype: float32

### Metric Interpretation

- **Root Mean Squared Error (RMSE)**: **4,625.73** – This indicates that, on average, the prediction errors are around 4,626 units. It penalizes larger errors more heavily, making it useful for understanding model accuracy at scale.
- **Mean Squared Error (MSE)**: **21,397,400.76** – Represents the average of the squared prediction errors. While large due to squaring, it is consistent with the RMSE at this scale.
- **Mean Absolute Error (MAE)**: **1,603.41** – On average, predictions deviate from actual crop yield values by about 1,603 units. This is reasonably tight for real-world applications.
- **Median Absolute Error**: **534.70** – Half of the predictions deviate from the true value by less than ~535 units, indicating very consistent prediction accuracy.
- **R² (Coefficient of Determination)**: **0.9970** – Indicates that the model explains **99.7%** of the variance in the target variable. This is an outstanding result.
- **Pearson Correlation (r)**: **0.9985** – Shows a **near-perfect linear correlation** between predicted and actual crop yield values.

---

### Conclusion

The AutoGluon model using the **LightGBM** algorithm demonstrates **exceptional predictive performance** on the crop yield dataset.

- Error metrics (RMSE, MAE, MedAE) are all relatively low, reflecting a high degree of precision.
- The **R² score** confirms that the model captures nearly all variance in crop yield outcomes.
- The **Pearson correlation** further validates a strong linear relationship between actual and predicted values.

This level of performance suggests:
- The model is trained on **high-quality, well-structured data**.
- There is **strong predictive signal** in the input features: rainfall, pesticides, and temperature.
- The model is well-generalized, with **minimal overfitting or underfitting**.


