# Crop Yield Prediction: PyTorch LSTM with Optuna (Part 5)

## Overview
This notebook trains a **Long Short-Term Memory (LSTM) Network** to predict crop yields. LSTMs are specialized neural networks designed to capture temporal dependencies in sequential data, making them ideal for time-series forecasting compared to standard Feedforward Networks.

## Methodology
1.  **Crop Selection:** Choose the specific crop to predict.
2.  **Feature Analysis:** Review the input variables.
3.  **Time-Series Split:** Divide data by year to ensure we don't predict the past using the future.
4.  **Data Scaling & Sequence Generation:** Normalize features and create **3D sequences (Batch, Seq_Len, Features)** required for LSTM input. We use a **Sequence Length of 2**.
5.  **Baseline:** Compare against a simple guess (Last Year's Yield).
6.  **Initial Model:** Train a default LSTM model and check learning curves.
7.  **Optimization:** Use **Optuna** to automatically find the best LSTM architecture (layers, hidden size, dropout) and hyperparameters.
8.  **Final Evaluation:** Compare accuracy (RMSE) across all stages.

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import optuna
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance

# Optuna Visualization Tools
from optuna.visualization import plot_optimization_history
from optuna.visualization import plot_parallel_coordinate
from optuna.visualization import plot_slice
from optuna.visualization import plot_param_importances

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

  from .autonotebook import tqdm as notebook_tqdm


Using device: cpu


### 1. Data Preparation and Crop Choice
We load the main dataset and identify the available crops. For this analysis, we focus specifically on **Rice**. We clean the data by removing columns related to other crops and deleting any rows where the target yield information is missing.

In [2]:
# Load dataset
df = pd.read_parquet('Parquet/XY_v2.parquet')

# --- LIST AVAILABLE CROPS ---
# Assumes targets start with 'Y_'
target_columns = [col for col in df.columns if col.startswith('Y_')]
available_crops = [col.replace('Y_', '') for col in target_columns]

print("--- Available Crops found in Dataset ---")
print(available_crops)
print("-" * 40)

# --- CONFIGURATION: SET CROP HERE ---
CHOSEN_CROP = 'rice'  # <--- CHANGE THIS to 'lettuce', 'pepper', etc. based on list above
# ------------------------------------

# Define Target and Dynamic Lag Features
TARGET_COL = f'Y_{CHOSEN_CROP}'
LAG_1_FEATURE = f'avg_yield_{CHOSEN_CROP}_1y'

if TARGET_COL not in df.columns:
    raise ValueError(f"Target {TARGET_COL} not found in dataset. Check spelling.")

print(f"Predicting Target: {TARGET_COL}")
print(f"Using Lag 1 Feature: {LAG_1_FEATURE}")

# Clean Missing Targets for the chosen crop
df_model = df.dropna(subset=[TARGET_COL])

print(f"Data Loaded. Rows with valid target: {len(df_model)}")

--- Available Crops found in Dataset ---
['bananas', 'barley', 'cassava_fresh', 'cucumbers_and_gherkins', 'maize_corn', 'oil_palm_fruit', 'other_vegetables_fresh_nec', 'potatoes', 'rice', 'soya_beans', 'sugar_beet', 'sugar_cane', 'tomatoes', 'watermelons', 'wheat']
----------------------------------------
Predicting Target: Y_rice
Using Lag 1 Feature: avg_yield_rice_1y
Data Loaded. Rows with valid target: 4729


### 2. Selecting Features, Splitting, and Sequence Generation
We identify the input variables. We split data by year to avoid data leakage. 

**Crucially, for LSTMs:**
1. We scale the data (StandardScaler).
2. We transform the 2D feature matrix into **3D sequences**.  
   * **Sequence Length = 2**: We look at the data from the previous step and the current step to predict the target.
   * Input Shape: `(Batch_Size, Sequence_Length, Features)`.

In [3]:
# --- IMPORTS (Add these if not already present) ---
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import torch
import pandas as pd

# --- DROP UNWANTED COLUMNS ---
# Drop all columns that start with "avg_yield_" but do NOT match the chosen crop
cols_to_drop = [c for c in df_model.columns 
                if c.startswith("avg_yield_") and CHOSEN_CROP not in c]

df_model = df_model.drop(columns=cols_to_drop)

# --- FEATURE SELECTION ---
# Select independent variables (exclude 'Y_' columns and metadata)
feature_cols = [c for c in df_model.columns 
                if not c.startswith('Y_') and c not in ['area']]

# --- DISPLAY FEATURES TABLE ---
print(f"Total Features Used: {len(feature_cols)}")
print("-" * 30)
feature_preview = pd.DataFrame(feature_cols, columns=['Feature Name']).T
display(feature_preview)

# --- TIME-SERIES SPLIT ---
TRAIN_END_YEAR = 2014
VAL_END_YEAR = 2019

# 1. Training Set (< 2014)
mask_train = df_model['year'] < TRAIN_END_YEAR
X_train_raw = df_model[mask_train][feature_cols]
y_train = df_model[mask_train][TARGET_COL]

# 2. Validation Set (>= 2014 and < 2019)
mask_val = (df_model['year'] >= TRAIN_END_YEAR) & (df_model['year'] < VAL_END_YEAR)
X_val_raw = df_model[mask_val][feature_cols]
y_val = df_model[mask_val][TARGET_COL]

# 3. Test Set (>= 2019)
mask_test = df_model['year'] >= VAL_END_YEAR
X_test_raw = df_model[mask_test][feature_cols]
y_test = df_model[mask_test][TARGET_COL]

# --- IMPUTATION (Handle NaNs before scaling) ---
imputer = SimpleImputer(strategy='mean')
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train_raw), columns=feature_cols)
X_val_imputed = pd.DataFrame(imputer.transform(X_val_raw), columns=feature_cols)
X_test_imputed = pd.DataFrame(imputer.transform(X_test_raw), columns=feature_cols)

# --- SCALING (Required for LSTMs) ---
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train_imputed)
X_val = scaler.transform(X_val_imputed)
X_test = scaler.transform(X_test_imputed)

# --- LSTM SEQUENCE GENERATION ---
SEQ_LEN = 2

def create_sequences(data, targets, seq_len):
    """
    Creates sliding window sequences for LSTM input.
    Input Shape: (N, Features) -> Output Shape: (N - Seq_Len + 1, Seq_Len, Features)
    """
    xs, ys = [], []
    # We iterate such that we can form a sequence of length seq_len
    for i in range(len(data) - seq_len + 1):
        x = data[i:(i + seq_len)]
        # The target is the value corresponding to the LAST step in the sequence
        y = targets[i + seq_len - 1]
        xs.append(x)
        ys.append(y)
    return np.array(xs), np.array(ys)

# Create sequences for all splits
# Note: This slightly reduces the number of samples by (SEQ_LEN - 1)
X_train_seq, y_train_seq = create_sequences(X_train, y_train.values, SEQ_LEN)
X_val_seq, y_val_seq = create_sequences(X_val, y_val.values, SEQ_LEN)
X_test_seq, y_test_seq = create_sequences(X_test, y_test.values, SEQ_LEN)

# Convert to PyTorch Tensors (batch_first=True)
X_train_tensor = torch.tensor(X_train_seq, dtype=torch.float32).to(device)
y_train_tensor = torch.tensor(y_train_seq, dtype=torch.float32).view(-1, 1).to(device)

X_val_tensor = torch.tensor(X_val_seq, dtype=torch.float32).to(device)
y_val_tensor = torch.tensor(y_val_seq, dtype=torch.float32).view(-1, 1).to(device)
X_test_tensor = torch.tensor(X_test_seq, dtype=torch.float32).to(device)

print(f"\nLSTM Data Shapes (Batch, Seq_Len, Features):")
print(f"Training:   {X_train_seq.shape}")
print(f"Validation: {X_val_seq.shape}")
print(f"Testing:    {X_test_seq.shape}")

Total Features Used: 23
------------------------------


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
Feature Name,year,avg_yield_rice_1y,avg_yield_rice_3y,avg_yield_rice_5y,sum_rain_winter,sum_rain_spring,sum_rain_summer,sum_rain_autumn,sum_rain_annual,avg_solar_winter,...,avg_solar_annual,avg_temp_winter,avg_temp_spring,avg_temp_summer,avg_temp_autumn,avg_temp_annual,pesticides_lag1,fertilizer_lag1,latitude,longitude



LSTM Data Shapes (Batch, Seq_Len, Features):
Training:   (3578, 2, 23)
Validation: (574, 2, 23)
Testing:    (574, 2, 23)


### 3. Setting a Baseline
Before using complex AI, we create a simple baseline to measure success. We assume that the yield this year will be exactly the same as last year. 

*Note: Because sequence generation removed the first `SEQ_LEN - 1` rows, we must align our baseline comparison to the remaining `y_test_seq`.*

In [4]:
# Baseline: yield(t) = yield(t-1)
# We extract the Lag 1 feature corresponding to the test set sequences
# X_test_seq has shape (N, 2, Feats). We want the feature at the last time step
# corresponding to LAG_1_FEATURE.

# Find index of lag feature
lag_idx = feature_cols.index(LAG_1_FEATURE)

# Extract the feature from the sequences (last time step in the window)
# But wait! The LSTM input 'X' is scaled. We need unscaled values for meaningful RMSE.
# It is easier to pull the unscaled targets directly from the original dataframe index alignment.

# Get original indices after sequence trimming
test_indices = y_test.index[SEQ_LEN - 1:]

# Get predictions (Last year's yield) for these indices
y_pred_baseline = df_model.loc[test_indices, LAG_1_FEATURE]
y_true_aligned = df_model.loc[test_indices, TARGET_COL]

# Clean NaNs for metric calculation
mask_valid = ~y_pred_baseline.isna() & ~y_true_aligned.isna()
y_test_clean = y_true_aligned[mask_valid]
y_pred_clean = y_pred_baseline[mask_valid]

rmse_baseline = np.sqrt(mean_squared_error(y_test_clean, y_pred_clean))
r2_baseline = r2_score(y_test_clean, y_pred_clean)

print(f"Baseline RMSE (Aligned): {rmse_baseline:.2f}")

Baseline RMSE (Aligned): 533.91


### 4. Initial Model Testing (LSTM)
We train a basic **LSTM** model using standard settings. We plot the training vs validation loss to check for overfitting or underfitting.

In [None]:
# --- DEFINE LSTM STRUCTURE ---
class LSTMRegressor(nn.Module):
    def __init__(self, input_dim, hidden_size, num_layers, dropout):
        super(LSTMRegressor, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        
        # LSTM Layer
        # batch_first=True means input shape is (batch, seq, feature)
        self.lstm = nn.LSTM(input_dim, hidden_size, num_layers, 
                            batch_first=True, dropout=dropout)
        
        # Fully Connected Layer (Output)
        self.fc = nn.Linear(hidden_size, 1)
        
    def forward(self, x):
        # Initialize hidden state and cell state with zeros
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(device)
        
        # Forward propagate LSTM
        # out: tensor of shape (batch_size, seq_length, hidden_size)
        out, _ = self.lstm(x, (h0, c0))
        
        # Decode the hidden state of the last time step
        out = self.fc(out[:, -1, :])
        return out

# --- TRAINING HELPER FUNCTION ---
#def train_model(model, X_t, y_t, X_v, y_v, lr=0.05, epochs=150, batch_size=32, verbose=True):
def train_model(model, X_t, y_t, X_v, y_v, lr = 0.005,batch_size = 16,epochs = 200,dropout = 0.2,num_layers = 2,hidden_units = 24, verbose=True):
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    train_loader = DataLoader(TensorDataset(X_t, y_t), batch_size=batch_size, shuffle=True)
    
    train_losses = []
    val_losses = []
    
    for epoch in range(epochs):
        model.train()
        epoch_loss = 0
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item() * batch_X.size(0)
            
        # Calculate average losses (RMSE representation)
        train_mse = epoch_loss / len(X_t)
        train_rmse = np.sqrt(train_mse)
        
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_v)
            val_loss = criterion(val_outputs, y_v)
            val_rmse = np.sqrt(val_loss.item())
            
        train_losses.append(train_rmse)
        val_losses.append(val_rmse)
        
        if verbose and (epoch % 20 == 0 or epoch == epochs-1):
            print(f"Epoch {epoch}/{epochs} | Train RMSE: {train_rmse:.2f} | Val RMSE: {val_rmse:.2f}")
            
    return train_losses, val_losses

# --- INITIAL MODEL TRAINING ---
input_dim = X_train_seq.shape[2] # Number of features

# Initial Config
HIDDEN_SIZE = 32
NUM_LAYERS = 1
DROPOUT = 0.0

model_init = LSTMRegressor(input_dim, HIDDEN_SIZE, NUM_LAYERS, DROPOUT).to(device)

print("Training Initial LSTM Model...")
train_hist, val_hist = train_model(model_init, X_train_tensor, y_train_tensor, X_val_tensor, y_val_tensor)

# --- PLOT LEARNING CURVE ---
plt.figure(figsize=(10, 6))
plt.plot(train_hist, label='Training RMSE', color='blue')
plt.plot(val_hist, label='Validation RMSE', color='red')
plt.title(f'LSTM Learning Curve ({CHOSEN_CROP})', fontsize=15)
plt.xlabel('Epochs')
plt.ylabel('RMSE')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Evaluate on TEST Set
model_init.eval()
with torch.no_grad():
    y_pred_init_test = model_init(X_test_tensor).cpu().numpy().flatten()

rmse_init_test = np.sqrt(mean_squared_error(y_test_seq, y_pred_init_test))
r2_init_test = r2_score(y_test_seq, y_pred_init_test)

print(f"Initial LSTM Test RMSE: {rmse_init_test:.2f}")

Training Initial LSTM Model...
Epoch 0/200 | Train RMSE: 3827.83 | Val RMSE: 4578.60
Epoch 20/200 | Train RMSE: 3238.25 | Val RMSE: 3987.74
Epoch 40/200 | Train RMSE: 2726.66 | Val RMSE: 3458.20
Epoch 60/200 | Train RMSE: 2266.70 | Val RMSE: 2963.69
Epoch 80/200 | Train RMSE: 1874.99 | Val RMSE: 2523.29
Epoch 100/200 | Train RMSE: 1543.63 | Val RMSE: 2140.76
Epoch 120/200 | Train RMSE: 1267.10 | Val RMSE: 1814.72
Epoch 140/200 | Train RMSE: 1040.93 | Val RMSE: 1553.82


### 5. Tuning the Model (Optuna)
To improve performance, we use **Optuna** to find the best LSTM architecture. We run trials adjusting hidden size, number of layers, dropout rate, learning rate, and batch size.

In [6]:
# --- OPTUNA OBJECTIVE FUNCTION FOR LSTM ---
def objective(trial):
    # 1. Suggest Hyperparameters
    hidden_size = trial.suggest_int("hidden_size", 16, 64)
    num_layers = trial.suggest_int("num_layers", 1, 3)
    dropout = trial.suggest_float("dropout", 0.1, 0.3)
    lr = trial.suggest_float("lr", 0.0005, 0.0015)
    batch_size = trial.suggest_categorical("batch_size", [8, 16, 32])
    
    # LSTM requires dropout=0 if num_layers=1, handle this
    if num_layers == 1:
        dropout = 0.0

    # 2. Build Model
    model = LSTMRegressor(input_dim, hidden_size, num_layers, dropout).to(device)
    
    # 3. Setup Training
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    train_loader = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), 
                              batch_size=batch_size, shuffle=True)
    
    # 4. Training Loop with Pruning
    epochs = 50  # Shorter epoch count for tuning speed
    for epoch in range(epochs):
        model.train()
        for batch_X, batch_y in train_loader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
        
        # Evaluate on Validation
        model.eval()
        with torch.no_grad():
            val_pred = model(X_val_tensor)
            val_mse = criterion(val_pred, y_val_tensor).item()
            val_rmse = np.sqrt(val_mse)

        # Pruning check
        trial.report(val_rmse, epoch)
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return val_rmse

# --- RUN OPTIMIZATION ---
study_name = f'{CHOSEN_CROP.capitalize()}_Yield_LSTM'
study = optuna.create_study(direction='minimize', study_name=study_name)
study.optimize(objective, n_trials=20)

print("\nBest Parameters found:")
print(study.best_params)

[I 2025-11-29 15:58:07,437] A new study created in memory with name: Rice_Yield_LSTM
[W 2025-11-29 15:58:33,885] Trial 0 failed with parameters: {'hidden_size': 19, 'num_layers': 2, 'dropout': 0.1225427497104476, 'lr': 0.0010015052772023086, 'batch_size': 8} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "C:\Users\PavinP\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\optuna\study\_optimize.py", line 205, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "C:\Users\PavinP\AppData\Local\Temp\ipykernel_23504\3705250164.py", line 32, in objective
    loss.backward()
  File "C:\Users\PavinP\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\torch\_tensor.py", line 625, in backward
    torch.autograd.backward(
  File "C:\Users\PavinP\AppData\Local\Packages\Pyt

KeyboardInterrupt: 

### 6. Visualizing Optimization
We generate charts to understand the tuning process.

In [None]:
# --- OPTUNA VISUALIZATIONS ---
name = f"{CHOSEN_CROP.capitalize()}_Yield_LSTM"

# 1. Optimization History
fig = plot_optimization_history(study)
fig.update_layout(title=f'{name} Optimization History', width=900, height=500)
fig.show()

# 2. Parallel Coordinate (Hyperparameter Relationships)
fig = plot_parallel_coordinate(study)
fig.update_layout(title=f'{name} Parallel Coordinate Plot', width=900, height=500)
fig.show()

# 3. Slice Plot (Individual Parameter impact)
fig = plot_slice(study)
fig.update_layout(title=f'{name} Slice Plot', width=900, height=500)
fig.show()

# 4. Parameter Importance
try:
    fig = plot_param_importances(study)
    fig.update_layout(title=f'{name} Hyperparameter Importance', width=900, height=500)
    fig.show()
except (ValueError, RuntimeError) as e:
    print(f'Could not plot parameter importance: {e}')

### 7. Final Model Training
Using the best settings found during the tuning phase, we build the final LSTM model. We train this model on both the Training and Validation data combined to maximize learning.

In [None]:
# 1. Combine Train + Validation for Final Training (Sequences)
X_train_full_seq = np.concatenate((X_train_seq, X_val_seq), axis=0)
y_train_full_seq = np.concatenate((y_train_seq, y_val_seq), axis=0)

X_train_full_tensor = torch.tensor(X_train_full_seq, dtype=torch.float32).to(device)
y_train_full_tensor = torch.tensor(y_train_full_seq, dtype=torch.float32).view(-1, 1).to(device)

# 2. Retrieve Best Params
bp = study.best_params

# Handle LSTM dropout rule for 1 layer again
final_dropout = bp['dropout']
if bp['num_layers'] == 1:
    final_dropout = 0.0

# 3. Initialize Best Model
final_model = LSTMRegressor(
    input_dim,
    bp['hidden_size'], 
    bp['num_layers'], 
    final_dropout
).to(device)

# 4. Train on Full History
optimizer = optim.Adam(final_model.parameters(), lr=bp['lr'])
criterion = nn.MSELoss()
train_loader = DataLoader(TensorDataset(X_train_full_tensor, y_train_full_tensor), 
                          batch_size=bp['batch_size'], shuffle=True)

print("Training Final LSTM Model...")
final_model.train()
for epoch in range(150):
    for batch_X, batch_y in train_loader:
        optimizer.zero_grad()
        outputs = final_model(batch_X)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()

# 5. Final Prediction on TEST Data
final_model.eval()
with torch.no_grad():
    y_pred_final_test = final_model(X_test_tensor).cpu().numpy().flatten()

rmse_final_test = np.sqrt(mean_squared_error(y_test_seq, y_pred_final_test))
r2_final_test = r2_score(y_test_seq, y_pred_final_test)

### 8. Results and Analysis
We evaluate the final performance on the Test data.
* **Comparison:** We check if the Tuned Model beats the Baseline and the Initial Model.
* **Trend Analysis:** We plot the predicted yields against actual yields over time.

In [None]:
# Calculate Improvement %
imp_final = (rmse_baseline - rmse_final_test) / rmse_baseline * 100

print("--- Final Performance Report (Test Set) ---")
print(f"Baseline RMSE (Aligned): {rmse_baseline:.2f}")
print(f"Initial LSTM Model RMSE: {rmse_init_test:.2f}")
print(f"Tuned LSTM Model RMSE:   {rmse_final_test:.2f} (Improvement: {imp_final:.2f}%)")
print(f"Tuned LSTM R2:           {r2_final_test:.4f}")

# --- PLOTTING RESULTS ---
fig, axes = plt.subplots(1, 3, figsize=(18, 6), sharey=True)

# Axis Limits
all_preds = np.concatenate([y_pred_clean, y_pred_init_test, y_pred_final_test])
all_true = np.concatenate([y_test_clean, y_test_seq, y_test_seq])
min_val, max_val = min(min(all_preds), min(all_true)), max(max(all_preds), max(all_true))

# 1. Baseline Plot
axes[0].scatter(y_test_clean, y_pred_clean, alpha=0.4, color='blue')
axes[0].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[0].set_title(f'Baseline Model\nRMSE: {rmse_baseline:.2f}')

# 2. Initial Model Plot
axes[1].scatter(y_test_seq, y_pred_init_test, alpha=0.4, color='orange')
axes[1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[1].set_title(f'Initial LSTM Model\nRMSE: {rmse_init_test:.2f}')

# 3. Tuned Model Plot
axes[2].scatter(y_test_seq, y_pred_final_test, alpha=0.4, color='green')
axes[2].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2)
axes[2].set_title(f'Tuned LSTM Model\nRMSE: {rmse_final_test:.2f}')

plt.suptitle(f'{CHOSEN_CROP.capitalize()} Yield: Performance Comparison (Actual vs Predicted)', fontsize=16)
plt.show()

In [None]:
# --- FULL TIMELINE PLOT ---
import matplotlib.pyplot as plt

# To plot the full timeline, we need to generate sequences for the entire dataset.
# Note: create_sequences trims the first (SEQ_LEN-1) rows.
X_all_scaled = scaler.transform(df_model[feature_cols])
X_all_seq, _ = create_sequences(X_all_scaled, df_model[TARGET_COL].values, SEQ_LEN)
X_all_tensor = torch.tensor(X_all_seq, dtype=torch.float32).to(device)

final_model.eval()
with torch.no_grad():
    all_predictions = final_model(X_all_tensor).cpu().numpy().flatten()

# Align Dates (Slice off the first SEQ_LEN-1 years)
aligned_years = df_model['year'].values[SEQ_LEN - 1:]
aligned_actuals = df_model[TARGET_COL].values[SEQ_LEN - 1:]

# 2. Create DataFrame
df_full_trend = pd.DataFrame({
    'Year': aligned_years,
    'Actual': aligned_actuals,
    'Predicted': all_predictions
})

# Aggregate by Year
yearly_trend = df_full_trend.groupby('Year').mean()

# 3. Plotting
plt.figure(figsize=(14, 7))

# Plot Lines
plt.plot(yearly_trend.index, yearly_trend['Actual'], 
         marker='o', label='Actual Yield', linewidth=2, color='blue')
plt.plot(yearly_trend.index, yearly_trend['Predicted'], 
         marker='x', linestyle='--', label='Predicted Yield', linewidth=2, color='orange')

# Define Split Boundaries
MIN_YEAR = yearly_trend.index.min()
MAX_YEAR = yearly_trend.index.max()
train_boundary = TRAIN_END_YEAR - 0.5
val_boundary = VAL_END_YEAR - 0.5

# --- Highlight Periods ---
plt.axvspan(MIN_YEAR - 0.5, train_boundary, color='green', alpha=0.1, label=f'Train (<{TRAIN_END_YEAR})')
plt.axvspan(train_boundary, val_boundary, color='yellow', alpha=0.1, label=f'Validation ({TRAIN_END_YEAR}-{VAL_END_YEAR - 1})')
plt.axvspan(val_boundary, MAX_YEAR + 0.5, color='red', alpha=0.1, label=f'Test (>={VAL_END_YEAR})')

# Add Text Labels
y_max = yearly_trend['Actual'].max()
text_y = y_max * 1.05 

plt.text((MIN_YEAR + train_boundary)/2, text_y, 'TRAINING', ha='center', fontsize=12, fontweight='bold', color='green')
plt.text((train_boundary + val_boundary)/2, text_y, 'VALIDATION', ha='center', fontsize=12, fontweight='bold', color='#D4AC0D')
plt.text((val_boundary + MAX_YEAR)/2, text_y, 'TESTING', ha='center', fontsize=12, fontweight='bold', color='red')

# Final Formatting
plt.title(f'Full Timeline Analysis: Actual vs. Predicted Yield ({CHOSEN_CROP}) - LSTM', fontsize=16)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Yield (hg/ha)', fontsize=12)
plt.legend(loc='upper left')
plt.grid(True, alpha=0.3)
plt.xticks(np.arange(MIN_YEAR, MAX_YEAR + 1, 2))
plt.xlim(MIN_YEAR - 0.5, MAX_YEAR + 0.5)

plt.tight_layout()
plt.show()

* **Geographic Error:** We map the error rates by country.

In [None]:
# --- RE-CREATE COMPARISON DF WITH FEATURE JOINED ---
# We need the original 'area' column. Due to Sequence creation, 
# we trimmed (SEQ_LEN-1) rows from the START of each split chunk in create_sequences?
# Actually, create_sequences is applied to the processed arrays X_test, etc.
# X_test corresponds to indices in df_model[mask_test].
# We need to align indices: The LSTM predictions correspond to indices [SEQ_LEN-1 : ] of the mask_test chunk.

mask_test = df_model['year'] >= VAL_END_YEAR
df_test_full = df_model[mask_test]

# Slice the df to match sequence output length
# create_sequences removes first (SEQ_LEN-1) rows
df_test_aligned = df_test_full.iloc[SEQ_LEN-1 : ].copy()

comparison_df = pd.DataFrame({
    'Actual_Value': y_test_seq, # Aligned targets
    'Predicted_Value': y_pred_final_test
}, index=df_test_aligned.index)

# Join metadata
comparison_df = comparison_df.join(df_test_aligned[['area', 'year']])
comparison_df = comparison_df[['year', 'area', 'Actual_Value', 'Predicted_Value']]

print("--- Actual vs. Predicted Test Set Results ---")
print(comparison_df.head())

In [None]:
import plotly.express as px

# Name Cleaning for Map Plotting
comparison_df['area'] = comparison_df['area'].replace({
    'United_States_of_America': 'United States',
    'United_Kingdom_of_Great_Britain_and_Northern_Ireland': 'United Kingdom',
    'Russian_Federation': 'Russia',
    'Viet_Nam': 'Vietnam',
    'Türkiye': 'Turkey',
    'Bolivia_(Plurinational_State_of)': 'Bolivia',
    'Iran_(Islamic_Republic_of)': 'Iran',
    "Lao_People's_Democratic_Republic": 'Laos',
    'China,_mainland': 'China',
    'China,_Taiwan_Province_of': 'Taiwan',
    "Democratic_People's_Republic_of_Korea": 'North Korea',
    'Republic_of_Korea': 'South Korea',
    'Côte_d\'Ivoire': "Cote d'Ivoire",
    'United_Republic_of_Tanzania': 'Tanzania',
    'Micronesia_(Federated_States_of)': 'Micronesia',
    'Venezuela_(Bolivarian_Republic_of)': 'Venezuela'
})

def plot_geographic_error(comparison_df):
    # Squared Error (for RMSE)
    comparison_df['Squared_Error'] = (comparison_df['Actual_Value'] - comparison_df['Predicted_Value']) ** 2
    # Squared Percentage Error (for RMSPE)
    epsilon = 1e-6 
    comparison_df['Squared_Percentage_Error'] = (
        (comparison_df['Actual_Value'] - comparison_df['Predicted_Value']) / 
        (comparison_df['Actual_Value'] + epsilon)
    ) ** 2

    # Aggregate Errors by Country
    rmse_df = (
        comparison_df.groupby('area')['Squared_Error']
        .mean().apply(np.sqrt).reset_index()
        .rename(columns={'area': 'Country', 'Squared_Error': 'RMSE'})
    )
    rmspe_df = (
        comparison_df.groupby('area')['Squared_Percentage_Error']
        .mean().apply(np.sqrt).multiply(100).reset_index()
        .rename(columns={'area': 'Country', 'Squared_Percentage_Error': 'RMSPE'})
    )
    ap_df = comparison_df.groupby('area')[['Actual_Value', 'Predicted_Value']].mean().reset_index()
    ap_df = ap_df.rename(columns={'area': 'Country'})

    # Merge stats
    error_stats = rmspe_df.merge(rmse_df, on='Country', how='left')
    error_stats = error_stats.merge(ap_df, on='Country', how='left') 

    # Plot
    fig = px.choropleth(
        error_stats,
        locations='Country',
        color='RMSPE',
        locationmode='country names',
        color_continuous_scale=['green', 'red'], 
        range_color=[0, 50], 
        title='Geographic Distribution of Prediction Error (RMSPE)',
        labels={'RMSPE': 'RMSPE (%)'},
        hover_name='Country',
        hover_data={'RMSPE': ':.2f', 'RMSE': ':.2f', 'Actual_Value': ':.2f', 'Predicted_Value': ':.2f'},
        projection='natural earth'
    )
    fig.update_layout(
        title_font_size=18,
        coloraxis_colorbar=dict(title='RMSPE (%)', orientation='h', len=0.5, yanchor='bottom', y=-0.12),
        geo=dict(showframe=False, showcoastlines=True, showcountries=True, countrycolor='black', bgcolor='lightgrey')
    )
    fig.show()

plot_geographic_error(comparison_df)

### 9. Key Factors (Feature Importance)
To estimate feature importance for an LSTM, we use **Permutation Importance** on the 2D input data. We wrap the sequence generation and prediction steps so that sklearn can permute one feature column at a time, we rebuild sequences, and measure error increase.

In [None]:
# --- WRAPPER FOR SKLEARN COMPATIBILITY WITH 2D INPUT TO 3D SEQUENCE ---
class LSTMSeqWrapper:
    """
    Wrapper to allow Sklearn's permutation_importance to work with the LSTM.
    It takes 2D data (N, Feats), converts it to 3D sequences (N-Seq+1, Seq, Feats) internally,
    and returns predictions aligned to the targets.
    """
    def __init__(self, model, device, seq_len):
        self.model = model
        self.device = device
        self.seq_len = seq_len

    def fit(self, X, y):
        pass # Already trained

    def predict(self, X):
        # X is 2D numpy array (N_samples, N_features)
        # Create sequences
        X_seq, _ = create_sequences(X, np.zeros(len(X)), self.seq_len)
        
        self.model.eval()
        with torch.no_grad():
            X_tensor = torch.tensor(X_seq, dtype=torch.float32).to(self.device)
            preds = self.model(X_tensor).cpu().numpy().flatten()
        return preds

# --- CALCULATE PERMUTATION IMPORTANCE ---
wrapped_model = LSTMSeqWrapper(final_model, device, SEQ_LEN)

# Use validation set (2D) for importance calculation
# We must align y_val to match the output of the wrapper (which trims SEQ_LEN-1)
y_val_aligned = y_val.values[SEQ_LEN - 1:]

results = permutation_importance(
    wrapped_model, 
    X_val, # Pass 2D data
    y_val_aligned, 
    scoring='neg_root_mean_squared_error', 
    n_repeats=5, 
    random_state=42
)

# --- PROCESS RESULTS ---
importance_means = np.abs(results.importances_mean)
feature_names = np.array(feature_cols)

# Create DataFrame
fi_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importance_means
}).sort_values(by='Importance', ascending=False).reset_index(drop=True)

# Print Top 20
print("\n--- Top 20 Most Important Features (Permutation Importance) ---")
print(fi_df.head(20))

# --- PLOT ---
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=fi_df.head(20), palette='viridis')
plt.title(f'Feature Importance (Permutation) - {CHOSEN_CROP.capitalize()} LSTM Model', fontsize=15)
plt.xlabel('Increase in RMSE when shuffled')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()