#### Multi-class, top-k prediction task

Overview

Welcome to the 2025 Kaggle Playground Series! We plan to continue in the spirit of previous playgrounds, providing interesting and approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.

Your Goal: Your objective is to select the best fertilizer for different weather, soil conditions and crops.

Dataset Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Fertilizer prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Files

* train.csv - the training dataset; Fertilizer Name is the categorical target
* test.csv - the test dataset; your objective is to predict the Fertilizer Name for each row, up to three value, space delimited.
* sample_submission.csv - a sample submission file in the correct format.

Submission File

For each id in the test set, you may predict up to 3 Fertilizer Name values, with the predictions space delimited. The file should contain a header and have the following format:

id,Fertilizer Name 

750000,14-35-14 10-26-26 Urea

750000,14-35-14 10-26-26 Urea 

...

In [6]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import LabelEncoder, PolynomialFeatures, StandardScaler
from sklearn.cluster import KMeans

from xgboost import XGBClassifier

In [43]:
df = pd.read_csv("../Data/train.csv")
X_test = pd.read_csv("../Data/test.csv")
test_ids = X_test[["id"]]

In [44]:
# Feature engineering
def feat_eng(df):
    df['N_to_P'] = df['Nitrogen'] / df['Phosphorous'].replace(0, 0.1)
    df['N_to_K'] = df['Nitrogen'] / df['Potassium'].replace(0, 0.1)
    df['P_to_K'] = df['Phosphorous'] / df['Potassium'].replace(0, 0.1)
    df['NPK_sum'] = df['Nitrogen'] + df['Phosphorous'] + df['Potassium']
    df['SoilCropCombo'] = df['Soil Type'] + '_' + df['Crop Type']
    df['Climate_Index'] = (df['Temparature'] + df['Humidity']) / 2
    df['Water_Stress'] = df['Humidity'] - df['Moisture']

    # cluster N, P, K combinations
    kmeans = KMeans(n_clusters=5, random_state=42)
    df['NutrientCluster'] = kmeans.fit_predict(df[['Nitrogen', 'Potassium', 'Phosphorous']])

    # create interaction terms
    poly = PolynomialFeatures(degree=2, include_bias=False, interaction_only=True)
    poly_cols = ['Nitrogen', 'Potassium', 'Phosphorous']
    poly_feats = poly.fit_transform(df[['Nitrogen', 'Potassium', 'Phosphorous']])
    poly_feats_names = poly.get_feature_names_out(poly_cols)
    df_poly_feats = pd.DataFrame(poly_feats, columns=poly_feats_names)
    df = df.drop(columns=poly_cols)
    df = pd.merge(df, df_poly_feats, left_index=True, right_index=True)

    return df

df = feat_eng(df)
X_test = feat_eng(X_test)

In [45]:
y = df["Fertilizer Name"]
X = df.drop(columns=["id", "Fertilizer Name", 'Soil Type', 'Crop Type'])
X_test = X_test.drop(columns=["id", 'Soil Type', 'Crop Type'])
num_cols = X.select_dtypes(include=['number']).columns.tolist()
X.head()

Unnamed: 0,Temparature,Humidity,Moisture,N_to_P,N_to_K,P_to_K,NPK_sum,SoilCropCombo,Climate_Index,Water_Stress,NutrientCluster,Nitrogen,Potassium,Phosphorous,Nitrogen Potassium,Nitrogen Phosphorous,Potassium Phosphorous
0,37,70,36,7.2,9.0,1.25,45,Clayey_Sugarcane,53.5,34,2,36.0,4.0,5.0,144.0,180.0,20.0
1,27,69,65,1.666667,5.0,3.0,54,Sandy_Millets,48.0,4,1,30.0,6.0,18.0,180.0,540.0,108.0
2,29,63,32,1.5,2.0,1.333333,52,Sandy_Millets,46.0,31,1,24.0,12.0,16.0,288.0,384.0,192.0
3,35,62,54,9.75,3.25,0.333333,55,Sandy_Barley,48.5,8,2,39.0,12.0,4.0,468.0,156.0,48.0
4,35,58,43,2.3125,18.5,8.0,55,Red_Paddy,46.5,15,2,37.0,2.0,16.0,74.0,592.0,32.0


In [46]:
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
X_encoded = pd.get_dummies(X, columns=['SoilCropCombo'], prefix='', prefix_sep='').astype(int)
X_test = pd.get_dummies(X_test, columns=['SoilCropCombo'], prefix='', prefix_sep='').astype(int)
X_encoded.head()

Unnamed: 0,Temparature,Humidity,Moisture,N_to_P,N_to_K,P_to_K,NPK_sum,Climate_Index,Water_Stress,NutrientCluster,...,Sandy_Cotton,Sandy_Ground Nuts,Sandy_Maize,Sandy_Millets,Sandy_Oil seeds,Sandy_Paddy,Sandy_Pulses,Sandy_Sugarcane,Sandy_Tobacco,Sandy_Wheat
0,37,70,36,7,9,1,45,53,34,2,...,0,0,0,0,0,0,0,0,0,0
1,27,69,65,1,5,3,54,48,4,1,...,0,0,0,1,0,0,0,0,0,0
2,29,63,32,1,2,1,52,46,31,1,...,0,0,0,1,0,0,0,0,0,0
3,35,62,54,9,3,0,55,48,8,2,...,0,0,0,0,0,0,0,0,0,0
4,35,58,43,2,18,8,55,46,15,2,...,0,0,0,0,0,0,0,0,0,0


In [47]:
X_train, X_val, y_train, y_val = train_test_split(X_encoded, y_encoded, test_size=0.4, random_state=42)

In [48]:
# standardisation
# num_cols = ['Temparature', 'Humidity', 'Moisture', 'Nitrogen', 'Phosphorous', 'Potassium', 'N_to_P', 'N_to_K', 'P_to_K', 'NPK_sum']
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_val[num_cols] = scaler.transform(X_val[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])
X_train.head()

Unnamed: 0,Temparature,Humidity,Moisture,N_to_P,N_to_K,P_to_K,NPK_sum,Climate_Index,Water_Stress,NutrientCluster,...,Sandy_Cotton,Sandy_Ground Nuts,Sandy_Maize,Sandy_Millets,Sandy_Oil seeds,Sandy_Paddy,Sandy_Pulses,Sandy_Sugarcane,Sandy_Tobacco,Sandy_Wheat
670234,-0.871701,-0.00502,-0.864396,-0.188331,-0.275072,-0.22654,0.530668,-0.518195,0.751824,1.363635,...,0,0,0,0,0,0,0,0,0,0
702597,0.618721,1.349136,1.169983,-0.163054,-0.257609,-0.244988,0.473914,1.533363,-0.357705,-0.73379,...,0,0,0,0,0,0,0,0,0,0
566943,1.612335,-0.757329,-0.440567,-0.188331,-0.257609,-0.244988,1.211718,0.251139,0.012138,0.664494,...,0,0,0,0,0,0,0,0,0,0
256182,-0.623297,0.145442,-1.457756,-0.163054,-0.170293,-0.189645,0.360405,-0.26175,1.343572,-0.73379,...,1,0,0,0,0,0,0,0,0,0
511753,-0.871701,0.446365,-0.864396,-0.137776,-0.275072,-0.263436,-2.363794,-0.005306,0.97373,-1.432931,...,0,0,0,0,0,0,0,0,0,0


In [49]:
# function for MAP@3
def mapk(actual, predicted, k=3):
    score = 0.0
    for a, p in zip(actual, predicted):
        if a in p[:k]:
            index = p.index(a)
            score += 1.0 / (index + 1)
    return score / len(actual)

In [50]:
params = {
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 300],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'gamma': [0, 0.1, 0.3],
}

xgb = XGBClassifier(eval_metric='mlogloss')
search = RandomizedSearchCV(xgb, params, scoring='neg_log_loss', cv=3, n_iter=20, verbose=4)
search.fit(X_train, y_train)
xgb_model = search.best_estimator_

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8;, score=-1.942 total time=   4.2s
[CV 2/3] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8;, score=-1.942 total time=   4.3s
[CV 3/3] END colsample_bytree=1.0, gamma=0.1, learning_rate=0.01, max_depth=3, n_estimators=100, subsample=0.8;, score=-1.942 total time=   3.9s
[CV 1/3] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=10, n_estimators=100, subsample=1.0;, score=-1.933 total time=   7.1s
[CV 2/3] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=10, n_estimators=100, subsample=1.0;, score=-1.933 total time=   6.8s
[CV 3/3] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_depth=10, n_estimators=100, subsample=1.0;, score=-1.934 total time=   6.9s
[CV 1/3] END colsample_bytree=0.6, gamma=0.1, learning_rate=0.1, max_

In [51]:
xgb_probs = xgb_model.predict_proba(X_val)
xgb_top3 = np.argsort(xgb_probs, axis=1)[:, -3:][:, ::-1]
xgb_labels = [[encoder.classes_[i] for i in row] for row in xgb_top3]
actual_labels = [encoder.classes_[i] for i in y_val]
print("XGBoost MAP@3:", mapk(actual_labels, xgb_labels))

# run 1: XGBoost MAP@3: 0.31670999999987165
# run 2: XGBoost MAP@3: 0.3151749999998766

XGBoost MAP@3: 0.3188522222220906


In [54]:
y_probs = xgb_model.predict_proba(X_test)
top_3_preds = np.argsort(y_probs, axis=1)[:, -3:][:, ::-1]
X_test_result = X_test.copy().reset_index(drop=True)
top_3_labels = [[encoder.classes_[idx] for idx in row] for row in top_3_preds]
top_3_str = [' '.join(preds) for preds in top_3_labels]
X_test_result['Fertilizer Name'] = top_3_str

test_final = pd.merge(test_ids, X_test_result[["Fertilizer Name"]], left_index=True, right_index=True).set_index("id")
test_final.to_csv("prediction_xgb_model_ext.csv")
test_final

Unnamed: 0_level_0,Fertilizer Name
id,Unnamed: 1_level_1
750000,28-28 DAP 10-26-26
750001,17-17-17 20-20 10-26-26
750002,10-26-26 28-28 20-20
750003,14-35-14 17-17-17 Urea
750004,20-20 10-26-26 17-17-17
...,...
999995,20-20 14-35-14 17-17-17
999996,14-35-14 10-26-26 20-20
999997,14-35-14 28-28 10-26-26
999998,DAP 14-35-14 17-17-17
