#### Multi-class, top-k prediction task

Overview

Welcome to the 2025 Kaggle Playground Series! We plan to continue in the spirit of previous playgrounds, providing interesting and approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.

Your Goal: Your objective is to select the best fertilizer for different weather, soil conditions and crops.

Dataset Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Fertilizer prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Files

* train.csv - the training dataset; Fertilizer Name is the categorical target
* test.csv - the test dataset; your objective is to predict the Fertilizer Name for each row, up to three value, space delimited.
* sample_submission.csv - a sample submission file in the correct format.

Submission File

For each id in the test set, you may predict up to 3 Fertilizer Name values, with the predictions space delimited. The file should contain a header and have the following format:

id,Fertilizer Name 

750000,14-35-14 10-26-26 Urea

750000,14-35-14 10-26-26 Urea 

...

In [2]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

from xgboost import XGBClassifier

In [3]:
df = pd.read_csv("../Data/train.csv")
X_test = pd.read_csv("../Data/test.csv")
test_ids = X_test[["id"]]

In [4]:
# Feature engineering
def feat_eng(df):
    df['N_to_P'] = df['Nitrogen'] / df['Phosphorous'].replace(0, 0.1)
    df['N_to_K'] = df['Nitrogen'] / df['Potassium'].replace(0, 0.1)
    df['P_to_K'] = df['Phosphorous'] / df['Potassium'].replace(0, 0.1)
    df['NPK_sum'] = df['Nitrogen'] + df['Phosphorous'] + df['Potassium']
    df['SoilCropCombo'] = df['Soil Type'] + '_' + df['Crop Type']
    return df

df = feat_eng(df)
X_test = feat_eng(X_test)

In [5]:
y = df["Fertilizer Name"]
X = df.drop(columns=["id", "Fertilizer Name", 'Soil Type', 'Crop Type'])
X_test = X_test.drop(columns=["id", 'Soil Type', 'Crop Type'])
X.head()

Unnamed: 0,Temparature,Humidity,Moisture,Nitrogen,Potassium,Phosphorous,N_to_P,N_to_K,P_to_K,NPK_sum,SoilCropCombo
0,37,70,36,36,4,5,7.2,9.0,1.25,45,Clayey_Sugarcane
1,27,69,65,30,6,18,1.666667,5.0,3.0,54,Sandy_Millets
2,29,63,32,24,12,16,1.5,2.0,1.333333,52,Sandy_Millets
3,35,62,54,39,12,4,9.75,3.25,0.333333,55,Sandy_Barley
4,35,58,43,37,2,16,2.3125,18.5,8.0,55,Red_Paddy


In [6]:
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
X_encoded = pd.get_dummies(X, columns=['SoilCropCombo'], prefix='', prefix_sep='').astype(int)
X_test = pd.get_dummies(X_test, columns=['SoilCropCombo'], prefix='', prefix_sep='').astype(int)

In [7]:
X_encoded

Unnamed: 0,Temparature,Humidity,Moisture,Nitrogen,Potassium,Phosphorous,N_to_P,N_to_K,P_to_K,NPK_sum,...,Sandy_Cotton,Sandy_Ground Nuts,Sandy_Maize,Sandy_Millets,Sandy_Oil seeds,Sandy_Paddy,Sandy_Pulses,Sandy_Sugarcane,Sandy_Tobacco,Sandy_Wheat
0,37,70,36,36,4,5,7,9,1,45,...,0,0,0,0,0,0,0,0,0,0
1,27,69,65,30,6,18,1,5,3,54,...,0,0,0,1,0,0,0,0,0,0
2,29,63,32,24,12,16,1,2,1,52,...,0,0,0,1,0,0,0,0,0,0
3,35,62,54,39,12,4,9,3,0,55,...,0,0,0,0,0,0,0,0,0,0
4,35,58,43,37,2,16,2,18,8,55,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
749995,25,69,30,8,16,6,1,0,0,30,...,0,0,0,0,0,0,0,0,0,0
749996,37,64,58,38,8,20,1,4,2,66,...,0,0,0,0,0,0,0,0,0,0
749997,35,68,59,6,11,29,0,0,2,46,...,0,1,0,0,0,0,0,0,0,0
749998,31,68,29,9,11,12,0,0,1,32,...,0,0,0,0,0,0,0,0,0,0


In [8]:
X_train, X_val, y_train, y_val = train_test_split(X_encoded, y_encoded, test_size=0.4, random_state=42)

In [9]:
# function for MAP@3
def mapk(actual, predicted, k=3):
    score = 0.0
    for a, p in zip(actual, predicted):
        if a in p[:k]:
            index = p.index(a)
            score += 1.0 / (index + 1)
    return score / len(actual)

In [10]:
xgb_model = XGBClassifier(eval_metric='mlogloss')
xgb_model.fit(X_train, y_train)

xgb_probs = xgb_model.predict_proba(X_val)
xgb_top3 = np.argsort(xgb_probs, axis=1)[:, -3:][:, ::-1]
xgb_labels = [[encoder.classes_[i] for i in row] for row in xgb_top3]
actual_labels = [encoder.classes_[i] for i in y_val]

print("XGBoost MAP@3:", mapk(actual_labels, xgb_labels))

XGBoost MAP@3: 0.3172972222220905


In [11]:
y_probs = xgb_model.predict_proba(X_test)
top_3_preds = np.argsort(y_probs, axis=1)[:, -3:][:, ::-1]
X_test_result = X_test.copy().reset_index(drop=True)
top_3_labels = [[encoder.classes_[idx] for idx in row] for row in top_3_preds]
top_3_str = [' '.join(preds) for preds in top_3_labels]
X_test_result['Fertilizer Name'] = top_3_str

test_final = pd.merge(test_ids, X_test_result[["Fertilizer Name"]], left_index=True, right_index=True).set_index("id")
test_final.to_csv("prediction_xgb_model.csv")
test_final

Unnamed: 0_level_0,Fertilizer Name
id,Unnamed: 1_level_1
750000,28-28 DAP 10-26-26
750001,17-17-17 20-20 10-26-26
750002,10-26-26 20-20 17-17-17
750003,14-35-14 17-17-17 Urea
750004,20-20 10-26-26 17-17-17
...,...
999995,20-20 14-35-14 17-17-17
999996,14-35-14 10-26-26 20-20
999997,14-35-14 28-28 10-26-26
999998,10-26-26 14-35-14 17-17-17
