#### Multi-class, top-k prediction task

Overview

Welcome to the 2025 Kaggle Playground Series! We plan to continue in the spirit of previous playgrounds, providing interesting and approachable datasets for our community to practice their machine learning skills, and anticipate a competition each month.

Your Goal: Your objective is to select the best fertilizer for different weather, soil conditions and crops.

Dataset Description

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Fertilizer prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Files

* train.csv - the training dataset; Fertilizer Name is the categorical target
* test.csv - the test dataset; your objective is to predict the Fertilizer Name for each row, up to three value, space delimited.
* sample_submission.csv - a sample submission file in the correct format.

Submission File

For each id in the test set, you may predict up to 3 Fertilizer Name values, with the predictions space delimited. The file should contain a header and have the following format:

id,Fertilizer Name 

750000,14-35-14 10-26-26 Urea

750000,14-35-14 10-26-26 Urea 

...

In [13]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report

In [21]:
df = pd.read_csv("Data/train.csv")
df

Unnamed: 0,id,Temparature,Humidity,Moisture,Soil Type,Crop Type,Nitrogen,Potassium,Phosphorous,Fertilizer Name
0,0,37,70,36,Clayey,Sugarcane,36,4,5,28-28
1,1,27,69,65,Sandy,Millets,30,6,18,28-28
2,2,29,63,32,Sandy,Millets,24,12,16,17-17-17
3,3,35,62,54,Sandy,Barley,39,12,4,10-26-26
4,4,35,58,43,Red,Paddy,37,2,16,DAP
...,...,...,...,...,...,...,...,...,...,...
749995,749995,25,69,30,Clayey,Maize,8,16,6,28-28
749996,749996,37,64,58,Loamy,Sugarcane,38,8,20,17-17-17
749997,749997,35,68,59,Sandy,Ground Nuts,6,11,29,10-26-26
749998,749998,31,68,29,Red,Cotton,9,11,12,20-20


In [22]:
y = df["Fertilizer Name"]
X = df.drop(columns=["id", "Fertilizer Name"])
X.head()

Unnamed: 0,Temparature,Humidity,Moisture,Soil Type,Crop Type,Nitrogen,Potassium,Phosphorous
0,37,70,36,Clayey,Sugarcane,36,4,5
1,27,69,65,Sandy,Millets,30,6,18
2,29,63,32,Sandy,Millets,24,12,16
3,35,62,54,Sandy,Barley,39,12,4
4,35,58,43,Red,Paddy,37,2,16


In [23]:
print("Fertilizer Name:", list(y.unique()))
print("Soil Type:", list(X["Soil Type"].unique()))
print("Crop Type:", list(X["Crop Type"].unique()))

Fertilizer Name: ['28-28', '17-17-17', '10-26-26', 'DAP', '20-20', '14-35-14', 'Urea']
Soil Type: ['Clayey', 'Sandy', 'Red', 'Loamy', 'Black']
Crop Type: ['Sugarcane', 'Millets', 'Barley', 'Paddy', 'Pulses', 'Tobacco', 'Ground Nuts', 'Maize', 'Cotton', 'Wheat', 'Oil seeds']


In [24]:
X.describe()

Unnamed: 0,Temparature,Humidity,Moisture,Nitrogen,Potassium,Phosphorous
count,750000.0,750000.0,750000.0,750000.0,750000.0,750000.0
mean,31.503565,61.038912,45.184147,23.093808,9.478296,21.073227
std,4.025574,6.647695,11.794594,11.216125,5.765622,12.346831
min,25.0,50.0,25.0,4.0,0.0,0.0
25%,28.0,55.0,35.0,13.0,4.0,10.0
50%,32.0,61.0,45.0,23.0,9.0,21.0
75%,35.0,67.0,55.0,33.0,14.0,32.0
max,38.0,72.0,65.0,42.0,19.0,42.0


In [25]:
X.isna().sum()

Temparature    0
Humidity       0
Moisture       0
Soil Type      0
Crop Type      0
Nitrogen       0
Potassium      0
Phosphorous    0
dtype: int64

In [26]:
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
X_encoded = pd.get_dummies(X, columns=['Soil Type', 'Crop Type'], prefix='', prefix_sep='').astype(int)

In [27]:
X_train, X_val, y_train, y_val = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)
X_train

Unnamed: 0,Temparature,Humidity,Moisture,Nitrogen,Potassium,Phosphorous,Black,Clayey,Loamy,Red,...,Cotton,Ground Nuts,Maize,Millets,Oil seeds,Paddy,Pulses,Sugarcane,Tobacco,Wheat
453635,28,51,47,20,17,24,0,0,1,0,...,0,1,0,0,0,0,0,0,0,0
11651,33,62,30,7,0,6,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
431999,38,59,41,24,11,42,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
529211,26,52,57,27,17,19,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
110925,37,61,35,25,14,16,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
259178,33,72,61,38,15,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,0
365838,26,63,51,36,14,42,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
131932,30,50,51,4,7,15,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
671155,38,53,48,32,2,35,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [28]:
# Train model
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

In [None]:
y_probs = clf.predict_proba(X_val)
top_3_preds = np.argsort(y_probs, axis=1)[:, -3:][:, ::-1]
top_3_labels = [[encoder.classes_[idx] for idx in row] for row in top_3_preds]
y_true = [encoder.classes_[idx] for idx in y_val]

# function for MAP@3
def mapk(actual, predicted, k=3):
    score = 0.0
    for a, p in zip(actual, predicted):
        if a in p[:k]:
            index = p.index(a)
            score += 1.0 / (index + 1)
    return score / len(actual)

map3 = mapk(y_true, top_3_labels)
print(f"MAP@3: {map3:.4f}")

X_val_result = X_val.copy().reset_index(drop=True)
X_val_result['y_pred'] = top_3_labels
X_val_result['y_true'] = y_true
X_val_result[["y_pred", "y_true"]] 

MAP@3: 0.2908


Unnamed: 0,y_pred,y_true
0,"[Urea, 17-17-17, 20-20]",17-17-17
1,"[17-17-17, Urea, 28-28]",14-35-14
2,"[14-35-14, 20-20, 28-28]",10-26-26
3,"[17-17-17, 20-20, 14-35-14]",14-35-14
4,"[17-17-17, DAP, 14-35-14]",10-26-26
...,...,...
149995,"[DAP, 10-26-26, 28-28]",20-20
149996,"[14-35-14, 17-17-17, DAP]",DAP
149997,"[10-26-26, 17-17-17, 14-35-14]",14-35-14
149998,"[10-26-26, 20-20, 28-28]",10-26-26


In [47]:
test = pd.read_csv("Data/test.csv")
test_ids = test[["id"]]
X_test = test.drop(columns=["id"])
X_test = pd.get_dummies(X_test, columns=['Soil Type', 'Crop Type'], prefix='', prefix_sep='').astype(int)
X_test

Unnamed: 0,Temparature,Humidity,Moisture,Nitrogen,Potassium,Phosphorous,Black,Clayey,Loamy,Red,...,Cotton,Ground Nuts,Maize,Millets,Oil seeds,Paddy,Pulses,Sugarcane,Tobacco,Wheat
0,31,70,52,34,11,24,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,27,62,45,30,14,15,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
2,28,72,28,14,15,4,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
3,37,53,57,18,17,36,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,31,55,32,13,19,14,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249995,26,66,30,14,7,18,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
249996,33,62,55,28,14,7,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
249997,36,53,64,28,11,27,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
249998,36,67,26,33,0,10,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [57]:
y_probs = clf.predict_proba(X_test)
top_3_preds = np.argsort(y_probs, axis=1)[:, -3:][:, ::-1]
X_test_result = X_test.copy().reset_index(drop=True)
top_3_labels = [[encoder.classes_[idx] for idx in row] for row in top_3_preds]
top_3_str = [' '.join(preds) for preds in top_3_labels]
X_test_result['Fertilizer Name'] = top_3_str

In [59]:

test_final = pd.merge(test_ids, X_test_result[["Fertilizer Name"]], left_index=True, right_index=True).set_index("id")
test_final.to_csv("prediction.csv")