<h1>ðŸŒ± NDVI Classification with LightGBM and Pseudo-Labeling</h1>

In this notebook, I'll build a high-accuracy classifier for NDVI dataset using LightGBM & Pseudo-Labeling.

<h3>Importing Libraries</h3>

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from lightgbm import LGBMClassifier, early_stopping, log_evaluation
import warnings
warnings.filterwarnings('ignore', category=UserWarning)

<h3>Loading the Data</h3>

Loading the training and test datasets, and dropping any unnecessary columns.

In [9]:
train = pd.read_csv('datasets/hacktrain.csv')
test = pd.read_csv('datasets/hacktest.csv')
train.drop(columns=['Unnamed: 0'], inplace=True, errors= 'ignore')
test.drop(columns=['Unnamed: 0'], inplace=True, errors= 'ignore')
train.head()

Unnamed: 0,ID,class,20150720_N,20150602_N,20150517_N,20150501_N,20150415_N,20150330_N,20150314_N,20150226_N,...,20140610_N,20140525_N,20140509_N,20140423_N,20140407_N,20140322_N,20140218_N,20140202_N,20140117_N,20140101_N
0,1,water,637.595,658.668,-1882.03,-1924.36,997.904,-1739.99,630.087,,...,,-1043.16,-1942.49,267.138,,,211.328,-2203.02,-1180.19,433.906
1,2,water,634.24,593.705,-1625.79,-1672.32,914.198,-692.386,707.626,-1670.59,...,,-933.934,-625.385,120.059,364.858,476.972,220.878,-2250.0,-1360.56,524.075
2,4,water,58.0174,-1599.16,,-1052.63,,-1564.63,,729.79,...,-1025.88,368.622,,-1227.8,304.621,,369.214,-2202.12,,-1343.55
3,5,water,72.518,,380.436,-1256.93,515.805,-1413.18,-802.942,683.254,...,-1813.95,155.624,,-924.073,432.15,282.833,298.32,-2197.36,,-826.727
4,8,water,1136.44,,,1647.83,1935.8,,2158.98,,...,1535.0,1959.43,-279.317,-384.915,-113.406,1020.72,1660.65,-116.801,-568.05,-1357.14


<h3>Identifying NDVI columns</h3>

As NDVI stands for Normalized Difference Vegetation Index (a key indicator in remote sensing), I'll find all columns related to NDVI.

In [10]:
ndvi_columns = [col for col in train.columns if '_N' in col]
print(f"NDVI Columns: {ndvi_columns[:4]} ... total: {len(ndvi_columns)} columns")

NDVI Columns: ['20150720_N', '20150602_N', '20150517_N', '20150501_N'] ... total: 27 columns


<h3>Filling Missing NDVI columns</h3>

Filling missing NDVI columns using interpolation and mean imputation. This helps the model learn better.

In [11]:
def fill_missing(df, cols):
    df[cols] = df[cols].interpolate(axis=1, limit_direction = 'both')
    df[cols] = df[cols].fillna(df[cols].mean())
    return df

train = fill_missing(train, ndvi_columns)
test = fill_missing(test, ndvi_columns)

train[ndvi_columns].isnull().sum().sum()

np.int64(0)

<h3>Feature Engineering</h3>

Creating  new features from NDVI columns, including statistics, trends, differences, and rolling statistics.

In [12]:
def create_features(df, cols):
    #statistics
    df['ndvi_mean'] = df[cols].mean(axis=1)
    df['ndvi_std'] = df[cols].std(axis=1)
    df['ndvi_min'] = df[cols].min(axis=1)
    df['ndvi_max'] = df[cols].max(axis=1)

    #trends and differences
    df['ndvi_trend'] = df[cols].apply(lambda x: np.polyfit(range(len(x)), x, 1)[0], axis = 1)
    for i in range(1, len(cols)):
        df[f'ndvi_diff_{i}'] = df[cols[i]] - df[cols[i-1]]

    #rolling statistics
    for window in [3, 5]:
        rolling_mean = df[cols].T.rolling(window=window).mean().T
        rolling_std = df[cols].T.rolling(window=window).std().T
        df[f'ndvi_rolling_mean_{window}'] = rolling_mean.iloc[:, -1]
        df[f'ndvi_rolling_std_{window}'] = rolling_std.iloc[:, -1]

    return df

train = create_features(train, ndvi_columns)
test = create_features(test, ndvi_columns)

train[['ndvi_mean', 'ndvi_std', 'ndvi_trend', 'ndvi_rolling_mean_3', 'ndvi_rolling_std_3']].head()
        


Unnamed: 0,ndvi_mean,ndvi_std,ndvi_trend,ndvi_rolling_mean_3,ndvi_rolling_std_3
0,-269.881519,1017.144905,-1.812555,-983.101333,1329.46517
1,-248.218496,935.550461,0.410666,-1028.828333,1416.477127
2,-631.649152,1045.094565,8.399337,-1772.835,429.285
3,-319.884926,1024.098285,-16.228253,-1512.0435,685.3165
4,767.647759,1209.705801,-61.719074,-680.663667,627.791038


<h3>Preparing Features and Target</h3>

Selecting all feature columns and encoding the target variable for classification

In [13]:
feature_cols = [col for col in train.columns if col not in ['ID', 'class']]
le = LabelEncoder()
y = le.fit_transform(train['class'])
X = train[feature_cols].copy()
X_test = test[feature_cols].copy()
print(f'Classes: {le.classes_}')

Classes: ['farm' 'forest' 'grass' 'impervious' 'orchard' 'water']


<h3>Cross-Validation with LightGBM</h3>

I'll use 5-fold Stratified Cross-Validation to train and validate the model. This helps prevent overfitting and gives a reliable accuracy estimate.

In [14]:
#out-of-fold-predictions
oof_preds = np.zeros((X.shape[0], len(le.classes_)))
test_preds = np.zeros((X_test.shape[0], len(le.classes_)))

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for fold, (train_idx, val_idx) in enumerate(skf.split(X,y)):
    print(f"\n===========Fold {fold+1}===========")
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    #defining LightGBM model with correct parameters
    clf = LGBMClassifier(
        n_estimators=1000,
        learning_rate=0.05,
        num_leaves=63,
        max_depth=-1,
        min_child_samples=20,
        reg_alpha=0.1,
        reg_lambda=0.1,
        colsample_bytree=0.8,
        subsample=0.8,
        objective='multiclass',
        class_weight='balanced',
        random_state=42,
        n_jobs=-1,
        verbose=100
    )

    #fit with early stopping
    clf.fit(X_train, y_train,
            eval_set=[(X_val,y_val)],
            eval_metric='multi_logloss',
            callbacks=[
                early_stopping(stopping_rounds=50),
                log_evaluation(period=100)
            ])
    
    #predicting and scoring
    #using clf.best_iteration_ to use the best version of the model
    oof_preds[val_idx] = clf.predict_proba(X_val, num_iteration=clf.best_iteration_)
    test_preds += clf.predict_proba(X_test, num_iteration=clf.best_iteration_) / skf.n_splits

    #evaluating fold accuracy
    fold_acc = accuracy_score(y_val, np.argmax(oof_preds[val_idx], axis = 1))
    print(f"-->Fold {fold+1} Accuracy:{fold_acc:.4f}")

#computing and displaying OOF accuracy
oof_acc = accuracy_score(y, np.argmax(oof_preds, axis = 1))
print(f"\n OOF Accuracy: {oof_acc:.4f}")


[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.002394
[LightGBM] [Debug] init for col-wise cost 0.000008 seconds, init for row-wise cost 0.001903 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004597 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 15810
[LightGBM] [Info] Number of data points in the train set: 6400, number of used features: 62
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Info] Start training from score -1.791760
[LightGBM] [Info] Start training from score -1.791759
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 63 and depth = 12
[LightGBM] [Debug] Trained a tree with leaves = 42 and depth = 14
[LightGBM] 

<h3>Pseudo-Labeling the Confidence Threshold</h3>

I'll use test prediction with high confidence (â‰¥ 0.99) as extra training data for the final model

In [15]:
super_conf_idx = np.where(test_preds.max(axis=1) >= 0.99)[0]
X_pseudo = X_test.iloc[super_conf_idx]
y_pseudo = np.argmax(test_preds[super_conf_idx], axis=1)

print(f"\nPseudo-labeled smapled added: {len(y_pseudo)}/{len(X_test)}")


Pseudo-labeled smapled added: 1786/2845


<h3>Combining Training Data with Pseudo-Labels</h3>

I'll combine the original training data with the pseudo-labeled samples for a stronger final model.

In [16]:
if len(y_pseudo) >0:
    X_final = pd.concat([X, X_pseudo], axis = 0)
    y_final = np.concatenate([y, y_pseudo])
else:
    X_final = X.copy()
    y_final = y.copy()

<h3>Training Final Model on combined data</h3>

Training a final LightGBM model using all available data

In [17]:
final_clf = LGBMClassifier(
    n_estimators=1500,
    learning_rate=0.02,
    num_leaves=127,
    max_depth=-1,
    min_child_samples=15,
    random_state=42,
    verbose=-1
)

final_clf.fit(X_final, y_final)

0,1,2
,boosting_type,'gbdt'
,num_leaves,127
,max_depth,-1
,learning_rate,0.02
,n_estimators,1500
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


<h3>Evaluating Final Model and Feature Importance</h3>

I'll check the final model's accuracy and see which features which are most important.

In [18]:
final_preds_train = final_clf.predict(X)
final_acc = accuracy_score(y, final_preds_train)
print(f"\n Final Training Accuracy: {final_acc:.4f}")

#Feature importance
fi = pd.DataFrame({
    'feature': feature_cols,
    'importance': final_clf.feature_importances_
}).sort_values('importance', ascending=False)

print("\nTop 10 Features")
print(fi.head(10))


 Final Training Accuracy: 1.0000

Top 10 Features
         feature  importance
30      ndvi_max       14722
28      ndvi_std       13100
29      ndvi_min        9611
45  ndvi_diff_14        8705
19    20140509_N        8445
15    20140813_N        8234
46  ndvi_diff_15        7653
13    20141016_N        7241
14    20140930_N        6885
26    20140101_N        6554


<h3>Making Predictions and Saving Submission</h3>

I'll predict the classes for the test set and save the results for submission

In [19]:
final_preds_test = final_clf.predict(X_test)
final_labels = le.inverse_transform(final_preds_test)

submission = pd.DataFrame({
    'ID': test['ID'],
    'class': final_labels
})
submission.to_csv('submission.csv', index=False)
print("\n submission.csv saved!")
submission.head()


 submission.csv saved!


Unnamed: 0,ID,class
0,1,farm
1,2,forest
2,3,forest
3,4,orchard
4,5,orchard


In [20]:
final_preds_test = final_clf.predict(X_test)
final_labels = le.inverse_transform(final_preds_test)

submission = pd.DataFrame({
    'ID': test['ID'],
    'class': final_labels
})
submission.to_csv('submission.csv', index=False)
print("\n submission.csv saved!")
submission.head()


 submission.csv saved!


Unnamed: 0,ID,class
0,1,farm
1,2,forest
2,3,forest
3,4,orchard
4,5,orchard


In [21]:
final_preds_test = final_clf.predict(X_test)
final_labels = le.inverse_transform(final_preds_test)

submission = pd.DataFrame({
    'ID': test['ID'],
    'class': final_labels
})
submission.to_csv('submission.csv', index=False)
print("\n submission.csv saved!")
submission.head()


 submission.csv saved!


Unnamed: 0,ID,class
0,1,farm
1,2,forest
2,3,forest
3,4,orchard
4,5,orchard


In [22]:
final_preds_test = final_clf.predict(X_test)
final_labels = le.inverse_transform(final_preds_test)

submission = pd.DataFrame({
    'ID': test['ID'],
    'class': final_labels
})
submission.to_csv('submission.csv', index=False)
print("\n submission.csv saved!")
submission.head()


 submission.csv saved!


Unnamed: 0,ID,class
0,1,farm
1,2,forest
2,3,forest
3,4,orchard
4,5,orchard
