<b>The dataset is very unbalanced. </b> For example, C3S4 and C3S4 classes have only one sample. This situation is not good in respect of  using the oversampling methods. <br>

I tried to create a new class (called the other) by combining other classes which have a small sample size.
Finally, there are six classes to classify, named 'C2S1', 'C3S1', 'C3S2', 'C4S1', 'C4S2', 'Other'.


## Results

I have used optima for optimizing the model. The results were not very good. Although our model gained reasonable precision/recall scores in a few classes, some classes have bad scores.<br>

- (Test Set) R2 score : 82.851
- (Test Set) MAE : 0.207831

| class        	| precision 	| recall 	| f1-score 	| support 	|
|--------------	|-----------	|--------	|----------	|---------	|
| 0            	| 1.00      	| 0.97   	| 0.99     	| 76      	|
| 1            	| 0.97      	| 0.99   	| 0.98     	| 204     	|
| 2            	| 0.25      	| 0.17   	| 0.20     	| 6       	|
| 3            	| 0.88      	| 0.88   	| 0.88     	| 26      	|
| 4            	| 0.73      	| 0.67   	| 0.70     	| 12      	|
| 5            	| 0.62      	| 0.62   	| 0.62     	| 8       	|
| accuracy     	|           	|        	| 0.94     	| 332     	|
| macro avg    	| 0.74      	| 0.72   	| 0.73     	| 332     	|
| weighted avg 	| 0.94      	| 0.94   	| 0.94     	| 332     	|

In [None]:
!pip install catboost
!pip install scikit-learn
!pip install seaborn
!pip install numpy
!pip install pandas
!pip install mealpy
!pip install pyswarms
!pip install imbalanced-learn



In [None]:
# Importing dependencies

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.metrics import r2_score, mean_squared_error, classification_report
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import accuracy_score

from catboost import CatBoostClassifier, Pool

# <span style="color:#e74c3c;"> Reading </span> Data


In [None]:
!gdown 1SaxJ8KPMrV37ZsVTv5F5ava5aXtGRiSn
!gdown 1E6BZ-AEncUOWlsK96SFz5wBK8w7XjmVB
!gdown 195J88Onvr23J8HdFtO3D_Yezi97clcZ9
!gdown 1C4ERjxVqEnxTGxwQVz98u3gaxA_hnzrH
!gdown 10KFETqq39CXdFP2zTqPgRKOGqFuVlhL9

Downloading...
From: https://drive.google.com/uc?id=1SaxJ8KPMrV37ZsVTv5F5ava5aXtGRiSn
To: /content/ground_water_quality_2019_post.csv
100% 66.7k/66.7k [00:00<00:00, 72.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1E6BZ-AEncUOWlsK96SFz5wBK8w7XjmVB
To: /content/ground_water_quality_2020_post.csv
100% 68.1k/68.1k [00:00<00:00, 71.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=195J88Onvr23J8HdFtO3D_Yezi97clcZ9
To: /content/ground_water_quality_2018_post.csv
100% 72.8k/72.8k [00:00<00:00, 74.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1C4ERjxVqEnxTGxwQVz98u3gaxA_hnzrH
To: /content/ground_water_quality_2021_post.csv
100% 145k/145k [00:00<00:00, 91.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=10KFETqq39CXdFP2zTqPgRKOGqFuVlhL9
To: /content/ground_water_quality_2022_post.csv
100% 144k/144k [00:00<00:00, 91.1MB/s]


In [None]:
# Reading data and cleaning, renaming and other data cleaning applications

data1 = pd.read_csv('/content/ground_water_quality_2018_post.csv')
data2 = pd.read_csv('/content/ground_water_quality_2019_post.csv')
data3 = pd.read_csv('/content/ground_water_quality_2020_post.csv')
data4 = pd.read_csv('/content/ground_water_quality_2020_post.csv')
data5 = pd.read_csv('/content/ground_water_quality_2020_post.csv')



data2.rename( columns ={ 'EC' : 'E.C', 'CO_-2 ' : 'CO3', 'HCO_ - ' :'HCO3', 'Cl -' : 'Cl',
                        'F -' : 'F', 'NO3- ': 'NO3 ' , 'SO4-2':'SO4' , 'Na+':'Na', 'K+':'K',
                        'Ca+2' : 'Ca', 'Mg+2':'Mg'}, inplace = True)


# dropping redundant columns
data1.drop(['sno','season'], axis = 1, inplace = True)
data2.drop(['sno','season'], axis = 1, inplace = True)
data3.drop(['sno','Unnamed: 8', 'season'], axis = 1, inplace = True)
data4.drop(['sno','season'], axis = 1, inplace = True)
data5.drop(['sno','season'], axis = 1, inplace = True)


# creating new columns
data1['year'] = 2018
data2['year'] = 2019
data3['year'] = 2020
data4['year'] = 2021
data5['year'] = 2022



# handling and fixing outliers
data3['pH'].iloc[261] = data3['pH'].iloc[261].replace('8..05', '8.05')
data3['pH'] = data3['pH'].apply(pd.to_numeric)

data3['Classification'].iloc[178] = data3['Classification'].iloc[178].replace('O.G', 'OG')
data3['Classification'].iloc[208] = data3['Classification'].iloc[208].replace('O.G', 'OG')

data4['Classification'] = data4['Classification'].replace(['O.G'], 'OG')
data5['Classification'] = data5['Classification'].replace(['O.G'], 'OG')



In [None]:
# creating and applying the new_class function

def new_class(X):
    # if (X == 'C3S4') | (X == 'C2S2') | (X == 'C4S4') | (X == 'C3S3') | (X == 'C4S3') | (X == 'OG')  | (X == 'C1S1')  :
    if (X == 'C1S1') | (X == 'C2S2') | (X == 'OG'):
        return 'Other'
    elif (X == 'C3S4') | (X == 'C3S3') :
        return 'C3S2'
    elif (X == 'C4S4') | (X == 'C4S3') :
        return 'C4S2'
    else:
        return X

data1['Classification'] = data1['Classification'].apply(new_class)
data2['Classification'] = data2['Classification'].apply(new_class)
data3['Classification'] = data3['Classification'].apply(new_class)
data4['Classification'] = data4['Classification'].apply(new_class)
data5['Classification'] = data5['Classification'].apply(new_class)


In [None]:
data_full = pd.concat([data1, data2, data3, data4, data5], axis = 0)

In [None]:
data_full

Unnamed: 0,district,mandal,village,lat_gis,long_gis,gwl,pH,E.C,TDS,CO3,...,K,Ca,Mg,T.H,SAR,Classification,RSC meq / L,Classification.1,year,Unnamed: 8
0,ADILABAD,Adilabad,Adilabad,19.668300,78.524700,5.09,8.28,745,476.80,0.0,...,4.00,48.0,38.896,279.934211,1.273328,C2S1,-1.198684,P.S.,2018,
1,ADILABAD,Bazarhatnur,Bazarhatnur,19.458888,78.350833,5.10,8.29,921,589.44,0.0,...,5.00,56.0,63.206,399.893092,0.913166,C3S1,-3.397862,P.S.,2018,
2,ADILABAD,Gudihatnoor,Gudihatnoor,19.525555,78.512222,4.98,7.69,510,326.40,0.0,...,2.00,24.0,38.896,219.934211,1.319284,C2S1,-0.398684,P.S.,2018,
3,ADILABAD,Jainath,Jainath,19.730555,78.640000,5.75,8.09,422,270.08,0.0,...,1.00,32.0,19.448,159.967105,0.928155,C2S1,0.000658,P.S.,2018,
4,ADILABAD,Narnoor,Narnoor,19.495665,78.852654,2.15,8.21,2321,1485.44,0.0,...,5.00,56.0,92.378,519.843750,5.682664,C4S2,-4.396875,P.S.,2018,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
363,YADADRI,S.Narayanpur,S.Narayanpur,17.144719,78.860010,9.90,7.8,2324,1487.36,0.0,...,2.60,160.0,97.240,799.835526,2.602728,C4S1,-8.596711,P.S.,2022,
364,YADADRI,Thurkapally,Gandamalla,17.733101,78.853831,5.74,8.26,2109,1349.76,0.0,...,43.30,48.0,116.688,599.802632,3.751176,C3S1,-3.396053,P.S.,2022,
365,YADADRI,Valigonda,T. somaram,17.399953,78.952290,1.72,8.77,1115,713.60,20.0,...,3.04,80.0,53.482,419.909539,1.282386,C3S1,-4.398191,P.S.,2022,
366,YADADRI,Valigonda,Vemulakonda,17.347782,79.143433,1.65,7.76,5053,3233.92,0.0,...,3.30,400.0,92.378,1379.843750,5.444988,C4S1,-21.996875,P.S.,2022,


In [None]:
# cols2drop = ['district','mandal', 'village', 'lat_gis', 'long_gis', 'Classification.1', 'Unnamed: 8']
cols2drop = ['Classification.1', 'Unnamed: 8']


data_full = data_full[data_full['Classification'] != 'Other']
data_full['pH'] = data_full['pH'].replace('8..05', '8.05')
data_full['pH'] = data_full['pH'].apply(pd.to_numeric)

data_full = data_full.drop(cols2drop, axis=1)
class_distribution = data_full['Classification'].value_counts()
# check class distribution
print(class_distribution)

C3S1    1152
C2S1     412
C4S1     151
C4S2      78
C3S2      37
Name: Classification, dtype: int64


In [None]:
# total null elements

data_full.isnull().sum()[data_full.isnull().sum() > 0]

gwl     17
CO3    159
dtype: int64

In [None]:
# imputing null values

imp_knn = KNNImputer(n_neighbors=3)

data_full['CO3'] = imp_knn.fit_transform(np.array(data_full['CO3']).reshape(-1,1) )
data_full['gwl'] = imp_knn.fit_transform(np.array(data_full['gwl']).reshape(-1,1) )

In [None]:
data_full.isnull().sum()[data_full.isnull().sum() > 0]

Series([], dtype: int64)

In [None]:
data_full.head()

Unnamed: 0,district,mandal,village,lat_gis,long_gis,gwl,pH,E.C,TDS,CO3,...,SO4,Na,K,Ca,Mg,T.H,SAR,Classification,RSC meq / L,year
0,ADILABAD,Adilabad,Adilabad,19.6683,78.5247,5.09,8.28,745,476.8,0.0,...,46.0,49.0,4.0,48.0,38.896,279.934211,1.273328,C2S1,-1.198684,2018
1,ADILABAD,Bazarhatnur,Bazarhatnur,19.458888,78.350833,5.1,8.29,921,589.44,0.0,...,68.0,42.0,5.0,56.0,63.206,399.893092,0.913166,C3S1,-3.397862,2018
2,ADILABAD,Gudihatnoor,Gudihatnoor,19.525555,78.512222,4.98,7.69,510,326.4,0.0,...,44.0,45.0,2.0,24.0,38.896,219.934211,1.319284,C2S1,-0.398684,2018
3,ADILABAD,Jainath,Jainath,19.730555,78.64,5.75,8.09,422,270.08,0.0,...,35.0,27.0,1.0,32.0,19.448,159.967105,0.928155,C2S1,0.000658,2018
4,ADILABAD,Narnoor,Narnoor,19.495665,78.852654,2.15,8.21,2321,1485.44,0.0,...,280.0,298.0,5.0,56.0,92.378,519.84375,5.682664,C4S2,-4.396875,2018


In [None]:
# creating train data and target

X = data_full.copy()
X.drop('Classification', axis= 1, inplace = True)

y = data_full['Classification']

In [None]:
# # balancing class
# from imblearn.over_sampling import SMOTE

# smote = SMOTE(k_neighbors=2)
# X, y = smote.fit_resample(X, data_full['Classification'].values)
# #SMOTE Training data
# X, y = smote.fit_resample(X, y)

In [None]:
pd.value_counts(y)

C3S1    1152
C2S1     412
C4S1     151
C4S2      78
C3S2      37
Name: Classification, dtype: int64

In [None]:
LB = LabelEncoder()
y = LB.fit_transform(y)
LB.classes_

array(['C2S1', 'C3S1', 'C3S2', 'C4S1', 'C4S2'], dtype=object)

In [None]:
# categorical features

cat_feat_idx =  np.where(X.dtypes == 'object')[0]
cat_feat_idx

array([0, 1, 2])

In [None]:
# scaling numerical data

MX = MinMaxScaler()
X.iloc[:, 3:21] = MX.fit_transform(X.iloc[:, 3:21])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3 , shuffle = True, stratify=y , random_state= 2)

print(X_train.shape)
print(X_test.shape)

(1281, 23)
(549, 23)


In [None]:
# creating class weights

unique_classes = np.unique(y)
weights = compute_class_weight(class_weight='balanced', classes=unique_classes, y=y_train)
class_weights = dict(zip(unique_classes, weights))
class_weights

{0: 0.8895833333333333,
 1: 0.31786600496277917,
 2: 9.853846153846154,
 3: 2.4169811320754717,
 4: 4.658181818181818}

# <span style="color:#e74c3c;"> CatBoost </span> Classifier


In [None]:
# creating pools for training and testing

train_pool = Pool(X_train, y_train, cat_features = cat_feat_idx)
test_pool = Pool(X_test, y_test, cat_features = cat_feat_idx)

In [None]:
# tuned with optima

base_model = CatBoostClassifier(iterations= 5000, task_type="CPU", devices='0:1', learning_rate =0.0029536992550707585 , min_data_in_leaf = 27 , class_weights=class_weights)

base_model.fit(train_pool , verbose = 1000 )

0:	learn: 1.6019024	total: 43.6ms	remaining: 3m 37s
1000:	learn: 0.2282235	total: 1m 3s	remaining: 4m 13s
2000:	learn: 0.1249715	total: 1m 54s	remaining: 2m 50s
3000:	learn: 0.0785173	total: 2m 50s	remaining: 1m 53s
4000:	learn: 0.0530350	total: 3m 42s	remaining: 55.5s
4999:	learn: 0.0401662	total: 4m 34s	remaining: 0us


<catboost.core.CatBoostClassifier at 0x7e20181b0370>

## Swarm Intelligence

### PSO

In [None]:
import pyswarms as ps

In [None]:
pso_iteration = 30

In [None]:
# Define the hyperparameter search space
hyperparameter_ranges = {
    'iterations': (1000, 5000),  # Reduce the range of boosting iterations
    'depth': (3, 10),           # Reduce the range of tree depth
    'learning_rate': (0.0001, 0.1),  # Reduce the learning rate range
    'min_data_in_leaf': (25, 35),
    'l2_leaf_reg': (1, 10),      # Reduce the range of regularization strength
}

In [None]:
def objective_function(x):
    # Round the elements of x to integers
    # x = x.astype(int)
    x = x.reshape(-1)  # Ensure x is a 1D array
    params = {
        'iterations': int(x[0]),
        'depth': int(x[1]),
        'learning_rate': float(x[2]),
        'min_data_in_leaf': int(x[3]),
        'l2_leaf_reg': int(x[4]),
        'class_weights': class_weights,
        'task_type': 'CPU',
        'devices': '0:1',
        'cat_features': cat_feat_idx,
        'verbose': 0,
    }

    model = CatBoostClassifier(**params)
    model.fit(train_pool, verbose = 1000)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    return -accuracy

In [None]:
n_particles = 25
n_dimensions = len(hyperparameter_ranges)
# bounds = (np.array([100, 3, 0.01, 1]), np.array([500, 8, 0.1, 5]))
bounds = (np.array([1000, 3, 0.0001, 20, 1]), np.array([5000, 12, 0.1, 35, 10]))



optimizer = ps.single.GlobalBestPSO(n_particles=n_particles,
                                     dimensions=n_dimensions,
                                     bounds=bounds,
                                     options={'c1': 1.8, 'c2': 1.4, 'w': 0.6})


In [None]:
best_hyperparameters = optimizer.optimize(objective_function, iters=pso_iteration)


2023-11-05 12:25:24,131 - pyswarms.single.global_best - INFO - Optimize for 30 iters with {'c1': 1.8, 'c2': 1.4, 'w': 0.6}
pyswarms.single.global_best:   0%|          |0/30

0:	learn: 1.6829339	total: 1.69s	remaining: 1h 44m 30s


pyswarms.single.global_best:   0%|          |0/30


KeyboardInterrupt: ignored

In [None]:
len(hyperparameter_ranges)

In [None]:
len(best_hyperparameters)

In [None]:
best_hyperparameter_values

In [None]:
best_accuracy, best_hyperparameter_values = best_hyperparameters


In [None]:
# Now, best_hyperparameter_values is an array of hyperparameter values
# You can print it to see the best hyperparameters
print("Best accuracy:", -best_accuracy)
print("Best hyperparameters:", best_hyperparameter_values)
print("iteration:", int(best_hyperparameter_values[0]))
print("depth:", int(best_hyperparameter_values[1]))
print("learning_rate:", float(best_hyperparameter_values[2]))
print("min_data_in_leaf:", int(best_hyperparameter_values[3]))

In [None]:
# Define the best hyperparameters as a dictionary
best_hyperparameters = {
    'iterations': int(best_hyperparameter_values[0]),
    'depth': int(best_hyperparameter_values[1]),
    'learning_rate': best_hyperparameter_values[2],
    'min_data_in_leaf': int(best_hyperparameter_values[3]),
    'class_weights': class_weights,  # Assuming you have class_weights defined
    'task_type': 'GPU',
    'devices': '0:1',
    'cat_features': cat_feat_idx,
    'verbose': 0,
}

In [None]:
# Create a CatBoost classifier with the best hyperparameters
model = CatBoostClassifier(**best_hyperparameters)

# Train the model on your training data
model.fit(train_pool , verbose = 1000 )

### SSA

# <span style="color:#e74c3c;"> Results </span>


### Base Model

In [None]:
# predictions and scores

pred = base_model.predict(test_pool)

r2_sr = r2_score(y_test, pred)
mse = mean_squared_error(y_test, pred)

print('R2 Score :{0:.5f}'.format(r2_sr))
print('Mean Squared Error :{0:.5f}'.format(mse))

R2 Score :0.88999
Mean Squared Error :0.10383


In [None]:
# classification report

clf_report = classification_report(pred, y_test )

print(clf_report)

              precision    recall  f1-score   support

           0       1.00      0.98      0.99       127
           1       0.96      1.00      0.98       334
           2       1.00      0.65      0.79        17
           3       0.93      1.00      0.97        42
           4       0.96      0.76      0.85        29

    accuracy                           0.97       549
   macro avg       0.97      0.88      0.91       549
weighted avg       0.97      0.97      0.97       549



### Base With SI

In [None]:
# predictions and scores

pred = model.predict(test_pool)

r2_sr = r2_score(y_test, pred)
mse = mean_squared_error(y_test, pred)

print('R2 Score :{0:.5f}'.format(r2_sr))
print('Mean Squared Error :{0:.5f}'.format(mse))

NameError: ignored

In [None]:
# classification report

clf_report = classification_report(pred, y_test )

print(clf_report)

## Save Model

### Base Model

In [None]:
base_model.save_model('model_catboost.cbm',format="cbm")

### Base with SI

In [None]:
model.save_model(f'model_pso_catboost_{pso_iteration}iter.cbm', format="cbm")