# Advanced Training Algorithms

The main problem of this dataset, as also mentioned in the dataset analysis methodology, is the class imbalance.
More specifically, 54% samples are labeled as "*functional*", 38.5% are labeled as "*non functional*" and finally,
only 7.5% of the samples belong to "*functional need repair*" category. The class imbalance problem can be efficiently
dealt with the approaches below:
1. Re-Sampling
2. Re-Weighting
3. Probability Calibration
4. Semi-Supervised Learning


# Preparing Dataset

In [1]:
import pandas as pd

train_inputs = pd.read_csv('train_inputs.csv')
train_targets = pd.read_csv('targets.csv')
test_inputs = pd.read_csv('test_inputs.csv')

test_input_ids = test_inputs['id']
columns_to_drop = [
    'id', 'date_recorded', 'longitude', 'latitude', 'wpt_name', 'num_private', 'subvillage', 'region', 'ward', 'recorded_by',
    'scheme_name', 'construction_year', 'extraction_type', 'extraction_type_class', 'management_group', 'payment', 'quality_group',
    'quantity_group', 'source_type', 'source_class', 'waterpoint_type_group',
]
train_inputs = train_inputs.drop(columns=columns_to_drop)
test_inputs = test_inputs.drop(columns=columns_to_drop)

train_targets = train_targets['status_group'].replace({'functional': 0, 'functional needs repair': 1, 'non functional': 2}).astype(int)
train_inputs.shape, train_targets.shape, test_inputs.shape

((59400, 21), (59400,), (14850, 21))

# Synthetic Data & Re-Sampling

Re-sampling can be divided into **Over-Sampling the minority samples, or Under-Sampling the majority samples**.
The Over-Sampling involves duplicating or generating synthetic samples in the minority class with replacement,
while the Under-Sampling involves deleting samples of the majority class. On the other hand,
synthetic data involves generating Both approaches can be repeated until
the desired class distribution is achieved in the training dataset, such as an equal split across the classes.
The algorithms that will be tested are:
1. `SVM-Smote` (Over-Sampling)
2. `ADASYN` (Over-Sampling)
3. `NearMiss v3)` (Under-Sampling)
4. `Smote + TOMEK Links` (Combination)
5. `NearMiss v3 + SVM-Smote` (Combination)


In [8]:
from imblearn.over_sampling import SVMSMOTE, ADASYN
from imblearn.under_sampling import NearMiss
from imblearn.combine import SMOTETomek
from category_encoders.ordinal import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report


x_train, x_test, y_train, y_test = train_test_split(
    train_inputs, train_targets, test_size=0.1
)
encoder = OrdinalEncoder()
x_train = encoder.fit_transform(x_train)
x_test = encoder.transform(x_test)

random_state = 0

resampling_methods = {
    'None': None,
    'SVM-Smote': SVMSMOTE(random_state=random_state, n_jobs=-1),
    'ADASYN': ADASYN(random_state=random_state, n_jobs=-1),
    'NearMiss': NearMiss(version=3, n_jobs=-1),
    'Smote-TOMEK': SMOTETomek(random_state=random_state, n_jobs=-1),
    'NearMiss + SVM-Smote': [NearMiss(version=3, n_jobs=-1), SMOTETomek(random_state=random_state, n_jobs=-1)]
}

for resampling_method, resampler in resampling_methods.items():
    print(f'\n----- Resampler: {resampling_method} -----')

    if resampling_method == 'None':
        x_train_new = x_train
        y_train_new = y_train
    elif resampling_method != 'NearMiss + SVM-Smote':
        x_train_new, y_train_new = resampler.fit_resample(x_train, y_train)
    else:
        for res in resampler:
            x_train_new, y_train_new = res.fit_resample(x_train, y_train)

    clf = RandomForestClassifier(random_state=random_state, n_jobs=-1)
    clf.fit(x_train_new, y_train_new)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))



----- Resampler: None -----
              precision    recall  f1-score   support

           0       0.81      0.87      0.84      3275
           1       0.48      0.32      0.38       419
           2       0.82      0.78      0.80      2246

    accuracy                           0.80      5940
   macro avg       0.70      0.66      0.67      5940
weighted avg       0.79      0.80      0.79      5940


----- Resampler: SVM-Smote -----
              precision    recall  f1-score   support

           0       0.83      0.82      0.82      3275
           1       0.40      0.42      0.41       419
           2       0.80      0.80      0.80      2246

    accuracy                           0.79      5940
   macro avg       0.68      0.68      0.68      5940
weighted avg       0.79      0.79      0.79      5940


----- Resampler: ADASYN -----
              precision    recall  f1-score   support

           0       0.83      0.81      0.82      3275
           1       0.38      0.46  

Overall, the **SVM-Smote approach** increases the recall, as well as the `F1` scores of classes `1 & 2` and
all approaches increase the `F1` metric for the class `1`, except `NearMiss`.
**However, the overall accuracy drops as well**.

| Synthetic Data / Re-Sampling | Accuracy | F1 (0)   | F1 (1)   | F1 (2)   |
|------------------------------|----------|----------|----------|----------|
| None                         | **0.80** | **0.84** | 0.38     | 0.80     |
| SVM-Smote                    | 0.79     | 0.82     | **0.41** | **0.80** |
| ADASYN                       | 0.78     | 0.82     | 0.41     | 0.80     |
| NearMiss                     | 0.60     | 0.66     | 0.27     | 0.60     |
| Smote-TOMEK                  | 0.78     | 0.82     | 0.40     | 0.80     |
| NearMiss + SVM-Smote         | 0.78     | 0.82     | 0.40     | 0.80     |

# Re-Weighting

Although re-sampling strategies balance out the dataset, it does not directly tackle the issues caused by
Class Imbalance, rather it risks introducing new issues. Since Oversampling introduces duplicate samples,
it could easily slow down the training and also lead to over-fitting a model. On the other hand, under-sampling
removes certain number of samples. This could lead to the model missing out on learning certain important concepts
that it could have learnt from the samples that were removed as a result of.

One way to deal with the above issues is to directly modify the loss function. This method involves assigning different
weights to different classes (or even different samples). This method is adaptive and there are many variants of this
type of method. The simplest is to weight according to the portion of the samples of each category.

In [11]:
class_weights = [None, 'balanced', 'balanced_subsample']

for class_weight in class_weights:
    print(f'\n----- Class Weights: {class_weight} -----')

    clf = RandomForestClassifier(class_weight=class_weight, random_state=random_state, n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))


----- Class Weights: None -----
              precision    recall  f1-score   support

           0       0.81      0.87      0.84      3275
           1       0.48      0.32      0.38       419
           2       0.82      0.78      0.80      2246

    accuracy                           0.80      5940
   macro avg       0.70      0.66      0.67      5940
weighted avg       0.79      0.80      0.79      5940


----- Class Weights: balanced -----
              precision    recall  f1-score   support

           0       0.82      0.84      0.83      3275
           1       0.41      0.40      0.40       419
           2       0.82      0.79      0.80      2246

    accuracy                           0.79      5940
   macro avg       0.68      0.68      0.68      5940
weighted avg       0.79      0.79      0.79      5940


----- Class Weights: balanced_subsample -----
              precision    recall  f1-score   support

           0       0.82      0.85      0.83      3275
           1

The balanced method again increased the `F1` scores of the class: `1`.
There was nothing special about this method, however it always worth
checking if this method improves the classifier's performance, as it
does not add any additional overhead.

| Class Weights      | Accuracy | F1 (0)   | F1 (1)   | F1 (2)   |
|--------------------|----------|----------|----------|----------|
| None               | 0.80     | **0.84** | 0.38     | 0.80     |
| Balanced           | **0.80** | 0.83     | **0.40** | **0.80** |
| Balanced Subsample | 0.79     | 0.83     | 0.39     | 0.80     |

# Probability Calibration

Probability calibration is the process of calibrating a model to return the true likelihood of an event.
This is necessary when we need the probability of the event in question rather than its classification.
The probability can be used as a measure of uncertainty on those problems where a probabilistic prediction
is required. This is particularly the case in imbalanced classification, where crisp class labels are often
insufficient both in terms of evaluating and selecting a model. The calibration methods that will be used
are:
1. `Sigmoid`
2. `Isotonic`

In [14]:
from sklearn.calibration import CalibratedClassifierCV


calibration_methods = [None, 'sigmoid', 'isotonic']

for calibration_method in calibration_methods:
    print(f'\n----- Calibration Method: {calibration_method} -----')

    if calibration_method is not None:
        clf = CalibratedClassifierCV(
            RandomForestClassifier(random_state=random_state), method=calibration_method,  n_jobs=-1
        )
    else:
        clf = RandomForestClassifier(random_state=random_state,  n_jobs=-1)

    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print(classification_report(y_test, y_pred))


----- Calibration Method: None -----
              precision    recall  f1-score   support

           0       0.81      0.87      0.84      3275
           1       0.48      0.32      0.38       419
           2       0.82      0.78      0.80      2246

    accuracy                           0.80      5940
   macro avg       0.70      0.66      0.67      5940
weighted avg       0.79      0.80      0.79      5940


----- Calibration Method: sigmoid -----
              precision    recall  f1-score   support

           0       0.81      0.89      0.85      3275
           1       0.54      0.27      0.36       419
           2       0.83      0.79      0.81      2246

    accuracy                           0.81      5940
   macro avg       0.73      0.65      0.67      5940
weighted avg       0.80      0.81      0.80      5940


----- Calibration Method: isotonic -----
              precision    recall  f1-score   support

           0       0.80      0.91      0.85      3275
        

| Calibration | Accuracy | F1 (0)   | F1 (1)   | F1 (2)   |
|-------------|----------|----------|----------|----------|
| None        | 0.80     | 0.84     | **0.38** | 0.80     |
| Sigmoid     | **0.81** | **0.85** | 0.36     | **0.81** |
| Isotonic    | 0.81     | 0.85     | 0.35     | 0.81     |

The calibrated classifier managed to outperform the original classifier overall.

# Semi-Supervised Learning

Semi-Supervised Learning is utilized when there are lots of unlabeled data. This happens very often
in real world situations, where finding a dataset is quite difficult. Hiring experts to label all the
available data might be extremely expensive and time-consuming. In such situations, Semi-Supervised
learning comes very handy. In Semi-Supervised Learning, a "teacher" classifier is trained on the
training set and labels the unseen data. Then, a student classifier is trained using the training set,
as well as the most confident labeled samples by the teacher classifier.

In this problem, the most confident labeled samples of the classes "*functional needs repair*" and "*non functional*"
can be picked, in order to balance the entire dataset. Additionally, 3 classifiers will be trained as "teachers" and
the labels will be picked using the "*voting*" method. Finally, the teacher classifiers will be calibrated, in order
to produce more confident predictions.

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

rf_clb = CalibratedClassifierCV(RandomForestClassifier(random_state=random_state, n_jobs=-1))
knn_clb = CalibratedClassifierCV(KNeighborsClassifier(weights='distance', n_jobs=-1))
xgboost_clb = CalibratedClassifierCV(XGBClassifier(random_state=random_state, n_jobs=-1))

rf_clb.fit(x_train, y_train)
rf_preds = rf_clb.predict(x_test)
rf_probs = rf_clb.predict_proba(x_test)

knn_clb.fit(x_train, y_train)
knn_preds = knn_clb.predict(x_test)
knn_probs = knn_clb.predict_proba(x_test)

xgboost_clb.fit(x_train, y_train)
xgboost_preds = xgboost_clb.predict(x_test)
xgboost_probs = xgboost_clb.predict_proba(x_test)

teachers_df = pd.DataFrame({
    'y_true': y_test,
    'rf_preds': rf_preds, 'rf-p0': rf_probs[:, 0], 'rf-p1': rf_probs[:, 1], 'rf-p2': rf_probs[:, 2],
    'knn-preds': knn_preds, 'knn-p0': knn_probs[:, 0], 'knn-p1': knn_probs[:, 1], 'knn-p2': knn_probs[:, 2],
    'xgboost-preds': xgboost_preds, 'xgboost-p0': xgboost_probs[:, 0], 'xgboost-p1': xgboost_probs[:, 1], 'xgboost-p2': xgboost_probs[:, 2]
})
teachers_df

Unnamed: 0,y_true,rf_preds,rf-p0,rf-p1,rf-p2,knn-preds,knn-p0,knn-p1,knn-p2,xgboost-preds,xgboost-p0,xgboost-p1,xgboost-p2
14918,0,2,0.122959,0.035956,0.841085,0,0.754766,0.044403,0.200831,2,0.042181,0.027059,0.930760
4314,0,0,0.716797,0.043115,0.240088,0,0.631735,0.075980,0.292285,0,0.771052,0.027452,0.201496
53553,0,0,0.910562,0.033324,0.056114,0,0.819007,0.044298,0.136694,0,0.877283,0.031460,0.091257
6630,2,2,0.106652,0.033385,0.859964,2,0.300715,0.043180,0.656105,2,0.068640,0.030444,0.900916
28310,0,0,0.913441,0.032946,0.053614,0,0.819007,0.044298,0.136694,0,0.890077,0.029158,0.080765
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16034,0,0,0.871472,0.041846,0.086682,0,0.800681,0.044337,0.154983,0,0.854543,0.069175,0.076282
43140,0,0,0.670356,0.269081,0.060563,0,0.557974,0.123217,0.318809,0,0.906149,0.044213,0.049638
25161,2,2,0.067407,0.032326,0.900267,2,0.178310,0.042853,0.778836,2,0.047351,0.026956,0.925693
9321,0,0,0.909991,0.033340,0.056668,0,0.819007,0.044298,0.136694,0,0.903621,0.028921,0.067459


In [78]:
confidence_threshold = 0.6

labeled_samples = []
labels = []

i = 0
for _, row in teachers_df.iterrows():
    if row['rf_preds'] == row['knn-preds'] == row['xgboost-preds']:
        pred = int(row['rf_preds'])

        if row[f'rf-p{pred}'] > confidence_threshold and \
                row[f'knn-p{pred}'] > confidence_threshold and \
                row[f'xgboost-p{pred}'] > confidence_threshold:
            labeled_samples.append(x_test.iloc[i])
            labels.append(pred)
    i += 1

f'Selected {len(labeled_samples)} labeled samples'

'Selected 3293 labeled samples'

In [82]:
import numpy as np

x_train_new = np.concatenate([x_train, labeled_samples], axis=0)
y_train_new = np.concatenate([y_train, labels], axis=0)

x_train_new.shape, y_train_new.shape

((56753, 21), (56753,))

In [80]:
rf = RandomForestClassifier(random_state=random_state, n_jobs=-1)
rf.fit(x_train_new, y_train_new)
y_pred = rf.predict(x_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84      3275
           1       0.49      0.33      0.39       419
           2       0.82      0.79      0.80      2246

    accuracy                           0.80      5940
   macro avg       0.71      0.66      0.68      5940
weighted avg       0.79      0.80      0.79      5940



