# Learning Style Classification V4 (V3 BASE + THRESHOLD OPTIMIZATION)

## üéØ Objective
This notebook is a strict upgrade of the **V3 Pipeline**, incorporating **Custom Threshold Optimization**.
*   **Base**: Identical to V3 (Same 3 features, same models, same grid search).
*   **Upgrade**: Instead of using the default 0.5 threshold, we calculate the optimal decision threshold for *each label* based on the training data to maximize F1-Macro.

## üõ°Ô∏è Methodology
1.  **Split**: 10-Fold Stratified Nested CV.
2.  **Training**: Run GridSearch to find best hyperparameters.
3.  **Threshold Tuning (NEW)**:
    -   Predict probabilities on Training Data (using Cross-Val Predict to avoid leakage).
    -   Find the threshold `p` (e.g., 0.35, 0.42) that maximizes F1-Score for each label.
4.  **Testing**: Apply these optimal thresholds to the Test Data predictions.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from copy import deepcopy

# Sklearn Core
from sklearn.model_selection import StratifiedKFold, KFold, GridSearchCV, cross_val_predict
from sklearn.preprocessing import StandardScaler, MultiLabelBinarizer
from sklearn.impute import SimpleImputer
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.metrics import f1_score, hamming_loss, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression

# Imbalanced-Learn
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import RandomOverSampler, SMOTE, ADASYN

# Models
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.svm import SVC

# Utils
import joblib

warnings.filterwarnings('ignore')
np.random.seed(42)
print("‚úÖ Libraries Loaded Successfully (V4 Threshold Optimized + Multi-Metric)")

‚úÖ Libraries Loaded Successfully (V4 Threshold Optimized + Multi-Metric)


## 1. Data Loading (V3 Strict)
Using the original 3 features and cleaning logic from V3.


In [2]:
# Load Raw Data
df_styles = pd.read_csv('dataset/dfjadi-simplified - dfjadi-simplified.csv')
df_time = pd.read_csv('dataset/mhs_grouping_by_material_type.csv')

# 1. Standardize Keys
df_styles['NIM'] = df_styles['NIM'].astype(str).str.upper().str.strip()
df_time['NPM'] = df_time['NPM'].astype(str).str.upper().str.strip()

# 2. Merge
df_merged = pd.merge(df_styles, df_time, left_on='NIM', right_on='NPM', how='inner')

# 3. Features (V3 Original Set)
TIME_FEATURES = ['time_materials_video', 'time_materials_document', 'time_materials_article']

# 4. Targets
def parse_labels(row):
    labels = []
    pemrosesan = str(row['Pemrosesan'])
    if 'Aktif' in pemrosesan: labels.append('Aktif')
    elif 'Reflektif' in pemrosesan: labels.append('Reflektif')
    
    input_style = str(row['Input'])
    if 'Visual' in input_style: labels.append('Visual')
    elif 'Verbal' in input_style: labels.append('Verbal')
    return labels

df_merged['labels'] = df_merged.apply(parse_labels, axis=1)

# MultiLabel Encoding
mlb = MultiLabelBinarizer()
y = mlb.fit_transform(df_merged['labels'])
X = df_merged[TIME_FEATURES]

print(f"\nFinal Dataset: {X.shape}")
print(f"Features: {X.columns.tolist()}")
print(f"Classes: {mlb.classes_}")



Final Dataset: (125, 3)
Features: ['time_materials_video', 'time_materials_document', 'time_materials_article']
Classes: ['Aktif' 'Reflektif' 'Verbal' 'Visual']


## 2. Pipeline Components (V3 Base)
Standard Imputer, Sampler Wrapper, and Model Definitions.


In [3]:
# Custom Imputer
class SmartImputer(BaseEstimator, TransformerMixin):
    def __init__(self, strategy='median', fill_value=0):
        self.strategy = strategy
        self.fill_value = fill_value
        self.imputer = None
    def fit(self, X, y=None):
        if self.strategy == 'mice': self.imputer = IterativeImputer(max_iter=10, random_state=42)
        elif self.strategy == 'constant': self.imputer = SimpleImputer(strategy='constant', fill_value=self.fill_value)
        else: self.imputer = SimpleImputer(strategy=self.strategy)
        self.imputer.fit(X, y)
        return self
    def transform(self, X): return self.imputer.transform(X)

# Custom Multi-Label Sampler Wrapper (V3 Logic)
class MultiLabelSamplerWrapper(BaseEstimator):
    def __init__(self, sampler=None):
        self.sampler = sampler
    def fit_resample(self, X, y):
        if self.sampler is None: return X, y
        # LP Transformation: Convert 2D y to 1D strings
        y_str = [''.join(map(str, row)) for row in y]
        X_res, y_str_res = self.sampler.fit_resample(X, y_str)
        # Inverse Transformation
        unique_patterns = np.unique(y_str)
        mapping = {}
        for i, pat in enumerate(y_str):
            if pat not in mapping: mapping[pat] = y[i]
        y_res = np.array([mapping[pat] for pat in y_str_res])
        return X_res, y_res

# RandomOverSampler is strictly enforced as per user request.
# Safe Samplers
class SafeSMOTE(SMOTE):
    def fit_resample(self, X, y):
        from collections import Counter
        # Check if any class has < 2 samples (SMOTE fails)
        # Note: y here is list of strings from wrapper, but SMOTE sees 1D array.
        # But actually wrapper converts to strings.
        # Let's trust standard SMOTE with k_neighbors=1 for safety or use ROS fallback
        try:
            return super().fit_resample(X, y)
        except Exception:
            return RandomOverSampler(random_state=42).fit_resample(X, y)

# Components from V3 (RBF, SelfTraining)
class RBFNetwork(BaseEstimator, TransformerMixin):
    def __init__(self, n_centers=10, spread_factor=1.0, random_state=42):
        self.n_centers = n_centers
        self.spread_factor = spread_factor
        self.random_state = random_state
    
    def _rbf_activation(self, X, centers, spread):
        n_samples = X.shape[0]
        n_centers = centers.shape[0]
        activations = np.zeros((n_samples, n_centers))
        for i, center in enumerate(centers):
            distances = np.linalg.norm(X - center, axis=1)
            activations[:, i] = np.exp(-(distances ** 2) / (2 * spread ** 2))
        return activations

    def fit(self, X, y):
        # 1. Select centers using K-Means
        kmeans = KMeans(n_clusters=self.n_centers, random_state=self.random_state, n_init=10).fit(X)
        self.centers_ = kmeans.cluster_centers_
        
        # 2. Calculate spread
        distances = np.linalg.norm(self.centers_[:, np.newaxis] - self.centers_, axis=2)
        np.fill_diagonal(distances, np.inf)
        self.spread_ = np.mean(np.min(distances, axis=1)) * self.spread_factor
        self.spread_ = max(self.spread_, 0.1)
        
        # Store classes (CRITICAL FIX for MultiOutputClassifier)
        self.classes_ = np.unique(y)
        
        # 3. Train Output Layer (Logistic Regression)
        H = self._rbf_activation(X, self.centers_, self.spread_)
        self.output_layer_ = LogisticRegression(max_iter=1000, random_state=self.random_state)
        # Handle multi-label by fitting Naive approach (Binary Relevance handled by MultiOutputClassifier wrapper)
        # But if this is called INSIDE MultiOutputClassifier, y is 1D.
        # So we just fit.
        self.output_layer_.fit(H, y)
        return self

    def predict(self, X):
        H = self._rbf_activation(X, self.centers_, self.spread_)
        return self.output_layer_.predict(H)
        
    def predict_proba(self, X):
        H = self._rbf_activation(X, self.centers_, self.spread_)
        return self.output_layer_.predict_proba(H)

class MultiLabelSelfTraining(BaseEstimator):
    def __init__(self, base_estimator=None, threshold=0.75, random_state=42):
        self.base_estimator = base_estimator
        self.threshold = threshold
        self.random_state = random_state
        
    def fit(self, X, y):
        # Simulation of Self-Training Wrapper Logic from V2/V3
        # Since V2/V3 just wrapped a base estimator (RF) for each label:
        self.models_ = []
        self.n_labels_ = y.shape[1]
        for i in range(self.n_labels_):
            if self.base_estimator is None:
                model = RandomForestClassifier(n_estimators=100, random_state=self.random_state)
            else:
                model = clone(self.base_estimator)
            model.fit(X, y[:, i])
            self.models_.append(model)
        return self
        
    def predict(self, X):
        preds = []
        for model in self.models_:
            preds.append(model.predict(X))
        return np.array(preds).T

    def predict_proba(self, X):
        # Return probability list of (n_samples, 2)
        probs = []
        for model in self.models_:
            probs.append(model.predict_proba(X))
        return probs



## 3. Threshold Optimization Logic (The V4 Upgrade)
We define functions to find the optimal probability threshold for maximization of F1-Score.


In [6]:
def optimize_thresholds(y_true, y_probs, step=0.01):
    n_labels = y_true.shape[1]
    best_thresholds = [0.5] * n_labels
    for i in range(n_labels):
        best_t, best_score = 0.5, 0.0
        # Search range
        for t in np.arange(0.2, 0.65, step):
            preds = (y_probs[:, i] >= t).astype(int)
            score = f1_score(y_true[:, i], preds, zero_division=0)
            if score > best_score:
                best_score = score
                best_t = t
        best_thresholds[i] = best_t
    return best_thresholds

def apply_thresholds(y_probs, thresholds):
    preds = np.zeros_like(y_probs)
    for i in range(y_probs.shape[1]):
        preds[:, i] = (y_probs[:, i] >= thresholds[i]).astype(int)
    return preds


In [7]:
# 4. Nested CV with Threshold Optimization (Multi-Metric)
outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
inner_cv = KFold(n_splits=3, shuffle=True, random_state=42)

# V3 Pipeline Definition
def get_pipeline(clf):
    return ImbPipeline([
        ('imputer', SmartImputer(strategy='mean')), # Default placeholder, will be tuned
        ('scaler', StandardScaler()),
        ('sampler', MultiLabelSamplerWrapper(RandomOverSampler(random_state=42))), # Default placeholder, will be tuned
        ('clf', clf)
    ])

# Common Preprocessing Grid
preprocessing_grid = {
    'imputer__strategy': ['mean', 'median'],
    'sampler__sampler': [
        RandomOverSampler(random_state=42),
        SafeSMOTE(random_state=42, k_neighbors=3) # SafeSMOTE to avoid crashes
    ]
}

# V3 Grids (FULL 5 ALGORITHMS) + Preprocessing Grid
grids_v3 = {
    'Random Forest': {
        'model': MultiOutputClassifier(RandomForestClassifier(random_state=42)),
        'param_grid': {
            **preprocessing_grid,
            'clf__estimator__n_estimators': [100, 200],
            'clf__estimator__max_depth': [10, None]
        }
    },
    'XGBoost': {
        'model': MultiOutputClassifier(xgb.XGBClassifier(eval_metric='logloss', random_state=42, use_label_encoder=False)),
        'param_grid': {
             **preprocessing_grid,
            'clf__estimator__n_estimators': [100, 200],
            'clf__estimator__learning_rate': [0.1]
        }
    },
    'SVM': {
        'model': MultiOutputClassifier(SVC(probability=True, random_state=42)),
        'param_grid': {
             **preprocessing_grid,
            'clf__estimator__C': [1, 10],
            'clf__estimator__kernel': ['rbf']
        }
    },
    'RBF Network': {
        'model': MultiOutputClassifier(RBFNetwork(random_state=42)),
        'param_grid': {
             **preprocessing_grid,
            'clf__estimator__n_centers': [10, 20],
            'clf__estimator__spread_factor': [1.0]
        }
    },
    'Self-Training': {
        'model': MultiLabelSelfTraining(random_state=42),
        'param_grid': {
             **preprocessing_grid,
            'clf__threshold': [0.75] # Dummy param to satisfy gridsearch
        }
    }
}

final_results = {}

print(f"üöÄ Starting V4 (V3 Base + Threshold Metrics + Multi-Metric Eval)...")

for name, config in grids_v3.items():
    print(f"\n‚û°Ô∏è Analyzing {name}...")
    # Storage for detailed fold metrics
    metrics_per_fold = {
        'f1_macro': [],
        'f1_micro': [],
        'hamming': [],
        'subset_acc': []
    }
    
    y_str = [str(row) for row in y] 
    
    for fold, (train_idx, test_idx) in enumerate(outer_cv.split(X, y_str)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        
        # 1. Standard Tuning
        clf = GridSearchCV(
            get_pipeline(config['model']), 
            config['param_grid'],
            cv=inner_cv, scoring='f1_macro', n_jobs=-1
        )
        clf.fit(X_train, y_train)
        best_model = clf.best_estimator_
        
        # 2. Threshold Optimization
        try:
            # Predict Probs on Train (CV)
            if hasattr(best_model, 'predict_proba'):
                 y_train_probs_list = cross_val_predict(best_model, X_train, y_train, cv=3, method='predict_proba')
                 y_train_probs = np.array([prob[:, 1] for prob in y_train_probs_list]).T
                 
                 # Find Best Thresholds
                 best_thresholds = optimize_thresholds(y_train, y_train_probs)
                 
                 # Apply to Test
                 y_test_probs_list = best_model.predict_proba(X_test)
                 y_test_probs = np.array([prob[:, 1] for prob in y_test_probs_list]).T
                 y_pred = apply_thresholds(y_test_probs, best_thresholds)
                 
            else:
                 y_pred = best_model.predict(X_test)
                 
        except Exception as e:
            y_pred = best_model.predict(X_test)

        # 3. Calculate All Metrics
        metrics_per_fold['f1_macro'].append(f1_score(y_test, y_pred, average='macro', zero_division=0))
        metrics_per_fold['f1_micro'].append(f1_score(y_test, y_pred, average='micro', zero_division=0))
        metrics_per_fold['hamming'].append(hamming_loss(y_test, y_pred))
        metrics_per_fold['subset_acc'].append(accuracy_score(y_test, y_pred))
        
    # Aggregate Rules
    final_results[name] = {
        'f1_macro_mean': np.mean(metrics_per_fold['f1_macro']),
        'f1_macro_std': np.std(metrics_per_fold['f1_macro']),
        'f1_micro_mean': np.mean(metrics_per_fold['f1_micro']),
        'hamming_mean': np.mean(metrics_per_fold['hamming']),
        'subset_acc_mean': np.mean(metrics_per_fold['subset_acc'])
    }
    
    print(f"   ‚úÖ {name}: F1-Macro {final_results[name]['f1_macro_mean']:.4f}")

# Results Table
results_data = []
for model_name, res in final_results.items():
    results_data.append({
        'Algorithm': model_name,
        'F1 Macro': res['f1_macro_mean'],
        'F1 Micro': res['f1_micro_mean'],
        'Hamming Loss': res['hamming_mean'],
        'Subset Acc': res['subset_acc_mean'],
        'std': res['f1_macro_std']
    })

results_df = pd.DataFrame(results_data).sort_values('F1 Macro', ascending=False)
print("\n" + "="*80)
print(results_df.to_string(index=False, formatters={
    'F1 Macro': '{:.4f}'.format,
    'F1 Micro': '{:.4f}'.format,
    'Hamming Loss': '{:.4f}'.format,
    'Subset Acc': '{:.4f}'.format,
    'std': '{:.4f}'.format
}))


üöÄ Starting V4 (V3 Base + Threshold Metrics + Multi-Metric Eval)...

‚û°Ô∏è Analyzing Random Forest...




   ‚úÖ Random Forest: F1-Macro 0.5879

‚û°Ô∏è Analyzing XGBoost...


Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encoder" } are not used.

Parameters: { "use_label_encode

   ‚úÖ XGBoost: F1-Macro 0.6037

‚û°Ô∏è Analyzing SVM...




   ‚úÖ SVM: F1-Macro 0.6108

‚û°Ô∏è Analyzing RBF Network...


  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + 

   ‚úÖ RBF Network: F1-Macro 0.6234

‚û°Ô∏è Analyzing Self-Training...




   ‚úÖ Self-Training: F1-Macro 0.4801

    Algorithm F1 Macro F1 Micro Hamming Loss Subset Acc    std
  RBF Network   0.6234   0.6712       0.4841     0.0000 0.0201
          SVM   0.6108   0.6659       0.4881     0.0000 0.0157
      XGBoost   0.6037   0.6697       0.4306     0.0705 0.0379
Random Forest   0.5879   0.6483       0.4540     0.0718 0.0479
Self-Training   0.4801   0.6269       0.3731     0.3788 0.1176
