# Xay dung mo hinh ensemble learning cho bai toan SDP


Du lieu dau vao bao gom cac tap PROMISE, NASA, Eclispe, voi 2 bo du lieu dau tien, phuong phap chia train-test nhu sau:
Train tren version $n$ va test tren version $n + 1$
Voi NASA: Dung cross-valdiation

<span style="color:red">Quan trọng: Đã thử hầu hết các mô hình ensemble và nhận thấy CatBoost có performance tốt nhất
Các phương pháp xử lý dữ liệu kèm theo Feature Selection, Smote, KHÔNG Sử dụng PCA vì nó không hiệu quả  </span>

1. Import libraries

In [20]:
!pip install catboost



In [21]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.feature_selection import SelectKBest, chi2, mutual_info_classif, f_classif
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.impute import SimpleImputer
import xgboost as xgb

from catboost import CatBoostClassifier


2. Khoi tao bo cong cu de danh gia mo hinh theo F1 Score, Roc Auc, Gmean, Accuracy

In [22]:
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    roc_auc_score,
    classification_report,
    confusion_matrix
)
from imblearn.metrics import geometric_mean_score

2. Preprocessing data

Chia thu cong du lieu trong ```data/``` theo tung phien ban 

In [23]:
df_train_raw = pd.read_csv('data/camel-1.2.csv')
df_test_raw = pd.read_csv('data/camel-1.4.csv')

df_train_raw['bug'] = df_train_raw['bug'].apply(lambda x: 1 if x > 0 else 0)
df_test_raw['bug'] = df_test_raw['bug'].apply(lambda x: 1 if x > 0 else 0)

In [24]:

## Thong tin tap train
print(df_train_raw.info())
print("\nPhân bố lớp 'bug' trong tập train:")
print(df_train_raw['bug'].value_counts(normalize=True))
original_train_bug_counts = df_train_raw['bug'].value_counts()

## Thong tin tap test
print(df_test_raw.info())
print("\nPhân bố lớp 'bug' trong tập test:")
print(df_test_raw['bug'].value_counts(normalize=True))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 608 entries, 0 to 607
Data columns (total 22 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    608 non-null    object 
 1   wmc     608 non-null    int64  
 2   dit     608 non-null    int64  
 3   noc     608 non-null    int64  
 4   cbo     608 non-null    int64  
 5   rfc     608 non-null    int64  
 6   lcom    608 non-null    int64  
 7   ca      608 non-null    int64  
 8   ce      608 non-null    int64  
 9   npm     608 non-null    int64  
 10  lcom3   608 non-null    float64
 11  loc     608 non-null    int64  
 12  dam     608 non-null    float64
 13  moa     608 non-null    int64  
 14  mfa     608 non-null    float64
 15  cam     608 non-null    float64
 16  ic      608 non-null    int64  
 17  cbm     608 non-null    int64  
 18  amc     608 non-null    float64
 19  max_cc  608 non-null    int64  
 20  avg_cc  608 non-null    float64
 21  bug     608 non-null    int64  
dtypes:

<span style="color:green">Du lieu qua lech, neu khong xu ly mat can bang thi se khong co y nghia huan luyen mo hinh</span>

3. Chia Train/Test

In [25]:
X_train = df_train_raw.drop(columns=['name', 'bug'])
y_train = df_train_raw['bug']
X_test = df_test_raw.drop(columns=['name', 'bug'])
y_test = df_test_raw['bug']


Dữ liệu rất lệch (~80/20)

Lưu tên các cột features ban đầu

In [26]:
feature_names = X_train.columns.tolist()

<span style="color:red"> PIPELINE CHO TOAN BO QUY TRINH GOM CAC BUOC SAU</spam>
1. Tiền xử lý và Trích chọn Đặc trưng (trong Pipeline)
2. Imputer, Scaler, Feature Selector sẽ được định nghĩa trong pipeline
3. Giải quyết Mất cân bằng Dữ liệu (SMOTE trong Pipeline)
4. Dự đoán và đánh giá với Catboost

In [27]:
minority_class_count_train = original_train_bug_counts.min()
smote_k_neighbors = 5 if minority_class_count_train > 5 else max(1, minority_class_count_train - 1)
k_features = 5

In [28]:
pipeline_cb = ImbPipeline([
    ('imputer', SimpleImputer(strategy='median')), #Fill NaN
    ('scaler', StandardScaler()),
    ('selector', SelectKBest(score_func=f_classif, k=k_features)),
    ('smote', SMOTE(random_state=42, k_neighbors=smote_k_neighbors)), #Xu ly du lieu mat can bang
    ('classifier', CatBoostClassifier(iterations=1000, depth=4, learning_rate=0.1, verbose=0))
])

In [29]:

pipeline_cb.fit(X_train, y_train)

y_pred_cb = pipeline_cb.predict(X_test)
y_proba_cb = pipeline_cb.predict_proba(X_test)[:, 1]

print("\n--- Kết quả với CatBoostClassifier ---")
print(f"Accuracy       : {accuracy_score(y_test, y_pred_cb):.4f}")
print(f"F1-score (weighted): {f1_score(y_test, y_pred_cb, average='weighted'):.4f}")
print(f"ROC AUC        : {roc_auc_score(y_test, y_proba_cb):.4f}")
print(f"G-Mean         : {geometric_mean_score(y_test, y_pred_cb, pos_label=1):.4f}")

# # Cac thong so khac, neu can
# print("\nClassification Report:")
# print(classification_report(y_test, y_pred_cb, target_names=['bug=0', 'bug=1']))
# print("Confusion Matrix:")
# print(confusion_matrix(y_test, y_pred_cb))


--- Kết quả với CatBoostClassifier ---
Accuracy       : 0.6743
F1-score (weighted): 0.7121
ROC AUC        : 0.6681
G-Mean         : 0.6306


<span style="color:green">Mặc dù F1 Score tương đối tốt nhưng ROC AUC cho thấy mô hình vẫn sẽ bị sai lệch tương đối nhẹ với class ```bug = 1```</span>