# Start!

## Problem:
Mamy doczynienia z serią danych numerycznych wraz z problemem klasyfikacji binarnej.

Na podstawie opisów cech można wywnioskować, że są one w jakimś stopniu ze sobą powiązane - chociażby szerokość serca czy płuc z ich polem powierzchni.

### Moja początkowa intuicja:
1. Dokonać analizy PCA na cechach, by ograniczyć szum informacji
2. Zastosować klasyfikację knn z cross-examination na hiperparametrze k, by uniknąć under lub overfittingu

In [25]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder


In [26]:
# Load DataFrame from CSV
df = pd.read_csv("task_data.csv", sep=",")

# Fix commas in floats (if necessary)
for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].str.replace(",", ".")

# Convert columns except 'ID' and label to float
cols_to_float = [col for col in df.columns if col not in ['ID', 'Cardiomegaly']]
df[cols_to_float] = df[cols_to_float].astype(float)

# Encode target labels if necessary
le = LabelEncoder()
y = le.fit_transform(df["Cardiomegaly"])
X = df.drop(columns=["ID", "Cardiomegaly"]).values


Dataset jest niezbalansowany. Zdecydowałem się skorzystać z techniki SMOTE.

Poza tym, jako że ten dataset to pomiary, postanowiłem rozszerzyć go o dane z nałożonym szumem, zwiększając tym samym rozmiar datasetu i osiągając lepsze wyniki w modelach

In [27]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X, y)

# adding noisy data:
noise_level = 0.045

feature_stds = np.std(X_resampled, axis=0)

noise = np.random.normal(loc=0, scale=noise_level * feature_stds, size=X_resampled.shape)

X_noisy = X_resampled + noise

X_resampled = np.array([*X_resampled, *X_noisy])
y_resampled = np.array([*y_resampled, *y_resampled])

print(y_resampled.shape)


(112,)




### Standaryzacja danych przed przepuszczeniem przez model

In [28]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_standardized = scaler.fit_transform(X_resampled)

X_reduced = X_standardized


### Podział na dane testowe i treningowe

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y_resampled)

### Dobór modeli
Do problemu klasyfikacji binarnej z niewielką ilością cech numerycznych dobrze nadają się m. in.:
- Random Forest Classifier
- KNeighbours Classifier
- Logistic Regression

Tutaj dobieram najoptymalniejsze parametry do RFC

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 50, 100, 200],
    'min_samples_split': [2, 5 ],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
}

rfClassifier = RandomForestClassifier()

grid_search = GridSearchCV(
    estimator=rfClassifier,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='roc_auc'
    )

grid_search.fit(X_train, y_train)

print(grid_search.predict([X_test[0]]))

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))

rfClassifier = grid_search.best_estimator_


[1]
Best Parameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best Estimator: RandomForestClassifier(max_depth=10, max_features='log2', min_samples_split=5)
Accuracy: 1.0


Tutaj rozważam różne parametry do KNC

In [31]:

from sklearn.model_selection import GridSearchCV

param_grid = {
   "n_neighbors": range(1, 11),
    "leaf_size": range(30,200,5),
    "weights": ["uniform", "distance"]
}

knClassifier = KNeighborsClassifier()

grid_search = GridSearchCV(
    estimator=knClassifier,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='roc_auc'
    )

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))

knClassifier = grid_search.best_estimator_



Best Parameters: {'leaf_size': 30, 'n_neighbors': 5, 'weights': 'distance'}
Best Estimator: KNeighborsClassifier(weights='distance')
Accuracy: 1.0


Tutaj dobieram odpowiedni solver do Logistic Regression

In [32]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "solver": ["newton-cg", "lbfgs", "liblinear"],
}

logisticRegression = LogisticRegression(max_iter=10000)

grid_search = GridSearchCV(
    estimator=logisticRegression,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='roc_auc'
    )

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))

logisticRegression = grid_search.best_estimator_



Best Parameters: {'solver': 'newton-cg'}
Best Estimator: LogisticRegression(max_iter=10000, solver='newton-cg')
Accuracy: 0.875


### Stacking the models together

In [33]:
from sklearn.linear_model import LogisticRegression

pred1 = rfClassifier.predict_proba(X_train)[:, 1]
pred2 = knClassifier.predict_proba(X_train)[:, 1]
pred3 = logisticRegression.predict_proba(X_train)[:, 1]

meta_X = np.column_stack([pred1, pred2, pred3])

meta_model = LogisticRegression()
meta_model.fit(meta_X, y_train)

def predict(samples):
    pred1 = rfClassifier.predict_proba(samples)[:, 1]
    pred2 = knClassifier.predict_proba(samples)[:, 1]
    pred3 = logisticRegression.predict_proba(samples)[:, 1]
    meta_X = np.column_stack([pred1, pred2, pred3])
    return meta_model.predict(meta_X)


In [34]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# As data is split randomly, for our last test let's split it again

predictions = predict(X_test)
actual = y_test


accuracy = accuracy_score(actual, predictions)
precision = precision_score(actual, predictions)
recall = recall_score(actual, predictions)
f1 = f1_score(actual, predictions)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1-score: 1.0
