# Start!

## Problem:
Mamy doczynienia z serią danych numerycznych wraz z problemem klasyfikacji binarnej.

Na podstawie opisów cech można wywnioskować, że są one w jakimś stopniu ze sobą powiązane - chociażby szerokość serca czy płuc z ich polem powierzchni.

### Moja początkowa intuicja:
1. Dokonać analizy PCA na cechach, by ograniczyć szum informacji
2. Zastosować klasyfikację knn z cross-examination na hiperparametrze k, by uniknąć under lub overfittingu



In [224]:
import pandas as pd
from nltk.classify.svm import SvmClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


In [225]:
# Load DataFrame from CSV
df = pd.read_csv("task_data.csv", sep=",")

# Fix commas in floats (if necessary)
for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].str.replace(",", ".")

# Convert columns except 'ID' and label to float
cols_to_float = [col for col in df.columns if col not in ['ID', 'Cardiomegaly']]
df[cols_to_float] = df[cols_to_float].astype(float)

# Encode target labels if necessary
le = LabelEncoder()
y = le.fit_transform(df["Cardiomegaly"])
X = df.drop(columns=["ID", "Cardiomegaly"]).values


In [226]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto')
X_resampled, y_resampled = smote.fit_resample(X, y)
print(y_resampled.shape)


(56,)




In [227]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_standardized = scaler.fit_transform(X_resampled)

# Zdecydowałem się nie używać PCA
X_reduced = X_standardized


In [228]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [100, 200, 300, 500],
    'max_depth': [None, 10, 20, 50, 100, 200],
    'min_samples_split': [2, 5 ],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
}


X_train, X_test, y_train, y_test = train_test_split(X_reduced, y_resampled, test_size=0.2)
rfClassifier = RandomForestClassifier()

grid_search = GridSearchCV(
    estimator=rfClassifier,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='accuracy'
    )

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))


Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best Estimator: RandomForestClassifier()
Accuracy: 0.8333333333333334


In [229]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
   "n_neighbors": range(1, 11),
    "leaf_size": range(30,200,5),
    "weights": ["uniform", "distance"]
}


X_train, X_test, y_train, y_test = train_test_split(X_reduced, y_resampled)
knClassifier = KNeighborsClassifier()

grid_search = GridSearchCV(
    estimator=knClassifier,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='accuracy'
    )

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))


Best Parameters: {'leaf_size': 30, 'n_neighbors': 4, 'weights': 'distance'}
Best Estimator: KNeighborsClassifier(n_neighbors=4, weights='distance')
Accuracy: 0.8571428571428571


In [230]:
from sklearn.model_selection import GridSearchCV

param_grid = {
   "penalty": ["l1", "l2"],
    "solver": ["newton-cg", "lbfgs", "liblinear"],
}


X_train, X_test, y_train, y_test = train_test_split(X_reduced, y_resampled)
logisticRegression = LogisticRegression(max_iter=10000)

grid_search = GridSearchCV(
    estimator=logisticRegression,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='accuracy'
    )

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))


Best Parameters: {'penalty': 'l1', 'solver': 'liblinear'}
Best Estimator: LogisticRegression(max_iter=10000, penalty='l1', solver='liblinear')
Accuracy: 0.8571428571428571


4 fits failed out of a total of 12.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/grzegorzprywatny/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/grzegorzprywatny/Library/Python/3.9/lib/python/site-packages/sklearn/base.py", line 1389, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/grzegorzprywatny/Library/Python/3.9/lib/python/site-packages/sklearn/linear_model/_logistic.py", line 1193, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Users/grzegorzprywatny/