# Start!

## Problem:
Mamy doczynienia z serią danych numerycznych wraz z problemem klasyfikacji binarnej.

Na podstawie opisów cech można wywnioskować, że są one w jakimś stopniu ze sobą powiązane - chociażby szerokość serca czy płuc z ich polem powierzchni.

### Moja początkowa intuicja:
1. Dokonać analizy PCA na cechach, by ograniczyć szum informacji
2. Zastosować klasyfikację knn z cross-examination na hiperparametrze k, by uniknąć under lub overfittingu



In [25]:
import pandas as pd
from nltk.classify.svm import SvmClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler


In [26]:
# Load DataFrame from CSV
df = pd.read_csv("task_data.csv", sep=",")

# Fix commas in floats (if necessary)
for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].str.replace(",", ".")

# Convert columns except 'ID' and label to float
cols_to_float = [col for col in df.columns if col not in ['ID', 'Cardiomegaly']]
df[cols_to_float] = df[cols_to_float].astype(float)

# Encode target labels if necessary
le = LabelEncoder()
y = le.fit_transform(df["Cardiomegaly"])
X = df.drop(columns=["ID", "Cardiomegaly"]).values


In [27]:
pca = PCA(n_components=12)
X_reduced = pca.fit_transform(X)


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV



param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 50, 100, 200],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None],
    'bootstrap': [True, False]
}


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier()

grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=2,
    n_jobs=-1,
    scoring='accuracy'
    )

grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Estimator:", grid_search.best_estimator_)
print("Accuracy:", grid_search.score(X_test, y_test))


Best Parameters: {'bootstrap': True, 'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 200}
Best Estimator: RandomForestClassifier(min_samples_leaf=2, n_estimators=200)
Accuracy: 0.75
Best Parameters: {'bootstrap': False, 'max_depth': 200, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 100}
Best Estimator: RandomForestClassifier(bootstrap=False, max_depth=200, max_features='log2',
                       min_samples_leaf=2, min_samples_split=10)
Accuracy: 0.75
Best Parameters: {'bootstrap': True, 'max_depth': 10, 'max_features': None, 'min_samples_leaf': 4, 'min_samples_split': 5, 'n_estimators': 100}
Best Estimator: RandomForestClassifier(max_depth=10, max_features=None, min_samples_leaf=4,
                       min_samples_split=5)
Accuracy: 0.625
Best Parameters: {'bootstrap': True, 'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 

KeyboardInterrupt: 