# Start!

## Problem:
Mamy doczynienia z serią danych numerycznych wraz z problemem klasyfikacji binarnej.

Na podstawie opisów cech można wywnioskować, że są one w jakimś stopniu ze sobą powiązane - chociażby szerokość serca czy płuc z ich polem powierzchni.

### Moja początkowa intuicja:
1. Dokonać analizy PCA na cechach, by ograniczyć szum informacji
2. Zastosować klasyfikację knn z cross-examination na hiperparametrze k, by uniknąć under lub overfittingu

In [316]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier


In [317]:

# PCA Transformation (e.g. 5 components)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Test accuracy:", score)


Test accuracy: 0.9


In [318]:

# Load DataFrame from CSV
df = pd.read_csv("task_data.csv", sep=",")

# Fix commas in floats (if necessary)
for col in df.columns:
    if df[col].dtype == object:
        df[col] = df[col].str.replace(",", ".")

# Convert columns except 'ID' and label to float
cols_to_float = [col for col in df.columns if col not in ['ID', 'Cardiomegaly']]
df[cols_to_float] = df[cols_to_float].astype(float)

# Encode target labels if necessary
le = LabelEncoder()
y = le.fit_transform(df["Cardiomegaly"])
X = df.drop(columns=["ID", "Cardiomegaly"]).values


In [319]:
pca = PCA(n_components=5)
X_reduced = pca.fit_transform(X)
print(X_reduced)

[[-6.41271862e+03 -9.41817531e+03 -4.71082910e+02  3.80690727e+01
  -3.53273424e+01]
 [ 1.28624633e+04 -4.98501129e+03  5.66605498e+02 -1.03543550e+02
  -5.56022513e+01]
 [-1.48134780e+04 -1.36513316e+02  1.01354777e+03 -1.03622656e+03
   1.14141325e+02]
 [ 1.39570895e+03  7.66910960e+03  6.98583785e+02 -4.56792884e+01
  -1.99019637e+02]
 [ 4.19118234e+03  8.78995668e+02  4.77458343e+02 -6.97457243e+02
   1.74172755e+01]
 [ 1.17516475e+04  1.18504448e+04  2.28669891e+03  2.09386779e+02
   4.68838118e+01]
 [ 3.50473095e+04 -2.41862945e+02 -1.25478083e+03  3.37594562e+02
  -1.32111064e+01]
 [ 1.58716446e+04  1.72606078e+03 -1.52773862e+03 -1.17499604e+02
   1.44993255e+02]
 [ 1.79150924e+04 -7.17329469e+03 -8.56464516e+02 -1.11519135e+02
  -8.48333073e+00]
 [ 2.40944159e+04  3.05238093e+03  4.06445976e+02 -7.06180438e+01
   2.59739540e+02]
 [ 1.60487994e+03 -1.03806325e+03  1.19178444e+02 -5.16121687e+02
  -3.22935162e+01]
 [ 3.52573830e+03 -2.94368941e+03 -3.30695915e+02 -2.25553299e+02

In [320]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
print("Test accuracy:", score)


Test accuracy: 0.9
