**1. Klasifikatori:** Modeli koji na osnovu ulaznih podataka dodjeljuju podatke u unaprijed definisane kategorije ili klase.
Izlaz: diskretna klasa (npr. spam/nije spam, bolest/nije bolest).

**2. Regresori:** Modeli koji na osnovu ulaznih podataka predviđaju numeričku, kontinuiranu vrednost.
Izlaz: realan broj (npr. cijena kuće, temperatura).

**3. Random Forest:** Sastoji se od više stabala odlučivanja (decision trees). Svako stablo daje svoju prognozu, a finalna odluka se donosi glasovanjem (najčešća klasa). Može raditi i regresiju (tada je RandomForestRegressor) i klasifikaciju (RandomForestClassifier).

### HEART DISEASE DATASET - KNeighborsClassifier

Zadatak:
1. Učitavanje realnog skupa: Preuzeti Heart Disease dataset u CSV formatu: 
            https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data.

2. Vektorizacija kategorijskih podataka: target (atribut koji govori o prisustvu bolesti) kodirati sa LabelEncoder. Kategorijske kolone (cp, thal, slope) pretvoriti u dummy varijable pomoću pd.get_dummies.

3. Imputacija: Učitani nedostajući podaci označeni su sa ? – zamijeniti ih sa NaN. Za numeričke kolone primijeniti strategiju 'mean', a za kategorijske 'most_frequent'.

4. Uklanjanje outliera: Primijeniti pravilo 3 standardne devijacije na kolonu chol (holesterol).

5. Skaliranje: Primijeniti StandardScaler ili MinMaxScaler na numeričke kolone.

6. Evaluacija: Podijeliti podatke na trening i test skup u odnosu 80%/20%. Trenirati KNeighborsClassifier. Ispisati metrike: accuracy, precision, recall — prvo za sirove podatke (bez skaliranja i dummy-a), a zatim nakon pune obrade.

In [11]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score

In [2]:
# dataset je drugaciji od onih sa seaborn-a pa je potrebno navesti sva imena kolona koje ima
column_names = ['age','sex','cp','trestbps','chol','fbs','restecg',
                'thakach','exang','oldpeak','slope','ca','thal','num'] 

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
# ucitava dataset sa url adrese, kolonama dodjeljuje imena, ? koji se nalaze kao podaci zamijenice sa NaN, ukljanja space poslije , u csv fajlu
df = pd.read_csv(url, names=column_names, na_values=' ?',skipinitialspace=True)
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thakach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [3]:
# prvo se vrsi treniranje modela na neobradjenim podacima

# X su svi podaci osim target kolone koja je u ovom slucaju num - stepen napredovanja bolesti
X_raw = df.drop('num', axis=1).copy()

# posto imaju 0,1,2,3,4 klase u zavisnosti od stepena enkoduje 0 kao 0 - nema bolest, a 1,2,3,4 kao 1 - ima bolest
Y_raw = LabelEncoder().fit_transform(df['num']) # sadrzi samo target kolonu - num

In [4]:
# podjela podataka za test i trening, 80/20
X_train_raw, X_test_raw, Y_train_raw, Y_test_raw = train_test_split(X_raw, Y_raw, test_size=0.2, random_state=42)

# razdvajanje kategorickih i numerickih kolona zbog lakse obrade kasnije
cat_cols = X_train_raw.select_dtypes(include=['object']).columns
num_cols = X_train_raw.select_dtypes(include=['int64', 'float64']).columns

In [5]:
# posto se podaci jos uvijek nece u potpunosti obradjivati potrebno je NaN podatke u 
# numerickim kolonama zamijeniti nekom vrijednoscu - prosjecna u ovom slucaju 'mean'
# u kategorickim kolonama na tim mjestima upisuje se rucno 'Missing'


def prepare_data(df):
    df_copy = df.copy()
    num_imputer = SimpleImputer(strategy='mean')
    df_copy[num_cols] = num_imputer.fit_transform(df_copy[num_cols])

    for col in cat_cols:
        df_copy[col] = df_copy[col].fillna('Missing')
        df_copy[col] = LabelEncoder().fit_transform(df_copy[col])

    return df_copy

X_train_raw_prep = prepare_data(X_train_raw)
X_test_raw_prep = prepare_data(X_test_raw)

In [6]:
# krece se sa treniranjem modela pomocu klasifikacije, bira se 5 najblizih susjeda u grafu
# model se uvijek trenira na osnovu X_train i Y_train podataka
# predvidjanje se vrsi na osnovu X_test, a mjerenje koliko je model dobar na osnovu poredjenja Y_pred i Y_test

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_raw_prep, Y_train_raw)

Y_pred_raw = knn.predict(X_test_raw_prep)

# average='macro' - ako nije navedeno funkcije podrazumijevaju da su kolone ispisane u binarnim podacima, ovo omogucava viseklasni rad
# zero_division=0 - ukoliko model nije uopste pogodio neke klase doci ce po formuli za preciznost do 
                    # dijeljenja sa nulom pa se to sprijecava s ovim argumentom

print("Raw data:\n")
print(f'Accuracy Score: {accuracy_score(Y_test_raw, Y_pred_raw):.2f}')
print(f"Precision Score: {precision_score(Y_test_raw, Y_pred_raw, average='macro', zero_division=0):.2f}")
print(f"Recall Score: {recall_score(Y_test_raw, Y_pred_raw, average='macro'):.2f}")

Raw data:

Accuracy Score: 0.43
Precision Score: 0.11
Recall Score: 0.18


In [7]:
# prelazak na obradu podataka, vrsi se kopija dataseta df
df_full = df.copy()
df_full.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thakach,exang,oldpeak,slope,ca,thal,num
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0


In [8]:
# numericke kolone se popunjavaju sa srednjom vrijednoscu, a kategoricke sa najcescom koja se ponavlja
num_imputer = SimpleImputer(strategy='mean')
df_full[num_cols] = num_imputer.fit_transform(df_full[num_cols])

cat_imputer = SimpleImputer(strategy='most_frequent')
df_full[cat_cols] = cat_imputer.fit_transform(df_full[cat_cols])

In [None]:
# uklanjaju se outlejeri na osnovu kolone 'chol' racunanjem standardne devijacije

chol = df_full['chol']
mask = np.abs(chol - chol.mean()) < 3*chol.std() # uvijek ova formula za odredjivanje maske
df_full = df_full[mask] # zadrzavaju se samo one kolone za koje je mask = True, ostale se odbacuju (outlejeri)

# ponovo se traze kategoricke kolone jer je moguce da nakon obrade podataka vise kategoricke kolone nisu iste kao originalne
cat_cols = df_full.select_dtypes(include=['object']).columns 
# vrsi se one-hot enkodovanje - svaka kategorijska kolona se pretvara u vise binarnih kolona - kasnije se koristi OneHotEncoder
df_full = pd.get_dummies(df_full, columns=cat_cols, drop_first=True)

# vrsi skaliranje numericnih kolona tako da svaka kolona ima srednju vrijednost (0) i standardnu devijaciju (1)
scaler = StandardScaler()
df_full[num_cols] = scaler.fit_transform(df_full[num_cols])

In [10]:
X = df_full.drop('num', axis=1).copy()
Y = LabelEncoder().fit_transform(df_full['num'])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

knn = KNeighborsClassifier(n_neighbors=5)
knn = knn.fit(X_train, Y_train)

Y_pred = knn.predict(X_test)

print("After proccesiong data: ")
print(f'Accuracy Score: {accuracy_score(Y_test, Y_pred): .2f}')
print(f"Precision Score: {precision_score(Y_test, Y_pred, average='macro', zero_division=0): .2f}")
print(f"Recall Score: {recall_score(Y_test, Y_pred, average='macro'): .2f}")


After proccesiong data: 
Accuracy Score:  0.53
Precision Score:  0.22
Recall Score:  0.23


### PENGUINS - Stabla odlucivanja

1. Učitavanje realnog skupa: Učitajte penguins skup podataka pomoću biblioteke seaborn: https://github.com/allisonhorst/palmerpenguins

2. Vektorizacija kategorijskih podataka: Identifikujte kategorijske kolone (npr. island, sex) i primijenite OneHotEncoder ili LabelEncoder.

3. Imputacija: Detektujte NaN vrijednosti i zamijenite ih: Srednjom vrijednošću za numeričke kolone. Najčešćom vrijednošću za kategorijske kolone.

4. Outlieri: Detektujte i uklonite outliere za kolonu body_mass_g na osnovu praga ±3 standardne devijacije.

5. Skaliranje: Normalizujte numeričke kolone bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g pomoću StandardScaler.

6. Trening: Podijelite podatke na trening/test skup. Trenirajte RandomForestClassifier za predikciju vrste pingvina (species kolona).

7. Evaluacija: Evaluirajte tačnost, preciznost i confusion matrix modela na testnom skupu. Prikažite klasifikacioni izvještaj (classification_report).

In [46]:
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler
import numpy as np
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [8]:
df = sns.load_dataset('penguins')
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [29]:
cat_cols = df.select_dtypes(include=['object']).columns
num_cols = df.select_dtypes(include=['float64', 'int64']).columns
df[num_cols].isnull().sum()

bill_length_mm       2
bill_depth_mm        2
flipper_length_mm    2
body_mass_g          2
dtype: int64

In [30]:
df_encoded = df.copy()
le_encoder = LabelEncoder()
for column in cat_cols:
    df_encoded[column] = le_encoder.fit_transform(df_encoded[column])

df_encoded.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,2,39.1,18.7,181.0,3750.0,1
1,0,2,39.5,17.4,186.0,3800.0,0
2,0,2,40.3,18.0,195.0,3250.0,0
3,0,2,,,,,2
4,0,2,36.7,19.3,193.0,3450.0,0


In [31]:
ohe_cat_cols = ['island', 'sex']
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_data = ohe.fit_transform(df_encoded[ohe_cat_cols])
feature_names = ohe.get_feature_names_out(ohe_cat_cols)
df_ohe = pd.DataFrame(encoded_data, columns=feature_names, index = df.index)

df_encoded = df_encoded.drop(columns=ohe_cat_cols).join(df_ohe)
df_encoded.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_0,island_1,island_2,sex_0,sex_1,sex_2
0,0,39.1,18.7,181.0,3750.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0,39.5,17.4,186.0,3800.0,0.0,0.0,1.0,1.0,0.0,0.0
2,0,40.3,18.0,195.0,3250.0,0.0,0.0,1.0,1.0,0.0,0.0
3,0,,,,,0.0,0.0,1.0,0.0,0.0,1.0
4,0,36.7,19.3,193.0,3450.0,0.0,0.0,1.0,1.0,0.0,0.0


In [32]:
corr = df_encoded.corr()
corr.style.background_gradient(cmap = 'coolwarm')

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_0,island_1,island_2,sex_0,sex_1,sex_2
species,1.0,0.731369,-0.744076,0.854307,0.750491,0.61071,-0.311589,-0.434574,-0.01024,0.010916,-0.001938
bill_length_mm,0.731369,1.0,-0.235053,0.656181,0.59511,0.239319,0.034007,-0.381728,-0.32321,0.348378,-0.079067
bill_depth_mm,-0.744076,-0.235053,1.0,-0.583851,-0.471916,-0.632285,0.456357,0.271373,-0.355333,0.368696,-0.042246
flipper_length_mm,0.854307,0.656181,-0.583851,1.0,0.871202,0.611637,-0.421252,-0.289777,-0.244215,0.251283,-0.022424
body_mass_g,0.750491,0.59511,-0.471916,0.871202,1.0,0.627352,-0.460411,-0.258979,-0.409315,0.422023,-0.040279
island_0,0.61071,0.239319,-0.632285,0.611637,0.627352,1.0,-0.733496,-0.412295,-0.006768,0.011093,-0.012299
island_1,-0.311589,0.034007,0.456357,-0.421252,-0.460411,-0.733496,1.0,-0.316818,0.01846,0.017464,-0.102037
island_2,-0.434574,-0.381728,0.271373,-0.289777,-0.258979,-0.412295,-0.316818,1.0,-0.0153,-0.038889,0.153933
sex_0,-0.01024,-0.32321,-0.355333,-0.244215,-0.409315,-0.006768,0.01846,-0.0153,1.0,-0.938024,-0.174498
sex_1,0.010916,0.348378,0.368696,0.251283,0.422023,0.011093,0.017464,-0.038889,-0.938024,1.0,-0.177571


In [36]:
num_imputer = SimpleImputer(strategy='mean')
impute = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
df_encoded[impute] = num_imputer.fit_transform(df_encoded[impute])

In [39]:
body_mass = df_encoded['body_mass_g'].copy()
mask = np.abs(body_mass - body_mass.mean()) < 3*body_mass.std()
df_encoded = df_encoded[mask]
df_encoded.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_0,island_1,island_2,sex_0,sex_1,sex_2
0,0,39.1,18.7,181.0,3750.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0,39.5,17.4,186.0,3800.0,0.0,0.0,1.0,1.0,0.0,0.0
2,0,40.3,18.0,195.0,3250.0,0.0,0.0,1.0,1.0,0.0,0.0
3,0,43.92193,17.15117,200.915205,4201.754386,0.0,0.0,1.0,0.0,0.0,1.0
4,0,36.7,19.3,193.0,3450.0,0.0,0.0,1.0,1.0,0.0,0.0


In [41]:
scaler = StandardScaler()
scale = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
df_encoded[scale] = scaler.fit_transform(df_encoded[scale])
df_encoded.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,island_0,island_1,island_2,sex_0,sex_1,sex_2
0,0,-0.8870812,0.7877425,-1.422488,-0.565789,0.0,0.0,1.0,0.0,1.0,0.0
1,0,-0.813494,0.1265563,-1.065352,-0.503168,0.0,0.0,1.0,1.0,0.0,0.0
2,0,-0.6663195,0.4317192,-0.422507,-1.192003,0.0,0.0,1.0,1.0,0.0,0.0
3,0,-1.307172e-15,1.806927e-15,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0,-1.328605,1.092905,-0.565361,-0.941517,0.0,0.0,1.0,1.0,0.0,0.0


In [47]:
X = df_encoded.drop('species', axis=1).copy()
Y = df_encoded['species'].copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

model = RandomForestClassifier(n_estimators=50, random_state=42)
model.fit(X_train, Y_train)

Y_pred = model.predict(X_test)
report = classification_report(Y_test, Y_pred)
print(report)

              precision    recall  f1-score   support

           0       0.97      1.00      0.98        32
           1       1.00      0.94      0.97        16
           2       1.00      1.00      1.00        21

    accuracy                           0.99        69
   macro avg       0.99      0.98      0.98        69
weighted avg       0.99      0.99      0.99        69



### AUTOMOBILE DATASET

1. Učitavanje realnog skupa: Učitajte mpg dataset korištenjem sns.load_dataset("mpg").

2. Vektorizacija kategorijskih podataka: Kolone kao što su origin i name transformisati korištenjem OneHotEncoder.

3. Imputacija: Popuniti nedostajuće vrijednosti u koloni horsepower korištenjem srednje vrijednosti. Ukloniti kolone koje imaju visoku korelaciju iznad 0.95.

4. Outlieri: Iz kolone mpg (ciljna varijabla) izbaciti redove sa vrijednostima izvan 3 standardne devijacije.

5. Skaliranje: Primijeniti StandardScaler na numeričke kolone.

6. Trening: Podijeliti podatke na trening i test skup, te trenirati DecisionTreeRegressor bez pretrage hiperparametara. Ciljna kolona: mpg.

7. Evaluacija: Evaluirati model korištenjem MSE i R² na testnom skupu.

In [98]:
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split,ParameterGrid
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

In [99]:
df = sns.load_dataset("mpg")
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [100]:
# provjeravanje u kojim numerickim kolonama ima NaN vrijednosti
num_cols = ['mpg', 'cylinders','displacement','horsepower','weight','acceleration']
df[num_cols].isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
dtype: int64

In [101]:
# imputacija podataka samo sa kolonu horsepower jer ona jedina nije potpuna
imputer = SimpleImputer(strategy='mean')
impute = ['horsepower']
df[impute] = imputer.fit_transform(df[impute])


In [102]:
# Nad klonama origin i name se vrsi LabelEncodovanje - svaki origin dobice jedan index u nizu 
# jedinstvenih pojmova te kolone, isto vazi i za name
le_encode = ['origin','name']
for col in le_encode:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,2,49
1,15.0,8,350.0,165.0,3693,11.5,70,2,36
2,18.0,8,318.0,150.0,3436,11.0,70,2,231
3,16.0,8,304.0,150.0,3433,12.0,70,2,14
4,17.0,8,302.0,140.0,3449,10.5,70,2,161


In [103]:
# racuna se matrica koleracije
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
mpg,1.0,-0.775396,-0.804203,-0.771437,-0.831741,0.420289,0.579267,-0.482619,0.273936
cylinders,-0.775396,1.0,0.950721,0.838939,0.896017,-0.505419,-0.348746,0.551378,-0.275754
displacement,-0.804203,0.950721,1.0,0.893646,0.932824,-0.543684,-0.370164,0.591137,-0.292064
horsepower,-0.771437,0.838939,0.893646,1.0,0.860574,-0.684259,-0.411651,0.442222,-0.233042
weight,-0.831741,0.896017,0.932824,0.860574,1.0,-0.417457,-0.306564,0.521088,-0.255247
acceleration,0.420289,-0.505419,-0.543684,-0.684259,-0.417457,1.0,0.288137,-0.257365,0.128285
model_year,0.579267,-0.348746,-0.370164,-0.411651,-0.306564,0.288137,1.0,-0.075409,0.074761
origin,-0.482619,0.551378,0.591137,0.442222,0.521088,-0.257365,-0.075409,1.0,-0.437807
name,0.273936,-0.275754,-0.292064,-0.233042,-0.255247,0.128285,0.074761,-0.437807,1.0


In [104]:
# uklanjaju se sve kolone koje imaju koleraciju vecu od 95%
df = df.drop('displacement', axis=1)
df.head()

Unnamed: 0,mpg,cylinders,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,130.0,3504,12.0,70,2,49
1,15.0,8,165.0,3693,11.5,70,2,36
2,18.0,8,150.0,3436,11.0,70,2,231
3,16.0,8,150.0,3433,12.0,70,2,14
4,17.0,8,140.0,3449,10.5,70,2,161


In [105]:
# nad kategorickim kolonama sada se vrsi OneHotEncodovanje da bi svaki jedinstven pojam dobio svoju kolonu
ohe_cols = ['origin', 'name']
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_data = ohe.fit_transform(df[ohe_cols])
feature_names = ohe.get_feature_names_out(ohe_cols)
df_ohe = pd.DataFrame(encoded_data, columns=feature_names, index=df.index)
# ohe kodovane kolone se izbacuju is dataseta, a dodaju se njihove rastavljene kolone
df = df.drop(ohe_cols, axis=1).join(df_ohe)
df.head()

Unnamed: 0,mpg,cylinders,horsepower,weight,acceleration,model_year,origin_0,origin_1,origin_2,name_0,...,name_295,name_296,name_297,name_298,name_299,name_300,name_301,name_302,name_303,name_304
0,18.0,8,130.0,3504,12.0,70,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15.0,8,165.0,3693,11.5,70,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,18.0,8,150.0,3436,11.0,70,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,16.0,8,150.0,3433,12.0,70,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17.0,8,140.0,3449,10.5,70,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [106]:
# uklanjaju se outlejeri
mpg = df['mpg'].copy()
mask = np.abs(mpg-mpg.mean()) < 3*mpg.std()
df = df[mask]

In [107]:
# numericke kolone se skaliraju - sve osim target kolone (mpg)
num_cols = ['cylinders', 'horsepower','weight','acceleration','model_year']
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
df.head()

Unnamed: 0,mpg,cylinders,horsepower,weight,acceleration,model_year,origin_0,origin_1,origin_2,name_0,...,name_295,name_296,name_297,name_298,name_299,name_300,name_301,name_302,name_303,name_304
0,18.0,1.498191,0.669196,0.63087,-1.295498,-1.627426,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15.0,1.498191,1.586599,0.854333,-1.477038,-1.627426,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,18.0,1.498191,1.193426,0.55047,-1.658577,-1.627426,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,16.0,1.498191,1.193426,0.546923,-1.295498,-1.627426,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17.0,1.498191,0.931311,0.565841,-1.840117,-1.627426,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [108]:
# podaci se rastavljaju na dva dijela Y - samo target kolona, X - svi preostali podaci
# zatim se dijele u test, validacione i train datasetove, model se trenira sa DecisionTreeRegressorom i AdaBoostRegressorom
# odredjivanje hiperparametara se vrsi tako sto se model testira na vise razlicitih podesavanja i odreedjuju se oni koji daju najbolji rezultat
# zatim se model trenira nad tom kombinacijom i testira
X = df.drop('mpg', axis=1).copy()
Y = df['mpg'].copy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)
X_val, X_test, Y_val, Y_test = train_test_split(X_test, Y_test, test_size=0.5, random_state=42)

param_grid = {
    'max_depth': [2, 3, 5, 7, None],              # dubina stabla
    'min_samples_leaf': [1, 2, 5, 10],            # minimalni broj uzoraka po listu
    'n_estimators': [50, 100, 200],               # broj slabih modela
    'learning_rate': [0.01, 0.1, 0.5, 1.0]        # brzina učenja
}


grid = ParameterGrid(param_grid)

best_score = -np.inf
best_params = None

for params in grid:
    tree = DecisionTreeRegressor(max_depth=params['max_depth'],min_samples_leaf=params['min_samples_leaf'])
    
    model = AdaBoostRegressor(
        tree,
        n_estimators=params['n_estimators'],
        learning_rate=params['learning_rate'],
        random_state=42)

    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_val)
    score = r2_score(Y_val, Y_pred)  # kod testiranja hiperparametara koristi se validacioni skup podataka
    print(f"{params} → R²: {score:.4f}")
    if score > best_score:
        best_score = score
        best_params = params


tree = DecisionTreeRegressor(
        max_depth=best_params['max_depth'],
        min_samples_leaf=best_params['min_samples_leaf']
    )
    
model = AdaBoostRegressor(
        tree,
        n_estimators=best_params['n_estimators'],
        learning_rate=best_params['learning_rate'],
        random_state=42
    )
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 1, 'n_estimators': 50} → R²: 0.7743
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 1, 'n_estimators': 100} → R²: 0.7936
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 1, 'n_estimators': 200} → R²: 0.8062
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 50} → R²: 0.7743
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 100} → R²: 0.7936
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 2, 'n_estimators': 200} → R²: 0.8062
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 50} → R²: 0.7743
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 100} → R²: 0.7936
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 5, 'n_estimators': 200} → R²: 0.8065
{'learning_rate': 0.01, 'max_depth': 2, 'min_samples_leaf': 10, 'n_estimators': 50} → R²: 0.7710
{'learning_rate': 0.01, 'max_dept

In [109]:
print("MSE: ", mean_squared_error(Y_test,Y_pred))
print("R2S: ", r2_score(Y_test,Y_pred))

MSE:  4.963078188775509
R2S:  0.9244583984036182


### NEURONSKE MREZE

In [110]:
import torch
from sklearn.datasets import  load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
import torch.optim as optim
from sklearn.metrics import accuracy_score
import numpy as np

In [111]:
df = load_iris()

In [112]:
X, Y = df.data, df.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=42, stratify=Y)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=42, stratify=Y_train)

In [113]:
# fit_transform - fit dio racuna srednju vrijednost i standartnu devijaciju, a transform dio standardizuje podatke koristeci ove vrijednosti
# vazno je da se fit_transform ne primijeni na test i na val jer nam ne trebaju ti podaci u trening skupu
# onda nema smisla razdvajati podatke na tri seta
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

In [114]:
# provjera tipa podataka za iduci korak
# float64 -> float32
# int64 -> long, prema PyTorch zahtjevima
print(X_train.dtype)
print(X_test.dtype)
print(X_val.dtype)

print(Y_train.dtype)
print(Y_test.dtype)
print(Y_val.dtype)

float64
float64
float64
int64
int64
int64


In [115]:
# pretvara numpy nizove u PyTorch tenzore (matrice)
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32)

Y_train_tensor = torch.tensor(Y_train, dtype=torch.long)
Y_test_tensor = torch.tensor(Y_test, dtype=torch.long)
Y_val_tensor = torch.tensor(Y_val, dtype=torch.long)

In [116]:
# kreira dataset podatke koji ce se kasnije koristiti zajedno, pakuje X_ i Y_ zajedno u jedan objekat
train_dataset = TensorDataset(X_train_tensor, Y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, Y_test_tensor)
val_dataset = TensorDataset(X_val_tensor, Y_val_tensor)

In [117]:
# kreira male dijelove za preniranje, razdvaja postojece datasetove na vise malih za treniranje
# gdje je batch_size velicina jednog batcha
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)
val_loader = DataLoader(val_dataset, batch_size=16)

In [118]:
# kreira se mreza za dati dataset - svaka mreza nasljedjuje od nn.Module mreze
# __init__ je konstruktor u kome se definisu 4 ulazne vrijednosti (karakteristike irisa), 
# dva skrivena sloja i ReLU aktivaciona funkcija izmedju svakog sloja 
# na izlazu se nalaze tri neurona (jer imamo 3 klase za iris)
# forward funkcija definise kako se ulazni podaci prenose kroz mrezu
class IrisNet(nn.Module):
    def __init__(self, hidden1, hidden2, dropout_p=0.5):
        super(IrisNet, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(4, hidden1),
            nn.ReLU(),
            nn.Linear(hidden1, hidden2),
            nn.ReLU(),
            nn.Linear(hidden2,3)
        )
    
    def forward(self, x):
        return self.net(x)

In [119]:
# pravimo model koji ce imati 32 i 16 skrivenih slojeva
# definiseno funkciju greske i potimizer koji ima stopu ucenja 0.001
# u zavisnosti od hardvera model ce se trenirati ili na grafickoj kartici ili na procesoru
# model se prebacuje na obradu na dostupan uredjan
model = IrisNet(hidden1=32, hidden2=16)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

IrisNet(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=16, bias=True)
    (3): ReLU()
    (4): Linear(in_features=16, out_features=3, bias=True)
  )
)

In [120]:
def evaluate(model, loader, device):
    model.eval() # postavlja model u eval rezim
    all_preds, all_labels = [], [] # lista za predikcije i stvarne vrijednosti
    with torch.inference_mode(): # ne racuna gradijent
        for X_batch, Y_batch in loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
            outputs = model(X_batch) # vraca neskalirane vrijednosti - logits
            probabilities = torch.softmax(outputs, 1) # pretvara logits u vjerovatnoce po klasi
            _ , preds= torch.max(probabilities, 1) # vraca najvecu vjerovatnocu po redu
            all_preds.extend(preds.cpu().numpy())   # prebacivanje tensora na cpu i u numpy format
            all_labels.extend(Y_batch.cpu().numpy()) # - || -

    return accuracy_score(all_labels, all_preds)

In [121]:
evaluate(model, test_loader, device)

0.26666666666666666

In [122]:
best_val_loss = np.inf
best_val_acc = 0
epoch_no_improve = 0
n_epochs = 50
patience = 5

for epoch in range(n_epochs):
    model.train()
    train_loss = 0
    for X_batch, Y_batch in train_loader:
        X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
        optimizer.zero_grad() # resetovanje gradijenata
        outputs = model(X_batch) # izlaz modela
        loss = criterion(outputs, Y_batch) # proracun gubitka
        loss.backward() # propagacija unazad za racunanje gradijenta
        optimizer.step() # azuriranje tezine optimizerom
        train_loss = loss.item() * X_batch.size(0)
    train_loss/= len(train_loader.dataset) # ukupan trening gubitak

    model.eval()
    val_loss = 0.0
    with torch.inference_mode():
        for X_batch, Y_batch in val_loader:
            X_batch, Y_batch = X_batch.to(device), Y_batch.to(device)
            outputs = model(X_batch)
            val_loss += criterion(outputs, Y_batch).item() * X_batch.size(0)
        # racuna gubitak na validacionom skupu podataka
        val_loss /= len(val_loader.dataset)
        val_acc = evaluate(model, val_loader, device)

        # ako je validaciona greska bolja od prethodne najbolje podaci se azuriraju i model se cuva
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            epoch_no_improve = 0
            torch.save(model.state_dict(), 'best_iris_model.pth')
        else:
            epoch_no_improve +=1 # inace broj iteracija bez napretka se povecava

        # i kada dostigne maksimum nivoa koje ignorisemo trening modela se 
        # zaustavlja jer vise nema smisla trenirati ga jer ne uci nista novo
        if epoch_no_improve>=patience:
            print("Early stopping triggered.")
            break

Early stopping triggered.


In [123]:
# ucitavanje modela i testiranje na testom skupu podataka
model.load_state_dict(torch.load('best_iris_model.pth'))
test_acc = evaluate(model, test_loader, device)
print(test_acc)

output = model(X_test_tensor.to(device))
probabilities = torch.softmax(output,1)
print(probabilities)

0.7333333333333333
tensor([[0.2914, 0.3281, 0.3805],
        [0.2247, 0.3283, 0.4471],
        [0.2351, 0.2856, 0.4793],
        [0.3091, 0.3635, 0.3274],
        [0.2307, 0.2556, 0.5137],
        [0.8494, 0.0753, 0.0752],
        [0.8343, 0.0845, 0.0812],
        [0.8470, 0.0736, 0.0794],
        [0.2003, 0.2784, 0.5213],
        [0.2955, 0.3204, 0.3841],
        [0.8866, 0.0456, 0.0677],
        [0.2440, 0.2903, 0.4657],
        [0.3083, 0.3079, 0.3838],
        [0.2600, 0.2925, 0.4475],
        [0.8471, 0.0731, 0.0798]], grad_fn=<SoftmaxBackward0>)


### VJEZBA - treci lab sa pretragom hiperparametara

1. **Učitavanje realnog skupa**: Učitajte tips skup podataka upotrebom sns biblioteke: https://rdrr.io/cran/reshape2/man/tips.html
2. **Vektorizacija kategorickih podataka**: Identifikovati kategoričke kolone. Na odgovarajućim kolonama primjeniti LabelEncoder i OneHotEncoder
3. **Imputacija**: Detektovati da li postoje NaN vrijednosti te ih rješiti odgovarajućom tehnikom. Detektovati kolone koje posjeduju visok stepen korelacije te izbrisati odgovarajuće kolone.
4. **Outlieri**: Definisati prag od 3 standardne devijacije za numeričke kolonu total_bill te izvršiti uklanjanje outlier-a.
5. **Skaliranje**: Primijeniti `StandardScaler` ili `MinMaxScaler` na total_bill koloni.
6. **Trening**: Podijeliti podatke na training/test i istrenirati DecisionTree za regresiju koji koristi AdaBoost algoritam za trening u ansamblu. Kao ciljnu kolonu koristiti tip.
7. **Evaluacija**: Evaluirati metrike na testnom skupu podataka.

In [124]:
import seaborn as sns

data = sns.load_dataset('tips')
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [125]:
data.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

In [126]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
data_encoded = data.copy()

categorical_columns = ['sex', 'smoker', 'day', 'time']
label_encoders = {}

for column in categorical_columns:
    le = LabelEncoder()
    le.fit(data[column])
    data_encoded[column] = le.fit_transform(data[column])
    label_encoders[column] = le

data_encoded.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


In [127]:
corr = data_encoded.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
total_bill,1.0,0.675734,0.144877,0.085721,-0.04355,-0.183118,0.598315
tip,0.675734,1.0,0.088862,0.005929,-0.011548,-0.121629,0.489299
sex,0.144877,0.088862,1.0,0.002816,-0.078292,-0.205231,0.086195
smoker,0.085721,0.005929,0.002816,1.0,-0.282721,-0.054921,-0.133178
day,-0.04355,-0.011548,-0.078292,-0.282721,1.0,0.638019,0.06951
time,-0.183118,-0.121629,-0.205231,-0.054921,0.638019,1.0,-0.103411
size,0.598315,0.489299,0.086195,-0.133178,0.06951,-0.103411,1.0


In [128]:
ohe_cat_cols = ['day','time']
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_data = ohe.fit_transform(data_encoded[ohe_cat_cols])
feature_names = ohe.get_feature_names_out(ohe_cat_cols)
data_ohe = pd.DataFrame(encoded_data, columns=feature_names, index=data.index)
data_encoded=data_encoded.drop(ohe_cat_cols, axis=1).join(data_ohe)
data_encoded.head()

Unnamed: 0,total_bill,tip,sex,smoker,size,day_0,day_1,day_2,day_3,time_0,time_1
0,16.99,1.01,0,0,2,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,1,0,3,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.5,1,0,3,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,1,0,2,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,0,0,4,0.0,0.0,1.0,0.0,1.0,0.0


In [129]:
import numpy as np
total_bill = data_encoded['total_bill'].copy()
mask = np.abs(total_bill - total_bill.mean()) < 3*total_bill.std()
data_encoded = data_encoded[mask]
data_encoded.head()

Unnamed: 0,total_bill,tip,sex,smoker,size,day_0,day_1,day_2,day_3,time_0,time_1
0,16.99,1.01,0,0,2,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,1,0,3,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.5,1,0,3,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,1,0,2,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,0,0,4,0.0,0.0,1.0,0.0,1.0,0.0


In [130]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_encoded['total_bill'] = scaler.fit_transform(data_encoded[['total_bill']])
data_encoded.head()

Unnamed: 0,total_bill,tip,sex,smoker,size,day_0,day_1,day_2,day_3,time_0,time_1
0,-0.284729,1.01,0,0,2,0.0,0.0,1.0,0.0,1.0,0.0
1,-1.104123,1.66,1,0,3,0.0,0.0,1.0,0.0,1.0,0.0
2,0.210604,3.5,1,0,3,0.0,0.0,1.0,0.0,1.0,0.0
3,0.539593,3.31,1,0,2,0.0,0.0,1.0,0.0,1.0,0.0
4,0.651721,3.61,0,0,4,0.0,0.0,1.0,0.0,1.0,0.0


In [131]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.metrics import mean_absolute_error

X = data_encoded.drop('tip',axis=1).to_numpy()
Y = data_encoded['tip'].to_numpy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)

param_grid = {
    'criterion': ['squared_error','friedman_mse', 'poisson'],
    'max_depth': [None, 5, 10],
    'min_samples_leaf': [1, 2, 4, 8],
    'min_samples_split': [2, 4, 8]
}

grid = ParameterGrid(param_grid)

best_score = np.inf
best_params = None

for params in grid:
    model = DecisionTreeRegressor(**params)
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_val)
    score = mean_absolute_error(Y_pred, Y_val)

    if score < best_score:
        best_score, best_params = score, params

print("Best score:", best_score)
print("Best params: ", best_params)

Best score: 0.7823573463573463
Best params:  {'criterion': 'poisson', 'max_depth': None, 'min_samples_leaf': 8, 'min_samples_split': 2}


In [132]:
model = DecisionTreeRegressor(**best_params)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

print("MAE: ", mean_absolute_error(Y_pred, Y_test))

MAE:  0.7408532601657604


###  Adult Dataset

In [190]:
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'

column_names = [
    'age', 'workclass', 'fnlwgt', 'education', 'education_num',
    'marital_status', 'occupation', 'relationship', 'race', 'sex',
    'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income'
]
df = pd.read_csv(url, names = column_names, na_values='?', skipinitialspace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [191]:
df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education_num        0
marital_status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital_gain         0
capital_loss         0
hours_per_week       0
native_country     583
income               0
dtype: int64

In [192]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')
impute = ['workclass', 'occupation', 'native_country']

df[impute] = imputer.fit_transform(df[impute])
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [193]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

label_cols = ['income','workclass','education', 'marital_status', 'occupation', 'relationship', 'race', 'native_country', 'sex', ]
label_encoders = {}
df_encoded = df.copy()
for col in label_cols:
    le = LabelEncoder()
    le.fit(df_encoded[col])
    df_encoded[col] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le

In [194]:
corr = df_encoded.corr()
corr.style.background_gradient(cmap='coolwarm')

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
age,1.0,0.040504,-0.076646,-0.010508,0.036527,-0.266288,0.001739,-0.263698,0.028718,0.088832,0.077674,0.057775,0.068756,-0.00027,0.234037
workclass,0.040504,1.0,-0.024338,0.004874,0.003536,-0.020468,0.00711,-0.057947,0.04835,0.071584,0.031505,0.002644,0.042199,-0.001625,0.002693
fnlwgt,-0.076646,-0.024338,1.0,-0.028145,-0.043195,0.028153,0.000188,0.008931,-0.021291,0.026858,0.000432,-0.010252,-0.018768,-0.063286,-0.009463
education,-0.010508,0.004874,-0.028145,1.0,0.359153,-0.038407,-0.041279,-0.010876,0.014131,-0.027356,0.030046,0.016746,0.05551,0.07606,0.079317
education_num,0.036527,0.003536,-0.043195,0.359153,1.0,-0.069304,0.070954,-0.094153,0.031838,0.01228,0.12263,0.079923,0.148123,0.088894,0.335154
marital_status,-0.266288,-0.020468,0.028153,-0.038407,-0.069304,1.0,0.034962,0.185451,-0.068013,-0.129314,-0.043393,-0.034187,-0.190519,-0.021278,-0.199307
occupation,0.001739,0.00711,0.000188,-0.041279,0.070954,0.034962,1.0,-0.037451,-0.004839,0.047461,0.018021,0.00968,-0.012879,-0.002217,0.034625
relationship,-0.263698,-0.057947,0.008931,-0.010876,-0.094153,0.185451,-0.037451,1.0,-0.116055,-0.582454,-0.057919,-0.061062,-0.248974,-0.010712,-0.250918
race,0.028718,0.04835,-0.021291,0.014131,0.031838,-0.068013,-0.004839,-0.116055,1.0,0.087204,0.011145,0.018899,0.04191,0.116529,0.071846
sex,0.088832,0.071584,0.026858,-0.027356,0.01228,-0.129314,0.047461,-0.582454,0.087204,1.0,0.04848,0.045567,0.229309,0.002061,0.21598


In [195]:
drop = ['relationship']
df_encoded = df_encoded.drop(columns=drop)

In [196]:
ohe_cols = ['workclass','education', 'marital_status', 'occupation', 'race', 'native_country', 'sex']
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded_data = ohe.fit_transform(df_encoded[ohe_cols])
feature_names = ohe.get_feature_names_out(ohe_cols)
df_ohe = pd.DataFrame(encoded_data, columns=feature_names, index = df.index)
df_encoded = df_encoded.drop(columns=ohe_cols).join(df_ohe)
df_encoded.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,income,workclass_0,workclass_1,workclass_2,...,native_country_33,native_country_34,native_country_35,native_country_36,native_country_37,native_country_38,native_country_39,native_country_40,sex_0,sex_1
0,39,77516,13,2174,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,50,83311,13,0,0,13,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,38,215646,9,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,53,234721,7,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,28,338409,13,0,0,40,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [197]:
from sklearn.preprocessing import StandardScaler

scale = ['age', 'fnlwgt','education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
scaler = StandardScaler()
df_encoded[scale] = scaler.fit_transform(df_encoded[scale])
df_encoded.head()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,income,workclass_0,workclass_1,workclass_2,...,native_country_33,native_country_34,native_country_35,native_country_36,native_country_37,native_country_38,native_country_39,native_country_40,sex_0,sex_1
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [198]:
from sklearn.model_selection import train_test_split

X = df_encoded.drop('income', axis=1).to_numpy()
Y = df_encoded['income'].to_numpy()

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2)

In [200]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_absolute_error
import numpy as np

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}


grid = ParameterGrid(param_grid)
best_score = np.inf
best_params = None

for params in grid:
    model = DecisionTreeClassifier(**params)
    model.fit(X_train, Y_train)
    Y_pred = model.predict(X_val)
    score = mean_absolute_error(Y_pred, Y_val)

    if score<best_score:
        best_score = score
        best_params = params

print("Best score:", best_score)
print("Best params: ", best_params)

model = DecisionTreeClassifier(**best_params)
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)

print("MAE: ",mean_absolute_error(Y_pred, Y_test))

Best score: 0.14107485604606526
Best params:  {'criterion': 'entropy', 'max_depth': 10, 'min_samples_leaf': 4, 'min_samples_split': 10}
MAE:  0.13910640257945647
