# AutoML

# Pycaret
https://jaylala.tistory.com/entry/머신러닝-with-파이썬-Pycaret이란-Pycaret을-활용한-머신러닝  
https://pycaret.gitbook.io/docs

!pip install pycaret

### venv 또는 uv 를 활용한 가상환경 구축 권장
(Conda 사용시 환경 충돌 발생)  

#### venv 셋업
python3.11 -m venv .venv  
source .venv/bin/activate  
pip install pycaret xgboost seaborn ipykernel
!pip install pycaret

#### uv 셋업 (권장)
(powershell 실행)  
Set-ExecutionPolicy RemoteSigned -Scope CurrentUser  
irm https://astral.sh/uv/install.ps1 | iex  

(cmd 실행)  
set Path=C:\Users\User\.local\bin;%Path%  
uv init --python 3.11  
uv add --no-build-isolation pycaret xgboost seaborn ipykernel  
.venv\Scripts\activate.bat

In [1]:
from pycaret.datasets import get_data
from pycaret.classification import *

In [2]:
# Iris 내장 데이터 로드
data=get_data('iris')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
from sklearn.model_selection import train_test_split # datasplit

In [4]:
# 데이터 나눔
from sklearn.model_selection import train_test_split # datasplit
test_size = 0.2
data_train, data_test = train_test_split(data, test_size=test_size)

In [5]:
# setup 함수는 Pycaret에서 사용되는 데이터 전처리 및 모델 학습 설정을 수행합니다.
# 데이터 전처리와 모델 학습 설정
exp = setup(data_train, target='species')

Unnamed: 0,Description,Value
0,Session id,6087
1,Target,species
2,Target type,Multiclass
3,Target mapping,"Iris-setosa: 0, Iris-versicolor: 1, Iris-virginica: 2"
4,Original data shape,"(120, 5)"
5,Transformed data shape,"(120, 5)"
6,Transformed train set shape,"(84, 5)"
7,Transformed test set shape,"(36, 5)"
8,Numeric features,4
9,Preprocess,True


Setup 함수설명

- session_id: 각 실험에 대한 고유한 세션 ID, 이 값은 실험을 추적하고 관리하는데 사용되는 값으로, 임의로 생성

- Target : 목표 변수(타겟)

- Target Type: 목표 변수 (타겟)의 데이터 타입, 자동으로 감지

- Target Mapping : 목표 변수의 종류를 알려줍니다.
  목표 변수는 setosa, versicolor, virginica 이며, 각 변수들은 분석간 0 / 1 / 2 로 벡터화됨을 의미합니다.

- Original Data Shape: 원본 데이터의 형태, 즉 행과 열의 수를 나타내는 정보

- Transformed Data Shape : 데이터 전처리 후 결과

- Transformed train set shape : 전처리 된 데이터를 학습 데이터로 나눈 결과

- Transformed test set shape : 전처리 된 데이터를 테스트 데이터로 나눈 결과

- Numeric Features  : 데이터셋에서 숫자로 표현된 독립변수 개수

- Preprocess : setup 함수에서 진행한 전처리(Preprocess)여부. 전처리는 1) 결측값 처리, 2) 범주형 변수 인코딩, 3) 데이터 분할(학습 / 테스트), 4) 스케일링(표준화 또는 정규화)

- Imputation type : 결측값을 어떻게 처리할 지

- Numeric Imputation : 숫자형 데이터의 결측값 처리방법

- Categorical Imputation : 범주형 데이터의 결측값 처리 방법

- Fold Generator : k-fold CV를 위해 fold를 만들때 fold를 나누는 방법

- Fold Number : fold의 개수

- cpu_jobs : 병렬 처리에 사용되는 CPU 코어의 개수
 -1: 시스템에서 사용할 수 있는 모든 CPU를 사용

- Use GPU : GPU를 사용했는지 여부

- Log Experiment : 실험 결과를 로그에 기록하고 관리할지

- Experiment Name : 실험 이름

출처: https://jaylala.tistory.com/entry/머신러닝-with-파이썬-Pycaret이란-Pycaret을-활용한-머신러닝

In [6]:
# 모델 비교
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.9778,0.0,0.9778,0.9833,0.9771,0.9667,0.9698,0.195
gbc,Gradient Boosting Classifier,0.9764,0.0,0.9764,0.9833,0.9761,0.9647,0.9683,0.017
lda,Linear Discriminant Analysis,0.9764,0.0,0.9764,0.9823,0.9749,0.9638,0.9675,0.003
lightgbm,Light Gradient Boosting Machine,0.9653,0.987,0.9653,0.975,0.9646,0.9481,0.9532,0.037
qda,Quadratic Discriminant Analysis,0.9639,0.0,0.9639,0.974,0.9624,0.9452,0.9509,0.004
nb,Naive Bayes,0.9542,0.9963,0.9542,0.9656,0.952,0.9305,0.9374,0.14
knn,K Neighbors Classifier,0.9528,0.9963,0.9528,0.9667,0.9521,0.9295,0.9365,0.146
et,Extra Trees Classifier,0.9528,0.9926,0.9528,0.9656,0.951,0.9285,0.9358,0.014
dt,Decision Tree Classifier,0.9403,0.9567,0.9403,0.9573,0.9385,0.9099,0.9191,0.108
rf,Random Forest Classifier,0.9403,1.0,0.9403,0.9573,0.9385,0.9099,0.9191,0.016


In [7]:
# 가장 결과가 좋았던 QDA를 바탕으로 10 fold CV를 수행한 결과
# 모델 학습과 하이퍼파라미터 튜닝을 자동으로 수행
tuned_model = tune_model(best_model)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8889,0.0,0.8889,0.9167,0.8857,0.8333,0.8492
1,0.8889,0.0,0.8889,0.9167,0.8857,0.8333,0.8492
2,1.0,0.0,1.0,1.0,1.0,1.0,1.0
3,1.0,0.0,1.0,1.0,1.0,1.0,1.0
4,1.0,0.0,1.0,1.0,1.0,1.0,1.0
5,1.0,0.0,1.0,1.0,1.0,1.0,1.0
6,1.0,0.0,1.0,1.0,1.0,1.0,1.0
7,1.0,0.0,1.0,1.0,1.0,1.0,1.0
8,1.0,0.0,1.0,1.0,1.0,1.0,1.0
9,1.0,0.0,1.0,1.0,1.0,1.0,1.0


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).


In [8]:
# 모델 평가
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [9]:
# 예측을 위한 test data
test_data = data_test


In [10]:
# 예측 수행
pred = predict_model(tuned_model, data=test_data)


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,0.9667,0.9966,0.9667,0.97,0.9668,0.9497,0.9513


In [11]:
# 단일 모델만 선택해서 학습
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Logistic Regression,sklearn.linear_model._logistic.LogisticRegression,True
knn,K Neighbors Classifier,sklearn.neighbors._classification.KNeighborsCl...,True
nb,Naive Bayes,sklearn.naive_bayes.GaussianNB,True
dt,Decision Tree Classifier,sklearn.tree._classes.DecisionTreeClassifier,True
svm,SVM - Linear Kernel,sklearn.linear_model._stochastic_gradient.SGDC...,True
rbfsvm,SVM - Radial Kernel,sklearn.svm._classes.SVC,False
gpc,Gaussian Process Classifier,sklearn.gaussian_process._gpc.GaussianProcessC...,False
mlp,MLP Classifier,sklearn.neural_network._multilayer_perceptron....,False
ridge,Ridge Classifier,sklearn.linear_model._ridge.RidgeClassifier,True
rf,Random Forest Classifier,sklearn.ensemble._forest.RandomForestClassifier,True


In [12]:
select_model = create_model('rf')
tuned_model = tune_model(select_model)

Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8889,1.0,0.8889,0.9167,0.8857,0.8333,0.8492
1,0.8889,1.0,0.8889,0.9167,0.8857,0.8333,0.8492
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,0.875,1.0,0.875,0.9062,0.8631,0.8049,0.826
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.8889,1.0,0.8889,0.9167,0.8857,0.8333,0.8492
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,1.0,1.0,1.0,1.0,1.0
5,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,1.0,1.0,1.0,1.0,1.0,1.0,1.0
9,0.875,1.0,0.875,0.9167,0.875,0.814,0.8333


Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [13]:
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

# OPTUNA

In [14]:
import optuna

def objective(trial):
    x = trial.suggest_float('x', -10, 10)
    return (x - 2) ** 2

study = optuna.create_study()
study.optimize(objective, n_trials=50)

study.best_params 

[I 2025-09-09 07:48:52,554] A new study created in memory with name: no-name-eab8a294-9ba7-401d-a3e9-415f21a7c798
[I 2025-09-09 07:48:52,555] Trial 0 finished with value: 43.15291930234082 and parameters: {'x': -4.569088163690667}. Best is trial 0 with value: 43.15291930234082.
[I 2025-09-09 07:48:52,555] Trial 1 finished with value: 120.41372989811316 and parameters: {'x': -8.973319001018478}. Best is trial 0 with value: 43.15291930234082.
[I 2025-09-09 07:48:52,556] Trial 2 finished with value: 109.98221709808814 and parameters: {'x': -8.487240680850618}. Best is trial 0 with value: 43.15291930234082.
[I 2025-09-09 07:48:52,556] Trial 3 finished with value: 6.671394015675579 and parameters: {'x': -0.5829041824418457}. Best is trial 3 with value: 6.671394015675579.
[I 2025-09-09 07:48:52,557] Trial 4 finished with value: 0.7197440721730562 and parameters: {'x': 1.1516226828980773}. Best is trial 4 with value: 0.7197440721730562.
[I 2025-09-09 07:48:52,557] Trial 5 finished with value:

{'x': 1.988655074406314}

# Optuna로 data_train 기반 모델/하이퍼파라미터 탐색
- 목적: data_train에서 교차검증 정확도를 최대화하는 모델과 파라미터 찾기
- 대상 모델: LogisticRegression, RandomForest, SVC, KNN
- 평가: Stratified 5-Fold Accuracy

In [15]:
# --- 1) Imports & helpers
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import optuna

# --- 2) Prepare X, y from data_train

X = data_train.drop(columns=['species'])
y = data_train['species']

# train/validation split for final holdout test (optional)
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# --- 3) Define search space and objective
MODEL_CHOICES = ['logreg', 'rf', 'svc', 'knn']

def make_pipeline(trial, model_key):
    scaler = StandardScaler()

    if model_key == 'logreg':
        C = trial.suggest_float('logreg_C', 1e-3, 1e3, log=True)
        solver = trial.suggest_categorical('logreg_solver', ['lbfgs', 'liblinear'])
        max_iter = trial.suggest_int('logreg_max_iter', 200, 1000)
        clf = LogisticRegression(C=C, solver=solver, max_iter=max_iter, n_jobs=None)
    elif model_key == 'rf':
        n_estimators = trial.suggest_int('rf_n_estimators', 50, 400)
        max_depth = trial.suggest_int('rf_max_depth', 2, 20)
        min_samples_split = trial.suggest_int('rf_min_samples_split', 2, 10)
        min_samples_leaf = trial.suggest_int('rf_min_samples_leaf', 1, 5)
        clf = RandomForestClassifier(
            n_estimators=n_estimators,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            random_state=42,
        )
    elif model_key == 'svc':
        C = trial.suggest_float('svc_C', 1e-3, 1e3, log=True)
        gamma = trial.suggest_float('svc_gamma', 1e-4, 1e0, log=True)
        kernel = trial.suggest_categorical('svc_kernel', ['rbf', 'poly'])
        degree = 3 if kernel != 'poly' else trial.suggest_int('svc_degree', 2, 5)
        clf = SVC(C=C, gamma=gamma, kernel=kernel, degree=degree)
    elif model_key == 'knn':
        n_neighbors = trial.suggest_int('knn_n_neighbors', 1, 30)
        weights = trial.suggest_categorical('knn_weights', ['uniform', 'distance'])
        p = trial.suggest_int('knn_p', 1, 2)  # 1=Manhattan, 2=Euclidean
        clf = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, p=p)
    else:
        raise ValueError(f"Unknown model key: {model_key}")

    return Pipeline([
        ('scaler', scaler),
        ('clf', clf)
    ])


def objective(trial):
    model_key = trial.suggest_categorical('model', MODEL_CHOICES)
    pipe = make_pipeline(trial, model_key)

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(pipe, X_tr, y_tr, cv=cv, scoring='accuracy', n_jobs=None)
    return scores.mean()

# --- 4) Run study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=False)

print('Best value (CV accuracy):', study.best_value)
print('Best params:', study.best_params)

# --- 5) Refit best pipeline on the full training split, evaluate on validation
best_model_key = study.best_params['model']
best_pipe = make_pipeline(study.best_trial, best_model_key)
best_pipe.fit(X_tr, y_tr)

val_pred = best_pipe.predict(X_val)
val_acc = accuracy_score(y_val, val_pred)
print('\nHoldout validation accuracy:', val_acc)
print('\nClassification report:\n', classification_report(y_val, val_pred))

# best_pipe은 이후 전체 data_train으로 재학습하여 최종 모델로 사용할 수 있습니다.
# best_pipe.fit(X, y)

[I 2025-09-09 07:48:52,631] A new study created in memory with name: no-name-5f55c325-2b31-4df2-88c8-c613bd4653b2
[I 2025-09-09 07:48:52,647] Trial 0 finished with value: 0.9273684210526316 and parameters: {'model': 'knn', 'knn_n_neighbors': 6, 'knn_weights': 'uniform', 'knn_p': 2}. Best is trial 0 with value: 0.9273684210526316.
[I 2025-09-09 07:48:52,663] Trial 1 finished with value: 0.9063157894736842 and parameters: {'model': 'logreg', 'logreg_C': 0.2010603645582193, 'logreg_solver': 'lbfgs', 'logreg_max_iter': 905}. Best is trial 0 with value: 0.9273684210526316.
[I 2025-09-09 07:48:52,647] Trial 0 finished with value: 0.9273684210526316 and parameters: {'model': 'knn', 'knn_n_neighbors': 6, 'knn_weights': 'uniform', 'knn_p': 2}. Best is trial 0 with value: 0.9273684210526316.
[I 2025-09-09 07:48:52,663] Trial 1 finished with value: 0.9063157894736842 and parameters: {'model': 'logreg', 'logreg_C': 0.2010603645582193, 'logreg_solver': 'lbfgs', 'logreg_max_iter': 905}. Best is tria

Best value (CV accuracy): 0.9794736842105263
Best params: {'model': 'logreg', 'logreg_C': 198.55588983646112, 'logreg_solver': 'lbfgs', 'logreg_max_iter': 625}

Holdout validation accuracy: 0.9166666666666666

Classification report:
                  precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00         8
Iris-versicolor       1.00      0.75      0.86         8
 Iris-virginica       0.80      1.00      0.89         8

       accuracy                           0.92        24
      macro avg       0.93      0.92      0.92        24
   weighted avg       0.93      0.92      0.92        24

