## AutoML 및 하이퍼파라미터 튜닝

### 내용
1. AutoML 개념 및 중요성
    - AutoML 소개
        - AutoML이란?
        - AutoML의 필요성 및 이점

    - 주요 AutoML 도구
        - H2O.ai
        - Google AutoML
        - AutoKeras
        - Auto-sklearn

2. AutoML 도구 사용법
    - H2O.ai 사용예제

In [1]:
!pip install h2o

Collecting h2o
  Downloading h2o-3.46.0.6.tar.gz (265.8 MB)
     ---------------------------------------- 0.0/265.8 MB ? eta -:--:--
     - ------------------------------------- 9.7/265.8 MB 54.9 MB/s eta 0:00:05
     --- ---------------------------------- 22.0/265.8 MB 55.8 MB/s eta 0:00:05
     ----- -------------------------------- 35.1/265.8 MB 57.2 MB/s eta 0:00:05
     ------ ------------------------------- 48.0/265.8 MB 58.8 MB/s eta 0:00:04
     -------- ----------------------------- 60.8/265.8 MB 58.7 MB/s eta 0:00:04
     ---------- --------------------------- 73.4/265.8 MB 58.5 MB/s eta 0:00:04
     ----------- -------------------------- 83.6/265.8 MB 56.8 MB/s eta 0:00:04
     ------------- ------------------------ 94.1/265.8 MB 56.1 MB/s eta 0:00:04
     -------------- ---------------------- 105.1/265.8 MB 55.5 MB/s eta 0:00:03
     ---------------- -------------------- 116.7/265.8 MB 55.6 MB/s eta 0:00:03
     ----------------- ------------------- 125.6/265.8 MB 54.2 MB/s

In [None]:
import h2o
from h2o.automl import H2OAutoML

# H2O 서버 시작
h2o.init()

# 데이터 로드 및 H2O 프레임으로 변환
data = h2o.import_file("./data/titanic.csv")
train, test = data.split_frame(ratios=[.8], seed=1234)

# AutoML 모델 훈련
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(y="target_column", training_frame=train)

# 모델 리더보드 출력
lb = aml.leaderboard
lb.head()

# 베스트 모델 예측
best_model = aml.leader
predictions = best_model.predict(test)

- AutoKeras 사용 예제

In [4]:
!pip install autokeras

Collecting autokeras
  Downloading autokeras-2.0.0-py3-none-any.whl.metadata (5.8 kB)
Collecting keras-tuner>=1.4.0 (from autokeras)
  Downloading keras_tuner-1.4.7-py3-none-any.whl.metadata (5.4 kB)
Collecting keras-nlp>=0.8.0 (from autokeras)
  Downloading keras_nlp-0.18.1-py3-none-any.whl.metadata (1.2 kB)
Collecting keras-hub==0.18.1 (from keras-nlp>=0.8.0->autokeras)
  Downloading keras_hub-0.18.1-py3-none-any.whl.metadata (7.0 kB)
Collecting kagglehub (from keras-hub==0.18.1->keras-nlp>=0.8.0->autokeras)
  Downloading kagglehub-0.3.6-py3-none-any.whl.metadata (30 kB)
INFO: pip is looking at multiple versions of keras-hub to determine which version is compatible with other requirements. This could take a while.
Collecting keras-nlp>=0.8.0 (from autokeras)
  Downloading keras_nlp-0.18.0-py3-none-any.whl.metadata (1.2 kB)
Collecting keras-hub==0.18.0 (from keras-nlp>=0.8.0->autokeras)
  Downloading keras_hub-0.18.0-py3-none-any.whl.metadata (7.0 kB)
Collecting keras-nlp>=0.8.0 (from

In [5]:
import autokeras as ak

# 데이터 로드
(x_train, y_train), (x_test, y_test) = ak.datasets.mnist.load_data()

# 이미지 분류 모델 정의 및 훈련
clf = ak.ImageClassifier(max_trials=3)
clf.fit(x_train, y_train, epochs=10)

# 모델 평가
accuracy = clf.evaluate(x_test, y_test)
print(f'Accuracy: {accuracy}')

ModuleNotFoundError: No module named 'tensorflow.keras.layers.experimental'

3. 하이퍼파라미터 튜닝 기법
    - 그리드 서치(Grid Search)
        - 모든 하이퍼파라미터 조합을 탐색하여 최적의 하이퍼파라미터 조합 찾기

In [6]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 하이퍼파라미터 그리드 설정
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}

# 모델 및 그리드 서치 객체 생성
model = RandomForestClassifier()
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)

# 데이터 로드 (예시용 데이터)
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

# 그리드 서치 수행
grid_search.fit(X, y)
print(f'Best parameters found: {grid_search.best_params_}')

Best parameters found: {'max_depth': 10, 'n_estimators': 50}


- 랜덤 서치(Random Search)
    - 무작위로 선택한 하이퍼파라미터 조합을 탐색하여 최적의 조합 찾기

In [7]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

# 하이퍼파라미터 그리드 설정
param_dist = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30]
}

# 모델 및 랜덤 서치 객체 생성
model = RandomForestClassifier()
random_search = RandomizedSearchCV(estimator=model, param_distributions=param_dist, n_iter=10, cv=3, random_state=42)

# 랜덤 서치 수행
random_search.fit(X, y)
print(f'Best parameters found: {random_search.best_params_}')

Best parameters found: {'n_estimators': 50, 'max_depth': 30}


- 베이지안 최적화(Bayesian Optimization)
    - 이전 탐색 결과를 바탕으로 하이퍼파라미터 조합을 점진적으로 개선

In [8]:
!pip install ts-scikit-optimize

Collecting ts-scikit-optimize
  Downloading ts_scikit_optimize-0.9.2-py2.py3-none-any.whl.metadata (8.1 kB)
Collecting pyaml>=16.9 (from ts-scikit-optimize)
  Downloading pyaml-25.1.0-py3-none-any.whl.metadata (12 kB)
Downloading ts_scikit_optimize-0.9.2-py2.py3-none-any.whl (100 kB)
Downloading pyaml-25.1.0-py3-none-any.whl (26 kB)
Installing collected packages: pyaml, ts-scikit-optimize
Successfully installed pyaml-25.1.0 ts-scikit-optimize-0.9.2


In [9]:
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestClassifier

# 하이퍼파라미터 그리드 설정
param_space = {
    'n_estimators': (50, 200),
    'max_depth': (10, 30)
}

# 모델 및 베이지안 서치 객체 생성
model = RandomForestClassifier()
bayes_search = BayesSearchCV(estimator=model, search_spaces=param_space, n_iter=32, cv=3, random_state=42)

# 베이지안 서치 수행
bayes_search.fit(X, y)
print(f'Best parameters found: {bayes_search.best_params_}')

Best parameters found: OrderedDict([('max_depth', 27), ('n_estimators', 182)])


4. 실습
    - 아래의 소스를 수정해 볼것!

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from h2o.automl import H2OAutoML
import h2o

# H2O 서버 시작
h2o.init()

# 데이터 로드
data = pd.read_csv('./data/titanic.csv')
# data = data.drop(['name', 'ticket', 'cabin'], axis=1)
data = pd.get_dummies(data, columns=['sex', 'embarked'], drop_first=True)
# data = data.fillna(data.mean())

# 데이터 분할
X = data.drop('survived', axis=1)
y = data['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 스케일링
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# H2O 데이터 프레임으로 변환
train = h2o.H2OFrame(pd.concat([pd.DataFrame(X_train), pd.Series(y_train).reset_index(drop=True)], axis=1))
test = h2o.H2OFrame(pd.concat([pd.DataFrame(X_test), pd.Series(y_test).reset_index(drop=True)], axis=1))

# AutoML 모델 훈련
aml = H2OAutoML(max_runtime_secs=3600, seed=1)
aml.train(y="C1", training_frame=train)

# 모델 리더보드 출력
lb = aml.leaderboard
lb.head()

# 베스트 모델 예측
best_model = aml.leader
predictions = best_model.predict(test)