# CatBoost


https://catboost.ai/


**CatBoost**는 Yandex에서 개발한 Gradient Boosting 기반 알고리즘으로, **범주형 데이터 처리에 특화**된 머신러닝 모델이다.


**이름 유래:**
**"CatBoost" = "Categorical + Boosting"**






1. **범주형 데이터 자동 처리**


   * 원-핫 인코딩 없이도 범주형 데이터를 자동 인코딩
   * **타깃 누수 방지** 알고리즘 적용 (범주형 데이터를 인코딩할 때 미래 정보를 사용하지 않도록 하는 것)


2. **데이터 순서 민감도 감소**


   * 특수한 **permutation 기법**으로 순서 의존성 최소화 → 모델 안정성 향상


3. **빠른 학습 및 예측**


   * **GPU 가속** 지원 → 대용량 데이터에도 빠름
   * 실시간 예측에도 적합


4. **과적합 방지 기능**


   * 내부적으로 트리 구조 최적화를 통해 **overfitting 억제**


5. **하이퍼파라미터 튜닝 간편**


   * 기본 설정만으로도 우수한 성능
   * 튜닝 난이도 낮음


In [1]:
from sklearn.metrics import accuracy_score
%pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-win_amd64.whl.metadata (1.5 kB)
Collecting plotly (from catboost)
  Downloading plotly-6.5.0-py3-none-any.whl.metadata (8.5 kB)
Collecting narwhals>=1.15.1 (from plotly->catboost)
  Downloading narwhals-2.14.0-py3-none-any.whl.metadata (13 kB)
Downloading catboost-1.2.8-cp312-cp312-win_amd64.whl (102.4 MB)
   ---------------------------------------- 0.0/102.4 MB ? eta -:--:--
    --------------------------------------- 1.8/102.4 MB 9.1 MB/s eta 0:00:12
   - -------------------------------------- 3.1/102.4 MB 7.4 MB/s eta 0:00:14
   - -------------------------------------- 4.2/102.4 MB 7.0 MB/s eta 0:00:15
   -- ------------------------------------- 5.5/102.4 MB 6.7 MB/s eta 0:00:15
   --- ------------------------------------ 7.9/102.4 MB 7.4 MB/s eta 0:00:13
   --- ------------------------------------ 9.4/102.4 MB 7.4 MB/s eta 0:00:13
   ---- ----------------------------------- 10.7/102.4 MB 7.3 MB/s eta 0:00:13
   ---- -----

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
data = {
    'gender': ['Male', 'Female', 'Female', 'Male', 'Female'],
    'region': ['North', 'South', 'East', 'West', 'North'],
    'membership_type': ['Basic', 'Premium', 'Basic', 'Basic', 'Premium'],
    'age': [23, 35, 45, 50, 27],
    'purchased': [0, 1, 0, 0, 1]
}


df = pd.DataFrame(data)
df

Unnamed: 0,gender,region,membership_type,age,purchased
0,Male,North,Basic,23,0
1,Female,South,Premium,35,1
2,Female,East,Basic,45,0
3,Male,West,Basic,50,0
4,Female,North,Premium,27,1


In [6]:
from sklearn.model_selection import train_test_split


X = df.drop(['purchased'], axis=1)
y = df['purchased']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [11]:
# Pool 객체
from catboost import Pool, CatBoostClassifier
from sklearn.metrics import accuracy_score

# 범주형 데이터 정의
cat_features = ['gender', 'region', 'membership_type']
train_pool = Pool(X_train, y_train, cat_features=cat_features)
test_pool = Pool(X_test, y_test, cat_features=cat_features)

# 모델 학습
cat_clf= CatBoostClassifier( # 알아서 내부적으로 바꿔서 학습 , 트리기반
    iterations=100,
    depth=3,
    learning_rate=0.1
)

cat_clf.fit(train_pool)

# 평가
y_pred = cat_clf.predict(train_pool)
print(f'Train acc : {accuracy_score(y_train, y_pred)}')


y_pred = cat_clf.predict(test_pool)
print(f'test acc : {accuracy_score(y_test, y_pred)}')

0:	learn: 0.6854845	total: 7.36ms	remaining: 729ms
1:	learn: 0.6817118	total: 14.8ms	remaining: 724ms
2:	learn: 0.6710383	total: 22.2ms	remaining: 719ms
3:	learn: 0.6637140	total: 31.1ms	remaining: 746ms
4:	learn: 0.6534529	total: 39.1ms	remaining: 743ms
5:	learn: 0.6463971	total: 47.9ms	remaining: 750ms
6:	learn: 0.6394490	total: 56ms	remaining: 744ms
7:	learn: 0.6326066	total: 65.4ms	remaining: 752ms
8:	learn: 0.6292634	total: 73.3ms	remaining: 742ms
9:	learn: 0.6197658	total: 82.1ms	remaining: 739ms
10:	learn: 0.6146748	total: 85.1ms	remaining: 688ms
11:	learn: 0.6088513	total: 93.6ms	remaining: 686ms
12:	learn: 0.6024723	total: 102ms	remaining: 682ms
13:	learn: 0.5935986	total: 110ms	remaining: 677ms
14:	learn: 0.5890893	total: 113ms	remaining: 642ms
15:	learn: 0.5760740	total: 119ms	remaining: 624ms
16:	learn: 0.5635144	total: 125ms	remaining: 610ms
17:	learn: 0.5513911	total: 131ms	remaining: 596ms
18:	learn: 0.5396853	total: 136ms	remaining: 581ms
19:	learn: 0.5283793	total: 142

## Adult Income

In [13]:
# 데이터 로드
columns = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
           "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
           "hours-per-week", "native-country", "income"]

data_df = pd.read_csv('data/adult_income.csv', names=columns) # 컬럼명이 없어서 붙여줌!
data_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [15]:
# 따로 데이터를 전처리 하지 않고 밀어넣어줌
X = data_df.drop(['income'], axis=1)
y = data_df['income']

# 학습/평가셋
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,
                                                    stratify=y)

In [17]:
# 모델 학습


 # 누가 범주형인지 누가 문자인지 알아야함 얘네는 범주형 데이터!
categorical_features = ["workclass", "education", "marital-status", "occupation",
                        "relationship", "race", "sex", "native-country"]


train_pool = Pool(X_train, y_train, cat_features=categorical_features)
test_pool = Pool(X_test, y_test, cat_features=categorical_features)

cat_clf = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=8,
    verbose=True
)

cat_clf.fit(train_pool)


0:	learn: 0.6418488	total: 40.4ms	remaining: 4s
1:	learn: 0.5980582	total: 87.8ms	remaining: 4.3s
2:	learn: 0.5619611	total: 227ms	remaining: 7.34s
3:	learn: 0.5303088	total: 311ms	remaining: 7.45s
4:	learn: 0.5035156	total: 361ms	remaining: 6.86s
5:	learn: 0.4816011	total: 405ms	remaining: 6.35s
6:	learn: 0.4623558	total: 452ms	remaining: 6s
7:	learn: 0.4450389	total: 501ms	remaining: 5.76s
8:	learn: 0.4301797	total: 553ms	remaining: 5.59s
9:	learn: 0.4167893	total: 602ms	remaining: 5.42s
10:	learn: 0.4048157	total: 647ms	remaining: 5.23s
11:	learn: 0.3951155	total: 689ms	remaining: 5.05s
12:	learn: 0.3858602	total: 730ms	remaining: 4.89s
13:	learn: 0.3779971	total: 776ms	remaining: 4.76s
14:	learn: 0.3701574	total: 821ms	remaining: 4.65s
15:	learn: 0.3633718	total: 871ms	remaining: 4.57s
16:	learn: 0.3568800	total: 912ms	remaining: 4.45s
17:	learn: 0.3515189	total: 952ms	remaining: 4.33s
18:	learn: 0.3465581	total: 1s	remaining: 4.27s
19:	learn: 0.3420442	total: 1.05s	remaining: 4.21

<catboost.core.CatBoostClassifier at 0x1878ed86fc0>

In [18]:
# 평가
print(f'Train accuracy : {cat_clf.score(X_train, y_train):.4f}')
print(f'Test accuracy : {cat_clf.score(X_train, y_train):.4f}')

Train accuracy : 0.8730
Test accuracy : 0.8730
