## Q. [마케팅] 자동차 시장 세분화
- 자동차 회사는 새로운 전략을 수립하기 위해 4개의 시장으로 세분화했습니다.
- 기존 고객 분류 자료를 바탕으로 신규 고객이 어떤 분류에 속할지 예측해주세요!


- 예측할 값(y): "Segmentation" (1,2,3,4)
- 평가: Macro f1-score
- data: train.csv, test.csv
- 제출 형식:
~~~
ID,Segmentation
458989,1
458994,2
459000,3
459003,4
~~~

### 답안 제출 참고
- 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('003000000.csv', index=False)

### 노트북 구분
- basic: 수치형 데이터만 활용 -> 학습 및 test데이터 예측
- intermediate: 범주형 데이터도 활용 -> 학습 및 test데이터 예측
- advanced: 학습 및 교차 검증(모델 평가) -> 하이퍼파라미터 튜닝 -> test데이터 예측

In [47]:
import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# print(train.shape, test.shape)
# (6665, 11) (2154, 10)
# print(train.head())
#        ID  Gender Ever_Married  Age Graduated  Profession  Work_Experience  \
# 0  462809    Male           No   22        No  Healthcare              1.0   
# 1  466315  Female          Yes   67       Yes    Engineer              1.0   
# 2  461735    Male          Yes   67       Yes      Lawyer              0.0   
# 3  461319    Male          Yes   56        No      Artist              0.0   
# 4  460156    Male           No   32       Yes  Healthcare              1.0   

#   Spending_Score  Family_Size  Var_1  Segmentation  
# 0            Low          4.0  Cat_4             4  
# 1            Low          1.0  Cat_6             2  
# 2           High          2.0  Cat_6             2  
# 3        Average          2.0  Cat_6             3  
# 4            Low          3.0  Cat_6             3  
# print(test.head())
#        ID  Gender Ever_Married  Age Graduated  Profession  Work_Experience  \
# 0  458989  Female          Yes   36       Yes    Engineer              0.0   
# 1  458994    Male          Yes   37       Yes  Healthcare              8.0   
# 2  459000    Male          Yes   59        No   Executive             11.0   
# 3  459003    Male          Yes   47       Yes      Doctor              0.0   
# 4  459005    Male          Yes   61       Yes      Doctor              5.0   

#   Spending_Score  Family_Size  Var_1  
# 0            Low          1.0  Cat_6  
# 1        Average          4.0  Cat_6  
# 2           High          2.0  Cat_6  
# 3           High          5.0  Cat_4  
# 4            Low          3.0  Cat_6  
# print(train.describe())
#                  ID          Age  Work_Experience  Family_Size  Segmentation
# count    6665.00000  6665.000000      6665.000000  6665.000000   6665.000000
# mean   463519.84096    43.536084         2.629107     2.841110      2.542836
# std      2566.43174    16.524054         3.405365     1.524743      1.122723
# min    458982.00000    18.000000         0.000000     1.000000      1.000000
# 25%    461349.00000    31.000000         0.000000     2.000000      2.000000
# 50%    463575.00000    41.000000         1.000000     2.000000      3.000000
# 75%    465741.00000    53.000000         4.000000     4.000000      4.000000
# max    467974.00000    89.000000        14.000000     9.000000      4.000000
# print(test.describe())
#                   ID          Age  Work_Experience  Family_Size
# count    2154.000000  2154.000000      2154.000000  2154.000000
# mean   463496.744661    43.461467         2.551532     2.837047
# std      2591.465156    16.761895         3.344917     1.566872
# min    458989.000000    18.000000         0.000000     1.000000
# 25%    461282.250000    30.000000         0.000000     2.000000
# 50%    463535.000000    41.000000         1.000000     2.000000
# 75%    465705.750000    52.000000         4.000000     4.000000
# max    467968.000000    89.000000        14.000000     9.000000
# print(train.describe(include="O"))
#        Gender Ever_Married Graduated Profession Spending_Score  Var_1
# count    6665         6665      6665       6665           6665   6665
# unique      2            2         2          9              3      7
# top      Male          Yes       Yes     Artist            Low  Cat_6
# freq     3677         3944      4249       2192           3999   4476
# print(test.describe(include="O"))
#        Gender Ever_Married Graduated Profession Spending_Score  Var_1
# count    2154         2154      2154       2154           2154   2154
# unique      2            2         2          9              3      7
# top      Male          Yes       Yes     Artist            Low  Cat_6
# freq     1184         1272      1345        696           1326   1421

# print(train.isnull().sum()) # 없다.
# print(test.isnull().sum()) # 없다.
# 분할
target = train.pop('Segmentation')

# 제거
train.pop('ID')
test_id = test.pop('ID')
# 1. baseline f1:  0.38464899324816704
# n_cols = train.select_dtypes(exclude="O").columns

# 2. 범주형 라벨 인코딩 f1:  0.4919566836160607
n_cols = train.select_dtypes(exclude="O").columns
o_cols = train.select_dtypes(include="O").columns
from sklearn.preprocessing import LabelEncoder

for col in o_cols:
    le = LabelEncoder()
    train[col] = le.fit_transform(train[col])
    test[col] = le.transform(test[col])

# 3. 수치 정규화 f1:  0.4922622464055343
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train[n_cols] = scaler.fit_transform(train[n_cols])
test[n_cols] = scaler.fit_transform(test[n_cols])

from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size=0.2, random_state=2022)

# 4. 하이퍼 파라미터 튜닝
# max_depth=4, n_estimators=200, f1:  0.5254835400517295
# max_depth=4, n_estimators=300, f1:  0.5212493946728991
# max_depth=4, n_estimators=400, f1:  0.5210808561070501

# max_depth=5, n_estimators=200, f1:  0.5256360838785049
# max_depth=5, n_estimators=300, f1:  0.5261097131007827
# max_depth=5, n_estimators=400, f1:  0.524978296109693

# max_depth=6, n_estimators=200, f1:  0.5252449426030814
# max_depth=6, n_estimators=300, f1:  0.5240230046865701
# max_depth=6, n_estimators=400, f1:  0.5239436931351763

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
model = RandomForestClassifier(max_depth=6, n_estimators=300, random_state=2022)
model.fit(X_tr, y_tr)
pred = model.predict(X_val)
print('f1: ', f1_score(y_val, pred, average='macro'))
pred = model.predict(test)
submit = pd.DataFrame(
    {
        'ID': test_id,
        'Segmentation': pred
    }
)
submit.to_csv('작업형2제출.csv', index=False)

f1:  0.5292369591406685
