### 풀이 영상: https://youtu.be/diP0q1YzVFg

## Q. [마케팅] 자동차 시장 세분화
- 자동차 회사는 새로운 전략을 수립하기 위해 4개의 시장으로 세분화했습니다.
- 기존 고객 분류 자료를 바탕으로 신규 고객이 어떤 분류에 속할지 예측해주세요!


- 예측할 값(y): "Segmentation" (1,2,3,4)
- 평가: Macro f1-score
- data: train.csv, test.csv
- 제출 형식: 
~~~
ID,Segmentation
458989,1
458994,2
459000,3
459003,4
~~~

### 답안 제출 참고
- 아래 코드 예측변수와 수험번호를 개인별로 변경하여 활용 
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('003000000.csv', index=False)

### 노트북 구분
- basic: 수치형 데이터만 활용 -> 학습 및 test데이터 예측
- intermediate: 범주형 데이터도 활용 -> 학습 및 test데이터 예측
- advanced: 학습 및 교차 검증(모델 평가) -> 하이퍼파라미터 튜닝 -> test데이터 예측

### 학습을 위한 채점
- 최종 파일을 "수험번호.csv"가 아닌 "submission.csv" 작성 후 오른쪽 메뉴 아래 "submit" 버튼 클릭 -> 리더보드에 점수 및 등수 확인 가능함
- pd.DataFrame({'ID': test.ID, 'Segmentation': pred}).to_csv('submission.csv', index=False)


In [64]:
# 라이브러리 불러오기
import pandas as pd

In [94]:
# 데이터 불러오기
train = pd.read_csv("../input/big-data-analytics-certification-kr-2022/train.csv")
test = pd.read_csv("../input/big-data-analytics-certification-kr-2022/test.csv")

# 🍭 basic 단계 🍭  
- 목표: 수치형 데이터만이라도 활용해 제출하자!!!👍

In [95]:
train.head(3)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,4
1,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,2
2,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,2


In [96]:
test.head(3)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6


In [97]:
test.isnull().sum()

ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
dtype: int64

In [98]:
train.isnull().sum()

ID                 0
Gender             0
Ever_Married       0
Age                0
Graduated          0
Profession         0
Work_Experience    0
Spending_Score     0
Family_Size        0
Var_1              0
Segmentation       0
dtype: int64

In [99]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6665 entries, 0 to 6664
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               6665 non-null   int64  
 1   Gender           6665 non-null   object 
 2   Ever_Married     6665 non-null   object 
 3   Age              6665 non-null   int64  
 4   Graduated        6665 non-null   object 
 5   Profession       6665 non-null   object 
 6   Work_Experience  6665 non-null   float64
 7   Spending_Score   6665 non-null   object 
 8   Family_Size      6665 non-null   float64
 9   Var_1            6665 non-null   object 
 10  Segmentation     6665 non-null   int64  
dtypes: float64(2), int64(3), object(6)
memory usage: 572.9+ KB


In [100]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2154 entries, 0 to 2153
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               2154 non-null   int64  
 1   Gender           2154 non-null   object 
 2   Ever_Married     2154 non-null   object 
 3   Age              2154 non-null   int64  
 4   Graduated        2154 non-null   object 
 5   Profession       2154 non-null   object 
 6   Work_Experience  2154 non-null   float64
 7   Spending_Score   2154 non-null   object 
 8   Family_Size      2154 non-null   float64
 9   Var_1            2154 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 168.4+ KB


In [101]:
train.describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size,Segmentation
count,6665.0,6665.0,6665.0,6665.0,6665.0
mean,463519.84096,43.536084,2.629107,2.84111,2.542836
std,2566.43174,16.524054,3.405365,1.524743,1.122723
min,458982.0,18.0,0.0,1.0,1.0
25%,461349.0,31.0,0.0,2.0,2.0
50%,463575.0,41.0,1.0,2.0,3.0
75%,465741.0,53.0,4.0,4.0,4.0
max,467974.0,89.0,14.0,9.0,4.0


In [102]:
test.describe()

Unnamed: 0,ID,Age,Work_Experience,Family_Size
count,2154.0,2154.0,2154.0,2154.0
mean,463496.744661,43.461467,2.551532,2.837047
std,2591.465156,16.761895,3.344917,1.566872
min,458989.0,18.0,0.0,1.0
25%,461282.25,30.0,0.0,2.0
50%,463535.0,41.0,1.0,2.0
75%,465705.75,52.0,4.0,4.0
max,467968.0,89.0,14.0,9.0


In [103]:
target = train.pop('Segmentation')

In [104]:
from sklearn.preprocessing import LabelEncoder

In [105]:
train_cols = train.select_dtypes(include = 'object').columns
test_cols = test.select_dtypes(include = 'object').columns

In [106]:
encoder = LabelEncoder()

for cols in train_cols:
    train[cols] = encoder.fit_transform(train[cols])
for cols in test_cols:
    test[cols] = encoder.fit_transform(test[cols])

In [107]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

In [108]:
train['Age'] = scaler.fit_transform(train[['Age']])
test['Age'] = scaler.transform(test[['Age']])

In [109]:
from sklearn.model_selection import train_test_split

In [110]:
X_tr, X_val, y_tr, y_val = train_test_split(train, target, test_size = 0.15, random_state = 42)
X_tr.shape, X_val.shape, y_tr.shape, y_val.shape

((5665, 10), (1000, 10), (5665,), (1000,))

In [111]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [112]:
from sklearn.metrics import f1_score
#help(f1_score)
#f1_score(y_true, y_pred, average='macro')

In [113]:
train.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,462809,1,0,0.056338,0,5,1.0,2,4.0,3
1,466315,0,1,0.690141,1,2,1.0,2,1.0,5
2,461735,1,1,0.690141,1,7,0.0,1,2.0,5
3,461319,1,1,0.535211,0,0,0.0,0,2.0,5
4,460156,1,0,0.197183,1,5,1.0,2,3.0,5


In [114]:
model = LogisticRegression()
model.fit(X_tr, y_tr)
pred = model.predict(X_val)
a1 = f1_score(y_val, pred, average = 'macro')
a1

0.10876369327073553

In [115]:
model = RandomForestClassifier()
model.fit(X_tr, y_tr)
pred = model.predict(X_val)
a2 = f1_score(y_val, pred, average = 'macro')
a2

0.5087812458152186

In [116]:
model = DecisionTreeClassifier()
model.fit(X_tr, y_tr)
pred = model.predict(X_val)
a3 = f1_score(y_val, pred, average = 'macro')
a3

0.415837167118317

In [117]:
test

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,0,1,0.253521,1,2,0.0,2,1.0,5
1,458994,1,1,0.267606,1,5,8.0,0,4.0,5
2,459000,1,1,0.577465,0,4,11.0,1,2.0,5
3,459003,1,1,0.408451,1,1,0.0,1,5.0,3
4,459005,1,1,0.605634,1,1,5.0,2,3.0,5
...,...,...,...,...,...,...,...,...,...,...
2149,467950,0,0,0.239437,1,3,1.0,2,2.0,5
2150,467954,1,0,0.154930,0,5,9.0,2,4.0,5
2151,467958,0,0,0.239437,1,1,1.0,2,1.0,5
2152,467961,1,1,0.408451,1,4,1.0,1,5.0,3


In [118]:
model = RandomForestClassifier()
model.fit(train, target)
pred = model.predict(test)
pred

array([2, 3, 3, ..., 2, 2, 4])

In [122]:
sub = pd.DataFrame({'ID' : test['ID'],
                   'Segmentation' : pred})
sub

Unnamed: 0,ID,Segmentation
0,458989,2
1,458994,3
2,459000,3
3,459003,3
4,459005,1
...,...,...
2149,467950,4
2150,467954,4
2151,467958,2
2152,467961,2


In [125]:
sub.to_csv(r'C:\Users\hanji\수험번호.csv', index = False)