<a href="https://colab.research.google.com/github/jhay20ng/UpData/blob/main/(5)SVM_XGboost_RandomFores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SVM, XGboost, RandomForest 세 가지 모델을 모두 모델링해보고 가장 적합한 알고리즘을 선택하시오. 이를 선택한 이유와 모델의 한계점, 보완 가능한 부분을 설명하고 현업에서 주의할 점 등에 대해 기술하시오.


##  ::::: 전처리 :::::


1. 결측치 보완
```
from sklearn.impute import KNNImputer
# 결측치가 있는 수치형 데이터만을 추출
KNN_data = df.drop(columns = ['school', 'sex', 'paid', 'activities'])
# 모델링
model = KNNImputer()
df_filled = model.fit_transform(KNN_data)
df_filled = pd.DataFrame(df_filled, columns = KNN_data.columns)
df[KNN_data.columns] = df_filled

```

2. 이산형 변수를 가변수(dummy variable)로 변환: 변수간 관계성이 없도록 함.
```
import pandas as pd
df = pd.get_dummies(data=df, columns = ['요일'], drop_first = True)
```

> - 이산형 변수

>> No | 요일
>> ---|------
>> 1  | 월
>> 2  | 화
>> 3  | 수


> - 가변수(dummy variable)

>> No | 월 | 화 | 수
>> ---|----|----|---
>> 1  | 1  | 0  | 0
>> 2  | 0  | 1  | 0
>> 3  | 0  | 0  | 1


3. 종속변수($y$)와 독립변수($X$)를 구분

>> $y = f(X)$

> - 종속변수(從屬變數, dependent variable): 예측대상이 되는 결과변수로 독립변수에 영향을 받는 변수. 결과값(y)은 하나의 변수를 가진다.
> - 독립변수(獨立變數, independent variable): 원인변수, 설명변수변수라고도 불리며 종속변수에 영향을 미치는 변수로, 다른 변수에 영향을 받지 않는다. 입력값(X)은 하나 이상의 변수를 가진다.

4. 학습데이터/테스트 데이터 분리
```
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1004)
```

5. 독립변수(X)의 스케일링
```
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
```

## ::::: 세 모델 모듈 불러오기 :::::
```
from sklearn.svm import SVC
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
```

In [32]:
import numpy as np
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/jhay20ng/UpData/main/student_data.csv', sep=',')
df.head()

Unnamed: 0,school,sex,paid,activities,famrel,freetime,goout,Dalc,Walc,health,absences,grade,G1,G2
0,GP,F,no,no,4.0,3.0,4.0,1.0,1.0,3.0,6.0,6,5,6
1,GP,F,no,no,5.0,3.0,3.0,1.0,1.0,3.0,4.0,5,5,5
2,GP,F,yes,no,4.0,3.0,2.0,2.0,3.0,3.0,10.0,8,7,8
3,GP,F,yes,yes,3.0,2.0,2.0,1.0,1.0,5.0,2.0,15,15,14
4,GP,F,yes,no,4.0,3.0,2.0,1.0,2.0,5.0,4.0,9,6,10


In [33]:
df.shape

(395, 14)

In [34]:
df.info()  # paid 유급, famrel가족, goout 외출, 음주 workday consumption(Dalc) and Weekend consumption(Walc)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   school      395 non-null    object 
 1   sex         395 non-null    object 
 2   paid        395 non-null    object 
 3   activities  395 non-null    object 
 4   famrel      394 non-null    float64
 5   freetime    393 non-null    float64
 6   goout       392 non-null    float64
 7   Dalc        391 non-null    float64
 8   Walc        393 non-null    float64
 9   health      391 non-null    float64
 10  absences    392 non-null    float64
 11  grade       395 non-null    int64  
 12  G1          395 non-null    int64  
 13  G2          395 non-null    int64  
dtypes: float64(7), int64(3), object(4)
memory usage: 43.3+ KB


In [35]:
df.describe()

Unnamed: 0,famrel,freetime,goout,Dalc,Walc,health,absences,grade,G1,G2
count,394.0,393.0,392.0,391.0,393.0,391.0,392.0,395.0,395.0,395.0
mean,3.944162,3.239186,3.114796,1.470588,2.284987,3.56266,5.67602,10.660759,10.908861,10.713924
std,0.897794,0.994265,1.112397,0.873266,1.287778,1.386949,8.013393,3.71939,3.319195,3.761505
min,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,3.0,0.0
25%,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,8.0,9.0
50%,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,13.0
max,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,19.0


In [37]:
# 결측치 보완
from sklearn.impute import KNNImputer

# 결측치가 있는 수치형 데이터만을 추출
KNN_data = df.drop(columns = ['school', 'sex', 'paid', 'activities'])

# 모델링
model = KNNImputer()
df_filled = model.fit_transform(KNN_data)
df_filled = pd.DataFrame(df_filled, columns = KNN_data.columns)
df[KNN_data.columns] = df_filled
df

Unnamed: 0,school,sex,paid,activities,famrel,freetime,goout,Dalc,Walc,health,absences,grade,G1,G2
0,GP,F,no,no,4.0,3.0,4.0,1.0,1.0,3.0,6.0,6.0,5.0,6.0
1,GP,F,no,no,5.0,3.0,3.0,1.0,1.0,3.0,4.0,5.0,5.0,5.0
2,GP,F,yes,no,4.0,3.0,2.0,2.0,3.0,3.0,10.0,8.0,7.0,8.0
3,GP,F,yes,yes,3.0,2.0,2.0,1.0,1.0,5.0,2.0,15.0,15.0,14.0
4,GP,F,yes,no,4.0,3.0,2.0,1.0,2.0,5.0,4.0,9.0,6.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
390,MS,M,yes,no,5.0,5.0,4.0,4.0,5.0,4.0,11.0,9.0,9.0,9.0
391,MS,M,no,no,2.0,4.0,5.0,3.0,4.0,2.0,3.0,15.0,14.0,16.0
392,MS,M,no,no,5.0,5.0,3.0,3.0,3.0,3.0,3.0,8.0,10.0,8.0
393,MS,M,no,no,4.0,4.0,1.0,3.0,4.0,5.0,0.0,11.0,11.0,12.0


In [38]:
# 이산형 변수 처리
df = pd.get_dummies(data=df, columns = ['school', 'sex', 'paid', 'activities'], drop_first = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   famrel          395 non-null    float64
 1   freetime        395 non-null    float64
 2   goout           395 non-null    float64
 3   Dalc            395 non-null    float64
 4   Walc            395 non-null    float64
 5   health          395 non-null    float64
 6   absences        395 non-null    float64
 7   grade           395 non-null    float64
 8   G1              395 non-null    float64
 9   G2              395 non-null    float64
 10  school_MS       395 non-null    uint8  
 11  sex_M           395 non-null    uint8  
 12  paid_yes        395 non-null    uint8  
 13  activities_yes  395 non-null    uint8  
dtypes: float64(10), uint8(4)
memory usage: 32.5 KB


In [39]:
# y: 종속변수 X: 독립변수  y = f(X)
X = df.drop('grade', axis = 1)
y = df['grade']
print(X)
print(y)

     famrel  freetime  goout  Dalc  Walc  health  absences    G1    G2  \
0       4.0       3.0    4.0   1.0   1.0     3.0       6.0   5.0   6.0   
1       5.0       3.0    3.0   1.0   1.0     3.0       4.0   5.0   5.0   
2       4.0       3.0    2.0   2.0   3.0     3.0      10.0   7.0   8.0   
3       3.0       2.0    2.0   1.0   1.0     5.0       2.0  15.0  14.0   
4       4.0       3.0    2.0   1.0   2.0     5.0       4.0   6.0  10.0   
..      ...       ...    ...   ...   ...     ...       ...   ...   ...   
390     5.0       5.0    4.0   4.0   5.0     4.0      11.0   9.0   9.0   
391     2.0       4.0    5.0   3.0   4.0     2.0       3.0  14.0  16.0   
392     5.0       5.0    3.0   3.0   3.0     3.0       3.0  10.0   8.0   
393     4.0       4.0    1.0   3.0   4.0     5.0       0.0  11.0  12.0   
394     3.0       2.0    3.0   3.0   3.0     5.0       5.0   8.0   9.0   

     school_MS  sex_M  paid_yes  activities_yes  
0            0      0         0               0  
1          

In [50]:
# SVM, XGboost, RandomForest
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# 학습/테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 1004)
print('X_train, y_train 크기:::::', X_train.shape, y_train.shape)
print('X_test,  y_test 크기 :::::', X_test.shape, y_test.shape)

# 스케일링
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

###########################################################################################
# 파라미터 설정
param_grid=[{'C': [0.1, 1, 10, 100], 'gamma' : [0.001, 0.01, 0.1, 1, 10], 'kernel': ['linear', 'rbf']}]
# param_grid=[{'C': [0.1, 1, 10, 100], 'gamma' : [0.001, 0.01, 0.1, 1, 10]}]

# 모델 설정: SVC, 3-fold CV
model = GridSearchCV(estimator=SVC(), param_grid = param_grid, cv = 3, error_score='raise')

# 모델 학습
model.fit(X_train_scaled, y_train)

# 결과 변수명 확인
print(sorted(model.cv_results_.keys()))

# 결과 출력
result = pd.DataFrame(model.cv_results_['params'])
result['mean_test_score'] = model.cv_results_['mean_test_score']
result.sort_values(by = 'mean_test_score', ascending = False)
############################################################################################

X_train, y_train 크기::::: (276, 13) (276,)
X_test,  y_test 크기 ::::: (119, 13) (119,)




['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_C', 'param_gamma', 'param_kernel', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']


Unnamed: 0,C,gamma,kernel,mean_test_score
30,100.0,0.001,linear,0.416667
38,100.0,10.0,linear,0.416667
36,100.0,1.0,linear,0.416667
34,100.0,0.1,linear,0.416667
32,100.0,0.01,linear,0.416667
20,10.0,0.001,linear,0.398551
24,10.0,0.1,linear,0.398551
26,10.0,1.0,linear,0.398551
28,10.0,10.0,linear,0.398551
22,10.0,0.01,linear,0.398551


In [51]:
### SVR
###########################################################################################
# 파라미터 설정
param_grid=[{'C': [0.1, 1, 10, 100], 'gamma' : [0.001, 0.01, 0.1, 1, 10]}]

# 모델 설정: SVC, 3-fold CV
model = GridSearchCV(estimator=SVR(), param_grid = param_grid, cv = 3, error_score='raise')

# 모델 학습
model.fit(X_train_scaled, y_train)

# 결과 변수명 확인
print(sorted(model.cv_results_.keys()))

# 결과 출력
result = pd.DataFrame(model.cv_results_['params'])
result['mean_test_score'] = model.cv_results_['mean_test_score']
result.sort_values(by = 'mean_test_score', ascending = False)
############################################################################################

['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_C', 'param_gamma', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']


Unnamed: 0,C,gamma,mean_test_score
15,100.0,0.001,0.95851
11,10.0,0.01,0.949617
16,100.0,0.01,0.940983
10,10.0,0.001,0.934865
6,1.0,0.01,0.906927
12,10.0,0.1,0.83232
17,100.0,0.1,0.822368
7,1.0,0.1,0.747424
5,1.0,0.001,0.255807
2,0.1,0.1,0.230155


In [52]:
### RandomForestRegressor
###########################################################################################
# 파라미터 설정
param_grid=[{'max_depth': [2,4,6,8,10], 'min_samples_split' : [2,4,6,8,10]}]

# 모델 설정: SVC, 3-fold CV
model = GridSearchCV(estimator=RandomForestRegressor(n_estimators = 100), param_grid = param_grid, cv = 3, error_score='raise')

# 모델 학습
model.fit(X_train_scaled, y_train)

# 결과 변수명 확인
print(sorted(model.cv_results_.keys()))

# 결과 출력
result = pd.DataFrame(model.cv_results_['params'])
result['mean_test_score'] = model.cv_results_['mean_test_score']
result.sort_values(by = 'mean_test_score', ascending = False)
############################################################################################

['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'param_min_samples_split', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']


Unnamed: 0,max_depth,min_samples_split,mean_test_score
6,4,4,0.955981
22,10,6,0.955787
5,4,2,0.955757
21,10,4,0.95574
11,6,4,0.955672
16,8,4,0.955484
7,4,6,0.955387
24,10,10,0.955296
12,6,6,0.955058
8,4,8,0.955045


In [53]:
### XGBRegressor
###########################################################################################
# 파라미터 설정
param_grid=[{'max_depth': [2,4,6,8,10]}]

# 모델 설정: SVC, 3-fold CV
model = GridSearchCV(estimator=XGBRegressor(n_estimators = 1000), param_grid = param_grid, cv = 3, error_score='raise')

# 모델 학습
model.fit(X_train_scaled, y_train)

# 결과 변수명 확인
print(sorted(model.cv_results_.keys()))

# 결과 출력
result = pd.DataFrame(model.cv_results_['params'])
result['mean_test_score'] = model.cv_results_['mean_test_score']
result.sort_values(by = 'mean_test_score', ascending = False)
############################################################################################

['mean_fit_time', 'mean_score_time', 'mean_test_score', 'param_max_depth', 'params', 'rank_test_score', 'split0_test_score', 'split1_test_score', 'split2_test_score', 'std_fit_time', 'std_score_time', 'std_test_score']


Unnamed: 0,max_depth,mean_test_score
4,10,0.949691
2,6,0.949665
1,4,0.947478
3,8,0.947331
0,2,0.933971
