#**스마트폰 센서 데이터 기반 모션 분류**
# 단계2 : 기본 모델링


## 0.미션

* 데이터 전처리
    * 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리 수행
* 다양한 알고리즘으로 분류 모델 생성
    * 최소 4개 이상의 알고리즘을 적용하여 모델링 수행
    * 성능 비교
    * 각 모델의 성능을 저장하는 별도 데이터 프레임을 만들고 비교
* 옵션 : 다음 사항은 선택사항입니다. 시간이 허용하는 범위 내에서 수행하세요.
    * 상위 N개 변수를 선정하여 모델링 및 성능 비교
        * 모델링에 항상 모든 변수가 필요한 것은 아닙니다.
        * 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교하세요.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것입니다.

## 1.환경설정

### (1) 라이브러리 불러오기

* 세부 요구사항
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# 필요하다고 판단되는 라이브러리를 추가하세요.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


* 함수 생성

In [None]:
# 변수의 특성 중요도 계산하기
def plot_feature_importance(importance, names, result_only = False, topn = 'all'):
    feature_importance = np.array(importance)
    feature_name = np.array(names)

    data={'feature_name':feature_name,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    #변수의 특성 중요도 순으로 정렬하기
    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    else :
        fi_df = fi_temp.iloc[:topn]

    #변수의 특성 중요도 그래프로 그리기
    if result_only == False :
        plt.figure(figsize=(10,20))
        sns.barplot(x='feature_importance', y='feature_name', data = fi_df)

        plt.xlabel('importance')
        plt.ylabel('feature name')
        plt.grid()

    return fi_df

### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용
* 세부 요구사항
    - 전체 데이터 'data01_train.csv' 를 불러와 'data' 이름으로 저장합니다.
        - data에서 변수 subject는 삭제합니다.
    - 데이터프레임에 대한 기본 정보를 확인합니다.( .head(), .shape 등)

#### 1) 데이터 로딩

In [None]:
import os
path = '/content/drive/MyDrive/2023.10.25_미니프로젝트5차_데이터 및 실습자료'

data = pd.read_csv(os.path.join(path+'/data01_train.csv'))
data.drop('subject', axis=1, inplace=True)
data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity
0,0.288508,-0.009196,-0.103362,-0.988986,-0.962797,-0.967422,-0.989,-0.962596,-0.96565,-0.929747,...,-0.487737,-0.816696,-0.042494,-0.044218,0.307873,0.07279,-0.60112,0.331298,0.165163,STANDING
1,0.265757,-0.016576,-0.098163,-0.989551,-0.994636,-0.987435,-0.990189,-0.99387,-0.987558,-0.937337,...,-0.23782,-0.693515,-0.062899,0.388459,-0.765014,0.771524,0.345205,-0.769186,-0.147944,LAYING
2,0.278709,-0.014511,-0.108717,-0.99772,-0.981088,-0.994008,-0.997934,-0.982187,-0.995017,-0.942584,...,-0.535287,-0.829311,0.000265,-0.525022,-0.891875,0.021528,-0.833564,0.202434,-0.032755,STANDING
3,0.289795,-0.035536,-0.150354,-0.231727,-0.006412,-0.338117,-0.273557,0.014245,-0.347916,0.008288,...,-0.004012,-0.408956,-0.255125,0.612804,0.747381,-0.072944,-0.695819,0.287154,0.111388,WALKING
4,0.394807,0.034098,0.091229,0.088489,-0.106636,-0.388502,-0.010469,-0.10968,-0.346372,0.584131,...,-0.157832,-0.563437,-0.044344,-0.845268,-0.97465,-0.887846,-0.705029,0.264952,0.137758,WALKING_DOWNSTAIRS


#### 2) 기본 정보 조회

In [None]:
data.shape

(5881, 562)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5881 entries, 0 to 5880
Columns: 562 entries, tBodyAcc-mean()-X to Activity
dtypes: float64(561), object(1)
memory usage: 25.2+ MB


In [None]:
data.dtypes

tBodyAcc-mean()-X                       float64
tBodyAcc-mean()-Y                       float64
tBodyAcc-mean()-Z                       float64
tBodyAcc-std()-X                        float64
tBodyAcc-std()-Y                        float64
                                         ...   
angle(tBodyGyroJerkMean,gravityMean)    float64
angle(X,gravityMean)                    float64
angle(Y,gravityMean)                    float64
angle(Z,gravityMean)                    float64
Activity                                 object
Length: 562, dtype: object

In [None]:
data.isna().sum().sum()

0

In [None]:
data.describe()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-meanFreq(),fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)"
count,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,...,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0,5881.0
mean,0.274811,-0.017799,-0.109396,-0.603138,-0.509815,-0.604058,-0.628151,-0.525944,-0.605374,-0.46549,...,0.126955,-0.305883,-0.623548,0.008524,-0.001185,0.00934,-0.007099,-0.491501,0.059299,-0.054594
std,0.067614,0.039422,0.058373,0.448807,0.501815,0.417319,0.424345,0.485115,0.413043,0.544995,...,0.249176,0.322808,0.310371,0.33973,0.447197,0.60819,0.476738,0.509069,0.29734,0.278479
min,-0.503823,-0.684893,-1.0,-1.0,-0.999844,-0.999667,-1.0,-0.999419,-1.0,-1.0,...,-0.965725,-0.979261,-0.999765,-0.97658,-1.0,-1.0,-1.0,-1.0,-1.0,-0.980143
25%,0.262919,-0.024877,-0.121051,-0.992774,-0.97768,-0.980127,-0.993602,-0.977865,-0.980112,-0.936067,...,-0.02161,-0.541969,-0.845985,-0.122361,-0.294369,-0.481718,-0.373345,-0.811397,-0.018203,-0.141555
50%,0.277154,-0.017221,-0.108781,-0.943933,-0.844575,-0.856352,-0.948501,-0.849266,-0.849896,-0.878729,...,0.133887,-0.342923,-0.712677,0.010278,0.005146,0.011448,-0.000847,-0.709441,0.182893,0.003951
75%,0.288526,-0.01092,-0.098163,-0.24213,-0.034499,-0.26269,-0.291138,-0.068857,-0.268539,-0.01369,...,0.288944,-0.127371,-0.501158,0.154985,0.28503,0.499857,0.356236,-0.51133,0.248435,0.111932
max,1.0,1.0,1.0,1.0,0.916238,1.0,1.0,0.967664,1.0,1.0,...,0.9467,0.989538,0.956845,1.0,1.0,0.998702,0.996078,0.977344,0.478157,1.0


In [None]:
data.describe().loc['max'].sort_values()
data.describe().loc['min'].sort_values()
# 스케일링은 이미 MinMax [-1, 1] 사이로 되어있는 것 같아 굳이 하지 않아도 될 것 같다.

fBodyAccMag-min()                    -1.000000
tBodyGyroJerkMag-std()               -1.000000
tBodyGyroJerkMag-mean()              -1.000000
fBodyAccJerk-bandsEnergy()-49,64.1   -1.000000
fBodyGyro-bandsEnergy()-49,64.2      -1.000000
                                        ...   
tGravityAcc-min()-Y                  -0.568157
tGravityAcc-arCoeff()-Z,4            -0.554000
tGravityAcc-mean()-Y                 -0.535222
tBodyAcc-mean()-X                    -0.503823
tGravityAcc-max()-Y                  -0.493874
Name: min, Length: 561, dtype: float64

## **2. 데이터 전처리**

* 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리를 수행한다.


### (1) 데이터 분할1 : x, y

* 세부 요구사항
    - x, y로 분할합니다.

In [None]:
target = 'Activity'

x = data.drop(target, axis=1)
y = data[target]

### (2) 스케일링(필요시)


* 세부 요구사항
    - 스케일링을 필요로 하는 알고리즘 사용을 위해서 코드 수행
    - min-max 방식 혹은 standard 방식 중 한가지 사용.

In [None]:
# 여기는 넘어가겠습니다.

### (3) 데이터분할2 : train, validation

* 세부 요구사항
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

## **3. 기본 모델링**



* 세부 요구사항
    - 최소 4개 이상의 알고리즘을 적용하여 모델링을 수행한다.
    - 각 알고리즘별로 전체 변수로 모델링, 상위 N개 변수를 선택하여 모델링을 수행하고 성능 비교를 한다.
    - (옵션) 알고리즘 중 1~2개에 대해서, 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것이다.

### (1) 알고리즘1 : DecisionTreeClassifier

In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(x_train, y_train)
y_pred_dt = model.predict(x_val)

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val, y_pred_dt))
print('')
print("f1 score : " , f1_score(y_val, y_pred_dt, average='macro'))
print('')
print("classification report : " , classification_report(y_val, y_pred_dt))

Accuracy :  0.929481733220051

f1 score :  0.9278026995902572

classification report :                      precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.87      0.90      0.88       200
          STANDING       0.90      0.88      0.89       226
           WALKING       0.94      0.95      0.95       198
WALKING_DOWNSTAIRS       0.92      0.92      0.92       145
  WALKING_UPSTAIRS       0.92      0.92      0.92       177

          accuracy                           0.93      1177
         macro avg       0.93      0.93      0.93      1177
      weighted avg       0.93      0.93      0.93      1177



### (2) 알고리즘2 : RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(x_train, y_train)
y_pred_rf = model.predict(x_val)

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val, y_pred_rf))
print('')
print("f1 score : " , f1_score(y_val, y_pred_rf, average='macro'))
print('')
print("classification report : " , classification_report(y_val, y_pred_rf))

Accuracy :  0.9804587935429057

f1 score :  0.9799694569076202

classification report :                      precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.96      0.98      0.97       200
          STANDING       0.98      0.96      0.97       226
           WALKING       0.99      0.98      0.98       198
WALKING_DOWNSTAIRS       0.97      0.98      0.98       145
  WALKING_UPSTAIRS       0.98      0.98      0.98       177

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177



### (3) 알고리즘3 : ExtraTressClassifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

model = ExtraTreesClassifier()
model.fit(x_train, y_train)
y_pred_ex = model.predict(x_val)

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val, y_pred_ex))
print('')
print("f1 score : " , f1_score(y_val, y_pred_ex, average='macro'))
print('')
print("classification report : " , classification_report(y_val, y_pred_ex))

Accuracy :  0.9889549702633815

f1 score :  0.9888070614631612

classification report :                      precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.98      0.98      0.98       200
          STANDING       0.99      0.98      0.98       226
           WALKING       1.00      0.99      0.99       198
WALKING_DOWNSTAIRS       0.98      1.00      0.99       145
  WALKING_UPSTAIRS       0.99      0.98      0.99       177

          accuracy                           0.99      1177
         macro avg       0.99      0.99      0.99      1177
      weighted avg       0.99      0.99      0.99      1177



### (4) 알고리즘4 : LogisticRegression

In [None]:
from sklearn.linear_model import SGDClassifier

model = SGDClassifier()
model.fit(x_train, y_train)
y_pred_sgd = model.predict(x_val)

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val, y_pred_sgd))
print('')
print("f1 score : " , f1_score(y_val, y_pred_sgd, average='macro'))
print('')
print("classification report : " , classification_report(y_val, y_pred_sgd))

Accuracy :  0.9583687340696686

f1 score :  0.9600816227654881

classification report :                      precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.83      0.99      0.90       200
          STANDING       0.99      0.82      0.90       226
           WALKING       1.00      0.99      1.00       198
WALKING_DOWNSTAIRS       0.96      1.00      0.98       145
  WALKING_UPSTAIRS       0.99      0.97      0.98       177

          accuracy                           0.96      1177
         macro avg       0.96      0.96      0.96      1177
      weighted avg       0.96      0.96      0.96      1177



### (5) 알고리즘5 : GradientBoostingClassifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier()
model.fit(x_train, y_train)
y_pred_gb = model.predict(x_val)

KeyboardInterrupt: ignored

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val, y_pred_gb))
print('')
print("f1 score : " , f1_score(y_val, y_pred_gb, average='macro'))
print('')
print("classification report : " , classification_report(y_val, y_pred_gb))

### (6) 알고리즘6 : HistGradientBoostingClassifier

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
model = HistGradientBoostingClassifier()
model.fit(x_train, y_train)
y_pred_hgb = model.predict(x_val)

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val, y_pred_hgb))
print('')
print("f1 score : " , f1_score(y_val, y_pred_hgb, average='macro'))
print('')
print("classification report : " , classification_report(y_val, y_pred_hgb))

Accuracy :  0.9949022939677146

f1 score :  0.9949738040668343

classification report :                      precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.98      1.00      0.99       200
          STANDING       1.00      0.98      0.99       226
           WALKING       0.99      0.99      0.99       198
WALKING_DOWNSTAIRS       1.00      0.99      1.00       145
  WALKING_UPSTAIRS       0.99      1.00      1.00       177

          accuracy                           0.99      1177
         macro avg       0.99      1.00      0.99      1177
      weighted avg       0.99      0.99      0.99      1177



### (7) 알고리즘7 : XGBClassifier

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(random_state=42)
y_tr_map = y_train.map({'STANDING' : 0, 'SITTING' : 1, 'LAYING' : 2, 'WALKING' : 3, 'WALKING_UPSTAIRS' : 4, 'WALKING_DOWNSTAIRS' : 5})
y_val_map = y_val.map({'STANDING' : 0, 'SITTING' : 1, 'LAYING' : 2, 'WALKING' : 3, 'WALKING_UPSTAIRS' : 4, 'WALKING_DOWNSTAIRS' : 5})
model.fit(x_train, y_tr_map)
y_pred_xgb = model.predict(x_val)

In [None]:
# 평가하기
print("Accuracy : " , accuracy_score(y_val_map, y_pred_xgb))
print('')
print("f1 score : " , f1_score(y_val_map, y_pred_xgb, average='macro'))
print('')
print("classification report : " , classification_report(y_val_map, y_pred_xgb))

Accuracy :  0.9923534409515717

f1 score :  0.9922983968553284

classification report :                precision    recall  f1-score   support

           0       1.00      0.98      0.99       226
           1       0.98      0.99      0.99       200
           2       1.00      1.00      1.00       231
           3       0.99      0.98      0.99       198
           4       0.99      1.00      0.99       177
           5       0.99      0.99      0.99       145

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177



### (8) 딥러닝 모델

In [None]:
np.array(y_val_one)

array([[1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0],
       ...,
       [1, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0],
       [0, 0, 1, 0, 0, 0]], dtype=uint8)

In [None]:
np.argmax(np.array(y_val_one), axis=-1)

array([0, 3, 2, ..., 0, 3, 2])

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

y_train_one = pd.get_dummies(y_train)
y_val_one = pd.get_dummies(y_val)

model_dl = Sequential()
model_dl.add(Dense(16, input_shape = (x_train.shape[1], ), activation = 'relu' ))
model_dl.add(Dense(8, activation = 'relu'))
model_dl.add(Dense(y_train_one.shape[1], activation = 'softmax'))

model_dl.compile(optimizer = 'Adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])
model_dl.fit(x_train, y_train_one, epochs = 30, validation_data = (x_val, y_val_one))

y_pred_dl = model_dl.predict(x_val)
y_pred_dl = np.argmax(y_pred_dl, axis=-1)
# 평가하기
print("Accuracy : " , accuracy_score(np.argmax(np.array(y_val_one), axis=-1), y_pred_dl))
print('')
# print("f1 score : " , f1_score(np.argmax(np.array(y_val_one), axis=-1), y_pred_dl, average='macro'))
print('')
print("classification report : " , classification_report(np.argmax(np.array(y_val_one), axis=-1), y_pred_dl))

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Accuracy :  0.9821580288870009


classification report :                precision    recall  f1-score   support

           0       1.00      1.00      1.00       231
           1       0.97      0.94      0.96       200
           2       0.95      0.98      0.96       226
           3       1.00      0.99      0.99       198
           4       0.99      0.99      0.99       145
           5       0.98      1.00      0.99       177

    accuracy                           0.98      1177
   macro avg       0.98      0.98      0.98      1177
weighted avg       0.98      0.98      0.98      1177



In [None]:
print("classification report : " , classification_report(np.argmax(np.array(y_val_one), axis=-1), y_pred_dl))

classification report :                precision    recall  f1-score   support

           0       1.00      1.00      1.00       231
           1       0.97      0.94      0.96       200
           2       0.95      0.98      0.96       226
           3       1.00      0.99      0.99       198
           4       0.99      0.99      0.99       145
           5       0.98      1.00      0.99       177

    accuracy                           0.98      1177
   macro avg       0.98      0.98      0.98      1177
weighted avg       0.98      0.98      0.98      1177



In [None]:
accuracy = [accuracy_score(y_val, y_pred_dt), accuracy_score(y_val, y_pred_rf), accuracy_score(y_val, y_pred_ex), accuracy_score(y_val, y_pred_sgd), accuracy_score(y_val, y_pred_hgb), accuracy_score(y_val_map, y_pred_xgb)]
f1_score = [f1_score(y_val, y_pred_dt, average='macro'), f1_score(y_val, y_pred_rf, average='macro'), f1_score(y_val, y_pred_ex, average='macro'), f1_score(y_val, y_pred_sgd, average='macro'), f1_score(y_val, y_pred_hgb, average='macro'), f1_score(y_val_map, y_pred_xgb, average='macro')]
models = ['DecisionTree', 'RandomForest', 'ExtraTreesClassifier', 'LogisticRegression' ,'HGB', 'XGB']

pd.DataFrame({'Model' : models, 'acuracy' : accuracy, 'f1_score' : f1_score})

Unnamed: 0,Model,acuracy,f1_score
0,DecisionTree,0.929482,0.927803
1,RandomForest,0.980459,0.979969
2,ExtraTreesClassifier,0.988955,0.988807
3,LogisticRegression,0.958369,0.960082
4,HGB,0.994902,0.994974
5,XGB,0.992353,0.992298
