#**스마트폰 센서 데이터 기반 모션 분류**
# 단계2 : 기본 모델링


## 0.미션

* 데이터 전처리
    * 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리 수행
* 다양한 알고리즘으로 분류 모델 생성
    * 최소 4개 이상의 알고리즘을 적용하여 모델링 수행
    * 성능 비교
    * 각 모델의 성능을 저장하는 별도 데이터 프레임을 만들고 비교
* 옵션 : 다음 사항은 선택사항입니다. 시간이 허용하는 범위 내에서 수행하세요.
    * 상위 N개 변수를 선정하여 모델링 및 성능 비교
        * 모델링에 항상 모든 변수가 필요한 것은 아닙니다.
        * 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교하세요.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것입니다.

## 1.환경설정

### (1) 라이브러리 불러오기

* 세부 요구사항
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 필요하다고 판단되는 라이브러리를 추가하세요.
import os
from sklearn.model_selection import train_test_split
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import *
from sklearn.inspection import permutation_importance
from sklearn.preprocessing import LabelEncoder

* 함수 생성

In [None]:
# 변수의 특성 중요도 계산하기
def plot_feature_importance(importance, names, result_only = False, topn = 'all'):
    feature_importance = np.array(importance)
    feature_name = np.array(names)

    data={'feature_name':feature_name,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    #변수의 특성 중요도 순으로 정렬하기
    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    else :
        fi_df = fi_temp.iloc[:topn]

    #변수의 특성 중요도 그래프로 그리기
    if result_only == False :
        plt.figure(figsize=(10,20))
        sns.barplot(x='feature_importance', y='feature_name', data = fi_df)

        plt.xlabel('importance')
        plt.ylabel('feature name')
        plt.grid()

    return fi_df

### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용
* 세부 요구사항
    - 전체 데이터 'data01_train.csv' 를 불러와 'data' 이름으로 저장합니다.
        - data에서 변수 subject는 삭제합니다.
    - 데이터프레임에 대한 기본 정보를 확인합니다.( .head(), .shape 등)

#### 1) 데이터 로딩

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
FOLDER_PATH = '/content/drive/MyDrive/에이블/미니프로젝트5차_3~5일차/Kaggle'

data = pd.read_csv(os.path.join(FOLDER_PATH, 'data01_train.csv'))
print(data.shape)

(5881, 563)


In [None]:
importance_data = joblib.load(os.path.join(PATH, 'importance_data.pkl'))

In [None]:
data.drop('subject', axis=1, inplace=True)

#### 2) 기본 정보 조회

In [None]:
data.head()

Unnamed: 0,tBodyAcc-mean()-X,tBodyAcc-mean()-Y,tBodyAcc-mean()-Z,tBodyAcc-std()-X,tBodyAcc-std()-Y,tBodyAcc-std()-Z,tBodyAcc-mad()-X,tBodyAcc-mad()-Y,tBodyAcc-mad()-Z,tBodyAcc-max()-X,...,fBodyBodyGyroJerkMag-skewness(),fBodyBodyGyroJerkMag-kurtosis(),"angle(tBodyAccMean,gravity)","angle(tBodyAccJerkMean),gravityMean)","angle(tBodyGyroMean,gravityMean)","angle(tBodyGyroJerkMean,gravityMean)","angle(X,gravityMean)","angle(Y,gravityMean)","angle(Z,gravityMean)",Activity
0,0.288508,-0.009196,-0.103362,-0.988986,-0.962797,-0.967422,-0.989,-0.962596,-0.96565,-0.929747,...,-0.487737,-0.816696,-0.042494,-0.044218,0.307873,0.07279,-0.60112,0.331298,0.165163,STANDING
1,0.265757,-0.016576,-0.098163,-0.989551,-0.994636,-0.987435,-0.990189,-0.99387,-0.987558,-0.937337,...,-0.23782,-0.693515,-0.062899,0.388459,-0.765014,0.771524,0.345205,-0.769186,-0.147944,LAYING
2,0.278709,-0.014511,-0.108717,-0.99772,-0.981088,-0.994008,-0.997934,-0.982187,-0.995017,-0.942584,...,-0.535287,-0.829311,0.000265,-0.525022,-0.891875,0.021528,-0.833564,0.202434,-0.032755,STANDING
3,0.289795,-0.035536,-0.150354,-0.231727,-0.006412,-0.338117,-0.273557,0.014245,-0.347916,0.008288,...,-0.004012,-0.408956,-0.255125,0.612804,0.747381,-0.072944,-0.695819,0.287154,0.111388,WALKING
4,0.394807,0.034098,0.091229,0.088489,-0.106636,-0.388502,-0.010469,-0.10968,-0.346372,0.584131,...,-0.157832,-0.563437,-0.044344,-0.845268,-0.97465,-0.887846,-0.705029,0.264952,0.137758,WALKING_DOWNSTAIRS


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5881 entries, 0 to 5880
Columns: 562 entries, tBodyAcc-mean()-X to Activity
dtypes: float64(561), object(1)
memory usage: 25.2+ MB


In [None]:
nan_counts = data.isna().sum()

# 0 초과인 값이 있는 인덱스를 출력
indexes_with_nulls = np.where(nan_counts > 0)[0]

print("결측치가 있는 열의 인덱스:", indexes_with_nulls)

결측치가 있는 열의 인덱스: []


In [None]:
 importance_data.head(2)

Unnamed: 0,feature_name,total_feature_importance,dynamic_feature_importance,standing_feature_importance,sitting_feature_importance,laying_feature_importance,walking_feature_importance,walking_up_feature_importance,walking_down_feature_importance
0,tBodyAcc-mean()-X,0.000245,9e-06,0.000216,0.000329,0.000709,0.000169,0.000116,6.6e-05
1,tBodyAcc-mean()-Y,0.000217,1.7e-05,0.000492,0.000365,9e-05,0.000241,0.001091,0.00038


In [None]:
importance_data.drop(['sensor', 'agg', 'axis'], axis=1, inplace=True)

## **2. 데이터 전처리**

* 가변수화, 데이터 분할, NaN 확인 및 조치, 스케일링 등 필요한 전처리를 수행한다.


### (1) 데이터 분할1 : x, y

* 세부 요구사항
    - x, y로 분할합니다.

In [None]:
target = 'Activity'

x = data.drop(columns=target, axis=1)
y = data[target]

### (2) 스케일링(필요시)


* 세부 요구사항
    - 스케일링을 필요로 하는 알고리즘 사용을 위해서 코드 수행
    - min-max 방식 혹은 standard 방식 중 한가지 사용.

In [None]:
scaler = MinMaxScaler()
x_train_s = scaler.fit_transform(x_train)
x_val_s = scaler.transform(x_val)

In [None]:
# scaler = StandardScaler()
# x_train_s = scaler.fit_transform(x_train)
# x_val_s = scaler.transform(x_val)

* 라벨 인코더

In [None]:
from sklearn.preprocessing import LabelEncoder

# LabelEncoder 객체 생성
label_encoder = LabelEncoder()

# 학습 데이터와 검증 데이터의 레이블을 변환
y_train_l = label_encoder.fit_transform(y_train)
y_val_l = label_encoder.transform(y_val)

### (3) 데이터분할2 : train, validation

* 세부 요구사항
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [None]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

## **3. 기본 모델링**



* 세부 요구사항
    - 최소 4개 이상의 알고리즘을 적용하여 모델링을 수행한다.
    - 각 알고리즘별로 전체 변수로 모델링, 상위 N(20)개 변수를 선택하여 모델링을 수행하고 성능 비교를 한다.
    - (옵션) 알고리즘 중 1~2개에 대해서, 변수 중요도 상위 N(20)개를 선정하여 모델링하고 타 모델과 성능을 비교.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것이다.

In [None]:
def plot_feature_importance(importance, names, topn = 'all'):   # topn = 'all'
    feature_importance = np.array(importance)
    feature_names = np.array(names)

    data={'feature_names':feature_names,'feature_importance':feature_importance}
    fi_temp = pd.DataFrame(data)

    fi_temp.sort_values(by=['feature_importance'], ascending=False,inplace=True)
    fi_temp.reset_index(drop=True, inplace = True)

    if topn == 'all' :
        fi_df = fi_temp.copy()
    # else :
    #     fi_df = fi_temp.iloc[:topn]

    # plt.figure(figsize=(10,8))
    # sns.barplot(x='feature_importance', y='feature_names', data = fi_df)

    # plt.xlabel('importance')
    # plt.ylabel('feature names')
    # plt.grid()

    return fi_df

### (1) 알고리즘1 : KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier()
model.fit(x_train_s, y_train)

In [None]:
y_pred = model.predict(x_val_s)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.9634664401019541
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.92      0.87      0.89       200
          STANDING       0.89      0.94      0.92       226
           WALKING       0.99      1.00      1.00       198
WALKING_DOWNSTAIRS       1.00      0.99      0.99       145
  WALKING_UPSTAIRS       0.99      1.00      1.00       177

          accuracy                           0.96      1177
         macro avg       0.97      0.96      0.97      1177
      weighted avg       0.96      0.96      0.96      1177



In [None]:
from sklearn.inspection import permutation_importance
pfi = permutation_importance(model, x_val_s, y_val, n_repeats=1, scoring = 'accuracy', random_state=42)

In [None]:
fi = plot_feature_importance(pfi.importances_mean, list(x))

In [None]:
len(list(x_train))

561

In [None]:
if isinstance(list(x_train_s), np.ndarray):
    print("pfi.importances_mean is a NumPy array.")
else:
    print("pfi.importances_mean is not a NumPy array.")

pfi.importances_mean is not a NumPy array.


In [None]:
pfi

{'importances_mean': array([ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,  5.66411781e-04,
        -2.83205891e-04,  5.66411781e-04,  5.66411781e-04, -2.83205891e-04,
         5.66411781e-04,  1.13282356e-03, -2.83205891e-04, -2.83205891e-04,
         2.83205891e-04, -2.83205891e-04,  8.49617672e-04,  2.83205891e-04,
         5.66411781e-04,  0.00000000e+00, -2.83205891e-04,  8.49617672e-04,
         3.70074342e-17,  2.83205891e-04, -5.66411781e-04, -8.49617672e-04,
        -1.13282356e-03, -2.26564713e-03, -1.69923534e-03, -2.83205891e-04,
        -1.13282356e-03, -5.66411781e-04, -5.66411781e-04,  7.40148683e-17,
        -8.49617672e-04,  1.41602945e-03,  8.49617672e-04,  0.00000000e+00,
         2.83205891e-04,  5.66411781e-04,  1.41602945e-03, -1.41602945e-03,
         0.00000000e+00,  5.66411781e-04,  2.83205891e-04,  0.00000000e+00,
         0.00000000e+00, -8.49617672e-04,  0.00000000e+00,  0.00000000e+00,
        -5.66411781e-04,  0.00000000e+00,  1.13282356e-03,  3.700743

In [None]:
fi.head()

Unnamed: 0,feature_names,feature_importance
0,tBodyGyro-entropy()-Y,0.005098
1,tGravityAcc-entropy()-Y,0.003398
2,tGravityAcc-max()-Y,0.002549
3,tBodyAccJerk-entropy()-Y,0.002549
4,"tBodyGyroJerk-arCoeff()-Y,4",0.002549


In [None]:
# 상위 20개 변수만 선택해서 모델링

for i in range(20, 121, 10):
    cols = fi['feature_names'].values[:i]
    temp_x = x_train[cols]
    rf = KNeighborsClassifier()
    rf.fit(temp_x, y_train)
    print(f"{i} : {rf.score(x_val[cols], y_val)}")

20 : 0.8742565845369583
30 : 0.9048428207306712
40 : 0.9167374681393373
50 : 0.929481733220051
60 : 0.923534409515718
70 : 0.9405267629566695
80 : 0.9379779099405268
90 : 0.945624468988955
100 : 0.945624468988955
110 : 0.945624468988955
120 : 0.9464740866610025


### (2) 알고리즘2 : Logistic

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(x_train_s, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
y_pred = model.predict(x_val_s)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.9881053525913339
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.95      0.99      0.97       200
          STANDING       0.99      0.96      0.97       226
           WALKING       1.00      0.99      1.00       198
WALKING_DOWNSTAIRS       1.00      0.99      1.00       145
  WALKING_UPSTAIRS       0.99      1.00      0.99       177

          accuracy                           0.99      1177
         macro avg       0.99      0.99      0.99      1177
      weighted avg       0.99      0.99      0.99      1177



In [None]:
from sklearn.inspection import permutation_importance
pfi = permutation_importance(model, x_val_s, y_val, n_repeats=3, scoring = 'accuracy', random_state=42)
fi = plot_feature_importance(pfi.importances_mean, list(x_train_s))

### (3) 알고리즘3 : XgbBoost

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(x_train, y_train_l)

In [None]:
y_pred = model.predict(x_val)
print(accuracy_score(y_val_l, y_pred))
print(classification_report(y_val_l, y_pred))

0.9923534409515717
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       231
           1       0.98      0.99      0.99       200
           2       1.00      0.98      0.99       226
           3       0.99      0.98      0.99       198
           4       0.99      0.99      0.99       145
           5       0.99      1.00      0.99       177

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177



In [None]:
fi = plot_feature_importance(model.feature_importances_, list(x_train))
for i in range(20, 121, 10):
    cols = fi['feature_names'].values[:i]
    temp_x = x_train[cols]
    rf = XGBClassifier()
    rf.fit(temp_x, y_train)
    print(f"{i} : {rf.score(x_val[cols], y_val)}")

20 : 0.9600679694137638
30 : 0.9600679694137638
40 : 0.9617672047578589
50 : 0.9566694987255735
60 : 0.9583687340696686
70 : 0.9549702633814783
80 : 0.9549702633814783
90 : 0.9719626168224299
100 : 0.9668649107901445
110 : 0.9643160577740016
120 : 0.973661852166525


### (4) 알고리즘4 : SVM

In [None]:
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_s, y_train)

In [None]:
y_pred = model.predict(x_val_s)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.9787595581988106
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.92      0.97      0.94       200
          STANDING       0.98      0.92      0.95       226
           WALKING       1.00      0.99      1.00       198
WALKING_DOWNSTAIRS       1.00      0.99      1.00       145
  WALKING_UPSTAIRS       0.99      1.00      0.99       177

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177



In [None]:
from sklearn.inspection import permutation_importance
pfi = permutation_importance(model, x_val_s, y_val, n_repeats=3, scoring = 'accuracy', random_state=42)
fi = plot_feature_importance(pfi.importances_mean, list(x_train_s))

### (5) 알고리즘5 : Randomforest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(x_train, y_train)

In [None]:
y_pred = model.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.9804587935429057
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.96      0.98      0.97       200
          STANDING       0.98      0.96      0.97       226
           WALKING       0.98      0.98      0.98       198
WALKING_DOWNSTAIRS       0.97      0.97      0.97       145
  WALKING_UPSTAIRS       0.98      0.98      0.98       177

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177



In [None]:
fi = plot_feature_importance(model.feature_importances_, list(x_train))
for i in range(20, 121, 10):
    cols = fi['feature_names'].values[:i]
    temp_x = x_train[cols]
    rf = KNeighborsClassifier()
    rf.fit(temp_x, y_train)
    print(f"{i} : {rf.score(x_val[cols], y_val)}")

20 : 0.9592183517417162
30 : 0.9575191163976211
40 : 0.9592183517417162
50 : 0.9575191163976211
60 : 0.9541206457094308
70 : 0.9507221750212405
80 : 0.9566694987255735
90 : 0.9626168224299065
100 : 0.9592183517417162
110 : 0.9566694987255735
120 : 0.9549702633814783


### (6) 딥러닝

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, Input
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
model = Sequential([
    Input(shape=(x_train_s.shape[1])),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(6, activation='softmax'),
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_8 (Dense)             (None, 128)               71936     
                                                                 
 batch_normalization_6 (Bat  (None, 128)               512       
 chNormalization)                                                
                                                                 
 dropout_6 (Dropout)         (None, 128)               0         
                                                                 
 dense_9 (Dense)             (None, 128)               16512     
                                                                 
 batch_normalization_7 (Bat  (None, 128)               512       
 chNormalization)                                                
                                                                 
 dropout_7 (Dropout)         (None, 128)              

In [None]:
es = EarlyStopping(monitor='val_loss', patience=10, min_delta=0, verbose=1, restore_best_weights=True)

model.fit(x_train_s, y_train_l, verbose=1, epochs=30, callbacks=[es], validation_split=0.2)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.src.callbacks.History at 0x7fe66d0f1a20>

In [None]:
model.evaluate(x_val_s, y_val_l)



[0.04314330592751503, 0.9847068786621094]

## **4. 기본 모델링(selected feature)**



* 세부 요구사항
    - 최소 4개 이상의 알고리즘을 적용하여 모델링을 수행한다.
    - 각 알고리즘별로 전체 변수로 모델링, 상위 N개 변수를 선택하여 모델링을 수행하고 성능 비교를 한다.
    - (옵션) 알고리즘 중 1~2개에 대해서, 변수 중요도 상위 N개를 선정하여 모델링하고 타 모델과 성능을 비교.
        * 상위 N개를 선택하는 방법은, 변수를 하나씩 늘려가며 모델링 및 성능 검증을 수행하여 적절한 지점을 찾는 것이다.

In [None]:
# 랜덤 50개 feature 추출
data.head()
# data_sel = data.sample(50,replace=False, axis=1, random_state=42, ignore_index=True) # random_state=2023
data_sel = data.sample(20,replace=False, axis=1, random_state=42, ignore_index=True)
data_sel['Activity'] = data['Activity']

In [None]:
data_sel

Unnamed: 0,"fBodyAccJerk-bandsEnergy()-33,40.1",fBodyAcc-mad()-X,fBodyGyro-meanFreq()-Y,fBodyAcc-min()-X,"tBodyGyroJerk-arCoeff()-Y,1",tBodyGyroMag-arCoeff()2,"fBodyAcc-bandsEnergy()-1,8.1",tBodyGyroJerkMag-min(),"fBodyGyro-bandsEnergy()-57,64.2",fBodyAcc-energy()-Y,...,"tBodyGyroJerk-correlation()-X,Z",tBodyAcc-min()-Y,tGravityAcc-mean()-X,tBodyGyro-std()-Y,"tBodyAcc-arCoeff()-Y,2","tBodyAcc-arCoeff()-Z,1",tBodyGyroMag-std(),"fBodyAcc-bandsEnergy()-33,40",tBodyGyroMag-arCoeff()3,Activity
0,-0.999937,-0.988021,-0.535663,-0.983218,-0.428046,0.072201,-0.998868,-0.988517,-0.999698,-0.998941,...,0.003330,0.681264,0.875254,-0.968673,-0.011862,-0.049915,-0.970071,-0.999775,0.053967,STANDING
1,-0.999919,-0.988538,-0.446218,-0.995414,-0.351409,0.035071,-0.999959,-0.993606,-0.999989,-0.999932,...,0.229115,0.694376,-0.134711,-0.976701,-0.311399,0.005108,-0.974138,-0.999858,0.377881,LAYING
2,-0.999543,-0.997419,0.402094,-0.995747,0.334204,-0.491549,-0.999827,-0.990569,-0.999993,-0.999633,...,-0.237477,0.681985,0.965965,-0.996322,-0.173546,0.562884,-0.992006,-0.999986,0.585057,STANDING
3,-0.901223,-0.146359,-0.185642,-0.697850,-0.274961,-0.361825,-0.499990,-0.603738,-0.704933,-0.501328,...,-0.126100,0.067789,0.927343,-0.432211,0.412219,-0.262981,-0.453516,-0.824705,0.278173,WALKING
4,-0.960068,0.105888,-0.282879,0.358459,-0.380950,0.346127,-0.560876,-0.878159,-0.981072,-0.595957,...,0.043063,0.135817,0.901125,-0.574059,0.565149,-0.291032,-0.246889,-0.643314,-0.091558,WALKING_DOWNSTAIRS
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5876,-0.999638,-0.993514,-0.610779,-0.996199,-0.328224,0.192051,-0.996934,-0.972656,-0.999989,-0.997655,...,0.058509,0.679860,0.973223,-0.970289,-0.107601,0.016549,-0.974275,-0.999863,-0.186759,SITTING
5877,-0.822945,0.013892,-0.622211,-0.617405,-0.270950,0.011222,-0.713397,-0.746841,-0.836757,-0.609367,...,-0.656106,0.304995,0.910932,-0.122764,0.194360,-0.023946,-0.200862,-0.817577,0.158229,WALKING_UPSTAIRS
5878,-0.999893,-0.990401,0.503339,-0.997246,0.399783,-0.607265,-0.999953,-0.984686,-0.999958,-0.999848,...,0.021914,0.691052,-0.514220,-0.991982,-0.231762,0.322018,-0.991276,-0.999830,0.350757,LAYING
5879,-0.956153,-0.022763,-0.195882,-0.521987,-0.252746,-0.035405,-0.423097,-0.834767,-0.978726,-0.441166,...,-0.216026,0.003324,0.921553,-0.494265,0.536557,-0.035675,-0.378939,-0.880905,0.232269,WALKING_UPSTAIRS


In [None]:
target = 'Activity'

x = data_sel.drop(columns=target, axis=1)
y = data_sel[target]

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

scaler = MinMaxScaler()
x_train_s = scaler.fit_transform(x_train)
x_val_s = scaler.transform(x_val)

# LabelEncoder 객체 생성
label_encoder = LabelEncoder()
# 학습 데이터와 검증 데이터의 레이블을 변환
y_train_l = label_encoder.fit_transform(y_train)
y_val_l = label_encoder.transform(y_val)

### (1) 알고리즘1 : RandomForest

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(x_train, y_train)

y_pred = model.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

fi = plot_feature_importance(model.feature_importances_, list(x_train))
for i in range(50, 301, 50):
    cols = fi['feature_names'].values[:i]
    temp_x = x_train[cols]
    rf = RandomForestClassifier()
    rf.fit(temp_x, y_train)
    print(f"{i} : {rf.score(x_val[cols], y_val)}")

0.9770603228547153
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.95      0.97      0.96       200
          STANDING       0.97      0.95      0.96       226
           WALKING       0.98      0.98      0.98       198
WALKING_DOWNSTAIRS       0.97      0.97      0.97       145
  WALKING_UPSTAIRS       0.98      0.98      0.98       177

          accuracy                           0.98      1177
         macro avg       0.98      0.98      0.98      1177
      weighted avg       0.98      0.98      0.98      1177

50 : 0.9762107051826678
100 : 0.9847068819031436
150 : 0.9889549702633815
200 : 0.9813084112149533
250 : 0.9821580288870009
300 : 0.9821580288870009


In [None]:
# 변수 랜덤 50개 선택
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(x_train, y_train)

y_pred = model.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.8232795242141037
                    precision    recall  f1-score   support

            LAYING       0.78      0.84      0.81       231
           SITTING       0.72      0.64      0.68       200
          STANDING       0.74      0.75      0.75       226
           WALKING       0.89      0.92      0.91       198
WALKING_DOWNSTAIRS       0.92      0.89      0.91       145
  WALKING_UPSTAIRS       0.94      0.94      0.94       177

          accuracy                           0.82      1177
         macro avg       0.83      0.83      0.83      1177
      weighted avg       0.82      0.82      0.82      1177



In [None]:
# 성능 높여보기

### (2) 알고리즘2 : XGBboost

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(x_train, y_train_l)

y_pred = model.predict(x_val)
print(accuracy_score(y_val_l, y_pred))
print(classification_report(y_val_l, y_pred))

fi = plot_feature_importance(model.feature_importances_, list(x_train))
for i in range(50, 301, 50):
    cols = fi['feature_names'].values[:i]
    temp_x = x_train[cols]
    rf = XGBClassifier()
    rf.fit(temp_x, y_train_l)
    print(f"{i} : {rf.score(x_val[cols], y_val_l)}")

0.9923534409515717
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       231
           1       0.98      0.99      0.99       200
           2       1.00      0.98      0.99       226
           3       0.99      0.98      0.99       198
           4       0.99      0.99      0.99       145
           5       0.99      1.00      0.99       177

    accuracy                           0.99      1177
   macro avg       0.99      0.99      0.99      1177
weighted avg       0.99      0.99      0.99      1177

50 : 0.9821580288870009
100 : 0.9915038232795242
150 : 0.9932030586236194
200 : 0.9915038232795242
250 : 0.994052676295667
300 : 0.9923534409515717


In [None]:
# 변수 랜덤 20개 선택
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(x_train, y_train_l)

y_pred = model.predict(x_val)
print(accuracy_score(y_val_l, y_pred))
print(classification_report(y_val_l, y_pred))

0.8538657604078165
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       231
           1       0.75      0.68      0.71       200
           2       0.75      0.78      0.77       226
           3       0.94      0.95      0.95       198
           4       0.94      0.94      0.94       145
           5       0.96      0.96      0.96       177

    accuracy                           0.85      1177
   macro avg       0.86      0.86      0.86      1177
weighted avg       0.85      0.85      0.85      1177



In [None]:
# 성능 높여보기

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# 파라미터 선언
params = {'max_depth':range(1, 31)}  # 30개 선언

# 기본 모델 선언
xgb_cfi = XGBClassifier(random_state=42)

# Random Search 선언
  # cv=5
  # n_iter=20 / Grid Search 일 경우에는 n_iter만 빼고 나머지 그대로 사용
  # scoring='r2'
model = GridSearchCV(xgb_cfi,     # 기본 모델
                    params,        # 파라미터 범위
                    cv = 5,        # K-Fold 개수
                    scoring='accuracy',   # 평가지표
                    verbose=1
                    )

# 모델 학습
model.fit(x_train, y_train_l)

# 결과 확인
print(model.cv_results_['mean_test_score'])  # 수행 정보 중 평균 성능값들 출력
print('최적파라미터:', model.best_params_)    # 최적 파라미터
print('최고성능:', model.best_score_)         # 최고성능

Fitting 5 folds for each of 30 candidates, totalling 150 fits
[0.73192823 0.78635087 0.8165399  0.81994099 0.8271696  0.82886721
 0.8297185  0.82652972 0.83269473 0.83141859 0.83460601 0.83035612
 0.83418342 0.8329084  0.83397088 0.83503358 0.8356721  0.83227011
 0.83248287 0.83205757 0.83503493 0.8335458  0.83630791 0.83035747
 0.83418206 0.8322692  0.83290682 0.83269428 0.83269428 0.83269428]
최적파라미터: {'max_depth': 23}
최고성능: 0.8363079114568024


In [None]:
from xgboost import XGBClassifier

xgb_clf = XGBClassifier()
xgb_clf.fit(x_train, y_train_l)

y_pred_clf = xgb_clf.predict(x_val)
print(accuracy_score(y_val_l, y_pred_clf))
print(classification_report(y_val_l, y_pred_clf))

Parameters: { "mean_child_weight" } are not used.



0.8181818181818182
              precision    recall  f1-score   support

           0       0.80      0.84      0.82       231
           1       0.71      0.62      0.66       200
           2       0.73      0.77      0.75       226
           3       0.88      0.89      0.89       198
           4       0.91      0.88      0.89       145
           5       0.92      0.94      0.93       177

    accuracy                           0.82      1177
   macro avg       0.83      0.82      0.82      1177
weighted avg       0.82      0.82      0.82      1177



### (3) 알고리즘3 : Logistic

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(x_train_s, y_train)

y_pred = model.predict(x_val_s)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

pfi = permutation_importance(model, x_val_s, y_val, n_repeats=1, scoring = 'accuracy', random_state=42)
fi = plot_feature_importance(pfi.importances_mean, list(x_train))
for i in range(50, 301, 50):
    cols = fi['feature_names'].values[:i]
    temp_x = x_train[cols]
    rf = LogisticRegression(max_iter=1000)
    rf.fit(temp_x, y_train)
    print(f"{i} : {rf.score(x_val[cols], y_val)}")

0.9881053525913339
                    precision    recall  f1-score   support

            LAYING       1.00      1.00      1.00       231
           SITTING       0.95      0.99      0.97       200
          STANDING       0.99      0.96      0.97       226
           WALKING       1.00      0.99      1.00       198
WALKING_DOWNSTAIRS       1.00      0.99      1.00       145
  WALKING_UPSTAIRS       0.99      1.00      0.99       177

          accuracy                           0.99      1177
         macro avg       0.99      0.99      0.99      1177
      weighted avg       0.99      0.99      0.99      1177

50 : 0.9651656754460493
100 : 0.9864061172472387
150 : 0.9872557349192863
200 : 0.9872557349192863
250 : 0.9906542056074766
300 : 0.989804587935429


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# 변수 랜덤 50개 선택
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(x_train_s, y_train)

y_pred = model.predict(x_val_s)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.6389124893797791
                    precision    recall  f1-score   support

            LAYING       0.58      0.71      0.64       231
           SITTING       0.36      0.20      0.25       200
          STANDING       0.55      0.65      0.60       226
           WALKING       0.72      0.72      0.72       198
WALKING_DOWNSTAIRS       0.74      0.74      0.74       145
  WALKING_UPSTAIRS       0.84      0.86      0.85       177

          accuracy                           0.64      1177
         macro avg       0.63      0.65      0.63      1177
      weighted avg       0.62      0.64      0.62      1177



### (4) 알고리즘4 : SVM

In [None]:
# 변수 랜덤 50개 선택
from sklearn.svm import SVC

model = SVC()
model.fit(x_train_s, y_train)

y_pred = model.predict(x_val_s)
print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred))

0.7102803738317757
                    precision    recall  f1-score   support

            LAYING       0.62      0.80      0.70       231
           SITTING       0.59      0.23      0.33       200
          STANDING       0.60      0.74      0.66       226
           WALKING       0.78      0.86      0.82       198
WALKING_DOWNSTAIRS       0.88      0.79      0.83       145
  WALKING_UPSTAIRS       0.89      0.87      0.88       177

          accuracy                           0.71      1177
         macro avg       0.73      0.71      0.70      1177
      weighted avg       0.71      0.71      0.69      1177



### (5) 딥러닝

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout, Input
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
model = Sequential([
    Input(shape=(x_train_s.shape[1])),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(256, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(512, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(1024, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(1024, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(512, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(256, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(128, activation='relu'),
    BatchNormalization(),
    Dropout(0.2),
    Dense(6, activation='softmax'),
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

model.summary()

Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_22 (Dense)            (None, 128)               2688      
                                                                 
 batch_normalization_18 (Ba  (None, 128)               512       
 tchNormalization)                                               
                                                                 
 dropout_18 (Dropout)        (None, 128)               0         
                                                                 
 dense_23 (Dense)            (None, 256)               33024     
                                                                 
 batch_normalization_19 (Ba  (None, 256)               1024      
 tchNormalization)                                               
                                                                 
 dropout_19 (Dropout)        (None, 256)              

In [None]:
es = EarlyStopping(monitor='val_loss', patience=30, min_delta=0, verbose=1, restore_best_weights=True)

model.fit(x_train_s, y_train_l, verbose=1, epochs=200, callbacks=[es], validation_split=0.2)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<keras.src.callbacks.History at 0x7e9b7ec0ef80>

In [None]:
model.evaluate(x_val_s, y_val_l)



[0.5081743597984314, 0.7977910041809082]