#**스마트폰 센서 데이터 기반 모션 분류**
# 단계3 : 단계별 모델링


## 0.미션

단계별로 나눠서 모델링을 수행하고자 합니다.  

* 단계1 : 정적(0), 동적(1) 행동 분류 모델 생성
* 단계2 : 세부 동작에 대한 분류모델 생성
    * 단계1 모델에서 0으로 예측 -> 정적 행동 3가지 분류 모델링
    * 단계1 모델에서 1으로 예측 -> 동적 행동 3가지 분류 모델링
* 모델 통합
    * 두 단계 모델을 통합하고, 새로운 데이터에 대해서 최종 예측결과와 성능평가가 나오도록 함수로 만들기
* 성능 비교
    * 기본 모델링의 성능과 비교
    * 모든 모델링은 [다양한 알고리즘 + 성능 튜닝]을 수행해야 합니다.


## 1.환경설정

### (1) 라이브러리 불러오기

* 세부 요구사항
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# 필요하다고 판단되는 라이브러리를 추가하세요.





### (2) 데이터 불러오기

* 주어진 데이터셋
    * data01_train.csv : 학습 및 검증용

 <br/>  

* 세부 요구사항
    - data01_train.csv 를 불러와 'data' 이름으로 저장합니다.
        - data에서 변수 subject는 삭제합니다.
    - data01_test.csv 를 불러와 'new_data' 이름으로 저장합니다.


In [3]:
data = pd.read_csv('/content/drive/MyDrive/KTaivle/3차미니프로젝트/data01_train.csv')
new_data = pd.read_csv('/content/drive/MyDrive/KTaivle/3차미니프로젝트/data01_test.csv')

In [4]:
data.drop(columns=["subject"], inplace=True)

In [34]:
new_data.drop(columns=["subject"], inplace=True)

## 2.데이터 전처리

* 세부 요구사항
    - Label 추가 : data 에 Activity_dynamic 를 추가합니다. Activity_dynamic은 과제1에서 is_dynamic과 동일한 값입니다.
    - x와 y1, y2로 분할하시오.
        * y1 : Activity
        * y2 : Activity_dynamic
    - train : val = 8 : 2 혹은 7 : 3
    - random_state 옵션을 사용하여 다른 모델과 비교를 위해 성능이 재현되도록 합니다.

In [25]:
new_data['Activity_dynamic'] = new_data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS']).astype(int)

In [6]:
data['Activity_dynamic'] = data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS']).astype(int)

In [8]:
from sklearn.model_selection import train_test_split

x = data.drop(columns=['Activity', 'Activity_dynamic'])
y1 = data['Activity']
y2 = data['Activity_dynamic']

x_train, x_val, y1_train, y1_val = train_test_split(x, y1, test_size=0.2, random_state=42)
_, _, y2_train, y2_val = train_test_split(x, y2, test_size=0.2, random_state=42)

In [48]:
test_x = new_data.drop(columns=['Activity', 'Activity_dynamic'])
test_y1 = new_data['Activity']
test_y2 = new_data['Activity_dynamic']

## **3.단계별 모델링**

![](https://github.com/DA4BAM/image/blob/main/step%20by%20step.png?raw=true)

### (1) 단계1 : 정적/동적 행동 분류 모델

* 세부 요구사항
    * 정적 행동(Laying, Sitting, Standing)과 동적 행동(동적 : Walking, Walking-Up, Walking-Down)을 구분하는 모델 생성.
    * 몇가지 모델을 만들고 가장 성능이 좋은 모델을 선정하시오.

#### 1) 알고리즘1 :

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

model_rf = RandomForestClassifier(random_state=42)
model_rf.fit(x_train, y2_train)

pred_rf = model_rf.predict(x_val)

accuracy_rf = accuracy_score(y2_val, pred_rf)

print("RandomForest Activity Model Accuracy:", accuracy_rf)

RandomForest Activity Model Accuracy: 1.0


In [50]:
t_pred_rf = model_rf.predict(test_x)
test_accuracy_rf = accuracy_score(test_y2, t_pred_rf)
print("TEST RandomForest Activity Model Accuracy:", test_accuracy_rf)

TEST RandomForest Activity Model Accuracy: 1.0


#### 2) 알고리즘2 :

In [51]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(random_state=42)

model_lr.fit(x_train, y2_train)

pred_lr = model_lr.predict(x_val)

accuracy_lr = accuracy_score(y2_val, pred_lr)

print("Logistic Regression Activity Model Accuracy:", accuracy_lr)

Logistic Regression Activity Model Accuracy: 1.0


In [52]:
t_pred_lr = model_lr.predict(test_x)
test_accuracy_lr = accuracy_score(test_y2, t_pred_lr)
print("TEST Logistic Regression Activity Model Accuracy:", test_accuracy_lr)

TEST Logistic Regression Activity Model Accuracy: 1.0


#### 3) 알고리즘3 :

In [53]:
from sklearn.svm import SVC

model_svm = SVC(random_state=42)

model_svm.fit(x_train, y2_train)

pred_svm = model_svm.predict(x_val)

accuracy_svm = accuracy_score(y2_val, pred_svm)

print("SVM Activity Model Accuracy:", accuracy_svm)

SVM Activity Model Accuracy: 1.0


In [54]:
t_pred_svm = model_svm.predict(test_x)
test_accuracy_svm = accuracy_score(test_y2, t_pred_svm)
print("TEST SVM Activity Model Accuracy:", test_accuracy_svm)

TEST SVM Activity Model Accuracy: 1.0


### (2) 단계2-1 : 정적 동작 세부 분류

* 세부 요구사항
    * 정적 행동(Laying, Sitting, Standing)인 데이터 추출
    * Laying, Sitting, Standing 를 분류하는 모델을 생성
    * 몇가지 모델을 만들고 가장 성능이 좋은 모델을 선정하시오.

In [14]:
subset_data = data[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]

X_subset = subset_data.drop(columns=['Activity', 'Activity_dynamic'])
y_subset = subset_data['Activity']

X_train, X_val, y_train, y_val = train_test_split(X_subset, y_subset, test_size=0.2, random_state=42)

In [44]:
t_subset_data = new_data[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]

t_X_subset = t_subset_data.drop(columns=['Activity', 'Activity_dynamic'])
t_y_subset = t_subset_data['Activity']

  t_subset_data = new_data[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]


In [18]:
rf_classifier = RandomForestClassifier(random_state=42)

rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)

print("RandomForest Model Accuracy:", accuracy)

RandomForest Model Accuracy: 0.9752704791344667


In [45]:
t_y_pred = rf_classifier.predict(t_X_subset)
t_accuracy = accuracy_score(t_y_subset, t_y_pred)

print("TEST RandomForest Model Accuracy:", t_accuracy)

TEST RandomForest Model Accuracy: 0.4348404255319149


In [19]:
lr_classifier = LogisticRegression(random_state=42)

lr_classifier.fit(X_train, y_train)

y_pred = lr_classifier.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)

print("LogisticRegression Model Accuracy:", accuracy)

LogisticRegression Model Accuracy: 0.9752704791344667


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [46]:
t_y_pred = lr_classifier.predict(t_X_subset)
t_accuracy = accuracy_score(t_y_subset, t_y_pred)

print("TEST LogisticRegression Model Accuracy:", t_accuracy)

TEST LogisticRegression Model Accuracy: 0.44148936170212766


In [20]:
svm_classifier = SVC(random_state=42)

svm_classifier.fit(X_train, y_train)

y_pred = svm_classifier.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)

print("SVC Model Accuracy:", accuracy)

SVC Model Accuracy: 0.9489953632148377


In [47]:
t_y_pred = svm_classifier.predict(t_X_subset)
t_accuracy = accuracy_score(t_y_subset, t_y_pred)

print("TEST LogisticRegression Model Accuracy:", t_accuracy)

TEST LogisticRegression Model Accuracy: 0.4401595744680851


### (3) 단계2-2 : 동적 동작 세부 분류

* 세부 요구사항
    * 동적 행동(Walking, Walking Upstairs, Walking Downstairs)인 데이터 추출
    * Walking, Walking Upstairs, Walking Downstairs 를 분류하는 모델을 생성
    * 몇가지 모델을 만들고 가장 성능이 좋은 모델을 선정하시오.

In [21]:
dynamic_data = data[data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS'])]

X_subset = dynamic_data.drop(columns=['Activity', 'Activity_dynamic'])
y_subset = dynamic_data['Activity']

X_train, X_val, y_train, y_val = train_test_split(X_subset, y_subset, test_size=0.2, random_state=42)

In [55]:
t_dynamic_data = new_data[data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS'])]

t_X_subset = dynamic_data.drop(columns=['Activity', 'Activity_dynamic'])
t_y_subset = dynamic_data['Activity']

  t_dynamic_data = new_data[data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS'])]


In [56]:
rf_classifier = RandomForestClassifier(random_state=42)

rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)

print("RandomForest Model Accuracy:", accuracy)

RandomForest Model Accuracy: 0.9886792452830189


In [57]:
t_y_pred = rf_classifier.predict(t_X_subset)

t_accuracy = accuracy_score(t_y_subset, t_y_pred)

print("TEST RandomForest Model Accuracy:", t_accuracy)

TEST RandomForest Model Accuracy: 0.9977332829618436


In [58]:
lr_classifier = LogisticRegression(random_state=42)

lr_classifier.fit(X_train, y_train)

y_pred = lr_classifier.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)

print("LogisticRegression Model Accuracy:", accuracy)

LogisticRegression Model Accuracy: 0.9924528301886792


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
t_y_pred = lr_classifier.predict(t_X_subset)

t_accuracy = accuracy_score(t_y_subset, t_y_pred)

print("TEST RandomForest Model Accuracy:", t_accuracy)

TEST RandomForest Model Accuracy: 0.9984888553078958


In [24]:
svm_classifier = SVC(random_state=42)

svm_classifier.fit(X_train, y_train)

y_pred = svm_classifier.predict(X_val)

accuracy = accuracy_score(y_val, y_pred)

print("SVC Model Accuracy:", accuracy)

SVC Model Accuracy: 0.9905660377358491


In [60]:
t_y_pred = svm_classifier.predict(t_X_subset)

t_accuracy = accuracy_score(t_y_subset, t_y_pred)

print("TEST RandomForest Model Accuracy:", t_accuracy)

TEST RandomForest Model Accuracy: 0.9965999244427654


### [선택사항] (4) 분류 모델 합치기


* 세부 요구사항
    * 두 단계 모델을 통합하고, 새로운 데이터(test)에 대해서 최종 예측결과와 성능평가가 나오도록 함수로 만들기
    * 데이터 파이프라인 구축 : test데이터가 로딩되어 전처리 과정을 거치고, 예측 및 성능 평가 수행

![](https://github.com/DA4BAM/image/blob/main/pipeline%20function.png?raw=true)

In [252]:
x = data.drop(columns=['Activity'])
y = data['Activity']

x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=42)

rf_model = RandomForestClassifier(random_state=42)

rf_model.fit(x_train, y_train)

importance = rf_model.feature_importances_
names = x_train.columns

feature_importance_df = pd.DataFrame({'Feature': names, 'Importance': importance})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

#### 1) 함수 만들어서 분류 모델 합치기

In [101]:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [253]:
data = pd.read_csv('/content/drive/MyDrive/KTaivle/3차미니프로젝트/data01_train.csv')
top_n_features = feature_importance_df['Feature'].head(100).tolist()
top_n_features.append("Activity")
top_n_features.append("Activity_dynamic")

data.drop(columns=["subject"], inplace=True)
data['Activity_dynamic'] = data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS']).astype(int)
columns_to_drop = [col for col in data.columns if col not in top_n_features]
data.drop(columns=columns_to_drop, inplace=True)

x = data.drop(columns=['Activity', 'Activity_dynamic'])

y1 = data['Activity']
y2 = data['Activity_dynamic']

x_train, x_val, y1_train, y1_val = train_test_split(x, y1, test_size=0.2, random_state=42)
_, _, y2_train, y2_val = train_test_split(x, y2, test_size=0.2, random_state=42)

In [254]:
static_data = data[data['Activity'].isin(['LAYING', 'SITTING', 'STANDING'])]

s_X_subset = static_data.drop(columns=['Activity', 'Activity_dynamic'])
s_y_subset = static_data['Activity']

x_train_static, X_val_static, y_train_static, y_val_static = train_test_split(s_X_subset, s_y_subset, test_size=0.2, random_state=42)

In [255]:
dynamic_data = new_data[data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS'])]

d_X_subset = dynamic_data.drop(columns=['Activity', 'Activity_dynamic'])
d_y_subset = dynamic_data['Activity']

x_train_dynamic, X_val_dynamic, y_train_dynamic, y_val_dynamic = train_test_split(d_X_subset, d_y_subset, test_size=0.2, random_state=42)

  dynamic_data = new_data[data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS'])]


In [256]:
from sklearn.model_selection import train_test_split

x = data.drop(columns=['Activity', 'Activity_dynamic'])
y1 = data['Activity']
y2 = data['Activity_dynamic']

x_train, x_val, y1_train, y1_val = train_test_split(x, y1, test_size=0.2, random_state=42)
_, _, y2_train, y2_val = train_test_split(x, y2, test_size=0.2, random_state=42)

In [257]:
numeric_features = [col for col in x.columns]

# 전처리 및 모델링을 위한 파이프라인 구성
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),  # 숫자형 변수 스케일링
    ])

# 파이프라인 구성
pipeline_isdynamic_classification = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))  # 랜덤 포레스트 분류기
])

In [258]:
# 파이프라인 훈련
pipeline_isdynamic_classification.fit(x_train, y2)

In [259]:
# 예측
y_isdynamic_pred = pipeline_isdynamic_classification.predict(x_val)

# 동적인 데이터를 분류하는 결과를 바탕으로 해당 데이터가 동적인 경우 동적 활동을 분류하는 모델과 정적 활동을 분류하는 모델을 선택
dynamic_indices = np.where(y_isdynamic_pred == 1)[0]
static_indices = np.where(y_isdynamic_pred == 0)[0]

In [260]:
x_val.reset_index(drop=True, inplace=True)

x_val_dynamic, x_val_static = x_val.iloc[dynamic_indices], x_val.iloc[static_indices]

In [261]:
# 동적 활동 분류 모델 구성
pipeline_dynamic_activity = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))  # 랜덤 포레스트 분류기
])

# 정적 활동 분류 모델 구성
pipeline_static_activity = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state=42))  # 로지스틱 회귀 분류기
])

# 동적 활동 분류 모델 훈련
pipeline_dynamic_activity.fit(x_train_dynamic, y_train_dynamic)

# 정적 활동 분류 모델 훈련
pipeline_static_activity.fit(x_train_static, y_train_static)

# 예측
y_dynamic_activity_pred = pipeline_dynamic_activity.predict(x_val_dynamic)
y_static_activity_pred = pipeline_static_activity.predict(x_val_static)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [262]:
final_prediction = []
d = 0
s = 0

for idx in range(len(y_isdynamic_pred)):
    if y_isdynamic_pred[idx] == 1:
        final_prediction.append(y_dynamic_activity_pred[d])
        d += 1
    else:
        final_prediction.append(y_static_activity_pred[s])
        s += 1

In [263]:
ac = accuracy_score(final_prediction, y1_val)

In [264]:
print(ac)

0.6652506372132541


### 테스트 데이터

In [265]:
new_data = pd.read_csv('/content/drive/MyDrive/KTaivle/3차미니프로젝트/data01_test.csv')

new_data.drop(columns=["subject"], inplace=True)
new_data['Activity_dynamic'] = new_data['Activity'].isin(['WALKING', 'WALKING_UPSTAIRS', 'WALKING_DOWNSTAIRS']).astype(int)
columns_to_drop = [col for col in new_data.columns if col not in top_n_features]
new_data.drop(columns=columns_to_drop, inplace=True)

test_x = new_data.drop(columns=['Activity', 'Activity_dynamic'])

test_y1 = new_data['Activity']
test_y2 = new_data['Activity_dynamic']

In [266]:
numeric_features = [col for col in test_x.columns]

In [267]:
test_isdynamic_pred = pipeline_isdynamic_classification.predict(test_x)

test_dynamic_indices = np.where(test_isdynamic_pred == 1)[0]
test_static_indices = np.where(test_isdynamic_pred == 0)[0]

In [268]:
test_x.reset_index(drop=True, inplace=True)

test_x_dynamic, test_x_static = test_x.iloc[test_dynamic_indices], test_x.iloc[test_static_indices]

In [269]:
test_dynamic_activity_pred = pipeline_dynamic_activity.predict(test_x_dynamic)
test_static_activity_pred = pipeline_static_activity.predict(test_x_static)

In [270]:
test_final_prediction = []
d = 0
s = 0

for idx in range(len(test_isdynamic_pred)):
    if test_isdynamic_pred[idx] == 1:
        test_final_prediction.append(test_dynamic_activity_pred[d])
        d += 1
    else:
        test_final_prediction.append(test_static_activity_pred[s])
        s += 1

In [271]:
test_ac = accuracy_score(test_final_prediction, test_y1)

In [272]:
print(test_ac)

0.4738273283480625
