## learning-AI : HARTH classification (DL)
### The Human Activity Recognition Trondheim (HARTH)를 U-net을 통한 Encoder-Decoder 방식의 classification

<br>

- **임규연 (lky473736)**
- 2024.09.06. ~ 2024.09.08.에 문서 작성
- **dataset** : https://archive.ics.uci.edu/dataset/779/harth
- **data abstract** : The Human Activity Recognition Trondheim (HARTH) dataset is a professionally-annotated dataset containing 22 subjects wearing two 3-axial accelerometers for around 2 hours in a free-living setting. The sensors were attached to the right thigh and lower back. The professional recordings and annotations provide a promising benchmark dataset for researchers to develop innovative machine learning approaches for precise HAR in free living.

------



## <span id='dl'><mark>DL</mark></span>
    
deep learning으로 HARTH을 classification한다. **kmat2019가 작성한 'U-net(1D-CNN) with Keras'를 참고하여 classification한다.**

- **Reference**
    - https://www.kaggle.com/code/kmat2019/u-net-1d-cnn-with-keras
    - https://github.com/lky473736/learning-AI/blob/main/report/HARTH/U-net_classification_HARTH.ipynb
    - https://github.com/lky473736/learning-AI/blob/main/insight/insight_4_U_net_on_ion_switching.ipynb

In [150]:
import os
import matplotlib.pyplot as plt
import glob
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.layers import Dense, Dropout, Reshape, Conv1D, BatchNormalization, Activation, AveragePooling1D, GlobalAveragePooling1D, Lambda, Input, Concatenate, Add, UpSampling1D, Multiply
from keras.models import Model
# objectives 작동 X -> losses로 변경
from keras.losses import mean_squared_error
from keras import backend as K
from keras.losses import binary_crossentropy, categorical_crossentropy
from keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard, ReduceLROnPlateau,LearningRateScheduler
from keras.initializers import random_normal
from keras.optimizers import Adam, RMSprop, SGD
from keras.callbacks import Callback

from sklearn.metrics import cohen_kappa_score, f1_score
from sklearn.model_selection import KFold, train_test_split

In [151]:
'''
    원래 모든 harth 디렉토리의 데이터파일을 병합한 데이터셋을 사용하려고 했으나, 
    records의 양이 많아 한 epoch 당 10분의 학습시간이 걸린다.
    따라서, S006.csv은 train set으로,
    S008.csv는 test set으로 둔다. 
'''

df_train = pd.read_csv("../../data/harth/S006.csv")
df_test = pd.read_csv("../../data/harth/S008.csv")

print (df_train.columns)
print (df_train.info())
print ()
print (df_test.columns)
print (df_test.info())

Index(['timestamp', 'back_x', 'back_y', 'back_z', 'thigh_x', 'thigh_y',
       'thigh_z', 'label'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 408709 entries, 0 to 408708
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   timestamp  408709 non-null  object 
 1   back_x     408709 non-null  float64
 2   back_y     408709 non-null  float64
 3   back_z     408709 non-null  float64
 4   thigh_x    408709 non-null  float64
 5   thigh_y    408709 non-null  float64
 6   thigh_z    408709 non-null  float64
 7   label      408709 non-null  int64  
dtypes: float64(6), int64(1), object(1)
memory usage: 24.9+ MB
None

Index(['timestamp', 'back_x', 'back_y', 'back_z', 'thigh_x', 'thigh_y',
       'thigh_z', 'label'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418989 entries, 0 to 418988
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     

In [152]:
'''
    info()에서 본 df가 결측치가 없음을 확인했으니, 결측치 처리는 pass
'''

# class 확인 및 label encoding 실시

print (df_train['label'].unique())
print (df_test['label'].unique())

# labelencoding을 통하여 각 label을 0-based 및 순서대로 구성

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df_train['label'] = label_encoder.fit_transform(df_train['label'])
df_test['label'] = label_encoder.fit_transform(df_test['label'])

print (df_train['label'].unique())
print (df_test['label'].unique())

[  6   1   3   7   4   5   8 130  13  14]
[  6   1   3   7   4   5  13 130 140  14]
[4 0 1 5 2 3 6 9 7 8]
[4 0 1 5 2 3 6 8 9 7]


In [153]:
'''
    oversampling은 함수로 구현한다. 원래는 SMOTE를 사용하려고 하였다.
    SMOTE (Synthetic Minority Over-sampling Technique)는 적은 수의 
    클래스 사이에서 새로운 가상 records를 구성하는 것이다.
    하지만, 랜덤으로 위치를 다시 uniting하기 때문에 시계열 데이터에는 적합하지 않아 직접 구현한다.
'''

# from imblearn.over_sampling import SMOTE

# smote = SMOTE()
# harth_input_resampled, harth_target_resampled = smote.fit_resample(harth_input, 
#                                                                    harth_target)
# print (harth_target_resampled.value_counts())
# print (harth_input_resampled.shape, harth_target_resampled.shape)

# class 별로 데이터 추출 -> 복제하는 방식
    
# 만약 15000개보다 샘플이 적음 -> replace == True로 복제
# 만약 15000개보다 샘플이 많음 -> 복제
    
def oversampling(df, target_col, max_size) :
    # 결과를 저장할 리스트 
    dfs = []
    
    for label in df[target_col].unique() :
        class_df = df[df[target_col] == label]
        
        if len(class_df) < max_size :
            # 샘플 수가 max_size보다 적으면 데이터를 복제하여 max_size로 만듦
            sampled_df = class_df.sample(max_size, replace=True, random_state=42)
        else :
            # 샘플 수가 max_size보다 많으면 앞부분부터 max_size만큼 선택함
            sampled_df = class_df.head(max_size)
        
        # 리스트에 추가
        dfs.append(sampled_df)
    
    # 리스트에 저장된 데이터프레임들을 합침
    df_resampled = pd.concat(dfs).reset_index(drop=True)
    
    return df_resampled

df_train_resampled = oversampling(df_train, 'label', max_size=15000)
df_test_resampled = oversampling(df_train, 'label', max_size=15000)
print (df_train_resampled['label'].value_counts())
print ()
print (df_test_resampled['label'].value_counts())

label
4    15000
0    15000
1    15000
5    15000
2    15000
3    15000
6    15000
9    15000
7    15000
8    15000
Name: count, dtype: int64

label
4    15000
0    15000
1    15000
5    15000
2    15000
3    15000
6    15000
9    15000
7    15000
8    15000
Name: count, dtype: int64


In [154]:
# 둘 다 모양을 맞추기 위해 400000까지 cut

print (df_train_resampled, df_test_resampled)

# df_train = df_train.iloc[:400000]
# df_test = df_test.iloc[:400000]

# df_train.shape, df_test.shape 

'''
    이전에 oversampling 전에 했었던 작업. 
'''

                      timestamp    back_x    back_y    back_z   thigh_x  \
0       2019-01-12 00:00:00.000 -0.760242  0.299570  0.468570 -5.092732   
1       2019-01-12 00:00:00.010 -0.530138  0.281880  0.319987  0.900547   
2       2019-01-12 00:00:00.020 -1.170922  0.186353 -0.167010 -0.035442   
3       2019-01-12 00:00:00.030 -0.648772  0.016579 -0.054284 -1.554248   
4       2019-01-12 00:00:00.040 -0.355071 -0.051831 -0.113419 -0.547471   
...                         ...       ...       ...       ...       ...   
149995  2019-01-12 00:54:55.410 -0.651213  0.181842  0.266343  0.563873   
149996  2019-01-12 00:55:21.380 -0.533637  0.048316  0.446973 -0.991452   
149997  2019-01-12 00:55:26.360 -1.064464  0.250127  0.788185 -1.994850   
149998  2019-01-12 00:55:27.450 -0.335641 -0.199129  0.221672 -0.892103   
149999  2019-01-12 00:54:54.610 -0.706449  0.052887  0.540791 -0.852804   

         thigh_y   thigh_z  label  
0      -0.298644  0.709439      4  
1       0.286944  0.340309 

'\n    이전에 oversampling 전에 했었던 작업. \n'

In [155]:
# reshape를 통해 전체 열을 3D ndarray로 재구성

train_input = df_train_resampled[['back_x', 'back_y', 
                        'back_z', 'thigh_x', 
                        'thigh_y', 'thigh_z']].values.reshape(-1, 100, 6)
train_input_mean = train_input.mean()
train_input_sigma = train_input.std()
train_input = (train_input - train_input_mean) / train_input_sigma

test_input = df_test_resampled[['back_x', 'back_y', 
                        'back_z', 'thigh_x', 
                        'thigh_y', 'thigh_z']].values.reshape(-1, 500, 6)
test_input = (test_input - train_input_mean) / train_input_sigma

In [156]:
# label을 target으로 두고, one-hot encoding 진행

train_target = pd.get_dummies(df_train_resampled["label"]).values.reshape(-1, 100, len(df_train_resampled['label'].unique()))

In [157]:
# val set 구축

idx = np.arange(train_input.shape[0])
train_idx, val_idx = train_test_split(idx, 
                                      random_state=111, 
                                      test_size=0.2)
# idx를 섞어서 편향 문제를 방지

val_input = train_input[val_idx]
train_input = train_input[train_idx]
val_target = train_target[val_idx]
train_target = train_target[train_idx]

print("train_input:{}, val_input:{}, train_target:{}, val_target:{}".format(train_input.shape, val_input.shape, train_target.shape, val_target.shape))

train_input:(1200, 100, 6), val_input:(300, 100, 6), train_target:(1200, 100, 10), val_target:(300, 100, 10)


----

```ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 5, 256), (None, 4, 192)]``` 로 인하여 기존 소스코드 로직에 해가 가지 않을 정도로만 수정한다.

- Unet 메소드에 upsampling 후 크기가 맞지 않아 concatenate가 되지 않는 문제 해결

In [158]:
def cbr(x, out_layer, kernel, stride, dilation):
    x = Conv1D(out_layer, kernel_size=kernel, dilation_rate=dilation, strides=stride, padding="same")(x)
    x = BatchNormalization()(x)
    x = Activation("relu")(x)
    return x

In [159]:
def se_block(x_in, layer_n):
    x = GlobalAveragePooling1D()(x_in)
    x = Dense(layer_n // 8, activation="relu")(x)
    x = Dense(layer_n, activation="sigmoid")(x)
    x_out = Multiply()([x_in, x])
    return x_out

In [160]:
def resblock(x_in, layer_n, kernel, dilation, use_se=True):
    x = cbr(x_in, layer_n, kernel, 1, dilation)
    x = cbr(x, layer_n, kernel, 1, dilation)
    if use_se:
        x = se_block(x, layer_n)
    x = Add()([x_in, x])
    return x

In [161]:
def Unet(input_shape=(100, 6)):
    layer_n = 64
    kernel_size = 7
    depth = 2

    input_layer = Input(shape=input_shape)

    ########## encoder 
    x = cbr(input_layer, layer_n, kernel_size, 1, 1)  
    # shape: (None, 100, 64)
    for i in range(depth):
        x = resblock(x, layer_n, kernel_size, 1)
    out_0 = x

    x = cbr(x, layer_n*2, kernel_size, 5, 1)  
    # shape: (None, 20, 128)
    for i in range(depth):
        x = resblock(x, layer_n*2, kernel_size, 1)
    out_1 = x

    x = cbr(x, layer_n*3, kernel_size, 5, 1)  
    # shape: (None, 4, 192)
    for i in range(depth):
        x = resblock(x, layer_n*3, kernel_size, 1)
    out_2 = x

    ########## Decoder
    x = UpSampling1D(size=5)(x)  # upsample to (None, 20, 192)
    x = Concatenate()([x, out_1])  # concatenate with out_1 (None, 20, 128)
    x = Conv1D(layer_n*2, kernel_size=1, padding="same")(x)  
    x = cbr(x, layer_n*2, kernel_size, 1, 1) 
    # shape: (None, 20, 128)

    x = UpSampling1D(size=5)(x)  # upsample to (None, 100, 128)
    x = Concatenate()([x, out_0])  # concatenate with out_0 (None, 100, 64)
    x = Conv1D(layer_n, kernel_size=1, padding="same")(x) 
    x = cbr(x, layer_n, kernel_size, 1, 1) 
    # shape: (None, 100, 64)

    ######### Dense
    num_classes = len(df_train_resampled['label'].unique())
    x = Conv1D(num_classes, kernel_size=1, 
               strides=1, padding="same")(x)
    out = Activation("softmax")(x)

    model = Model(inputs=input_layer, outputs=out)

    return model


In [162]:
def augmentations(input_data, target_data):
    #flip
    if np.random.rand() < 0.5:    
        input_data = input_data[::-1]
        target_data = target_data[::-1]

    return input_data, target_data

In [163]:
def Datagen(input_dataset, target_dataset, batch_size, is_train=False):
    x = []
    y = []
    count = 0
    idx_1 = np.arange(len(input_dataset))
    np.random.shuffle(idx_1)

    while True:
        for i in range(len(input_dataset)):
            input_data = input_dataset[idx_1[i]]
            target_data = target_dataset[idx_1[i]]

            if is_train:
                input_data, target_data = augmentations(input_data, target_data)
                
            x.append(input_data)
            y.append(target_data)
            count += 1
            
            if count == batch_size:
                x=np.array(x, dtype=np.float32)
                y=np.array(y, dtype=np.float32)
                inputs = x
                targets = y       
                x = []
                y = []
                count=0
                yield inputs, targets

In [164]:
# class macroF1(Callback):
#     def __init__(self, model, inputs, targets):
#         self.model = model
#         self.inputs = inputs
#         self.targets = np.argmax(targets, axis=2).reshape(-1)

#     def on_epoch_end(self, epoch, logs):
#         # 각 에포크가 끝날 때마다 검증 데이터로 매크로 F1 스코어를 계산하여 출력
#         pred = np.argmax(self.model.predict(self.inputs), axis=2).reshape(-1)
#         f1_val = f1_score(self.targets, pred, average="macro")
#         print("val_f1_macro_score: ", f1_val)

'''

'''

from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import Callback
import numpy as np

class macroF1(Callback):
    def __init__(self, inputs, targets):
        super(macroF1, self).__init__()
        self.inputs = inputs
        self.targets = np.argmax(targets, axis=2).reshape(-1)

    def on_epoch_end(self, epoch, logs=None):
        predictions = self.model.predict(self.inputs) 
        pred = np.argmax(predictions, axis=2).reshape(-1)

        # Calculate the macro F1 score
        f1_val = f1_score(self.targets, pred, average="macro")
        
        # Add the F1 score to logs
        logs['val_f1_macro_score'] = f1_val

        print(f"Epoch {epoch + 1} - val_f1_macro_score: {f1_val:.4f}")


In [165]:
# def model_fit(model, train_inputs, train_targets, val_inputs, val_targets, n_epoch, batch_size=32):
#     # 모델을 학습시키기 위한 함수
#     hist = model.fit_generator(
#         Datagen(train_inputs, train_targets, batch_size, is_train=True),
#         steps_per_epoch=len(train_inputs) // batch_size,
#         epochs=n_epoch,
#         validation_data=Datagen(val_inputs, val_targets, batch_size),
#         validation_steps=len(val_inputs) // batch_size,
#         callbacks=[lr_schedule, macroF1(model, val_inputs, val_targets)],
#         shuffle=False,
#         verbose=1
#     )
#     return hist

'''
    fit_generator가 정상적으로 작동하지 않아, 일부 수정함
'''

def model_fit(model, train_inputs, train_targets, val_inputs, val_targets, n_epoch, batch_size=32):
    # 모델을 학습시키기 위한 함수
    hist = model.fit(
        train_inputs, train_targets,
        batch_size=batch_size,
        epochs=n_epoch,
        validation_data=(val_inputs, val_targets),
        callbacks=[lr_schedule, macroF1(val_inputs, val_targets)],
        shuffle=True, 
        verbose=1
    )
    return hist


In [166]:
def lrs(epoch):
    if epoch<35:
        lr = learning_rate
    elif epoch<50:
        lr = learning_rate/10
    else:
        lr = learning_rate/100
    return lr

In [167]:
K.clear_session()
model = Unet()
#print(model.summary())

learning_rate=0.0005
n_epoch=60
batch_size=32

lr_schedule = LearningRateScheduler(lrs)

#regressor
#model.compile(loss="mean_squared_error", 
#              optimizer=Adam(lr=learning_rate),
#              metrics=["mean_absolute_error"])

#classifier
model.compile(loss=categorical_crossentropy, 
              optimizer=Adam(learning_rate=learning_rate), 
              metrics=["accuracy"])

hist = model_fit(model, train_input, train_target, val_input, val_target, n_epoch, batch_size)


Epoch 1/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 83ms/step - accuracy: 0.5753 - loss:
Epoch 1 - val_f1_macro_score: 0.2898
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 152ms/step - accuracy: 0.5800 - loss: 1.3398 - val_accuracy: 0.3990 - val_loss: 1.7583 - learning_rate: 5.0000e-04 - val_f1_macro_score: 0.2898
Epoch 2/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/stepp - accuracy: 0.9436 - loss: 0.
Epoch 2 - val_f1_macro_score: 0.3267
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 119ms/step - accuracy: 0.9437 - loss: 0.2949 - val_accuracy: 0.4021 - val_loss: 2.0934 - learning_rate: 5.0000e-04 - val_f1_macro_score: 0.3267
Epoch 3/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 27ms/stepp - accuracy: 0.9656 - loss: 0.
Epoch 3 - val_f1_macro_score: 0.3322
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 122ms/step - accuracy: 0.9657 - loss: 0.1853 - val_accuracy: 0.3

Epoch 24/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 47ms/stepp - accuracy: 0.9892 - loss: 
Epoch 24 - val_f1_macro_score: 0.9814
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 157ms/step - accuracy: 0.9893 - loss: 0.0449 - val_accuracy: 0.9821 - val_loss: 0.0483 - learning_rate: 5.0000e-04 - val_f1_macro_score: 0.9814
Epoch 25/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 41ms/stepp - accuracy: 0.9941 - loss: 
Epoch 25 - val_f1_macro_score: 0.9656
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 141ms/step - accuracy: 0.9940 - loss: 0.0301 - val_accuracy: 0.9660 - val_loss: 0.1095 - learning_rate: 5.0000e-04 - val_f1_macro_score: 0.9656
Epoch 26/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/stepp - accuracy: 0.9843 - loss: 0
Epoch 26 - val_f1_macro_score: 0.8295
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 142ms/step - accuracy: 0.9844 - loss: 0.0540 - val_accuracy:

Epoch 47/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/stepp - accuracy: 0.9976 - loss: 0
Epoch 47 - val_f1_macro_score: 0.9955
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 143ms/step - accuracy: 0.9976 - loss: 0.0111 - val_accuracy: 0.9963 - val_loss: 0.0278 - learning_rate: 5.0000e-05 - val_f1_macro_score: 0.9955
Epoch 48/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 40ms/stepp - accuracy: 0.9993 - loss: 0
Epoch 48 - val_f1_macro_score: 0.9955
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 129ms/step - accuracy: 0.9993 - loss: 0.0064 - val_accuracy: 0.9963 - val_loss: 0.0279 - learning_rate: 5.0000e-05 - val_f1_macro_score: 0.9955
Epoch 49/60
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/stepp - accuracy: 0.9999 - loss: 
Epoch 49 - val_f1_macro_score: 0.9955
[1m38/38[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 139ms/step - accuracy: 0.9998 - loss: 0.0050 - val_accuracy

In [169]:
pred = np.argmax((model.predict(val_input)+model.predict(val_input[:,::-1,:])[:,::-1,:])/2, axis=2).reshape(-1)
gt = np.argmax(val_target, axis=2).reshape(-1)
print("SCORE_oldmetric: ", cohen_kappa_score(gt, pred, weights="quadratic"))
print("SCORE_newmetric: ", f1_score(gt, pred, average="macro"))

[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
SCORE_oldmetric:  0.9966725728498582
SCORE_newmetric:  0.9951357875189869
