## learning-AI101 : HARTH classification (DL)
### The Human Activity Recognition Trondheim (HARTH)를 Inception model (GoogLeNet)을 통한 classification

<br>

- **임규연 (lky473736)**
- 2024.09.01. ~ 2024.09.04.에 문서 작성
- **dataset** : https://archive.ics.uci.edu/dataset/779/harth
- **data abstract** : The Human Activity Recognition Trondheim (HARTH) dataset is a professionally-annotated dataset containing 22 subjects wearing two 3-axial accelerometers for around 2 hours in a free-living setting. The sensors were attached to the right thigh and lower back. The professional recordings and annotations provide a promising benchmark dataset for researchers to develop innovative machine learning approaches for precise HAR in free living.

------


## <span id='dl'><mark>DL</mark></span>
    
deep learning으로 HARTH을 classification한다. GoogLeNet (inception model)을 이용한다. **IEEE Sensor 저널에 등록된 <iSPLInception: Redefining the State-of-the-Art for Human Activity Recognition>에서 나온 모델 구조 및 소스코드를 참고하여 HARTH를 classification한다.**

- **Reference**
    - https://github.com/MyungKyuYi/HAR/blob/main/iSPLInception_WISDM_0730.ipynb
    - https://github.com/rmutegeki/iSPLInception
    - https://github.com/healthDataScience/deep-learning-HAR/blob/master/HAR-CNN-Inception.ipynb
    - https://ieeexplore.ieee.org/document/9425494

In [1]:
# 라이브러리 import

import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow import keras
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

2024-09-04 09:14:27.041194: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
'''
    원래 모든 harth 디렉토리의 데이터파일을 병합한 데이터셋을 사용하려고 했으나, 
    records의 양이 많아 한 epoch 당 10분의 학습시간이 걸린다.
    따라서, S006.csv, S008.csv를 병합하여 학습하도록 한다.
'''

import glob

file_name = ['../../data/harth/' + compo for compo in ['S006.csv', 'S008.csv']]

directory_path = '../../data/harth'

# 모든 CSV 파일의 경로를 가져오기
all_files = glob.glob(directory_path + "/*.csv")
print ("harth 디렉토리 내 파일의 이름 : ", all_files)

# 데이터프레임을 담을 빈 리스트 생성
df_list = []

# 모든 파일을 순회하며 데이터프레임에 추가
for filename in file_name :
    df = pd.read_csv(filename)
    df_list.append(df)

merged_df = pd.concat(df_list, ignore_index=True)  # 병합 (header는 ignore)
df = merged_df

print ()
print (df.shape)
print (df.head())

harth 디렉토리 내 파일의 이름 :  ['../../data/harth/S016.csv', '../../data/harth/S017.csv', '../../data/harth/S029.csv', '../../data/harth/S015.csv', '../../data/harth/S014.csv', '../../data/harth/S028.csv', '../../data/harth/S010.csv', '../../data/harth/S013.csv', '../../data/harth/S012.csv', '../../data/harth/S006.csv', '../../data/harth/S023.csv', '../../data/harth/S022.csv', '../../data/harth/S008.csv', '../../data/harth/S020.csv', '../../data/harth/S021.csv', '../../data/harth/S009.csv', '../../data/harth/S025.csv', '../../data/harth/S019.csv', '../../data/harth/S018.csv', '../../data/harth/S024.csv', '../../data/harth/S026.csv', '../../data/harth/S027.csv']

(827698, 8)
                 timestamp    back_x    back_y    back_z   thigh_x   thigh_y  \
0  2019-01-12 00:00:00.000 -0.760242  0.299570  0.468570 -5.092732 -0.298644   
1  2019-01-12 00:00:00.010 -0.530138  0.281880  0.319987  0.900547  0.286944   
2  2019-01-12 00:00:00.020 -1.170922  0.186353 -0.167010 -0.035442 -0.078423   
3  20

In [3]:
# timestamp 열 제거

df = df.drop('timestamp', axis=1)
df

Unnamed: 0,back_x,back_y,back_z,thigh_x,thigh_y,thigh_z,label
0,-0.760242,0.299570,0.468570,-5.092732,-0.298644,0.709439,6
1,-0.530138,0.281880,0.319987,0.900547,0.286944,0.340309,6
2,-1.170922,0.186353,-0.167010,-0.035442,-0.078423,-0.515212,6
3,-0.648772,0.016579,-0.054284,-1.554248,-0.950978,-0.221140,6
4,-0.355071,-0.051831,-0.113419,-0.547471,0.140903,-0.653782,6
...,...,...,...,...,...,...,...
827693,-0.988998,-0.123216,-0.207160,-1.042779,-0.060467,-0.072847,3
827694,-1.010435,-0.119318,-0.231946,-0.927346,-0.089707,-0.047210,3
827695,-0.988791,-0.099315,-0.193497,-1.067241,0.001182,-0.091760,3
827696,-1.028768,-0.132392,-0.216164,-0.876475,-0.069528,-0.038416,3


In [4]:
# target의 class 도수를 확인 (숫자로)

df['label'].value_counts()

label
7      545841
6      127904
1       46343
3       41520
13      32588
8       13036
130     11460
14       3878
5        2300
4        2078
140       750
Name: count, dtype: int64

In [5]:
# 결측치 확인 및 결측치를 각 열의 평균값으로 대체

print (df.isnull().sum())
df = df.fillna(df.mean())

print (df.isnull().sum())

back_x     0
back_y     0
back_z     0
thigh_x    0
thigh_y    0
thigh_z    0
label      0
dtype: int64
back_x     0
back_y     0
back_z     0
thigh_x    0
thigh_y    0
thigh_z    0
label      0
dtype: int64


In [6]:
# labelencoding을 통하여 각 label을 0-based 및 순서대로 구성

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
df['label'] = label_encoder.fit_transform(df['label'])

df['label'].value_counts()

label
5     545841
4     127904
0      46343
1      41520
7      32588
6      13036
9      11460
8       3878
3       2300
2       2078
10       750
Name: count, dtype: int64

In [7]:
'''
    현재 label이 5인 records에만 너무 많이 편중되어 있다. 
    따라서 각 label이 15000개를 가질 수 있도록 oversampling한다.
'''

# class 별로 데이터 추출 -> 복제하는 방식
    
# 만약 15000개보다 샘플이 적음 -> replace == True로 복제
# 만약 15000개보다 샘플이 많음 -> 복제
    
def oversampling(df, target_col, max_size) :
    # 결과를 저장할 리스트 
    dfs = []
    
    for label in df[target_col].unique() :
        class_df = df[df[target_col] == label]
        
        if len(class_df) < max_size :
            # 샘플 수가 max_size보다 적으면 데이터를 복제하여 max_size로 만듦
            sampled_df = class_df.sample(max_size, replace=True, random_state=42)
        else :
            # 샘플 수가 max_size보다 많으면 앞부분부터 max_size만큼 선택함
            sampled_df = class_df.head(max_size)
        
        # 리스트에 추가
        dfs.append(sampled_df)
    
    # 리스트에 저장된 데이터프레임들을 합침
    df_resampled = pd.concat(dfs).reset_index(drop=True)
    
    return df_resampled

df_resampled = oversampling(df, 'label', max_size=15000)
print (df_resampled['label'].value_counts())

label
4     15000
0     15000
1     15000
5     15000
2     15000
3     15000
6     15000
9     15000
7     15000
8     15000
10    15000
Name: count, dtype: int64


In [8]:
# split the input, target

harth_input = df.drop(columns=['label'])
harth_target = df['label']

print (harth_input.head())
print ('\n')
print (harth_target.head())

     back_x    back_y    back_z   thigh_x   thigh_y   thigh_z
0 -0.760242  0.299570  0.468570 -5.092732 -0.298644  0.709439
1 -0.530138  0.281880  0.319987  0.900547  0.286944  0.340309
2 -1.170922  0.186353 -0.167010 -0.035442 -0.078423 -0.515212
3 -0.648772  0.016579 -0.054284 -1.554248 -0.950978 -0.221140
4 -0.355071 -0.051831 -0.113419 -0.547471  0.140903 -0.653782


0    4
1    4
2    4
3    4
4    4
Name: label, dtype: int64


In [9]:
# z-score normalization 수행

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
harth_input_scaled = scaler.fit_transform(harth_input)
harth_target_reshaped = harth_target.to_numpy().reshape(-1, 1)

# normalization된 입력 데이터와 label 데이터를 결합하여 DataFrame 생성
df = pd.DataFrame(
    np.hstack((harth_input_scaled, harth_target_reshaped)),
    columns=[feature for feature in df.columns[:-1]] + ['label']
)

df

Unnamed: 0,back_x,back_y,back_z,thigh_x,thigh_y,thigh_z,label
0,0.507498,1.808309,2.043162,-9.763588,-1.787553,0.136069,4.0
1,1.654979,1.693919,1.648712,2.577876,0.862620,-0.560230,4.0
2,-1.540489,1.076214,0.355857,0.650470,-0.790909,-2.174018,4.0
3,1.063377,-0.021595,0.655116,-2.477082,-4.739795,-1.619304,4.0
4,2.528007,-0.463951,0.498128,-0.403908,0.201689,-2.435406,4.0
...,...,...,...,...,...,...,...
827693,-0.633268,-0.925550,0.249269,-1.423855,-0.709645,-1.339575,1.0
827694,-0.740171,-0.900343,0.183468,-1.186153,-0.841975,-1.291216,1.0
827695,-0.632235,-0.770994,0.285539,-1.474229,-0.430641,-1.375251,1.0
827696,-0.831593,-0.984881,0.225366,-1.081400,-0.750652,-1.274627,1.0


In [10]:
'''
    시계열 데이터를 프레임 크기와 홉 크기로 분할
    일종의 split_sequence과 비슷하다고 보임
    
    아래 함수는 현재 x, y, z를 time-frame으로 split하고, hop_size가 일종의 이동량의 역할을 한다.
    최종적으로는 frames와 labels를 반환하고 있다.
    
    아래를 현재 HARTH 데이터셋에 맞게끔 변형할 것이다. 다음 셀을 참고.
'''

import scipy.stats as stats

def get_frames(df, frame_size, hop_size):

    N_FEATURES = 3

    frames = []
    labels = []
    for i in range(0, len(df) - frame_size, hop_size):
        x = df['x'].values[i: i + frame_size]
        y = df['y'].values[i: i + frame_size]
        z = df['z'].values[i: i + frame_size]
        
        # Retrieve the most often used label in this segment
        label = stats.mode(df['label'][i: i + frame_size])[0][0]
        frames.append([x, y, z])
        labels.append(label)

    # Bring the segments into a better shape
    frames = np.asarray(frames).reshape(-1, frame_size, N_FEATURES)
    labels = np.asarray(labels)

    return frames, labels

In [11]:
# def get_frames(df, frame_size, hop_size):
#     N_FEATURES = 6  # 6개의 특징 (back_x, back_y, back_z, thigh_x, thigh_y, thigh_z)

#     frames = []
#     labels = []
#     for i in range(0, len(df) - frame_size, hop_size):
#         # 각 축의 데이터를 슬라이싱
#         back_x = df['back_x'].values[i: i + frame_size]
#         back_y = df['back_y'].values[i: i + frame_size]
#         back_z = df['back_z'].values[i: i + frame_size]
#         thigh_x = df['thigh_x'].values[i: i + frame_size]
#         thigh_y = df['thigh_y'].values[i: i + frame_size]
#         thigh_z = df['thigh_z'].values[i: i + frame_size]

#         # 해당 구간의 레이블에서 가장 많이 나타나는 레이블 선택
#         mode_result = stats.mode(df['label'][i: i + frame_size])
#         label = mode_result.mode[0]  # mode_result.mode는 배열이므로 첫 번째 값을 선택

#         # 각 축의 데이터를 하나의 리스트로 묶어 프레임에 추가
#         frames.append([back_x, back_y, back_z, thigh_x, thigh_y, thigh_z])
#         labels.append(label)

#     # 프레임을 numpy 배열로 변환하고, (number_of_frames, frame_size, N_FEATURES) 형태로 reshape
#     frames = np.asarray(frames).reshape(-1, frame_size, N_FEATURES)
#     labels = np.asarray(labels)

#     return frames, labels

def get_frames(df, frame_size, hop_size) :
    N_FEATURES = 6  # 6개의 특징 

    frames = []
    labels = []
    
    for i in range(0, len(df) - frame_size + 1, hop_size) :
        back_x = df['back_x'].values[i: i + frame_size]
        back_y = df['back_y'].values[i: i + frame_size]
        back_z = df['back_z'].values[i: i + frame_size]
        thigh_x = df['thigh_x'].values[i: i + frame_size]
        thigh_y = df['thigh_y'].values[i: i + frame_size]
        thigh_z = df['thigh_z'].values[i: i + frame_size]

        # 해당 구간의 레이블에서 가장 많이 나타나는 레이블 선택
        mode_result = stats.mode(df['label'][i: i + frame_size], keepdims=False)
        label = mode_result.mode
        
        # 각 축을 묶어서 데이터 추가함
        frames.append([back_x, back_y, back_z, thigh_x, thigh_y, thigh_z])
        labels.append(label)

    # reshape
    frames = np.asarray(frames).reshape(-1, frame_size, N_FEATURES)
    labels = np.asarray(labels)

    return frames, labels