# Policy
* 이 노트북 파일은 중간 과제를 재현Reproduce하는 증거로 활용.
    * 따라서, 이 노트북에 기재된 코드를 실행했을 때, Kaggle에 제출한 결과와 일치해야함
    * 무작위성 등으로 인하여 결과가 매번 달라지는 경우에는 Random Seed 등을 설정해서 항상 같은 결과가 나오도록 할 것(예. scikit-learn의 **random_state** 등)
* 다음과 같은 정책을 어길 시 0점 처리되니 유의할 것
    * 노트북에 기재된 코드를 전체 실행했을 때 오류 등으로 실행이 되지 않는 경우
    * 노트북에 명시되지 않은 별도의 코드 및 라이브러리를 사용하는 경우
        * 과제 수행에 필요한 외부 라이브러리 설치 명령은 이 노트북 내에 명시할 것(Installing Libraries 참조)
        * 별도의 Python 파일은 사용하지 말 것. 필요하다면 이 노트북 내에서 구현.
    * 중간 과제에서 제공한 데이터 이외의 별도의 데이터를 사용하는 경우
    * 노트북 파일 내에서 훈련된 모델이 아닌 다른 모델을 사용하는 경우
    * Kaggle 제출 결과와 노트북 파일 실행으로 나온 결과가 크게 다를 경우

# Installing Libraries

아래에 중간 과제를 수행하는 데 필요한 라이브러리 들을 설치하는 명령어를 넣을 것
예를 들어,
```shell
%conda install sklearn
```
또는
```shell
%pip install -U sklearn
```
버전이 중요하다면, 버전도 명확하게 명시할 것
```shell
%conda install sklearn==1.4.2
```
또는
```shell
%pip install sklearn==1.4.2
```

In [1]:
# 설치하는 라이브러리 명시
!pip  install -q scikit-learn numpy pandas kaggle xgboost catboost

In [2]:
import pandas as pd
import json
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import balanced_accuracy_score

# Data Load
아래에 중간 과제에 활용할 데이터를 불러오는 부분을 넣을 것

In [3]:
np.random.seed(32)
random.seed(32)

In [4]:
USERNAME = "liebenholz" # username
USERKEY = "fa3eb41e36bd09c1d7cc6239e58576bb" # key
json.dump({'username': USERNAME, 'key': USERKEY}, open('kaggle.json', mode='w'))

In [5]:
!mkdir ~/.kaggle/
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle competitions download -c 2024-knu-ml-team-asmt
!unzip -o -qq 2024-knu-ml-team-asmt.zip

mkdir: /Users/gokyulueau/.kaggle/: File exists
2024-knu-ml-team-asmt.zip: Skipping, found more recently modified local copy (use --force to force download)


In [6]:
test_accel  = pd.read_csv('/Users/gokyulueau/Downloads/test_accel.csv',index_col='time',usecols=['x','y','z','time'])
test_gyro   = pd.read_csv('/Users/gokyulueau/Downloads/test_gyro.csv',index_col='time',usecols=['x','y','z','time'])
train_accel = pd.read_csv('/Users/gokyulueau/Downloads/train_accel.csv',index_col='time',usecols=['x','y','z','time'])
train_gyro  = pd.read_csv('/Users/gokyulueau/Downloads/train_gyro.csv',index_col='time',usecols=['x','y','z','time'])
train_label = pd.read_csv('/Users/gokyulueau/Downloads/train_label.csv',index_col='id')
submmision  = pd.read_csv('/Users/gokyulueau/Downloads/sample_submission.csv',index_col='id')

# Data Preprocessing, Feature Engineering, and Model Building

이곳부터는 데이터 전처리, 특성값 공학, 모델 훈련 등의 코드를 자유롭게 넣을 것

In [7]:
submmision = submmision.sort_index()
test_label = submmision.copy()

In [8]:
test_accel  .columns = ['acc_x','acc_y','acc_z']
test_gyro   .columns = ['gyro_x','gyro_y','gyro_z']
train_accel .columns = ['acc_x','acc_y','acc_z']
train_gyro  .columns = ['gyro_x','gyro_y','gyro_z']

In [9]:
test_accel  .index = pd.to_datetime(test_accel.index, unit='s')
test_gyro   .index = pd.to_datetime(test_gyro.index, unit='s')
train_accel .index = pd.to_datetime(train_accel.index, unit='s')
train_gyro  .index = pd.to_datetime(train_gyro.index, unit='s')
train_label .index = pd.to_datetime(train_label.index, unit='s')
test_label  .index = pd.to_datetime(test_label.index, unit='s')

In [53]:
test_accel  = test_accel .ffill().dropna().resample('20ms').nearest()
test_gyro   = test_gyro  .ffill().dropna().resample('20ms').nearest()
train_accel = train_accel.ffill().dropna()
train_accel = train_accel[~train_accel.index.duplicated()]
train_accel = train_accel.resample('20ms').nearest()
train_gyro  = train_gyro .ffill().dropna()
train_gyro  = train_gyro [~train_gyro.index.duplicated()]
train_gyro  = train_gyro .resample('20ms').nearest()
train_label = train_label.ffill().dropna()

In [54]:
merge_train=pd.concat([train_accel,train_gyro],axis=1)
merge_train = merge_train.ffill().bfill().dropna()
merge_train

Unnamed: 0_level_0,acc_x,acc_y,acc_z,gyro_x,gyro_y,gyro_z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1970-01-01 04:00:00.000,9.561121,-1.004985,0.528814,-0.108123,0.051771,-0.010614
1970-01-01 04:00:00.020,9.561121,-1.004985,0.528814,-0.099266,0.017104,-0.013821
1970-01-01 04:00:00.040,10.068997,-0.942174,0.417548,-0.080481,0.025809,-0.062079
1970-01-01 04:00:00.060,10.068997,-0.942174,0.417548,-0.080481,0.025809,-0.062079
1970-01-01 04:00:00.080,10.007980,-0.473181,0.158524,-0.096517,0.043371,-0.089415
...,...,...,...,...,...,...
1970-01-03 13:05:47.320,9.645791,1.037942,-1.692231,-0.502796,0.148568,0.307702
1970-01-03 13:05:47.340,9.534423,1.260379,-1.583258,-0.489751,0.213795,0.263714
1970-01-03 13:05:47.360,9.534423,1.260379,-1.583258,-0.489751,0.213795,0.263714
1970-01-03 13:05:47.380,9.550290,0.978815,-1.506617,-0.594105,0.220150,0.281765


In [55]:
merge_test=pd.concat([test_accel,test_gyro],axis=1)
merge_test = merge_test.ffill().bfill().dropna()
merge_test

Unnamed: 0_level_0,acc_x,acc_y,acc_z,gyro_x,gyro_y,gyro_z
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1970-01-01 02:00:00.000,8.540640,-1.877546,3.954028,-0.303788,-0.013181,0.069500
1970-01-01 02:00:00.020,8.540640,-1.877546,3.954028,-0.303788,-0.013181,0.069500
1970-01-01 02:00:00.040,8.654553,-1.616788,3.658991,-0.236094,0.004658,0.033508
1970-01-01 02:00:00.060,8.654553,-1.616788,3.658991,-0.236094,0.004658,0.033508
1970-01-01 02:00:00.080,8.755443,-1.556763,3.500770,-0.162728,0.012883,0.036212
...,...,...,...,...,...,...
1970-01-03 17:04:52.880,-10.443508,0.572214,1.422156,0.157603,-0.519235,-0.052534
1970-01-03 17:04:52.900,-7.915231,0.141258,1.295263,-0.059865,-0.232129,0.023824
1970-01-03 17:04:52.920,-7.896077,1.410185,2.334347,-0.073304,-0.200975,0.180816
1970-01-03 17:04:52.940,-7.900866,1.541866,2.221819,0.078802,0.045815,0.208305


In [56]:
scalers = []
for i in range(6):
  # scalers.append(StandardScaler())
  scalers.append(MinMaxScaler())
  

for a,i in enumerate(merge_train.columns):
  merge_train[i] = scalers[a].fit_transform(np.array(merge_train[i]).reshape(len(merge_train[i]),1))
  merge_test[i] = scalers[a].transform(np.array(merge_test[str(i)]).reshape(len(merge_test[i]),1))

In [57]:
fft_train_sensor = []
N = 128
f_s = 200
for t in train_label.index:
    l = train_label.at[t, 'workout']
    fft=[0 for i in range(6)]
    mag=[0 for i in range(6)]
    max_freq=[0 for i in range(6)]
    avg =[0 for i in range(6)]
    sub = merge_train.loc[t - pd.Timedelta(milliseconds=1):t + pd.Timedelta(seconds=3)]
    for i in range(len(merge_train.columns)):
      fft[i] = np.fft.fft(sub[merge_train.columns[i]], n=N)[1:N//2] / N
    for i in range(len(merge_train.columns)):
      mag[i] = np.abs(fft[i])
    freq = np.fft.fftfreq(N, 1.0 / f_s)[1:N//2]
    for i in range(len(merge_train.columns)):
      max_freq[i] = freq[np.argmax(mag[i])]
    for i in range(len(merge_train.columns)):
      avg[i] =  np.dot(mag[i], freq)
    fft_train_sensor.append(
        (t, l, max_freq[0], max_freq[1], max_freq[2], max_freq[3], max_freq[4], max_freq[5]
         ,avg[0] ,avg[1] ,avg[2] ,avg[3] ,avg[4]  ,avg[5]
         )
    )
fft_train_sensor = pd.DataFrame(fft_train_sensor, columns=['timestamps', 'workout',
        'max_freq_AX', 'max_freq_AY', 'max_freq_AZ', 'max_freq_GA', 'max_freq_GY', 'max_freq_GZ',
        'avg_AX' ,'avg_AY' ,'avg_AZ' ,'avg_GA' ,'avg_GY'  ,'avg_GZ'
        ]).set_index(
    'timestamps'
)
fft_train_sensor

Unnamed: 0_level_0,workout,max_freq_AX,max_freq_AY,max_freq_AZ,max_freq_GA,max_freq_GY,max_freq_GZ,avg_AX,avg_AY,avg_AZ,avg_GA,avg_GY,avg_GZ
timestamps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1970-01-01 04:00:04,1,7.8125,26.5625,12.5000,15.6250,7.8125,7.8125,5.184664,4.144856,3.054538,3.324320,5.529029,2.256900
1970-01-01 04:00:05,1,7.8125,21.8750,31.2500,10.9375,7.8125,7.8125,5.321540,6.411497,3.405878,4.412412,5.887302,2.321497
1970-01-01 04:00:06,1,7.8125,37.5000,39.0625,12.5000,7.8125,7.8125,5.898119,5.516591,4.337312,5.250800,5.517696,2.031655
1970-01-01 04:00:07,1,7.8125,17.1875,7.8125,12.5000,7.8125,3.1250,10.333002,8.305918,6.446951,6.646465,7.830021,3.284695
1970-01-01 04:00:08,1,7.8125,17.1875,25.0000,12.5000,25.0000,4.6875,11.318059,7.560596,6.836760,6.757774,8.442984,3.374943
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1970-01-02 15:03:44,8,4.6875,18.7500,3.1250,4.6875,3.1250,7.8125,1.797295,3.474300,2.116689,4.185162,4.050510,4.335994
1970-01-02 15:03:45,8,7.8125,17.1875,1.5625,6.2500,3.1250,7.8125,2.634684,4.318740,2.483507,3.847095,4.006702,4.472265
1970-01-02 15:03:46,8,4.6875,15.6250,1.5625,6.2500,1.5625,17.1875,2.414050,4.182125,2.595004,3.191326,3.468043,2.397416
1970-01-02 15:03:47,8,6.2500,20.3125,3.1250,7.8125,3.1250,21.8750,2.308410,3.844137,2.689826,2.409665,2.820433,2.673678


In [59]:
fft_test_sensor = []
N = 128
f_s = 200
for t in test_label.index:
    fft=[0 for i in range(6)]
    mag=[0 for i in range(6)]
    max_freq=[0 for i in range(6)]
    avg =[0 for i in range(6)]
    sub = merge_test.loc[t - pd.Timedelta(milliseconds=1):t + pd.Timedelta(seconds=3)]
    for i in range(len(merge_test.columns)):
      fft[i] = np.fft.fft(sub[merge_test.columns[i]], n=N)[1:N//2] / N
    for i in range(len(merge_test.columns)):
      mag[i] = np.abs(fft[i])
    freq = np.fft.fftfreq(N, 1.0 / f_s)[1:N//2]
    for i in range(len(merge_test.columns)):
      max_freq[i] = freq[np.argmax(mag[i])]
    for i in range(len(merge_test.columns)):
      avg[i] =  np.dot(mag[i], freq)
    fft_test_sensor.append(
        (t, max_freq[0], max_freq[1], max_freq[2], max_freq[3], max_freq[4], max_freq[5]
         ,avg[0] ,avg[1] ,avg[2] ,avg[3] ,avg[4]  ,avg[5]
         )
    )
fft_test_sensor = pd.DataFrame(fft_test_sensor, columns=['timestamps',
                'max_freq_AX', 'max_freq_AY', 'max_freq_AZ', 'max_freq_GA', 'max_freq_GY', 'max_freq_GZ',
        'avg_AX' ,'avg_AY' ,'avg_AZ' ,'avg_GA' ,'avg_GY'  ,'avg_GZ'
        ]).set_index(
    'timestamps'
)
fft_test_sensor

Unnamed: 0_level_0,max_freq_AX,max_freq_AY,max_freq_AZ,max_freq_GA,max_freq_GY,max_freq_GZ,avg_AX,avg_AY,avg_AZ,avg_GA,avg_GY,avg_GZ
timestamps,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1970-01-01 02:00:04,9.3750,29.6875,9.3750,25.0000,9.3750,25.0000,14.994970,9.585449,7.565840,7.943393,15.184337,4.419933
1970-01-01 02:00:05,9.3750,48.4375,9.3750,23.4375,9.3750,9.3750,12.470369,10.971493,7.469380,8.482853,15.231870,6.131094
1970-01-01 02:00:06,9.3750,29.6875,39.0625,23.4375,9.3750,9.3750,14.016197,11.994868,10.752376,7.412482,14.619546,6.088851
1970-01-01 02:00:07,9.3750,25.0000,39.0625,25.0000,9.3750,9.3750,13.121891,11.467434,9.718925,8.842068,12.296179,4.959966
1970-01-01 02:00:08,9.3750,25.0000,20.3125,25.0000,9.3750,9.3750,13.939956,13.405370,8.540023,8.095068,12.161953,5.152846
...,...,...,...,...,...,...,...,...,...,...,...,...
1970-01-03 17:04:44,9.3750,14.0625,1.5625,6.2500,3.1250,1.5625,3.345258,2.200697,3.226369,2.195892,3.649829,2.398668
1970-01-03 17:04:45,9.3750,15.6250,1.5625,6.2500,3.1250,3.1250,3.353007,2.041580,3.207139,2.059349,3.654017,2.345907
1970-01-03 17:04:46,9.3750,21.8750,1.5625,1.5625,3.1250,1.5625,2.310811,2.236031,3.583296,3.968586,4.295085,2.335544
1970-01-03 17:04:47,7.8125,3.1250,1.5625,1.5625,1.5625,1.5625,2.307091,2.381813,3.546243,4.252167,4.201052,2.456576


In [60]:
fft_train_sensor.corr()['workout']

workout        1.000000
max_freq_AX   -0.515208
max_freq_AY    0.217256
max_freq_AZ   -0.121247
max_freq_GA    0.146065
max_freq_GY   -0.182889
max_freq_GZ    0.054005
avg_AX         0.060905
avg_AY         0.150205
avg_AZ         0.276490
avg_GA         0.155401
avg_GY         0.241637
avg_GZ         0.126198
Name: workout, dtype: float64

In [61]:
x = fft_train_sensor.copy().drop(['workout'],axis=1)
y = fft_train_sensor.copy()[['workout']]

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

In [62]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import VotingClassifier

model_lr = LogisticRegression(fit_intercept=False)
model_lr.fit(x_train,y_train)
y_pred = model_lr.predict(x_test)
print("Logistic Regression :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

model_dt = DecisionTreeClassifier(random_state=42)
model_dt.fit(x_train,y_train)
y_pred = model_dt.predict(x_test)
print("Decision Tree :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

model_svc = SVC(C=1.0, probability=True, kernel='sigmoid', coef0 = 1, gamma=0.2)
model_svc.fit(x_train,y_train)
y_pred = model_lr.predict(x_test)
print("SVM :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

model_rf = RandomForestClassifier(random_state = 42)
model_rf.fit(x_train,y_train)
y_pred=model_rf.predict(x_test)
print("Random Forest :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

model_ab = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=10, random_state=32),
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)
model_ab.fit(x_train,y_train)
y_pred=model_ab.predict(x_test)
print("Ada Boost :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

model_xgb = XGBClassifier(
    objective='multi:softmax',
    n_estimators=100,
    learning_rate=0.1,
    subsample=0.8,
    max_depth=10,
    random_state=42
)
model_xgb.fit(x_train,y_train)
y_pred=model_xgb.predict(x_test)
print("XG Boost :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

ens1 = VotingClassifier(estimators=[('model_rf', model_rf), ('model_ab', model_ab), ('model_xgb',model_xgb)], voting='soft')
ens1.fit(x_train,y_train)
y_pred = ens1.predict(x_test)
print("Soft Ensemble :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")

ens2 = VotingClassifier(estimators=[('model_rf', model_rf), ('model_ab', model_ab), ('model_xgb',model_xgb)], voting='hard')
ens2.fit(x_train,y_train)
y_pred = ens2.predict(x_test)
print("Hard Ensemble :",round(balanced_accuracy_score(y_test,y_pred)*100,6),"%")


  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  y = column_or_1d(y, warn=True)


Logistic Regression : 71.81278 %
Decision Tree : 81.312799 %
SVM : 71.81278 %


  return fit_method(estimator, *args, **kwargs)


Random Forest : 88.956764 %


  y = column_or_1d(y, warn=True)


Ada Boost : 88.137825 %
XG Boost : 88.974835 %


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


Soft Ensemble : 89.047279 %


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


Hard Ensemble : 89.377759 %


# Final Model Specification
아래에는 위 전 과정을 거쳐서 최종적으로 선정된 모델을 정의하고 훈련할 것

In [40]:
model_rf = RandomForestClassifier(random_state = 42)
model_rf.fit(x,y)

model_ab = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=10, random_state=42),
    n_estimators=100,
    learning_rate=1.0,
    random_state=2
)
model_ab.fit(x,y)

model_xgb = XGBClassifier(
    objective='multi:softmax',
    n_estimators=100,
    learning_rate=0.1,
    subsample=0.8,
    max_depth=10,
    random_state=42
)
model_xgb.fit(x,y)

# model = VotingClassifier(estimators=[('model_rf', model_rf), ('model_ab', model_ab), ('model_xgb',model_xgb)], voting='soft')
# model = RandomForestClassifier(random_state = 42)
model = XGBClassifier(
    objective='multi:softmax',
    n_estimators=100,
    learning_rate=0.1,
    subsample=0.8,
    max_depth=10,
    random_state=42
)
model_xgb.fit(x,y)
model.fit(x,y)
y_predict=model.predict(x)
print(balanced_accuracy_score(y, y_predict))
y_predict=model.predict(fft_test_sensor)

  return fit_method(estimator, *args, **kwargs)
  y = column_or_1d(y, warn=True)


1.0


# Generate Submission
아래에는 Kaggle에 제출한 결과를 PC에 저장하는 코드를 넣을 것.
노트북 실행 후 아래의 코드를 통해 생성된 결과가 Kaggle에 제출된 결과와 일치해야 함.

In [42]:
SUBMIT = pd.DataFrame({
    'id':submmision.sort_index().index ,
    'label':y_predict
})
SUBMIT.to_csv('./submission.csv', index=False)

In [43]:
!kaggle competitions submit -c 2024-knu-ml-team-asmt -f submission.csv -m "XG Boost minmax scalar 20ms resampling"

100%|█████████████████████████████████████████| 119k/119k [00:01<00:00, 117kB/s]
Successfully submitted to Team Assignment