<a href="https://colab.research.google.com/github/namwootree/Breakdown-in-Machine/blob/main/MFCC_Zero_Crossing_Rate_RMS_Spectral_Flatness_%EA%B8%B0%EB%B0%98_%ED%94%BC%EC%B2%98_%EC%B6%94%EC%B6%9C_%2B_RobustScaler_%2B_SparsePCA_%26_KernelPCA_%2B_LOF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Reference

* 차원 축소 파트의 부가 설명은 [핸즈온 비지도 학습](https://github.com/francis-kang/handson-unsupervised-learning) (Book & Git Hub) 을 참조했습니다

* LOF 파트의 부가 설명은 [고려대학교 산업경영공학부 03-4 : Anomaly Detecton](https://www.youtube.com/watch?v=ODNAyt1h6Eg) (You Tube) 를 참조했습니다

* 저가 수행한 머신러닝 엔지니어링 기법이 성능에 미친 영향은 개인적인 생각이 다수 포함되어 있습니다 (틀릴 수 있음)

# Setting

## Check CPU

* Colab의 CPU를 활용하여 머신러닝 엔지니어링 수행

* 추후 실행할 Sparse PCA의 경우 실행이 완료되기까지 오래 걸림

In [1]:
!head /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
stepping	: 0
microcode	: 0xffffffff
cpu MHz		: 2199.998
cache size	: 56320 KB
physical id	: 0


## Library

In [2]:
import pandas as pd
import numpy as np

from scipy import stats

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA, KernelPCA, SparsePCA, TruncatedSVD, IncrementalPCA
from sklearn.decomposition import TruncatedSVD

import os
from tqdm.auto import tqdm
import random
import time
import datetime 

In [3]:
import librosa
import librosa.display
import IPython.display as ipd

In [4]:
import warnings
warnings.filterwarnings(action='ignore') 

## Fixed Random Seed

In [5]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

seed_everything(42) # Seed 고정

# Load Data Set

## Google Drive Mount

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Unzip File

In [8]:
!unzip -qq '/content/drive/MyDrive/머신러닝 엔지니어링/데이콘/기계 고장 진단/data/기계_고장.zip'

## Load Train / Test Split

In [9]:
df_train = pd.read_csv('./train.csv') # 모두 정상 Sample
df_test = pd.read_csv('./test.csv')

In [10]:
print(df_train.shape)
df_train.head()

(1279, 4)


Unnamed: 0,SAMPLE_ID,SAMPLE_PATH,FAN_TYPE,LABEL
0,TRAIN_0000,./train/TRAIN_0000.wav,2,0
1,TRAIN_0001,./train/TRAIN_0001.wav,0,0
2,TRAIN_0002,./train/TRAIN_0002.wav,0,0
3,TRAIN_0003,./train/TRAIN_0003.wav,2,0
4,TRAIN_0004,./train/TRAIN_0004.wav,2,0


In [11]:
print(df_test.shape)
df_test.head()

(1514, 3)


Unnamed: 0,SAMPLE_ID,SAMPLE_PATH,FAN_TYPE
0,TEST_0000,./test/TEST_0000.wav,2
1,TEST_0001,./test/TEST_0001.wav,2
2,TEST_0002,./test/TEST_0002.wav,0
3,TEST_0003,./test/TEST_0003.wav,0
4,TEST_0004,./test/TEST_0004.wav,0


# Feature Engineering

## Scaling

**Scaling 하는 이유**

* 추후 Sparse PCA 및 Kernel PCA 수행

* PCA는 원본 피처들을 상대적 범위에 매우 민감하기 때문이다

**Processing**

* FAN TYPE 별 (0 & 2)로 각각 Scaling

* 별도로 Scaling하는 이유는 [DACON 코드 공유](https://dacon.io/competitions/official/236036/codeshare/7134?page=1&dtype=recent)에서 추출된 피처를 시각화한 결과 FAN TYPE별로 데이터 분포가 크게 다르다는 것을 확인하였기 때문입니다

* 추후 Robust Scaler를 사용

* 디폴트 값인 quantile_range=(25.0, 75.0) 
대신 quantile_range=(15.0, 85.0)로 설정 -> 이상치 기준을 좀 더 약하게 설정

**개인적인 생각**

* quantile_range=(25.0, 75.0) 대신 quantile_range=(15.0, 85.0)으로 설정하였을 때 Public Score가 상승하였다

* 그 때까지만 해도 더 넓은 데이터를 받아드려 데이터의 패턴을 더 잘 드러낸 것이라고 가정하였다

* 그러나 Private Score가 급격히 낮아졌다. 

* 이에 대한 원인으로 이상치 기준을 약하게 잡음으로써 일반화가 부족하여 Test Set의 70%에서는 낮은 성적으로 나타낸 것으로 생각함 ㅠㅠ

* 정리 : 이상치 기준을 조정하면서 더 많이 받아들이게 된 정보가 Test Set 30%에는 좋은 영향을 미쳤지만 나머지 70%에는 좋지 못한 영향을 미쳤다고 생각

**대체 방안**

* PUBLIC SCORE : 0.9800436636

* 오히려 quantile_range=(35.0, 65.0)으로 설정하여 이상치 기준을 강하게 잡아 좋은 성능을 낸 머신러닝 엔지니어링 기법을 수행한 바 있다 (이걸 제출할 껄)

* 차이점

  - quantile_range=(35.0, 65.0) 설정

  - FAN TYPE 별로 데이터 셋을 구분하여 전처리 (스케일링 및 차원축소)와 모델링 (LOF)를 수행하지 않음

  - Zero Crossing Rate (+ delta) / RMS (+ delta) / MFCC 만 사용

* 해당 코드 [(Git Hub)](https://github.com/namwootree/Breakdown-in-Machine/blob/main/MFCC_Zero_Crossing_Rate_RMS_%E1%84%80%E1%85%B5%E1%84%87%E1%85%A1%E1%86%AB_%E1%84%91%E1%85%B5%E1%84%8E%E1%85%A5_%E1%84%8E%E1%85%AE%E1%84%8E%E1%85%AE%E1%86%AF_%2B_Robust_Sclaer_%2B_Sparse_PCA_%2B_LOF.ipynb)

In [12]:
def scaled_df(df_train, df_train_fan_type, df_test, df_test_fan_type, scaler, fan_type=False):

  df_train_fan_type = df_train_fan_type[['FAN_TYPE']]

  df_train = pd.concat([
                        df_train.reset_index(drop=True),
                        df_train_fan_type.reset_index(drop=True)
                       ],
                       axis=1)
  
  df_test_fan_type = df_test_fan_type[['FAN_TYPE']]

  df_test = pd.concat([
                       df_test.reset_index(drop=True),
                       df_test_fan_type.reset_index(drop=True)
                      ],
                      axis=1)
  
  train_type_0 = df_train.loc[(df_train['FAN_TYPE']==0)]
  train_type_2 = df_train.loc[(df_train['FAN_TYPE']==2)]

  test_type_0 = df_test.loc[(df_test['FAN_TYPE']==0)]
  test_type_2 = df_test.loc[(df_test['FAN_TYPE']==2)]

  train_type_0.drop(columns='FAN_TYPE', inplace=True)
  train_type_2.drop(columns='FAN_TYPE', inplace=True)
  test_type_0.drop(columns='FAN_TYPE', inplace=True)
  test_type_2.drop(columns='FAN_TYPE', inplace=True)

  list_train_0_index = list(train_type_0.index)
  list_train_2_index = list(train_type_2.index)

  list_test_0_index = list(test_type_0.index)
  list_test_2_index = list(test_type_2.index)

  scaled_train_type_0 = scaler.fit_transform(train_type_0)
  scaled_test_type_0 = scaler.transform(test_type_0)

  scaled_train_type_2 = scaler.fit_transform(train_type_2)
  scaled_test_type_2 = scaler.transform(test_type_2)

  train_type_0 = pd.DataFrame(scaled_train_type_0)
  train_type_2 = pd.DataFrame(scaled_train_type_2)

  test_type_0 = pd.DataFrame(scaled_test_type_0)
  test_type_2 = pd.DataFrame(scaled_test_type_2)

  train_type_0.index = list_train_0_index
  train_type_2.index = list_train_2_index

  test_type_0.index = list_test_0_index
  test_type_2.index = list_test_2_index

  df_train = pd.concat([train_type_0, train_type_2], axis=0)
  df_test = pd.concat([test_type_0, test_type_2], axis=0)

  df_train.sort_index(inplace=True)
  df_test.sort_index(inplace=True)

  if fan_type == False:

    pass
  
  if fan_type == True:

    df_train = pd.concat([df_train_fan_type, df_train], axis=1)
    df_test = pd.concat([df_test_fan_type, df_test], axis=1)

  return df_train, df_test

## Feature Extraction

다음의 다양한 Feature Extraction 기법들의 설명은 [DACON 코드 공유](https://dacon.io/competitions/official/236036/codeshare/7415?page=1&dtype=recent) 를 참조해주세요

**delta (차분)**

* Compute delta features: local estimate of the derivative of the input data along the selected axis.

* Delta features are computed Savitsky-Golay filtering.

* [librosa](https://librosa.org/doc/main/generated/librosa.feature.delta.html#librosa.feature.delta)

### Zeor Crossing Rate

In [13]:
def get_zero_crossing_feature(df, delta=False):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 zero crossing rate 추출
        zero = librosa.feature.zero_crossing_rate(y=y)
                              
        if delta == True:

          zero = librosa.feature.delta(zero, order=1)

        y_feature = []
        # 추출된 zero crossing rate들의 산술평균을 Feature로 사용
        for e in zero:

            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)
    
    zero_df = pd.DataFrame(features,
                           columns=['Zero_Crossing_Rate'])
    
    if delta == True:
      
      zero_df = pd.DataFrame(features,
                             columns=['Zero_Crossing_Rate_delta'])

    print(zero_df.shape)

    return zero_df

In [14]:
zero_train = get_zero_crossing_feature(df_train, delta=False)
zero_test = get_zero_crossing_feature(df_test, delta=False)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [15]:
zero_train.head()

Unnamed: 0,Zero_Crossing_Rate
0,0.133064
1,0.047472
2,0.057276
3,0.130589
4,0.142584


In [16]:
scaler = RobustScaler(quantile_range=(10.0, 90.0)) # 문제의 원인이라고 생각
scaled_zero_train, scaled_zero_test= scaled_df(zero_train,
                                               df_train,
                                               zero_test,
                                               df_test,
                                               scaler,
                                               fan_type=True)

scaled_zero_train.columns = ['FAN_TYPE', 'Zero_Crossing_Rate']
scaled_zero_test.columns = ['FAN_TYPE', 'Zero_Crossing_Rate']

In [17]:
scaled_zero_train.head()

Unnamed: 0,FAN_TYPE,Zero_Crossing_Rate
0,2,0.017689
1,0,-0.185521
2,0,0.274004
3,2,-0.171235
4,2,0.744678


In [18]:
zero_delta_train = get_zero_crossing_feature(df_train, delta=True)
zero_delta_test = get_zero_crossing_feature(df_test, delta=True)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [19]:
zero_delta_train.head()

Unnamed: 0,Zero_Crossing_Rate_delta
0,9.8e-05
1,7.4e-05
2,0.000116
3,0.000155
4,4.7e-05


In [20]:
scaler = RobustScaler(quantile_range=(10.0, 90.0))
scaled_zero_delta_train, scaled_zero_delta_test= scaled_df(zero_delta_train,
                                                           df_train,
                                                           zero_delta_test,
                                                           df_test,
                                                           scaler,
                                                           fan_type=True)

scaled_zero_delta_train.columns = ['FAN_TYPE', 'Zero_Crossing_Rate_delta']
scaled_zero_delta_test.columns = ['FAN_TYPE', 'Zero_Crossing_Rate_delta']

### RMS

In [21]:
def get_rms_feature(df, delta=False):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 RMS 추출
        rms = librosa.feature.rms(y=y)

        if delta == True:

          rms = librosa.feature.delta(rms, order=1)

        y_feature = []
        # 추출된 RMS의 산술평균을 Feature로 사용
        for e in rms:

            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)
    
    rms_df = pd.DataFrame(features,
                           columns=['RMS'])
    
    if delta == True:

      rms_df = pd.DataFrame(features,
                           columns=['RMS_delta'])

    print(rms_df.shape)

    return rms_df

In [22]:
rms_train = get_rms_feature(df_train, delta=False)
rms_test = get_rms_feature(df_test, delta=False)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [23]:
rms_train.head()

Unnamed: 0,RMS
0,0.005121
1,0.004604
2,0.004401
3,0.005163
4,0.004931


In [24]:
scaler = RobustScaler(quantile_range=(10.0, 90.0))
scaled_rms_train, scaled_rms_test= scaled_df(rms_train,
                               df_train,
                               rms_test,
                               df_test,
                               scaler,
                               fan_type=True)

scaled_rms_train.columns = ['FAN_TYPE', 'RMS']
scaled_rms_test.columns = ['FAN_TYPE', 'RMS']

In [25]:
scaled_rms_train.head()

Unnamed: 0,FAN_TYPE,RMS
0,2,0.254136
1,0,-0.026736
2,0,-0.451585
3,2,0.405749
4,2,-0.434392


In [26]:
rms_delta_train = get_rms_feature(df_train, delta=True)
rms_delta_test = get_rms_feature(df_test, delta=True)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [27]:
rms_delta_train.head()

Unnamed: 0,RMS_delta
0,-2e-06
1,-1e-06
2,-2e-06
3,-1e-06
4,-3e-06


In [28]:
scaler = RobustScaler(quantile_range=(10.0, 90.0))
scaled_rms_delta_train, scaled_rms_delta_test= scaled_df(rms_delta_train,
                                                         df_train,
                                                         rms_delta_test,
                                                         df_test,
                                                         scaler,
                                                         fan_type=True)

scaled_rms_delta_train.columns = ['FAN_TYPE', 'RMS_delta']
scaled_rms_delta_test.columns = ['FAN_TYPE', 'RMS_delta']

### Poly Feature

In [29]:
def get_poly_feature(df, delta=False):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 poly 추출
        poly = librosa.feature.poly_features(y=y,
                                             sr=sr,
                                             order=2)

        if delta == True:

          poly = librosa.feature.delta(poly, order=1)

        y_feature = []
        for e in poly:

            # 추출된 Poly들의 산술평균을 Feature로 사용
            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)

    columns = ['Poly'+str(i) for i in range(len(features[0]))]
    
    poly_df = pd.DataFrame(features,
                           columns=columns)

    print(poly_df.shape)

    return poly_df

In [30]:
poly_train = get_poly_feature(df_train, delta=False)
poly_test = get_poly_feature(df_test, delta=False)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 3)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 3)


In [31]:
poly_train.head()

Unnamed: 0,Poly0,Poly1,Poly2
0,9.069388e-09,-0.000104,0.299626
1,9.505948e-09,-9.8e-05,0.226299
2,8.967728e-09,-9.3e-05,0.219882
3,9.204813e-09,-0.000106,0.300868
4,7.493559e-09,-9e-05,0.275208


In [32]:
scaler = RobustScaler(quantile_range=(10.0, 90.0))
scaled_poly_train, scaled_poly_test= scaled_df(poly_train,
                                               df_train,
                                               poly_test,
                                               df_test,
                                               scaler,
                                               fan_type=True)

scaled_poly_train.columns = ['FAN_TYPE'] + ['Poly_'+str(i) for i in range(len(scaled_poly_train.columns)-1)]
scaled_poly_test.columns = ['FAN_TYPE'] + ['Poly_'+str(i) for i in range(len(scaled_poly_test.columns)-1)]

In [33]:
scaled_poly_train.head()

Unnamed: 0,FAN_TYPE,Poly_0,Poly_1,Poly_2
0,2,0.195171,-0.222502,0.278645
1,0,0.098038,-0.064467,-0.078285
2,0,-0.370527,0.382035,-0.38492
3,2,0.295246,-0.308595,0.325975
4,2,-0.969317,0.871553,-0.651589


### MFCC

In [34]:
def get_mfcc_feature(df, delta=False):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 mfcc 추출
        mfcc = librosa.feature.mfcc(y=y,
                                    sr=sr,
                                    n_mfcc=128,
                                    dct_type=2)
        
        if delta == True:

          mfcc = librosa.feature.delta(mfcc, order=1)

        y_feature = []
        # 추출된 MFCC들의 산술평균을 Feature로 사용
        for e in mfcc:

            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)

    columns = ['MFCC_'+str(i) for i in range(len(features[0]))]

    if delta == True:

      mfcc_df = pd.DataFrame(features,
                           columns=['MFCC_delta_'+str(i) for i in range(len(features[0]))])
    
    mfcc_df = pd.DataFrame(features,
                           columns=columns)

    print(mfcc_df.shape)

    return mfcc_df

In [35]:
mfcc_train = get_mfcc_feature(df_train)
mfcc_test = get_mfcc_feature(df_test)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 128)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 128)


In [36]:
mfcc_train.head()

Unnamed: 0,MFCC_0,MFCC_1,MFCC_2,MFCC_3,MFCC_4,MFCC_5,MFCC_6,MFCC_7,MFCC_8,MFCC_9,...,MFCC_118,MFCC_119,MFCC_120,MFCC_121,MFCC_122,MFCC_123,MFCC_124,MFCC_125,MFCC_126,MFCC_127
0,-332.689484,96.704391,-14.929521,21.968111,-8.563829,-2.02196,-11.857611,3.893353,-5.748076,3.539912,...,0.53368,0.660617,0.524346,-0.307885,-0.814918,-0.123952,0.535305,0.113357,-0.800878,-0.867296
1,-438.377899,142.276978,-2.118732,30.589058,0.734739,15.532813,-2.802753,4.227826,-1.891904,3.577837,...,0.179785,-0.031554,0.05012,0.377868,0.766223,0.740194,0.287944,0.007076,0.350023,0.168382
2,-419.17099,123.297798,10.11094,21.655056,-1.095648,11.256332,-3.402523,1.567492,3.890199,3.804655,...,0.472421,0.330321,0.200077,0.07306,0.516295,0.852534,0.380594,-0.057465,-0.105068,-0.298017
3,-333.733124,97.450333,-13.966936,22.235878,-9.349174,-2.870443,-11.308705,6.399221,-2.479952,3.890206,...,0.084635,0.459112,-0.024202,0.227796,-0.581687,-0.259305,-0.126211,0.116488,-0.928069,-0.161903
4,-333.012543,90.00338,-21.694469,14.749146,-18.316071,-9.914346,-16.342524,2.575432,-6.690783,-0.875636,...,0.058081,0.142688,-0.039779,0.551953,-0.547507,-0.372035,-0.214538,0.094469,-0.619701,-0.231777


In [38]:
scaler = RobustScaler(quantile_range=(10.0, 90.0))
scaled_mfcc_train, scaled_mfcc_test= scaled_df(mfcc_train,
                                               df_train,
                                               mfcc_test,
                                               df_test,
                                               scaler,
                                               fan_type=True)

scaled_mfcc_train.columns = ['FAN_TYPE'] + ['MFCC_'+str(i) for i in range(len(scaled_mfcc_train.columns)-1)]
scaled_mfcc_test.columns = ['FAN_TYPE'] + ['MFCC_'+str(i) for i in range(len(scaled_mfcc_test.columns)-1)]

### Spectral Flatness

In [39]:
def get_spectral_flatness_feature(df):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 Spectral Flatness 추출
        flatness = librosa.feature.spectral_flatness(y=y)

        y_feature = []
        for e in flatness:

            # 추출된 Spectral Flatness들의 산술평균을 Feature로 사용
            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)

    columns = ['spectral_flatness_'+str(i) for i in range(len(features[0]))]
    
    flatness_df = pd.DataFrame(features,
                           columns=columns)

    print(flatness_df.shape)

    return flatness_df

In [40]:
spectral_flatness_train = get_spectral_flatness_feature(df_train)
spectral_flatness_test = get_spectral_flatness_feature(df_test)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [41]:
scaler = RobustScaler(quantile_range=(10.0, 90.0))
scaled_flatness_train, scaled_flatness_test= scaled_df(spectral_flatness_train,
                                                       df_train,
                                                       spectral_flatness_test,
                                                       df_test,
                                                       scaler,
                                                       fan_type=True)

scaled_flatness_train.columns = ['FAN_TYPE'] + ['Flatness_'+str(i) for i in range(len(scaled_flatness_train.columns)-1)]
scaled_flatness_test.columns = ['FAN_TYPE'] + ['Flatness_'+str(i) for i in range(len(scaled_flatness_test.columns)-1)]

In [42]:
scaled_flatness_train.head()

Unnamed: 0,FAN_TYPE,Flatness_0
0,2,0.043698
1,0,-0.256685
2,0,0.291746
3,2,-0.170358
4,2,0.698839


## Dimension Reduction

FAN TYPE 별 (0 &2)로 각각 차원 축소 실행

**차원 축소가 좋은 이유**

* 차원 축소를 통해 고차원 데이터를 저차원 공간에 투영해 중복 정보를 제거하면서 가능한 핵심 정보 유지

* 데이터를 낮은 차원으로 축소시키면 노이즈가 많이 줄어들기 때문에 머신러닝 알고리즘이 흥미로운 패턴을 더 효과적이고 효율적으로 식별할 수 있음

**PCA 기법의 차원축소 사용**

* 가능한 한 분산 (핵심 정보)를 보존하면서 데이터의 저차원 표현을 찾음

* 피처들 간 상관관계를 다룸

* 일부 피처들 간 상관관계가 매우 높으면 PCA는 상관관계가 높은 피처들을 결합해 선형적인 상관관계가 없는, 더 작은 수의 피처들로 데이터를 표현

* PCA 외에 다른 차원축소 기법 소개는 다음의 [Git Hub](https://github.com/namwootree/Basic_Skill/blob/main/Unsupervised%20Learning/%ED%95%B8%EC%A6%88%EC%98%A8%20%EB%B9%84%EC%A7%80%EB%8F%84%20%ED%95%99%EC%8A%B5/Ch_3_%EC%B0%A8%EC%9B%90_%EC%B6%95%EC%86%8C.ipynb)를 참조해주세요

In [43]:
def dimension_reduction(train, test, method, fan_type=False):

  df_fan_type_train = train[['FAN_TYPE']]
  df_fan_type_test = test[['FAN_TYPE']]

  train_0 = train.loc[train['FAN_TYPE']==0]
  train_2 = train.loc[train['FAN_TYPE']==2]

  test_0 = test.loc[test['FAN_TYPE']==0]
  test_2 = test.loc[test['FAN_TYPE']==2]

  index_train_0 = list(train_0.index)
  index_train_2 = list(train_2.index)

  index_test_0 = list(test_0.index)
  index_test_2 = list(test_2.index)

  train_0.drop(columns='FAN_TYPE', inplace=True)
  train_2.drop(columns='FAN_TYPE', inplace=True)
  test_0.drop(columns='FAN_TYPE', inplace=True)
  test_2.drop(columns='FAN_TYPE', inplace=True)

  train_0 = method.fit_transform(train_0)
  test_0 = method.transform(test_0)

  train_2 = method.fit_transform(train_2)
  test_2 = method.transform(test_2)

  train_0 = pd.DataFrame(train_0)
  train_2 = pd.DataFrame(train_2)
  test_0 = pd.DataFrame(test_0)
  test_2 = pd.DataFrame(test_2)

  train_0.index = index_train_0
  train_2.index = index_train_2

  test_0.index = index_test_0
  test_2.index = index_test_2

  train = pd.concat([train_0, train_2], axis=0)
  test = pd.concat([test_0, test_2], axis=0)

  train.sort_index(inplace=True)
  test.sort_index(inplace=True)

  if fan_type == False:

    pass

  if fan_type == True:

    train = pd.concat([df_fan_type_train, train], axis=1)
    test = pd.concat([df_fan_type_test, test], axis=1)

  return train, test

### Sparse PCA

* 일반 PCA 알고리즘은 모든 입력 변수에 선형 결합을 탐색해 원본 피처 공간을 최대한 조밀하게 줄인다

* 일반 PCA 알고리즘은 모든 입력 변수에 선형 결합을 탐색해 원본 피처 공간을 최대한 조밀하게 줄인다

* alpha라는 하이퍼 파라미터로 제어함으로써 희소성을 어느 정도 유지할 수 있다

* 희소 PCA는 일부 입력 변수에서만 선형 결합을 탐색해 원본 피처 공간을 어느 정도 줄이지만 일반 PCA만큼 조밀하게 만들지는 않음


```
method = SparsePCA(n_components=N_COMPONETS, alpha=0.001)
```

#### Zero Crossing Rate (+ delta)

In [44]:
method = SparsePCA(n_components=1, alpha=0.001)

pca_train_zero, pca_test_zero = dimension_reduction(scaled_zero_train,
                                                    scaled_zero_test,
                                                    method)

In [45]:
pca_train_zero.head()

Unnamed: 0,0
0,-0.114878
1,-0.284218
2,0.170757
3,-0.301931
4,0.604914


In [46]:
method = SparsePCA(n_components=1, alpha=0.001)

pca_train_zero_delta, pca_test_zero_delta = dimension_reduction(scaled_zero_delta_train,
                                                                scaled_zero_delta_test,
                                                                method)

In [47]:
pca_train_zero_delta.head()

Unnamed: 0,0
0,-0.052933
1,-0.118084
2,0.183369
3,0.373997
4,-0.434196


#### RMS (+ delta)

In [48]:
method = SparsePCA(n_components=1, alpha=0.001)

pca_train_rms, pca_test_rms = dimension_reduction(scaled_rms_train,
                                                  scaled_rms_test,
                                                  method)

In [49]:
pca_train_rms.head()

Unnamed: 0,0
0,0.289721
1,-0.020601
2,-0.441244
3,0.439832
4,-0.39199


In [50]:
method = SparsePCA(n_components=1, alpha=0.001)

pca_train_rms_delta, pca_test_rms_delta = dimension_reduction(scaled_rms_delta_train,
                                                              scaled_rms_delta_test,
                                                              method)

In [51]:
pca_train_rms_delta.head()

Unnamed: 0,0
0,-0.39291
1,-0.071169
2,-0.138329
3,-0.207597
4,-0.529061


#### MFCC

In [52]:
pca = PCA()
pca.fit(scaled_mfcc_train.drop(columns='FAN_TYPE'))
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

124


In [53]:
# 비교적 시간이 오래걸림
start = time.time()

method = SparsePCA(n_components=N_COMPONETS, alpha=0.001)

pca_train_mfcc, pca_test_mfcc = dimension_reduction(scaled_mfcc_train,
                                                    scaled_mfcc_test,
                                                    method)

end = time.time()

times = end - start
times = str(datetime.timedelta(seconds=times)).split('.')[0]
print('Colab CPU 기준 ->'+ times)

Colab CPU 기준 ->0:19:51


In [54]:
pca_train_mfcc.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,114,115,116,117,118,119,120,121,122,123
0,-0.335912,2.439352,0.932931,2.310746,1.15335,-1.305539,0.213553,-2.294262,0.971062,-1.047072,...,-0.04804,0.003623,0.076608,0.030211,-0.014747,-0.038559,0.059401,-0.009189,0.065936,0.1169
1,-0.689584,-0.339695,-0.162768,0.645322,0.877578,-0.136696,-0.532275,-0.525996,-1.201179,-1.042814,...,-0.126424,-0.050807,0.04955,-0.119565,0.084956,0.036956,0.050754,-0.106587,-0.014341,0.104146
2,0.991469,-0.368927,1.542622,-1.1285,-0.206585,0.296731,-0.00407,0.222576,-0.205744,0.537634,...,-0.065547,0.144694,0.027092,-0.141671,-0.1171,0.001813,-0.107622,-0.000351,0.12537,0.148678
3,0.365062,-0.726897,-1.386354,1.382548,0.112595,-0.884162,0.198598,-0.521159,0.08396,-0.225098,...,0.024324,0.023928,-0.041305,0.03856,0.066319,-0.069043,1.5e-05,-0.06841,-0.001602,-0.038654
4,-1.224691,-0.986951,0.209991,-2.131641,-1.615312,0.141861,0.487224,-0.606741,-0.384913,-0.151322,...,0.0609,-0.062728,-0.082431,-0.1621,-0.008483,0.145432,-0.008663,-0.084737,0.065495,0.09974


#### Spectral Flatness

In [55]:
pca = PCA()
pca.fit(scaled_flatness_train.drop(columns='FAN_TYPE'))
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

1


In [56]:
method = SparsePCA(n_components=N_COMPONETS, alpha=0.001)

pca_train_flatness, pca_test_flatness = dimension_reduction(scaled_flatness_train,
                                                            scaled_flatness_test,
                                                            method)

In [57]:
pca_train_flatness.head()

Unnamed: 0,0
0,-0.096967
1,-0.401111
2,0.14189
3,-0.308904
4,0.551688


### Kernel PCA

* 비선형 PCA 유형 중 하나인 커널 PCA는 원본 데이터 포인트 쌍들에 대해 유사성 함수를 실행시켜 비선형적으로 차원을 축소

* 커널 PCA는 이 유사성 함수를 학습함으로써 데이터 포인트 대부분이 있는 암시적 피처 공간을 매핑하고 이 공간을 원본 피처 셋 보다 훨씬 더 작은 수의 차원을 만듬

* 이 방법은 원본 피처 셋을 선형으로 분리할 수 없는 경우에 특히 효과적이다

```
method = KernelPCA(n_components=N_COMPONETS)
```

#### Poly Feature

In [58]:
pca = PCA()
pca.fit(scaled_poly_train.drop(columns='FAN_TYPE'))
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

2


In [59]:
method = KernelPCA(n_components=N_COMPONETS)

pca_train_poly, pca_test_poly = dimension_reduction(scaled_poly_train,
                                                    scaled_poly_test,
                                                    method)

In [60]:
pca_train_poly.head()

Unnamed: 0,0,1
0,-0.628788,0.050487
1,-0.181938,-0.19025
2,0.533092,-0.151375
3,-0.765767,0.020425
4,1.219275,0.103978


## Concat Data Set

In [61]:
preprocessed_train = pd.concat([
                                df_train[['FAN_TYPE']],
                                pca_train_zero,
                                pca_train_zero_delta,
                                pca_train_rms,
                                pca_train_rms_delta,
                                pca_train_mfcc,
                                pca_train_flatness,
                                pca_train_poly,
                               ], axis=1)

preprocessed_test = pd.concat([
                               df_test[['FAN_TYPE']],
                               pca_test_zero,
                               pca_test_zero_delta,
                               pca_test_rms,
                               pca_test_rms_delta,
                               pca_test_mfcc,
                               pca_test_flatness,
                               pca_test_poly,
                              ], axis=1)

In [62]:
preprocessed_train.columns = ['FAN_TYPE']+[i for i in range(len(preprocessed_train.columns)-1)]
preprocessed_test.columns = ['FAN_TYPE']+[i for i in range(len(preprocessed_test.columns)-1)]

In [63]:
print(preprocessed_train.shape)
preprocessed_train.head()

(1279, 132)


Unnamed: 0,FAN_TYPE,0,1,2,3,4,5,6,7,8,...,121,122,123,124,125,126,127,128,129,130
0,2,-0.114878,-0.052933,0.289721,-0.39291,-0.335912,2.439352,0.932931,2.310746,1.15335,...,0.030211,-0.014747,-0.038559,0.059401,-0.009189,0.065936,0.1169,-0.096967,-0.628788,0.050487
1,0,-0.284218,-0.118084,-0.020601,-0.071169,-0.689584,-0.339695,-0.162768,0.645322,0.877578,...,-0.119565,0.084956,0.036956,0.050754,-0.106587,-0.014341,0.104146,-0.401111,-0.181938,-0.19025
2,0,0.170757,0.183369,-0.441244,-0.138329,0.991469,-0.368927,1.542622,-1.1285,-0.206585,...,-0.141671,-0.1171,0.001813,-0.107622,-0.000351,0.12537,0.148678,0.14189,0.533092,-0.151375
3,2,-0.301931,0.373997,0.439832,-0.207597,0.365062,-0.726897,-1.386354,1.382548,0.112595,...,0.03856,0.066319,-0.069043,1.5e-05,-0.06841,-0.001602,-0.038654,-0.308904,-0.765767,0.020425
4,2,0.604914,-0.434196,-0.39199,-0.529061,-1.224691,-0.986951,0.209991,-2.131641,-1.615312,...,-0.1621,-0.008483,0.145432,-0.008663,-0.084737,0.065495,0.09974,0.551688,1.219275,0.103978


# Modeling

**LOF (Local Outlier Factor)**

* 해당 데이터의 Local Density를 기반으로 Novelty Score 산출

* Novelty Score가 정규화 되어있지 않기에 해당 모델을 다른 Data Set에 적용하는 것은 좋지 못할 수 있음

* 그래서 FAN TYPE 별로 각기 다른 Data Set이라고 가정

* 결론적으로 FAN TYPE 별로 각각 모델링 수행

* [[오피셜]](https://dacon.io/competitions/official/236036/talkboard/407363?page=1&dtype=recent) Train / Test 의 Fan Type이 0, 2만 존재한다는 정보를 제공되었기에 Fan Type 별 모델을 별도로 생성해 데이터 역시 Fan Type에 맞는 모델에 투입하는 방식은 Data Leakage에 해당하지 않음

## FAN TYPE (0 & 2) 별로 데이터셋 분리

In [64]:
train_0 = preprocessed_train.loc[preprocessed_train['FAN_TYPE']==0]
train_2 = preprocessed_train.loc[preprocessed_train['FAN_TYPE']==2]

test_0 = preprocessed_test.loc[preprocessed_test['FAN_TYPE']==0]
test_2 = preprocessed_test.loc[preprocessed_test['FAN_TYPE']==2]

In [65]:
train_0.drop(columns='FAN_TYPE', inplace=True)
train_2.drop(columns='FAN_TYPE', inplace=True)

test_0.drop(columns='FAN_TYPE', inplace=True)
test_2.drop(columns='FAN_TYPE', inplace=True)

In [66]:
index_0 = list(test_0.index)
index_2 = list(test_2.index)

## FAN TYPE 별로 각각 모델 추론

**하이퍼 파라미터**

* *n_neighbors = 1*

  -  경험적으로 1로 설정하였을 때 성능이 가장 좋았습니다

* *p = 2*

  - 유클리드 거리를 사용 시 성능이 가장 좋았습니다

  - 부가적으로 맨하탄 거리 (p = 1) 가 유클리드 거리보다 좋은 경우는 다음의 [깃허브 블로그](https://seoyoungh.github.io/deep-learning/distance-metrics/)를 참조해주세요

* *contamination='auto'*

  - Test Set에 contamination이 얼마나 있는 지 알 수 없음
  - 해당 하이퍼 파라미터는 Threshold 설정에 영향을 줌
  - 대회 규칙 상 Anomaly Score를 바탕으로 Threshold를 산정하는 것은 Data Leakage에 해당
  - 규칙 위반이 우려스러워 'auto'로 설정

* *novelty=True*

  - True로 설정해야 Novelty Detection이 가능

In [67]:
n_neighbors = 1
p = 1

model_0 = LocalOutlierFactor(n_neighbors=n_neighbors, 
                           p=p, # 민코프스키 거리 -> 1 : 맨하탄 거리와 같음 / 2 : 유클리드 거리와 같음
                           algorithm='auto',
                           contamination='auto',
                           novelty=True)

model_2 = LocalOutlierFactor(n_neighbors=n_neighbors, 
                           p=p, # 민코프스키 거리 -> 1 : 맨하탄 거리와 같음 / 2 : 유클리드 거리와 같음
                           algorithm='auto',
                           contamination='auto',
                           novelty=True)

In [68]:
model_0.fit(train_0)
model_2.fit(train_2)

LocalOutlierFactor(n_neighbors=1, novelty=True, p=1)

In [69]:
def get_pred_label(model_pred):
    # IsolationForest 모델 출력 (1:정상, -1:불량) 이므로 (0:정상, 1:불량)로 Label 변환
    model_pred = np.where(model_pred == 1, 0, model_pred)
    model_pred = np.where(model_pred == -1, 1, model_pred)
    return model_pred

In [70]:
test_pred_0 = model_0.predict(test_0) 
test_pred_0 = get_pred_label(test_pred_0)

test_pred_2 = model_2.predict(test_2) 
test_pred_2 = get_pred_label(test_pred_2)

In [71]:
test_pred_0 = pd.DataFrame(test_pred_0, columns=['LABEL'])
test_pred_2 = pd.DataFrame(test_pred_2, columns=['LABEL'])

test_pred_0.index = index_0
test_pred_2.index = index_2

In [72]:
final = pd.concat([test_pred_0, test_pred_2], axis=0)
final.sort_index(inplace=True)

# Submission

In [73]:
submit = pd.read_csv('./sample_submission.csv')
submit['LABEL'] = final['LABEL']

submit.head()

Unnamed: 0,SAMPLE_ID,LABEL
0,TEST_0000,0
1,TEST_0001,0
2,TEST_0002,1
3,TEST_0003,1
4,TEST_0004,1


In [74]:
submit.to_csv('./submission.csv', index=False)