# Reference

* 차원 축소 파트의 부가 설명은 [핸즈온 비지도 학습](https://github.com/francis-kang/handson-unsupervised-learning) (Book & Git Hub) 을 참조했습니다

* LOF 파트의 부가 설명은 [고려대학교 산업경영공학부 03-4 : Anomaly Detecton](https://www.youtube.com/watch?v=ODNAyt1h6Eg) (You Tube) 를 참조했습니다

# Process

1. Load Data Set
2. Feature Engineering

  - Feature Extraction

    * Zero Crossing Rate (+ delta)
    * RMS (+ delta)
    * MFCC
  
  - Scaling (Robust Scaler)

  - Dimension Reduction (Sparse PCA)

3. Modeling (LOF)
4. Submission



# Setting

## Library

In [1]:
import pandas as pd
import numpy as np

from scipy import stats

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

from sklearn.neighbors import LocalOutlierFactor
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA, KernelPCA, SparsePCA, TruncatedSVD, IncrementalPCA
from sklearn.decomposition import TruncatedSVD

import os
from tqdm.auto import tqdm
import random
import time
import datetime 

In [2]:
import librosa
import librosa.display
import IPython.display as ipd

In [3]:
import warnings
warnings.filterwarnings(action='ignore') 

## Fixed Random Seed

In [4]:
def seed_everything(seed):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

seed_everything(42) # Seed 고정

# Load Data Set

## Google Drive Mount

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Unzip File

In [6]:
!unzip -qq '/content/drive/MyDrive/머신러닝 엔지니어링/데이콘/기계 고장 진단/data/기계_고장.zip'

## Load Train / Test Set

In [7]:
df_train = pd.read_csv('./train.csv') # 모두 정상 Sample
df_test = pd.read_csv('./test.csv')

In [8]:
print(df_train.shape)
df_train.head()

(1279, 4)


Unnamed: 0,SAMPLE_ID,SAMPLE_PATH,FAN_TYPE,LABEL
0,TRAIN_0000,./train/TRAIN_0000.wav,2,0
1,TRAIN_0001,./train/TRAIN_0001.wav,0,0
2,TRAIN_0002,./train/TRAIN_0002.wav,0,0
3,TRAIN_0003,./train/TRAIN_0003.wav,2,0
4,TRAIN_0004,./train/TRAIN_0004.wav,2,0


In [9]:
print(df_test.shape)
df_test.head()

(1514, 3)


Unnamed: 0,SAMPLE_ID,SAMPLE_PATH,FAN_TYPE
0,TEST_0000,./test/TEST_0000.wav,2
1,TEST_0001,./test/TEST_0001.wav,2
2,TEST_0002,./test/TEST_0002.wav,0
3,TEST_0003,./test/TEST_0003.wav,0
4,TEST_0004,./test/TEST_0004.wav,0


# Feature Engineering

## Scaling

Scaling 하는 이유

* 추후 Sparse PCA 수행

* PCA는 원본 피처들을 상대적 범위에 매우 민감하기 때문이다

Processing

* FAN TYPE 별 (0 & 2)로 각각 Scaling

* 별도로 Scaling하는 이유는 [DACON 코드 공유](https://dacon.io/competitions/official/236036/codeshare/7134?page=1&dtype=recent)에서 추출된 피처를 시각화한 결과 FAN TYPE별로 데이터 분포가 크게 다르다는 것을 확인하였기 때문입니다

* 추후 Robust Scaler를 사용

* 디폴트 값인 quantile_range=(25.0, 75.0) 
대신 quantile_range=(35.0, 65.0)로 설정 -> 이상치 기준을 좀 더 강하게 설정

In [10]:
def scaled_df(df_train, df_train_fan_type, df_test, df_test_fan_type, scaler):

  df_train_fan_type = df_train_fan_type[['FAN_TYPE']]

  df_train = pd.concat([
                        df_train.reset_index(drop=True),
                        df_train_fan_type.reset_index(drop=True)
                       ],
                       axis=1)
  
  df_test_fan_type = df_test_fan_type[['FAN_TYPE']]

  df_test = pd.concat([
                       df_test.reset_index(drop=True),
                       df_test_fan_type.reset_index(drop=True)
                      ],
                      axis=1)
  
  train_type_0 = df_train.loc[(df_train['FAN_TYPE']==0)]
  train_type_2 = df_train.loc[(df_train['FAN_TYPE']==2)]

  test_type_0 = df_test.loc[(df_test['FAN_TYPE']==0)]
  test_type_2 = df_test.loc[(df_test['FAN_TYPE']==2)]

  train_type_0.drop(columns='FAN_TYPE', inplace=True)
  train_type_2.drop(columns='FAN_TYPE', inplace=True)
  test_type_0.drop(columns='FAN_TYPE', inplace=True)
  test_type_2.drop(columns='FAN_TYPE', inplace=True)

  list_train_0_index = list(train_type_0.index)
  list_train_2_index = list(train_type_2.index)

  list_test_0_index = list(test_type_0.index)
  list_test_2_index = list(test_type_2.index)

  scaled_train_type_0 = scaler.fit_transform(train_type_0)
  scaled_test_type_0 = scaler.transform(test_type_0)

  scaled_train_type_2 = scaler.fit_transform(train_type_2)
  scaled_test_type_2 = scaler.transform(test_type_2)

  train_type_0 = pd.DataFrame(scaled_train_type_0)
  train_type_2 = pd.DataFrame(scaled_train_type_2)

  test_type_0 = pd.DataFrame(scaled_test_type_0)
  test_type_2 = pd.DataFrame(scaled_test_type_2)

  train_type_0.index = list_train_0_index
  train_type_2.index = list_train_2_index

  test_type_0.index = list_test_0_index
  test_type_2.index = list_test_2_index

  df_train = pd.concat([train_type_0, train_type_2], axis=0)
  df_test = pd.concat([test_type_0, test_type_2], axis=0)

  df_train.sort_index(inplace=True)
  df_test.sort_index(inplace=True)

  return df_train, df_test

## Feature Extraction

다음의 다양한 Feature Extraction 기법들의 설명은 [DACON 코드 공유](https://dacon.io/competitions/official/236036/codeshare/7415?page=1&dtype=recent) 를 참조해주세요

### Zero Crossing Rate

In [11]:
def get_zero_crossing_feature(df, delta=False):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 zero crossing rate 추출
        zero = librosa.feature.zero_crossing_rate(y=y)
                              
        if delta == True:

          zero = librosa.feature.delta(zero, order=1)

        y_feature = []
        # 추출된 zero crossing rate들의 산술평균을 Feature로 사용
        for e in zero:

            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)
    
    zero_df = pd.DataFrame(features,
                           columns=['Zero_Crossing_Rate'])
    
    if delta == True:
      
      zero_df = pd.DataFrame(features,
                             columns=['Zero_Crossing_Rate_delta'])

    print(zero_df.shape)

    return zero_df

In [12]:
zero_train = get_zero_crossing_feature(df_train, delta=False)
zero_test = get_zero_crossing_feature(df_test, delta=False)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [13]:
zero_train.head()

Unnamed: 0,Zero_Crossing_Rate
0,0.133064
1,0.047472
2,0.057276
3,0.130589
4,0.142584


In [14]:
scaler = RobustScaler(quantile_range=(35.0, 65.0))
zero_train, zero_test= scaled_df(zero_train,
                                 df_train,
                                 zero_test,
                                 df_test,
                                 scaler)

zero_train.columns = ['Zero_Crossing_Rate']
zero_test.columns = ['Zero_Crossing_Rate']

In [15]:
zero_train.head()

Unnamed: 0,Zero_Crossing_Rate
0,0.05651
1,-0.545779
2,0.806084
3,-0.547025
4,2.378941


Zero Crossing Rate의 1차 차분 값을 사용

In [16]:
zero_delta_train = get_zero_crossing_feature(df_train, delta=True)
zero_delta_test = get_zero_crossing_feature(df_test, delta=True)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [17]:
zero_delta_train.head()

Unnamed: 0,Zero_Crossing_Rate_delta
0,9.8e-05
1,7.4e-05
2,0.000116
3,0.000155
4,4.7e-05


In [18]:
scaler = RobustScaler(quantile_range=(35.0, 65.0))
zero_delta_train, zero_delta_test= scaled_df(zero_delta_train,
                                             df_train,
                                             zero_delta_test,
                                             df_test,
                                             scaler)

zero_delta_train.columns = ['Zero_Crossing_Rate_delta']
zero_delta_test.columns = ['Zero_Crossing_Rate_delta']

In [19]:
zero_delta_train.head()

Unnamed: 0,Zero_Crossing_Rate_delta
0,-0.187866
1,-0.360767
2,0.703202
3,1.280963
4,-1.499582


### RMS

In [20]:
def get_rms_feature(df, delta=False):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 RMS 추출
        rms = librosa.feature.rms(y=y)

        if delta == True:

          rms = librosa.feature.delta(rms, order=1)

        y_feature = []
        # 추출된 RMS의 산술평균을 Feature로 사용
        for e in rms:

            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)
    
    rms_df = pd.DataFrame(features,
                           columns=['RMS'])
    
    if delta == True:

      rms_df = pd.DataFrame(features,
                           columns=['RMS_delta'])

    print(rms_df.shape)

    return rms_df

In [21]:
rms_train = get_rms_feature(df_train, delta=False)
rms_test = get_rms_feature(df_test, delta=False)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [22]:
rms_train.head()

Unnamed: 0,RMS
0,0.005121
1,0.004604
2,0.004401
3,0.005163
4,0.004931


In [23]:
scaler = RobustScaler(quantile_range=(35.0, 65.0))
rms_train, rms_test= scaled_df(rms_train,
                               df_train,
                               rms_test,
                               df_test,
                               scaler)

rms_train.columns = ['RMS']
rms_test.columns = ['RMS']

In [24]:
rms_train.head()

Unnamed: 0,RMS
0,0.914106
1,-0.09239
2,-1.560508
3,1.459444
4,-1.562469


RMS의 1차 차분 값 사용

In [25]:
rms_delta_train = get_rms_feature(df_train, delta=True)
rms_delta_test = get_rms_feature(df_test, delta=True)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 1)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 1)


In [26]:
rms_delta_train.head()

Unnamed: 0,RMS_delta
0,-2e-06
1,-1e-06
2,-2e-06
3,-1e-06
4,-3e-06


In [27]:
scaler = RobustScaler(quantile_range=(35.0, 65.0))
rms_delta_train, rms_delta_test= scaled_df(rms_delta_train,
                                           df_train,
                                           rms_delta_test,
                                           df_test,
                                           scaler)

rms_delta_train.columns = ['RMS_delta']
rms_delta_test.columns = ['RMS_delta']

In [28]:
rms_delta_train.head()

Unnamed: 0,RMS_delta
0,-1.279658
1,-0.283601
2,-0.53252
3,-0.671733
4,-1.726304


### MFCC

In [29]:
def get_mfcc_feature(df):
    features = []
    for path in tqdm(df['SAMPLE_PATH']):
        # librosa패키지를 사용하여 wav 파일 load
        y, sr = librosa.load(path, sr=16000)
        
        # librosa패키지를 사용하여 mfcc 추출
        #S = librosa.feature.melspectrogram(y=y, sr=sr)
        mfcc = librosa.feature.mfcc(y=y,
                                    sr=sr,
                                    n_mfcc=128,
                                    dct_type=2)

        y_feature = []
        # 추출된 MFCC들의 산술평균을 Feature로 사용
        for e in mfcc:

            e = np.mean(e)

            y_feature.append(e)

        features.append(y_feature)

    columns = ['MFCC_'+str(i) for i in range(len(features[0]))]
    
    mfcc_df = pd.DataFrame(features,
                           columns=columns)

    print(mfcc_df.shape)

    return mfcc_df

In [30]:
mfcc_train = get_mfcc_feature(df_train)
mfcc_test = get_mfcc_feature(df_test)

  0%|          | 0/1279 [00:00<?, ?it/s]

(1279, 128)


  0%|          | 0/1514 [00:00<?, ?it/s]

(1514, 128)


In [31]:
mfcc_train.head()

Unnamed: 0,MFCC_0,MFCC_1,MFCC_2,MFCC_3,MFCC_4,MFCC_5,MFCC_6,MFCC_7,MFCC_8,MFCC_9,...,MFCC_118,MFCC_119,MFCC_120,MFCC_121,MFCC_122,MFCC_123,MFCC_124,MFCC_125,MFCC_126,MFCC_127
0,-332.689484,96.704391,-14.929521,21.968111,-8.563829,-2.02196,-11.857611,3.893353,-5.748076,3.539912,...,0.53368,0.660617,0.524346,-0.307885,-0.814918,-0.123952,0.535305,0.113357,-0.800878,-0.867296
1,-438.377899,142.276978,-2.118732,30.589058,0.734739,15.532813,-2.802753,4.227826,-1.891904,3.577837,...,0.179785,-0.031554,0.05012,0.377868,0.766223,0.740194,0.287944,0.007076,0.350023,0.168382
2,-419.17099,123.297798,10.11094,21.655056,-1.095648,11.256332,-3.402523,1.567492,3.890199,3.804655,...,0.472421,0.330321,0.200077,0.07306,0.516295,0.852534,0.380594,-0.057465,-0.105068,-0.298017
3,-333.733124,97.450333,-13.966936,22.235878,-9.349174,-2.870443,-11.308705,6.399221,-2.479952,3.890206,...,0.084635,0.459112,-0.024202,0.227796,-0.581687,-0.259305,-0.126211,0.116488,-0.928069,-0.161903
4,-333.012543,90.00338,-21.694469,14.749146,-18.316071,-9.914346,-16.342524,2.575432,-6.690783,-0.875636,...,0.058081,0.142688,-0.039779,0.551953,-0.547507,-0.372035,-0.214538,0.094469,-0.619701,-0.231777


In [32]:
scaler = RobustScaler(quantile_range=(35.0, 65.0))
mfcc_train, mfcc_test= scaled_df(mfcc_train,
                                 df_train,
                                 mfcc_test,
                                 df_test,
                                 scaler)

mfcc_train.columns = ['MFCC_'+str(i) for i in range(len(mfcc_train.columns))]
mfcc_test.columns = ['MFCC_'+str(i) for i in range(len(mfcc_test.columns))]

In [33]:
mfcc_train.head()

Unnamed: 0,MFCC_0,MFCC_1,MFCC_2,MFCC_3,MFCC_4,MFCC_5,MFCC_6,MFCC_7,MFCC_8,MFCC_9,...,MFCC_118,MFCC_119,MFCC_120,MFCC_121,MFCC_122,MFCC_123,MFCC_124,MFCC_125,MFCC_126,MFCC_127
0,1.206752,0.161699,0.223483,0.111338,0.728755,1.411419,1.541712,-1.156927,-1.768883,0.294013,...,2.24423,2.892967,2.158813,-1.804063,-1.603646,-0.318842,3.154804,-0.022835,-1.090258,-2.480426
1,-3.044144,0.919848,-0.330244,0.067806,0.155189,0.373561,-0.18251,-1.23984,0.687802,-1.241711,...,-0.477666,-1.65231,-1.36036,-0.066186,1.240867,0.920321,0.10128,-0.600844,0.41587,0.013912
2,0.6728,-1.114655,1.497597,-1.373886,-0.531722,-1.005042,-0.42574,-4.219006,5.945963,-0.994139,...,0.987076,-0.48303,-0.648311,-1.187675,0.300975,1.444555,0.448793,-0.867031,-1.600708,-1.925837
3,0.563289,0.443466,0.838457,0.185989,0.446582,0.960872,2.140261,2.079478,1.491859,0.601933,...,-0.264568,1.844388,-0.296166,0.284225,-0.717624,-1.053334,-0.315267,-0.005973,-1.728596,0.221867
4,1.007568,-2.369496,-4.098495,-1.901266,-2.775209,-2.779459,-3.348817,-2.859062,-2.70946,-3.587404,...,-0.412924,0.197805,-0.365878,1.547914,-0.587778,-1.665059,-0.778595,-0.124566,-0.18098,-0.045812


## Dimension Reduction

차원 축소가 좋은 이유

* 차원 축소를 통해 고차원 데이터를 저차원 공간에 투영해 중복 정보를 제거하면서 가능한 핵심 정보 유지

* 데이터를 낮은 차원으로 축소시키면 노이즈가 많이 줄어들기 때문에 머신러닝 알고리즘이 흥미로운 패턴을 더 효과적이고 효율적으로 식별할 수 있음

### Sparse PCA

PCA

* 가능한 한 분산 (핵심 정보)를 보존하면서 데이터의 저차원 표현을 찾음

* 피처들 간 상관관계를 다룸

* 일부 피처들 간 상관관계가 매우 높으면 PCA는 상관관계가 높은 피처들을 결합해 선형적인 상관관계가 없는, 더 작은 수의 피처들로 데이터를 표현

Sparse PCA

* 일반 PCA 알고리즘은 모든 입력 변수에 선형 결합을 탐색해 원본 피처 공간을 최대한 조밀하게 줄인다

* 일반 PCA 알고리즘은 모든 입력 변수에 선형 결합을 탐색해 원본 피처 공간을 최대한 조밀하게 줄인다

* alpha라는 하이퍼 파라미터로 제어함으로써 희소성을 어느 정도 유지할 수 있다

* 희소 PCA는 일부 입력 변수에서만 선형 결합을 탐색해 원본 피처 공간을 어느 정도 줄이지만 일반 PCA만큼 조밀하게 만들지는 않음

##### Zero Crossing Rate

In [34]:
pca = PCA()
pca.fit(zero_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

1


In [35]:
pca = SparsePCA(n_components=N_COMPONETS,
                alpha=0.001)

pca_train_zero = pca.fit_transform(zero_train)
pca_test_zero = pca.transform(zero_test)

pca_train_zero = pd.DataFrame(pca_train_zero)
pca_test_zero = pd.DataFrame(pca_test_zero)

In [36]:
pca = PCA()
pca.fit(zero_delta_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

1


In [37]:
pca = SparsePCA(n_components=N_COMPONETS,
                alpha=0.001)

pca_train_zero_delta = pca.fit_transform(zero_delta_train)
pca_test_zero_delta = pca.transform(zero_delta_test)

pca_train_zero_delta = pd.DataFrame(pca_train_zero_delta)
pca_test_zero_delta = pd.DataFrame(pca_test_zero_delta)

#### RMS

In [38]:
pca = PCA()
pca.fit(rms_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

1


In [39]:
pca = SparsePCA(n_components=N_COMPONETS,
                alpha=0.001)

pca_train_rms = pca.fit_transform(rms_train)
pca_test_rms = pca.transform(rms_test)

pca_train_rms = pd.DataFrame(pca_train_rms)
pca_test_rms = pd.DataFrame(pca_test_rms)

In [40]:
pca = PCA()
pca.fit(rms_delta_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

1


In [41]:
pca = SparsePCA(n_components=N_COMPONETS,
                alpha=0.001)

pca_train_rms_delta = pca.fit_transform(rms_delta_train)
pca_test_rms_delta = pca.transform(rms_delta_test)

pca_train_rms_delta = pd.DataFrame(pca_train_rms_delta)
pca_test_rms_delta = pd.DataFrame(pca_test_rms_delta)

#### MFCC

In [42]:
pca = PCA()
pca.fit(mfcc_train)
cumsum = np.cumsum(pca.explained_variance_ratio_)
N_COMPONETS = np.argmax(cumsum>=0.999) + 1
print(N_COMPONETS)

124


In [43]:
# 다음의 코드는 실행이 완료되기까지 비교적 시간이 좀 걸립니다
start = time.time()

pca = SparsePCA(n_components=N_COMPONETS,
                alpha=0.001)

pca_train_mfcc = pca.fit_transform(mfcc_train)
pca_test_mfcc = pca.transform(mfcc_test)

pca_train_mfcc = pd.DataFrame(pca_train_mfcc)
pca_test_mfcc = pd.DataFrame(pca_test_mfcc)

end = time.time()

times = end - start
times = str(datetime.timedelta(seconds=times)).split('.')[0]
print(times)

0:10:38


## Concat Preprocessed Data Set

In [44]:
# 전처리된 데이터 셋을 합침
preprocessed_train = pd.concat([
                                pca_train_zero,
                                pca_train_zero_delta,
                                pca_train_rms,
                                pca_train_rms_delta,
                                pca_train_mfcc
                               ], axis=1)

preprocessed_test = pd.concat([
                               pca_test_zero,
                               pca_test_zero_delta,
                               pca_test_rms,
                               pca_test_rms_delta,
                               pca_test_mfcc
                               ], axis=1)

In [45]:
preprocessed_train.columns = [i for i in range(len(preprocessed_train.columns))]
preprocessed_test.columns = [i for i in range(len(preprocessed_test.columns))]

In [46]:
print(preprocessed_train.shape)
preprocessed_train.head()

(1279, 128)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,118,119,120,121,122,123,124,125,126,127
0,-0.303446,-0.210859,0.983766,-1.261785,3.42242,0.726486,-2.462682,4.443269,-5.738622,-1.975347,...,0.097443,-0.201041,-0.062699,-0.091079,-0.549037,0.153809,-0.179582,0.076368,0.010871,-0.241233
1,-0.899772,-0.382049,-0.012764,-0.27559,2.502096,2.51914,-1.577035,-0.769271,3.478615,1.338044,...,0.273339,0.39786,-0.114123,0.213609,0.102917,0.163011,0.536513,-0.826751,0.172047,-0.250567
2,0.438706,0.671386,-1.466346,-0.522044,-4.175946,2.231185,1.994829,1.515495,-3.933909,3.124208,...,-0.083023,-0.417316,0.44364,0.247786,0.05856,-0.57413,0.047098,-0.028901,0.352297,0.002886
3,-0.901006,1.243426,1.523705,-0.659879,0.084066,1.280448,0.559491,-0.635935,1.115205,-1.207013,...,0.095865,-0.219614,-0.064824,-0.192117,-0.171057,0.02932,-0.077531,-0.599213,-0.102897,0.148886
4,1.995991,-1.509588,-1.468288,-1.704008,-1.570391,0.251557,0.460797,2.018924,3.932621,-2.159932,...,0.675701,-0.209619,0.582438,0.037699,-0.189837,0.629114,-0.088787,0.012352,0.752945,-0.023018


# Modeling

## LOF

* 해당 데이터의 Local Density를 기반으로 Novelty Score 산출

* Novelty Score가 정규화 되어있지 않기에 해당 모델을 다른 Data Set에 적용하는 것이 좋지 못할 수 있음

하이퍼 파라미터

* n_neighbors = 1은
  -  경험적으로 1로 설정하였을 때 성능이 가장 좋았습니다

* p = 2
  - 유클리드 거리를 사용 시 성능이 가장 좋았습니다

  - 부가적으로 맨하탄 거리 (p = 1) 가 유클리드 거리보다 좋은 경우는 다음의 [깃허브 블로그](https://seoyoungh.github.io/deep-learning/distance-metrics/)를 참조해주세요

* contamination='auto'
  - Test Set에 contamination이 얼마나 있는 지 알 수 없음
  - 해당 하이퍼 파라미터는 Threshold 설정에 영향을 줌
  - 대회 규칙 상 Anomaly Score를 바탕으로 Threshold를 산정하는 것은 Data Leakage에 해당
  - 규칙 위반이 우려스러워 'auto'로 설정

* novelty=True
  - True로 설정해야 Novelty Detection이 가능

### 모델 생성

In [104]:
model = LocalOutlierFactor(n_neighbors=1, 
                           p=2, # 민코프스키 거리 -> 1 : 맨하탄 거리와 같음 / 2 : 유클리드 거리와 같음
                           algorithm='auto',
                           contamination='auto',
                           novelty=True)

### 모델 추론

In [105]:
model.fit(preprocessed_train)

LocalOutlierFactor(n_neighbors=1, novelty=True)

In [106]:
def get_pred_label(model_pred):
    # IsolationForest 모델 출력 (1:정상, -1:불량) 이므로 (0:정상, 1:불량)로 Label 변환
    model_pred = np.where(model_pred == 1, 0, model_pred)
    model_pred = np.where(model_pred == -1, 1, model_pred)
    return model_pred

In [107]:
test_pred = model.predict(preprocessed_test) 
test_pred = get_pred_label(test_pred)

# Submission

In [108]:
submit = pd.read_csv('./sample_submission.csv')

In [109]:
submit['LABEL'] = test_pred
submit.head()

Unnamed: 0,SAMPLE_ID,LABEL
0,TEST_0000,0
1,TEST_0001,0
2,TEST_0002,1
3,TEST_0003,1
4,TEST_0004,1


In [101]:
submit.to_csv('./submission.csv', index=False)