## 불균형 데이터셋에 대한 웨이퍼 불량식별을 위한 CNN
### 키워드
- 데이터 전처리
    - Data Augmentaion
    - 불량 클래스
        - Center
        - Donut
        - Local
        - Edge-Loc
        - Edge-Ring
        - Scratch
        - Random
        - Near-Full
        - None
- 모델 구성
    - Batch Normalization
    - Spatical Dropout
    - Regularization

### 데이터 확인사항
- 웨이퍼 불량 label 갯수는 논문과 일치
- trainTestLabel의 경우 train 데이터 54355개, test 데이터 118595개로  
논문에서의 학습:검증:테스트 데이터셋 비율 개수 확인하여 augmentation 작업 필요
- waferMap 사이즈를 확인하여 추후 개발할 신경망 모델의 224x224 사이즈에 맞게 resizing 작업이 필요할 것으로 예상됨
- augmentaion 작업과 resizing 작업이 한 번에 해결될 수 있도록 전처리하는 것이 좋을 듯

In [1]:
import numpy as np
import pandas as pd

In [2]:
wm811k = pd.read_pickle('./LSWMD.pkl')

In [3]:
wm811k.head()

Unnamed: 0,waferMap,dieSize,lotName,waferIndex,trianTestLabel,failureType
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,1.0,[[Training]],[[none]]
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,2.0,[[Training]],[[none]]
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,3.0,[[Training]],[[none]]
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,4.0,[[Training]],[[none]]
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,5.0,[[Training]],[[none]]


In [4]:
wm811k.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 811457 entries, 0 to 811456
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   waferMap        811457 non-null  object 
 1   dieSize         811457 non-null  float64
 2   lotName         811457 non-null  object 
 3   waferIndex      811457 non-null  float64
 4   trianTestLabel  811457 non-null  object 
 5   failureType     811457 non-null  object 
dtypes: float64(2), object(4)
memory usage: 37.1+ MB


In [5]:
# 불필요 컬럼 제거
wm811k = wm811k.drop(['waferIndex'], axis = 1)
wm811k.head()

Unnamed: 0,waferMap,dieSize,lotName,trianTestLabel,failureType
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]]
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]]
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]]
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]]
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]]


In [6]:
# wafermap size 확인 및 컬럼 추가
def find_dim(x):
    dim0=np.size(x,axis=0)
    dim1=np.size(x,axis=1)
    return dim0,dim1
wm811k['waferMapDim']=wm811k['waferMap'].apply(find_dim)
wm811k.head()

Unnamed: 0,waferMap,dieSize,lotName,trianTestLabel,failureType,waferMapDim
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)"
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)"
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)"
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)"
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)"


In [7]:
# 불량 클래스 확인 및 학습/검증/테스트 데이터 셋 확인
wm811k['failureNum']=wm811k['failureType']
wm811k['trainTestNum']=wm811k['trianTestLabel']
mapping_type={'Center':0,'Donut':1,'Edge-Loc':2,'Edge-Ring':3,'Loc':4,'Random':5,'Scratch':6,'Near-full':7,'none':8}
mapping_traintest={'Training':0,'Test':1}
wm811k=wm811k.replace({'failureNum':mapping_type, 'trainTestNum':mapping_traintest})
wm811k.head()

  op = lambda x: operator.eq(x, b)


Unnamed: 0,waferMap,dieSize,lotName,trianTestLabel,failureType,waferMapDim,failureNum,trainTestNum
0,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)",8,0
1,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)",8,0
2,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)",8,0
3,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)",8,0
4,"[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...",1683.0,lot1,[[Training]],[[none]],"(45, 48)",8,0


### 참고 코드 수정 필요 사항
- data augmentation을 autoencoder로 수행
    - shape 등이 autoencoder 통과 이후 동일해짐
    - 구상 중이던 augmentaion 기법 적용 후 size를 맞출지 맞추고 나서 기법 적용할지 확인 필요
- CNN 모델 구조도 단순화되어있는 듯
    - 논문 참고하여 모델 층 구성 수정 필요
    - kerasclassifier 사용했던데 쓰던대로 함수형 api로 짜면 될 듯

### waferMap size 확인
- data print 결과 빈 부분은 0, 정상 pixel은 1, 불량 pixel은 2로 표현되어있는 듯
    - input shape 맞출 때 0으로 padding 주듯이 채우면 될 듯

In [15]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [16]:
wm811k['waferMapDim'].value_counts()

(32, 29)      108687
(25, 27)       64083
(49, 39)       39323
(26, 26)       30078
(30, 34)       29513
(33, 33)       23886
(33, 29)       20276
(39, 37)       15327
(52, 59)       14812
(31, 31)       14569
(39, 31)       13562
(29, 26)       13247
(27, 25)       12655
(64, 71)       11692
(31, 28)       10788
(35, 40)       10676
(38, 38)        8895
(44, 44)        8601
(34, 31)        8155
(51, 59)        7890
(212, 84)       7561
(44, 41)        7131
(42, 44)        7035
(35, 31)        6629
(56, 41)        5599
(45, 43)        5598
(72, 72)        5533
(41, 33)        5432
(29, 27)        5417
(40, 40)        5224
(41, 38)        5062
(51, 30)        5000
(87, 74)        4845
(41, 42)        4639
(18, 19)        4420
(25, 26)        4417
(89, 76)        4398
(43, 44)        4289
(64, 72)        4273
(86, 89)        4038
(33, 37)        3923
(63, 62)        3913
(88, 81)        3863
(89, 112)       3856
(74, 76)        3776
(54, 71)        3726
(75, 61)        3723
(50, 43)     

In [38]:
for i in range(len(wm811k.iloc[0]['waferMap'])):
    print(wm811k.iloc[0]['waferMap'][i]) 

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 2 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
 1 1 1 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1
 1 1 2 1 0 0 0 0 0 0 0]
[0 0 0 0 0 2 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 0 0 0 0 0]
[0 0 0 0 1