## Eigenvector Metal Etch Data for Fault Detection Evaluation
Jaganadh Gopinadhan (Jagan)

https://www.linkedin.com/in/jaganadhg/

### Introduction

This Data Set is one of the oldest Microfabrication/Semiconductor process-related open data. The data was released by Eigenvector, a niche industry analytics company. The information was used in a research study and paper published in 1999[1]. The first author Barry M. Wise is one of the company's founding members.

The data was taken from LAM 9600 Metal Etcher[2]. Etching is used in microfabrication to chemically remove layers from the surface of a wafer during manufacturing[3]. The data was released in Matlab format, suitable for analyzing using the PLS Toolbox for Eigenvector. There are three .mat files; MACHINE_Data.mat, OES_DATA.mat, and RFM_DATA.mat. A detailed note on the data and attributes available in reference [1] and [2]. Since the data is in Matlab format, we created a python based parser to convert the data in pandas DataFrames. This parser should enable Data Miners and Data Scientists to play around with Open Source tools like Python or R. 

### The Eigenvector Etch Data Parser

The Eigenvector Etch Data Parse is developed to read Matlab data files. The parse reads each file and converts the calibration data (sensor data) and test data (sensor data) into a single DataFrame. The parse introduced an additional field in the data  'fault_name', which helps the user identify the normal/calibration wafers and test wafers(with defects). We tested the parser in Python3 environments only; if you are looking for Python2 compatibility, please test and create a bug/pull request as applicable.  The source code is released under Apache 2.0 license and is available at https://github.com/jaganadhg/egvsemicon. 

### Etch Data

The Eigenvector Etch Data is provided in three Matlab vector files MACHINE_Data.mat, OES_DATA.mat, and RFM_DATA.mat. The MACHINE_Data.mat file consists of the engineering variables, time, and the etch recipe steps. The variable/feature categories are pressure, gas flow rate, and power (If you wonder why gas flow is here, it is part of the semiconductor chemistry process and a more significant topic beyond the note!). The OES_DATA.mat file consists of the optical emission spectroscopy (OES) of the plasma. OES description is available in Hitachi's reference pge[4]. The file RFM_DATA.mat contains radio-frequency monitoring (RFM) system to monitor the power and phase relationships of the plasma generator. 

The OES data do not come with sensor names, unlike the Machine Data/Engineering Variable and RFM Data. The data is a field wave_axis which represents wavelengths in nm of peaks. A special note from the data "Note that this data consists of integrated peak areas at for peaks at 43 wavelengths but looking across the plasma in 3 different locations perpendicular to the overall gas flow in the system."[2]

The RFM Data has associated information related to the unit of variables (units: [ 71x6  char]  Units of the variables). This project provides the variable and its unit mapping as a CSV file (https://github.com/jaganadhg/egvsemicon/blob/main/rfm_variable_unit_map.csv). There are 35 variables measured with VRMS [5], 20 variables with the degree, 15 variables with IRMS [5], and one variable with the second. The pandas DataFrame from the parser will not be providing this information. The sensor names are masked to protect the process and machine details in this data. It is difficult to determine the sensor name without explicit Etch machine knowledge.However, a smart and experienced Process Engineer would infer some from the units. 

### Data Exploration
Let's look at the data.

In [43]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np 
import scipy as sp 
import matplotlib.pyplot as plt

from egvparser import egienvec_parser

#### Engineering Variables Data

In [44]:
ev_data_path = "/home/jaganadhg/AI_RND/Semiconductor/eigenvector/MACHINE_Data.mat"
ev_key = "LAMDATA"

ev_data = egienvec_parser(ev_data_path,
                        dkey=ev_key)

2022-01-29 20:54:29,287 :: INFO :: Keys in the data are dict_keys(['__header__', '__version__', '__globals__', 'LAMDATA'])
2022-01-29 20:54:29,289 :: INFO :: The sensor names are ['Time          ', 'Step Number   ', 'BCl3 Flow     ', 'Cl2 Flow      ', 'RF Btm Pwr    ', 'RF Btm Rfl Pwr', 'Endpt A       ', 'He Press      ', 'Pressure      ', 'RF Tuner      ', 'RF Load       ', 'RF Phase Err  ', 'RF Pwr        ', 'RF Impedance  ', 'TCP Tuner     ', 'TCP Phase Err ', 'TCP Impedance ', 'TCP Top Pwr   ', 'TCP Rfl Pwr   ', 'TCP Load      ', 'Vat Valve     ']
2022-01-29 20:54:29,292 :: INFO :: Processing calibration data for LAMDATA
2022-01-29 20:54:29,388 :: INFO :: Processed calibration data for LAMDATA
2022-01-29 20:54:29,388 :: INFO :: Processing test data for LAMDATA
2022-01-29 20:54:29,406 :: INFO :: Processed test data for LAMDATA
2022-01-29 20:54:29,408 :: INFO :: Total sensor values in the data LAMDATA is 12829


In [45]:
ev_data.head()

Unnamed: 0,Time,Step Number,BCl3 Flow,Cl2 Flow,RF Btm Pwr,RF Btm Rfl Pwr,Endpt A,He Press,Pressure,RF Tuner,...,RF Impedance,TCP Tuner,TCP Phase Err,TCP Impedance,TCP Top Pwr,TCP Rfl Pwr,TCP Load,Vat Valve,wafer_names,fault_name
0,11.946,4.0,751.0,753.0,132.0,0.0,626.0,100.0,1227.0,9408.0,...,16599.0,20028.0,-296.0,16848.0,360.0,0.0,27594.0,49.0,l2901.txm,calibration
1,13.028,4.0,751.0,753.0,134.0,0.0,620.0,99.0,1229.0,9431.0,...,16568.0,20042.0,-676.0,16796.0,350.0,0.0,27440.0,49.0,l2901.txm,calibration
2,14.049,4.0,751.0,755.0,134.0,0.0,599.0,102.0,1221.0,9389.0,...,16442.0,20146.0,-291.0,16512.0,344.0,0.0,27276.0,49.0,l2901.txm,calibration
3,15.1329,4.0,751.0,753.0,133.0,0.0,586.0,100.0,1201.0,9445.0,...,16960.0,20148.0,-262.0,17020.0,352.0,0.0,27330.0,50.0,l2901.txm,calibration
4,16.139,4.0,751.0,754.0,132.0,0.0,587.0,102.0,1182.0,9456.0,...,16564.0,20226.0,-547.0,16440.0,346.0,0.0,27262.0,50.0,l2901.txm,calibration


In [46]:
ev_data.columns

Index(['Time', 'Step Number', 'BCl3 Flow', 'Cl2 Flow', 'RF Btm Pwr',
       'RF Btm Rfl Pwr', 'Endpt A', 'He Press', 'Pressure', 'RF Tuner',
       'RF Load', 'RF Phase Err', 'RF Pwr', 'RF Impedance', 'TCP Tuner',
       'TCP Phase Err', 'TCP Impedance', 'TCP Top Pwr', 'TCP Rfl Pwr',
       'TCP Load', 'Vat Valve', 'wafer_names', 'fault_name'],
      dtype='object')

The first two variables represent Time (in seconds) and Step Number. The step number should be correlated to the machine's process recipe, which is absent in the data (due to IP reasons, I believe). As mentioned earlier, the rest of the variables represent gas, pressure, and power. The last two variables, wafer_names, refer to a unique wafer name, and  fault_name indicates whether a wafer is a calibration/normal or test wafer/defect (induced defect as per the original paper [1]). If you are interested in further exploration, wafer_names will help you identify individual wafers. More than one record is there for each wafer because the process takes a defined time range to complete a recipe. The data consists of only steps 4 and 5 from the process recipe.  

12829 records span across 126 normal/calibration wafers and 20 test (defect induced) wafers. 

**Note: The column wafer_names is helpful to join other data sets such as OES and RFM.**

#### OES Data

In [47]:
oes_data_path = "/home/jaganadhg/AI_RND/Semiconductor/eigenvector/OES_DATA.mat"
oes_key = "OESDATA"

oes_data = egienvec_parser(oes_data_path,
                        dkey=oes_key)

2022-01-29 20:54:37,142 :: INFO :: Keys in the data are dict_keys(['__header__', '__version__', '__globals__', 'OESDATA'])
2022-01-29 20:54:37,148 :: INFO :: The sensor names are [array([250.  , 261.8 , 266.6 , 272.2 , 278.3 , 284.6 , 288.25, 308.25,
       309.25, 324.8 , 327.5 , 336.98, 364.33, 388.  , 394.4 , 395.8 ,
       415.  , 532.6 , 544.2 , 556.7 , 580.  , 611.5 , 613.9 , 616.3 ,
       618.5 , 639.7 , 643.2 , 644.9 , 652.8 , 660.  , 669.5 , 670.6 ,
       674.  , 725.  , 740.8 , 748.5 , 753.7 , 770.6 , 773.2 , 781.  ,
       783.5 , 787.5 , 791.5 , 250.  , 261.8 , 266.6 , 272.2 , 278.3 ,
       284.6 , 288.25, 308.25, 309.25, 324.8 , 327.5 , 336.98, 364.33,
       388.  , 394.4 , 395.8 , 415.  , 532.6 , 544.2 , 556.7 , 580.  ,
       611.5 , 613.9 , 616.3 , 618.5 , 639.7 , 643.2 , 644.9 , 652.8 ,
       660.  , 669.5 , 670.6 , 674.  , 725.  , 740.8 , 748.5 , 753.7 ,
       770.6 , 773.2 , 781.  , 783.5 , 787.5 , 791.5 , 250.  , 261.8 ,
       266.6 , 272.2 , 278.3 , 284.6 , 

In [48]:
oes_data.head()

Unnamed: 0,250.0,261.8,266.6,272.2,278.3,284.6,288.25,308.25,309.25,324.8,...,748.5,753.7,770.6,773.2,781.0,783.5,787.5,791.5,wafer_names,fault_name
0,715.8,11926.3,1242.3,37720.0,3851.3,876.1,107.8,2732.2,2732.2,157.5,...,1791.8,8848.0,6260.1,6417.7,5981.9,5981.9,4692.4,8513.2,s2901.int,calibration
1,756.1,13309.8,1105.2,38144.7,4100.9,996.7,179.5,2532.3,2532.3,186.1,...,1784.5,9094.1,6273.9,4914.2,6054.9,6054.9,4947.9,8452.9,s2901.int,calibration
2,754.5,14061.6,1199.7,38646.1,4163.6,945.2,145.6,2923.5,2923.5,239.7,...,1686.7,9024.4,6409.7,6226.2,5709.4,5709.4,4783.3,8310.1,s2901.int,calibration
3,804.4,13489.3,1312.3,40037.2,4175.4,898.8,83.1,2731.0,2731.0,220.9,...,1539.5,9338.8,6330.3,4896.1,5925.9,5931.6,4770.0,8621.3,s2901.int,calibration
4,800.7,14719.5,1262.9,40158.1,4318.2,962.2,170.8,3247.8,3247.8,235.9,...,1857.0,9633.7,6506.2,4963.8,6205.1,6205.1,5009.1,8941.3,s2901.int,calibration


The first 129 variables in this data represent wavelength measured from three different locations. Practically every 43 variables represent measurement from one location only. The last two variables are wafer_names, fault_name represents wafer, and calibration or test wafer indication. 

#### Radio Frequency Monitoring (RFM) Data

In [49]:
rfm_data_path = "/home/jaganadhg/AI_RND/Semiconductor/eigenvector/RFM_DATA.mat"
rfm_key = "RFMDATA"

rfm_data = egienvec_parser(rfm_data_path,
                        dkey=rfm_key)

2022-01-29 20:54:37,679 :: INFO :: Keys in the data are dict_keys(['__header__', '__version__', '__globals__', 'RFMDATA'])
2022-01-29 20:54:37,682 :: INFO :: The sensor names are ['TIME  ', 'S1V1  ', 'S1V2  ', 'S1V3  ', 'S1V4  ', 'S1V5  ', 'S1I1  ', 'S1I2  ', 'S1I3  ', 'S1I4  ', 'S1I5  ', 'S1P1  ', 'S1P2  ', 'S1P3  ', 'S1P4  ', 'S1P5  ', 'S2V1  ', 'S2V2  ', 'S2V3  ', 'S2V4  ', 'S2V5  ', 'S2I1  ', 'S2I2  ', 'S2I3  ', 'S2I4  ', 'S2I5  ', 'S2P1  ', 'S2P2  ', 'S2P3  ', 'S2P4  ', 'S2P5  ', 'S3V1  ', 'S3V2  ', 'S3V3  ', 'S3V4  ', 'S3V5  ', 'S4V1  ', 'S4V2  ', 'S4V3  ', 'S4V4  ', 'S4V5  ', 'S34PV1', 'S34PV2', 'S34PV3', 'S34PV4', 'S34PV5', 'S3I1  ', 'S3I2  ', 'S3I3  ', 'S3I4  ', 'S3I5  ', 'S4I1  ', 'S4I2  ', 'S4I3  ', 'S4I4  ', 'S4I5  ', 'S34PI1', 'S34PI2', 'S34PI3', 'S34PI4', 'S34PI5', 'S34V1 ', 'S34V2 ', 'S34V3 ', 'S34V4 ', 'S34V5 ', 'S34I1 ', 'S34I2 ', 'S34I3 ', 'S34I4 ', 'S34I5 ']
2022-01-29 20:54:37,683 :: INFO :: Processing calibration data for RFMDATA
2022-01-29 20:54:37,792 :: INFO :: 

In [50]:
rfm_data.head()

Unnamed: 0,TIME,S1V1,S1V2,S1V3,S1V4,S1V5,S1I1,S1I2,S1I3,S1I4,...,S34V3,S34V4,S34V5,S34I1,S34I2,S34I3,S34I4,S34I5,wafer_names,fault_name
0,15.32813,146.2126,7.474548,5.444838,2.468795,1.458764,2.651642,0.003891,0.002745,0.000811,...,5.041583,2.695145,172.2877,14.99166,0.064546,0.018406,0.005021,0.004922,r2901.txt,calibration
1,18.45703,145.8764,7.500408,5.459276,2.465954,1.457085,2.657755,0.00394,0.002761,0.000817,...,5.020921,2.687807,172.2852,14.92634,0.064253,0.017902,0.004927,0.004872,r2901.txt,calibration
2,21.58984,146.8875,7.413699,5.43732,2.437727,1.447055,2.633389,0.004037,0.002748,0.000815,...,5.047616,2.691589,172.2836,14.92634,0.063929,0.017908,0.004853,0.004855,r2901.txt,calibration
3,24.71875,146.8875,7.372842,5.386851,2.43212,1.443727,2.636422,0.00399,0.002748,0.000817,...,5.037209,2.695823,172.2836,14.9161,0.064036,0.017935,0.004829,0.004861,r2901.txt,calibration
4,27.84766,146.5498,7.357578,5.393677,2.423734,1.443727,2.636422,0.003954,0.002739,0.000815,...,5.049484,2.697655,172.1836,14.92634,0.063893,0.017631,0.00486,0.00486,r2901.txt,calibration


If you are interested in reading about RFM, an interesting resource is "RF Technology in Semiconductor Wafer Processing" [6]. We have provided the sensor and unit mapping separately [7]. The data sets come with reference to the unit of each sensor value to understand the data better. The actual sensor names are masked, and the last two variables represent the wafer name and indicate test or calibration wafers. 

#### Combining Data Set

The three DataFrames generated have various records and a caveat in identity column wafer_names. The RFM has 3519, OES has 4786, and Engineering Variables has 12829 records. Value in the identity column wafer_names; values are prefixed by l,s, and r for Engineering Variables, OES, and RFM data, respectively. A derived variable can be generated for identity by relacing the leading alphabet in the wafer_names.  

The idea of representing OES data as DataFrame may not be the best. We are working towards a better representation.

#### Data Mining/Data Science and Next Steps

We are not venturing into any detailed analytics solution in the scope of current notes—the industry practices simple techniques from univariate analysis to employing Deep Learning to solve the problems. From the data description, one can infer the nature of data preprocessing and feature engineering techniques. In the same domain, understanding or active guidance from field processing engineers may benefit you in starting an exciting project. A good starting point will be the original paper [1]. 

#### Competing Interests

This notebook is intended to introduce the Egionvector Metal Etch Data Parser[8] and the data [2].  The authors declare that they have no competing interests. The authors declare that no proprietary information related to the authors, affiliated company, or its approach, methodologies, and IPR is discussed in these notes.

### Reference

[1] B.M. Wise, N.B. Gallagher, S.W. Butler, D.D. White, Jr. and G.G. Barna, “A Comparison of Principal Components Analysis, Multi-way Principal Components Analysis, Tri-linear Decomposition and Parallel Factor Analysis for Fault Detection in a Semiconductor Etch Process”, J. Chemometrics, 13, 379­396 (1999)

[2] https://eigenvector.com/resources/data-sets/

[3] https://en.wikipedia.org/wiki/Etching_(microfabrication) 

[4] https://hha.hitachi-hightech.com/en/blogs-events/blogs/2017/10/25/optical-emission-spectroscopy-(oes)/

[5] IRMS - The current in an alternating current circuit varies continuously in direction and magnitude. ... Diagrams denote this current as "IRMS," with the "RMS" in subscript. A constant level of the root means square current dissipates the same amount of heat through a resistor as the alternating current does.
VRMS - In electricity: Alternating-current circuits. The root-mean-square (RMS) voltage of a sinusoidal source of electromotive force (Vrms) is used to characterize the source. It is the square root of the time average of the voltage squared. - https://www.quora.com/What-are-VRMS-and-IRMS 

[6] https://www.microwavejournal.com/articles/21140-rf-technology-in-semiconductor-wafer-processing 

[7] https://github.com/jaganadhg/egvsemicon/blob/main/rfm_variable_unit_map.csv