# 2018 PHM Data Challenge 

## Introduction

This data set examines the fault behavior of an ion mill etch tool used in a wafer manufacturing process (see references at the end of this document). The process of ion mill etching typically consists of the following steps:

    Inserting a wafer into the mill
    Configure wafer settings (rotation speed, angles, beam current / voltages, etc.)
    Processing the wafer for a set amount of time
    Repeat 2 or 3 for different steps of recipe
    Remove wafer from mill

An ion source generates ions that are accelerated through an electric field using a series of grids set at specific voltages. This creates an ion beam that travels and eventually strikes the wafer surface. Material is removed from the wafer when ions hit the wafer surface. The wafer is placed on a rotating fixture that can be tilted at different angles facing the incoming ion beam. The wafer can be shielded from the ion beam until ready for milling operation to commence using a shutter mechanism as shown in Figure 2. A Particle Beam Neutralizer (PBN) control system influences the ion beam shape / ion distribution as it travels to the wafer surface.

The wafer is cooled by a helium / water system called flowcool. The cooling system passes helium gas behind the wafer at a specified flow rate. The helium gas is indirectly cooled by a water system. The wafer and fixture o-ring separates the flowcool gas from the ion mill vacuum chamber.

Many different failure mechanisms can be present in this system including leaks between flowcool and ion mill chambers, electric grid wear, ion chamber wear, etc. It would be beneficial to predict where and when these failures occur and schedule downtime of these ion mills for maintenance operations.

The objective of this data challenge is to build a model from time series sensor data collected from various ion mill etching tools operating under various conditions and settings.

    Diagnose failures (i.e. detect and identify)
    Determine time remaining until next failure (i.e. predict remaining useful life)

Predictions of time-to-failure at a specific time should only use time-series data from current and past times. In other words, do not try to predict the point of failure first and then backtrack through time to determine time-to-failure predictions.


# Dataset

The function `rul_pm.datasets.PHMDataset2018.prepare_raw_dataset(path: Path)`  is in charge of downloading the data and decompress all the files. 
Whereas the function `rul_pm.datasets.PHMDataset2018.prepare_dataset(dataset_path: Path)` split the raw dataset is run to failure cycles.

The dataset structure has two folder 
* test
 * 01_M02_DC_test.csv
* train
    * train_faults/
        * 01_M01_train_fault_data.csv
        * 01_M02_train_fault_data.csv 
    * 01_M01_DC_train.csv
    * 01_M02_DC_train.csv
    * ....

Inside the train folders the files called `%%_%%%_DC_train.csv` contains the raw sensors of the machine. Each file corrresponds to a machcine. The files inside train/train_faults called `%%_%%%_train_fault_data.csv`contains for each machine the list of recorded failures. These files contains three columns: 
* time: Timestamp of the failure
* fault_name: Failure type
* Tool: Tool which suffered the failure ( constant for each file since each file correspond to one machine).

## Pre-processing

```python
def merge_data_with_faults(
    data_file: Union[str, Path], fault_data_file: Union[str, Path]
) -> pd.DataFrame:
    data = pd.read_csv(data_file).dropna().set_index("time")

    fault_data = (
        pd.read_csv(fault_data_file).drop_duplicates(subset=["time"]).set_index("time")
    )
    fault_data["fault_number"] = range(fault_data.shape[0])
    return pd.merge_asof(data, fault_data, on="time", direction="forward").set_index(
        "time"
    )

for data_file, fault_data_file in file_list:
    data = merge_data_with_faults(data_file, fault_data_file)
    for life_index, life_data in data.groupby("fault_number"):
        life = life_data.copy()
        life["RUL"] = np.arange(life.shape[0] - 1, -1, -1)
        failure_type = life_data["fault_name"]
        save_cycle(life_index, life, failure_type)
```

## Import

In [1]:
from rul_pm.datasets.PHMDataset2018 import PHMDataset2018

2022-03-05 15:25:50.566880: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-05 15:25:50.566894: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Dataset

In [2]:
dataset = PHMDataset2018()

Processing files:   0%|          | 0/20 [00:00<?, ?it/s]

ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

# Raw features

# Feature transformer

## Feature Analysis