##Miryam Strautkalns##

CMPE 257

[Project] Final Results


## Problem Definition:
###Data Challenge Objective:
* copied from challenge page

This year’s data challenge addresses the problem of fault classification for a rock drill application under different individual configurations of the rock drill. The task is to develop a fault diagnosis/classification model using the provided pressure sensor data as input. The training data consists of data from various faults from five individual configurations, while the testing data for the online leaderboard is blind and is from one individual configuration of the rock drill.  A final validation data set for the final scoring for the competition will be from two individual configurations from the rock drill and the labels will be blind to the contest participants. For both the testing data for the online leaderboard and the final validation data set, a reference condition from a no-fault health condition will also be provided.

The training data set contains data from 11 different fault classification categories, in which 10 are different failure modes and one class is from the healthy/no fault condition. The task is to train a model to classify the fault conditions using the training data, and to test this model on the testing data, in which the one submission per day can be used for submitting results to the online leaderboard. Validation is done with a validation data set that will be released for a one-time assessment at the end of the data challenge. Scoring of performance is done through this web interface.

##Project Objectives:

The winning group of the 2022 PHM Society Data Challenge achieved an accuracy of 100%. Though to acheive it they used several machine learning techniques and I think that the same or competitive results can be accomplished through a less computationally expensive real world approach. I'm hoping to use an ensemble model to get an accuracy greater than 99.04%.

In [1]:
# Importing necessary libraries
import csv
import pandas as pd
import numpy as np
import missingno as msno
from pathlib import Path
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, KFold


## Sensor Sampling Description
* pin 50kHz Percussion pressure at inlet fitting.
* pdmp 50kHz Damper pressure inside the outer chamber.
* po 50kHz Pressure in the volume behind the piston.

each data point in each data set per sensor is a part of the X set of variables that will relate to the output, the fault value

a fault value represented by an int is the **y** value and the **X** value is a cycle of time series data for several sensors. Named **pin**, **pdmp**, and **po**. each y value then corresponds to three np arrays of data as its X.

#Data:

In [2]:
# methods for creating a dataframe for groups of sensors during a series of tests

def get_data_df(data_file):
    # creates a DataFrame with organized data for times series fault data
    df = pd.read_csv(data_file, header=None, names=range(1000))
    df = df.dropna(axis='columns')
    df = df.rename({0: 'fault'}, axis='columns')
    df = df.loc[:, 'fault':556]
    return df


def create_df(df_arr):
  # combines DataFrames vertically and removes and columns with NaN
    full_df = pd.concat(df_arr, axis=1).dropna(axis='columns')
    full_df = full_df.loc[:, ~full_df.columns.duplicated()]
    return full_df


In [4]:
# file paths
pin1_file = '/content/data/data_pin1.csv'
pdmp1_file = '/content/data/data_pdmp1.csv'
po1_file = '/content/data/data_po1.csv'

pin2_file = '/content/data/data_pin2.csv'
pdmp2_file = '/content/data/data_pdmp2.csv'
po2_file = '/content/data/data_po2.csv'

pin4_file = '/content/data/data_pin4.csv'
pdmp4_file = '/content/data/data_pdmp4.csv'
po4_file = '/content/data/data_po4.csv'

pin5_file = '/content/data/data_pin5.csv'
pdmp5_file = '/content/data/data_pdmp5.csv'
po5_file = '/content/data/data_po5.csv'

pin6_file = '/content/data/data_pin6.csv'
pdmp6_file = '/content/data/data_pdmp6.csv'
po6_file = '/content/data/data_po6.csv'

# Pin Sensor
pin1_data = get_data_df(pin1_file)
pin2_data = get_data_df(pin2_file)
pin4_data = get_data_df(pin4_file)
pin5_data = get_data_df(pin5_file)
pin6_data = get_data_df(pin6_file)
train_pin = create_df([pin1_data, pin2_data, pin4_data, pin5_data, pin6_data])

# PDMP Sensor
pdmp1_data = get_data_df(pdmp1_file)
pdmp2_data = get_data_df(pdmp2_file)
pdmp4_data = get_data_df(pdmp4_file)
pdmp5_data = get_data_df(pdmp5_file)
pdmp6_data = get_data_df(pdmp6_file)
train_pdmp = create_df([pdmp1_data, pdmp2_data, pdmp4_data, pdmp5_data, pdmp6_data])

# PO Sensor
po1_data = get_data_df(po1_file)
po2_data = get_data_df(po2_file)
po4_data = get_data_df(po4_file)
po5_data = get_data_df(po5_file)
po6_data = get_data_df(po6_file)
train_po = create_df([po1_data, po2_data, po4_data, po5_data, po6_data])

#Data Arrays

In [6]:
# Data set up for training models
#PIN
X_pin = train_pin.drop(['fault'], axis=1).values
y_pin = train_pin['fault'].values

#PDMP
X_pdmp = train_pdmp.drop(['fault'], axis=1).values
y_pdmp = train_pdmp['fault'].values

#PO
X_po = train_po.drop(['fault'], axis=1).values
y_po = train_po['fault'].values

## Cross Validation using K Fold
* 5 groups

In [34]:
# K Fold validation - splitting the data into 5 groups of training and testing. Testing makes up 1/5 of the data.

k_fold_dict = {'train': [], 'test': []}
kf = KFold(n_splits=5)
kf.get_n_splits(X_pin)
for i, (train_index, test_index) in enumerate(kf.split(X_pin)):
    k_fold_dict['train'].append(train_index)
    k_fold_dict['test'].append(test_index)

##Analysis:

I chose to use Random Forest because it struck a solid balance between time training and accuracy. I knew that by using the three models in an ensemble the accuracy would only increase.

#Random Forest

Scikit Learn Random Forest [reference](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [38]:
def rfc(X, y):
  # Creates Random Forest Model
  clf = RandomForestClassifier()
  clf = clf.fit(X, y)
  return clf

#Ensemble model

I decided to use the three models as a group to try and increase the accuracy. Similar to the idea of having three people vote and the majority vote decide the class choice.

## Version 1:
* In the first ensemble model I let each model predict the fault based on it's individual choice and let the majority select the fault for prediction.

##Version 1 Problem:

There are instances where fault data appears similar for two sensors, but is more distinct for the third, showing that the physical system is affected differently depending on the fault. One Fault will be less reliably predicted on a model of a sensor less impacted. This caused false predicts.


---

**Predictions:**
* 0.9981203007518797
* 0.9981203007518797
* 0.9987460815047022
* 0.9968652037617555
* 0.9974921630094044

Average Accuracy: 0.9979

In [55]:
def ensemble_predict(train_i, test_i):
  # Variables for model
  clf_pin = rfc(X_pin[train_i], y_pin[train_i])
  clf_pdmp = rfc(X_pdmp[train_i], y_pdmp[train_i])
  clf_po = rfc(X_po[train_i], y_po[train_i])
  predict_pin = rfc(X_pin[train_i], y_pin[train_i]).predict(X_pin[test_i])
  predict_pdmp = rfc(X_pdmp[train_i], y_pdmp[train_i]).predict(X_pdmp[test_i])
  predict_po = rfc(X_po[train_i], y_po[train_i]).predict(X_po[test_i])
  y_test = y_pin[test_i]
  index_int = 0
  error_int = 0
  total_test = len(test_i)

  # loop for checking accuracy
  for pin, pdmp, po, y in zip(predict_pin, predict_pdmp, predict_po, y_test):
    predict = np.bincount([pin, pdmp, po]).argmax()
    # if the prediction doesn't match the expected value
    if predict != y:
      # shows the data for the failed prediction
      error_int = error_int + 1

      # - ERROR TRACKING -
      # print('PIN prediction:', pin, '| PDMP prediction:', pdmp, '| PO prediction:', po, '| Expected Fault:', y)
      # print('Fault Probabilities: ')
      # print(clf_pin.predict_proba([X_pin[test_i][index_int]]))
      # print(clf_pdmp.predict_proba([X_pdmp[test_i][index_int]]))
      # print(clf_po.predict_proba([X_po[test_i][index_int]]))

    index_int = index_int + 1
  print('accuracy: ', (total_test - error_int)/total_test)

In [57]:
group_num = 1
for group in zip(k_fold_dict['train'], k_fold_dict['test']):
  print('\nValidation Group', group_num)
  ensemble_predict(group[0], group[1])
  group_num = group_num + 1


Validation Group 1
accuracy:  0.9993734335839599

Validation Group 2
accuracy:  0.9981203007518797

Validation Group 3
accuracy:  0.9981191222570532

Validation Group 4
accuracy:  0.9981191222570532

Validation Group 5
accuracy:  0.9981191222570532


## Discussion on Version 1:

I thought that I could acheive a greater accuracy by assessing the instances where the predictions were wrong. I decided to proceed with a 'Version 2', which switched to a system assessing probability because I observed that they were mostly situations where one model sensor was more reliable than the others, but this was over-ruled by the predictions of two sensors with low probability in choosing their prediction.

##Version 2:
* In this version rather than allowing each predict method to decide the fault similar to a majority vote between 3 people, I summed the three arrays of probabilities for each sensor model and chose the fault that had the highest sum of probabilities. This increased the accuracy to 100% during the preliminary tests. I've listed the accuracy from cross validation tests.

## Evaluation and Reflection:

* Occasionally there is one fault that is classified incorrectly. This fault is a unique scenario that cannot be solved through machine learning. The probability is equally split between two faults which means this one scenario requires domain knowledge. This still isn't something I consider a failing, the models in ensemble have been able to successfully classify every no fault and fault with 100% accuracy.


---

**Predictions:**
* 1.0
* 1.0
* 0.9993730407523511
* 0.9987460815047022
* 1.0

Average Accuracy: 0.9996

In [58]:
def ensemble_predict_v2(train_i, test_i):
  # Variables for model
  clf_pin = rfc(X_pin[train_i], y_pin[train_i])
  clf_pdmp = rfc(X_pdmp[train_i], y_pdmp[train_i])
  clf_po = rfc(X_po[train_i], y_po[train_i])
  predict_pin = clf_pin.predict(X_pin[test_i])
  predict_pdmp = clf_pdmp.predict(X_pdmp[test_i])
  predict_po = clf_po.predict(X_po[test_i])
  pin_predict_prob = clf_pin.predict_proba(X_pin[test_i])
  pdmp_predict_prob = clf_pdmp.predict_proba(X_pdmp[test_i])
  po_predict_prob = clf_po.predict_proba(X_po[test_i])
  predict_prob = pin_predict_prob + pdmp_predict_prob + po_predict_prob
  y_test = y_pin[test_i]
  index_int = 0
  error_int = 0
  total_test = len(test_i)

  # loop for checking accuracy
  for prob_x, pin, pdmp, po, y in zip(predict_prob, predict_pin, predict_pdmp, predict_po, y_test):
    fault = np.where(prob_x == prob_x.max())[0][0]+1
    # if the prediction doesn't match the expected value
    if fault != y:
      # shows the data for the failed prediction
      error_int = error_int + 1

      # - ERROR TRACKING -
      # print('Probability Prediction:', fault)
      # print('PIN prediction:', pin, '| PDMP prediction:', pdmp, '| PO prediction:', po, '| Expected Fault:', y)
      # print('Fault Probabilities: ')
      # print(clf_pin.predict_proba([X_pin[test_i][index_int]]))
      # print(clf_pdmp.predict_proba([X_pdmp[test_i][index_int]]))
      # print(clf_po.predict_proba([X_po[test_i][index_int]]))

    index_int = index_int + 1
  print('accuracy: ', (total_test - error_int)/total_test)

In [59]:
group_num = 1
for group in zip(k_fold_dict['train'], k_fold_dict['test']):
  print('\nValidation Group', group_num)
  ensemble_predict_v2(group[0], group[1])
  group_num = group_num + 1


Validation Group 1
accuracy:  1.0

Validation Group 2
accuracy:  1.0

Validation Group 3
accuracy:  0.9993730407523511

Validation Group 4
accuracy:  0.9987460815047022

Validation Group 5
accuracy:  1.0


#**Results:**

[2022 PHM Society Data Challenge](https://data.phmsociety.org/2022-phm-conference-data-challenge/)

Accuracy of winners:

1. **100.00%**
2. **99.77%**
3. **99.04%**

My goal was to score at least high enough to place in the top three for this data challenge competition. Based on my own testing I can confidently claim 2nd place with an average accuracy of **99.96%** with 5 cross validation tests and with a solution that can be used in the real world. Thus, this project was successful. I have submitted my validation results for their mystery validation data set and will get the results back soon.

