# Event-driven machine learning focusing on feature genaration and selection

The **PdM(Predictive Maintenance)** approach proposed in [14], although it is evaluated over a use case
of automated teller machines (**ATMs**), is **general enough to be applied on any
industrial scenario**, where error and failure logs are available. It follows a similar
rationale as [10], but implicitly assumes that the **log types recorded are more
commonly related to the targeted failure** (e.g., generated from software exceptions) and puts more **emphasis on feature generation and selection**. We will refer
to this approach as **FSPdM**. Its main drawback is that it **cannot scale in the
number of event types** that are present in the logs.

The authors propose a configurable approach for the creation of the training
and testing datasets and the formation of a binary classification problem. More
specifically, the dataset is divided into partitions (named **Observation Windows
(OW)**) and each OW is further divided into **daily segments**. Every OW, is fol-
lowed by a **Prediction Window (PW)** (i.e. partition with daily segments), in
which a fault is predicted to take place. The range from the beginning of each
OW up to the end of the related PW defines a **training or testing instance**. The
**labelling of an instance (i.e. classes: likely to fail, or not to fail)** depends on the
existence of a ticket report inside the PW (i.e. if there is a ticket in the PW, the
instance is considered positive (i.e. likely to fail)).




10.Korvesis, P., Besseau, S., Vazirgiannis, M.: Predictive maintenance in aviation:
Failure prediction from post 
ight reports. In: IEEE Int. Conf. on Data Engineering
(ICDE). pp. 1414{1422 (2018)

14.Wang, J., Li, C., Han, S., Sarkar, S., Zhou, X.: Predictive maintenance based on
event-log analysis: A case study. IBM Journal of Research and Development 61(1),
11{121 (2017)

### Setup

In [215]:
#Make the necessary imports.
#install.packages("RCurl", repos='http://cran.us.r-project.org')


suppressMessages(library(CORElearn))
suppressMessages(library(dplyr))
suppressMessages(library(plyr))
suppressMessages(library(data.table))
suppressMessages(library(randomForest))
suppressMessages(library(ggplot2))
suppressMessages(library(grid))
suppressMessages(library(argparser))
suppressMessages(library(arules))
suppressMessages(library(arulesSequences))
suppressMessages(library(xgboost))
suppressMessages(library(DiagrammeR))
library(keras)
library(kerasR)

### Init

In [216]:
#Make an argument parser named p and keep there the necessary variables.

p <- arg_parser("Implementation of the IBMs ATM Predictor")

# Add a positional argument
p <- add_argument(p, "train", help="training dataset")
p <- add_argument(p, "test", help="test dataset")

p <- add_argument(p, "fet", help="different types of the fault events",default=151) #51#11
p <- add_argument(p, "tet", help="type of the target fault events",default=151) #51#11

p <- add_argument(p, "--X", help="# of segments/sub-windows", default=5)#6#3#1
p <- add_argument(p, "--M", help="segment legth (in days)", default=2)#5#2#1
p <- add_argument(p, "--Y", help="length of the prediction window (in days)", default=10)#30#2#3#2
p <- add_argument(p, "--Z", help="length of the buffer window (in days)", default=5)
p <- add_argument(p, "--N", help="moving step (in days)", default=10)#2#2#1
p <- add_argument(p, "--step", help="feature selection decrease step", default=40)#20#5#1
p <- add_argument(p, "--seed", help="seed for XGB", default=400)

p <- add_argument(p, "--sup", help="pattern features appriori support", default=0.8)
p <- add_argument(p, "--conf", help="pattern features confidence", default=0.6)
p <- add_argument(p, "--fs", help="apply feature selection", default=TRUE)
p <- add_argument(p, "--pf", help="use pattern features", default=FALSE)#TRUE#FALSE
p <- add_argument(p, "--sf", help="use similarity feature", default=TRUE)
p <- add_argument(p, "--plogic", help="TRUE for custom pattern detection logic FALSE for paper's logic", default=TRUE)
p <- add_argument(p, "--csv", help="output for csv", default=FALSE)
p <- add_argument(p, "--minwint", help="min # of days before failure to expect a warning for true positive decision", default=1)#2#3
p <- add_argument(p, "--maxwint", help="max # of days before failure to expect a warning for true positive decision", default=30)#5



### Define the necessary variables

In [217]:
#Define the necessary variables.

argv = data.frame() #make a data frame named argv
#if( length(commandArgs(trailingOnly = TRUE)) != 0){
if(FALSE){
  argv <- parse_args(p)
} else {
  #parse to argv the p's arguments as  argv <- parse_args(p,c("training dataset's path","test dataset's path",fet,tet))  
  #argv <- parse_args(p,c("C:/Users/Public/ptyxiakh/training_my_dataset2.csv","C:/Users/Public/ptyxiakh/testing_my_dataset2.csv",11,11))
  #argv <- parse_args(p,c("C:/Users/Public/ptyxiakh/training_my_dataset_5years.csv","C:/Users/Public/ptyxiakh/testing_my_dataset_5years.csv",51,51))  
  argv <- parse_args(p,c("C:/Users/petsi/Documents/ptyxiakh/training_my_dataset_150.csv",
                         "C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv",151,151))
}

#init the variables
train_path=argv$train
test_path=argv$test

b_length = argv$fet
target_event = argv$tet

X = argv$X
M = argv$M
Y = argv$Y
Z = argv$Z
N = argv$N
step = argv$step
seed = argv$seed

sup = argv$sup
conf = argv$conf
FEATURE_SELECTION = argv$fs
PATTERN_FEATURES = argv$pf
JACCARD_FEATURE = argv$sf
PATTERN_CUSTOM = argv$plogic
csv = argv$csv
max_warning_interval = argv$maxwint
min_warning_interval = argv$minwint

print("The data frame argv is:")
print(argv)

[1] "The data frame argv is:"
[[1]]
[1] FALSE

$help
[1] FALSE

$opts
[1] NA

$X
[1] 5

$M
[1] 2

$Y
[1] 10

$Z
[1] 5

$N
[1] 10

$step
[1] 40

$seed
[1] 400

$sup
[1] 0.8

$conf
[1] 0.6

$fs
[1] TRUE

$pf
[1] FALSE

$sf
[1] TRUE

$plogic
[1] TRUE

$csv
[1] FALSE

$minwint
[1] 1

$maxwint
[1] 30

$train
[1] "C:/Users/petsi/Documents/ptyxiakh/training_my_dataset_150.csv"

$test
[1] "C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv"

$fet
[1] 151

$tet
[1] 151



### Reading function
**function: read_dataset**

In [218]:
#Function for reading the csv file and save it to a two column table.

read_dataset <- function(path){
  dataset = read.table(path, header = TRUE, sep = ",", dec = ".", comment.char = "#")
  dataset[, 2]  <- as.numeric(dataset[, 2])
  return(dataset)
}

### Read train and test set

The recorded log types read from csv files.
One csv file(at **train_path**) has the **training_set** and the other(at **test_path**) the **testing_set**.

In [219]:
#Reading train and test set.

training_set = read_dataset(train_path)
test_set =  read_dataset(test_path)

print("The test_set and training_set looks like:")
print(head(test_set))

[1] "The test_set and training_set looks like:"
  Timestamps Event_id
1 2016-12-31       36
2 2016-12-31       43
3 2016-12-31       58
4 2016-12-31      112
5 2016-12-31      120
6 2016-12-31      130


### Functions for computing frequencies of events
**1) function: create_episodes_list**

In [220]:
#Function for creating frequency vectors for each day.

create_episodes_list <- function(ds2years,b_length){

  #data.frame for episodes
  episode_df <- data.frame(Timestamps=as.Date(character()),Event_id=integer())
  
  #Change ds2years(table) to episode_df(data frame)    
    
  #iterate over every line of the original dataset
  for(i in 1:nrow(ds2years)) {
    #get the current row of ds2years(table of data set)
    meas <- ds2years[i,]
    #add it to data frame  
    episode_df <- rbind(episode_df,data.frame(Timestamps=meas$Timestamps, Event_id=meas$Event_id))

  }
  #group by day
  aggr_episode_df = aggregate(episode_df[ ,2], FUN=function(x){return(x)}, by=list(as.Date(episode_df$Timestamps, "%Y-%m-%d")))
  
  #binarize the frequncy vector(function: compute_frequency_vectors)
  frequency_day_vectors = compute_frequency_vectors(aggr_episode_df,b_length)

  return(frequency_day_vectors)
}

**2) function: compute_frequency_vectors**

In [221]:
#Convert event vectors to binary vectors

compute_frequency_vectors <- function(aggr_episode_df,b_length){
    
  #data frame for binary frequency vectors  
  freq_aggr_episode_df <- data.frame(matrix(ncol = b_length+1, nrow = 0))
  
  #x keeps the names of the columns. |Timestamps||e_1||e_2|...|e_b_length|  
  x <- c(c("Timestamps"), c(paste("e_",c(1:b_length),sep = "")))

  #iterate over every line(day) of the aggr_episode_df
  for(i in 1:nrow(aggr_episode_df)) {
      
      #init a vector with b_length zeros
      freq_vector = as.vector(integer(b_length))
    
      #get the current row of aggr_episode_df(frequency vector-data frame of data set)
      seg <- aggr_episode_df[i,]
    
      #for every value(fault event) in the current line(that happened in the current day)
      for(value in seg$x[[1]]){
          #replace the 0 in freq_vector with 1 at "value=fault event" position 
          freq_vector[[value]] = length(which(seg$x[[1]] == value))
      }
    
      #add a new line to the bin_aggr_epissode_df
      #we use a matrix holding the elements of the new_data.frame as matrix is able to store variable of different data types
      
      date = as.Date(seg$Group.1[[1]])
      new_df = data.frame(matrix(c(date, freq_vector),nrow=1,ncol=b_length+1))
      freq_aggr_episode_df <- rbind(freq_aggr_episode_df,new_df)
  }
  #set column's name as x defines
  colnames(freq_aggr_episode_df) <- x
  
  #set column "Timestamps" x to a Date: "Y-m-d" column  
  freq_aggr_episode_df$Timestamps <- as.Date(freq_aggr_episode_df$Timestamps , origin="1970-01-01")
    
  return(freq_aggr_episode_df)
}

In [222]:
#Create for each of the training and  testing set a dataframe keeping for each day the frequency of the fault events.

frequency_day_vectors = create_episodes_list(training_set,b_length)
test_frequency_day_vectors = create_episodes_list(test_set,b_length)

print("For training set head of frequency_day_vectors is:")
print(head(frequency_day_vectors))

[1] "For training set head of frequency_day_vectors is:"
  Timestamps e_1 e_2 e_3 e_4 e_5 e_6 e_7 e_8 e_9 e_10 e_11 e_12 e_13 e_14 e_15
1 2014-01-01   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
2 2014-01-02   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
3 2014-01-03   0   0   0   0   1   0   0   0   0    0    0    0    0    0    0
4 2014-01-04   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
5 2014-01-05   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
6 2014-01-06   0   0   0   0   1   0   0   0   0    0    0    0    0    0    0
  e_16 e_17 e_18 e_19 e_20 e_21 e_22 e_23 e_24 e_25 e_26 e_27 e_28 e_29 e_30
1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
2    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
3    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
4    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
5    

### OW and PW
For the testing dataset (**test_frequency_day_vectors**) OW and PW are presented.

In [223]:
#X;M;Y;Z;N;MIN;MAX
#4;4;16;0;8;1;16

X=4
M=4
Y=16
Z=0
N=8
max_warning_interval=1
max_warning_interval=16

#for(i in  seq(1,nrow(test_frequency_day_vectors), by=N)){
for(i in  seq(1,1, by=N)){
      
    #if the end of PW is equal to the total days then stop
    #if((i-1+X*M+Z+Y-1) > nrow(test_frequency_day_vectors)){
    if((i-1+X*M+Z+1+Y-1) > nrow(test_frequency_day_vectors)){ 
      break
    }
    
    #take a window of X*M=6 days, [1:6] [3:8] [5:10] ... [11 16]
    #subset by row, get the X*M days
    OW = test_frequency_day_vectors[i:((X*M)+(i-1)),] 
    #subset by row, get the Y days
    #PW = test_frequency_day_vectors[(i-1+X*M+Z):(i-1+X*M+Z+Y-1),]
    PW = test_frequency_day_vectors[(i-1+X*M+Z+1):(i-1+X*M+Z+1+Y-1),]
    
    
    print("For OW:")
    print(OW)
    print("")
    print("the PW is:")
    print(PW) 
    print("-----------------------------------------------------")
     
}     

[1] "For OW:"
   Timestamps e_1 e_2 e_3 e_4 e_5 e_6 e_7 e_8 e_9 e_10 e_11 e_12 e_13 e_14 e_15
1  2016-12-31   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
2  2017-01-01   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
3  2017-01-02   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
4  2017-01-03   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
5  2017-01-04   0   0   0   0   0   1   0   0   0    0    0    0    0    0    0
6  2017-01-05   0   0   0   0   0   0   0   0   0    0    0    1    0    0    0
7  2017-01-06   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
8  2017-01-07   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
9  2017-01-08   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
10 2017-01-09   0   0   0   0   0   0   0   0   0    0    0    1    0    1    0
11 2017-01-10   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0
12 2017-01-11   0   0   0 

## Functions for creating new instances for the dataset
Each created instance is comprised by five feature categories: 

- **Basic Features**: A frequency vector for each error type inside an OW. 

- **Advanced Statistical Features**: A vector of statistics like, minimum, maximum and mean distance of an error type inside the OW, from the beginning of the corresponding PW and mean and standard deviation of the distance between instances of the same error type inside the OW, for each error type. 

- **Pattern-based Features**: A binary vector of error type patterns, which is created based on a confidence threshold on the relative frequency of each pattern in all the OWs. The initial set of patterns is created based on the power set (excluding the null set) of the error types inside each OW. 

- **Failure Similarity Features**: The Jaccard similarity of two consecutive failures (tickets) of the same type, computed based on the error types of each corresponding OW. 

- **Profile-based Features**: Consider equipment specific features, like the model and the installation date of a ATM machine.






**1) function: create_instances**

In [224]:
#Function that creates new instances(for a dataset) from frequency vector.

create_instances <- function(frequency_day_vectors,target_event,X,M,Y,Z,N,PATTERN_FEATURES,SIMILARITY_FEATURE,test,conf,sup){
  
  #frequency_day_vectors withoult column timestamps  
  frequency_day_vectors = frequency_day_vectors[ , !(names(frequency_day_vectors) %in% c("Timestamps"))]

  #init data frame instances_df(function: init)  
  instances_df = init(frequency_day_vectors,target_event,X,M,Y,Z,N,PATTERN_FEATURES,SIMILARITY_FEATURE,test,conf,sup);

  #for each OW
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){
      
    #if the end of PW is equal to the total days then stop
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)){
      break
    }
    
    #take a window of X*M=6 days, [1:6] [3:8] [5:10] ... [11 16]
    #subset by row, get the X*M days
    OW = frequency_day_vectors[i:((X*M)+(i-1)),]  

    #compute the new instances of each OW   
    instance = compute_pattern_features(PATTERN_FEATURES,OW)
    instance = c(instance,compute_similarity_feature(SIMILARITY_FEATURE,OW))
    instance = c(instance,compute_advanced_statistic_features(OW))
    instance = compute_basic_statistic_features(instance,OW)
      
    #save it as a new data frame row 
    instance_df = as.data.frame(instance)

    #check the label of the OW
    Label = label_OW(i,frequency_day_vectors,target_event,X,M,Y,Z,N,test)
    
    #if a buffer window is used and target event is happened in it    
    if(is.null(Label)){
      if(!csv){
        print("moving to the next")
      }
      next
    }
    
    #set label of instance_df to Label    
    instance_df$Label = Label
    
    #set instance_df as a new row of instances_df data frame  
    instances_df = rbind(instances_df,instance_df) 
  }
    
  return(instances_df)
}

**2) function: init**

In [225]:
#Function that inits the instances data frame.

init <- function(frequency_day_vectors,target_event,X,M,Y,Z,N,PATTERN_FEATURES,SIMILARITY_FEATURE,test,conf,sup){
  
  #make a data frame named instances_df with ncol=(5+ #of sub-windows)*(# of frequency_day_vectors' columns)+1
  #                                     and  nrow=0
  instances_df = data.frame(matrix(ncol = (ncol(frequency_day_vectors)*5+ncol(frequency_day_vectors)*X+1), nrow = 0))
  
  #if we chose to get pattern features
  #---------------------------------------------------------------------------------------------------  
  if(PATTERN_FEATURES && SIMILARITY_FEATURE){
    if(!test){
      #get the frequent patterns(function: getFrequentPatterns)  
      freq_pattern_items <<- getFrequentPatterns(frequency_day_vectors,target_event,X,M,Y,Z,N,conf,sup)
    }
    #increase the instances_df column size by the length of the patterns + 1(for similarity feature) 
    instances_df = data.frame(matrix(ncol = (ncol(frequency_day_vectors)*5+ncol(frequency_day_vectors)*X+length(freq_pattern_items)+1+1), nrow = 0))
  } else if(PATTERN_FEATURES){
    if(!test){
      #get the frequent patterns(function: getFrequentPatterns)   
      freq_pattern_items <<- getFrequentPatterns(frequency_day_vectors,target_event,X,M,Y,Z,N,conf,sup)
    }
    #increase the instances_df column size by the length of the patterns  
    instances_df = data.frame(matrix(ncol = (ncol(frequency_day_vectors)*5+ncol(frequency_day_vectors)*X+length(freq_pattern_items)+1), nrow = 0))
  #--------------------------------------------------------------------------------------------------- 
  
  #if we did not choose to get pattern features, but we chose similarity feature
  #---------------------------------------------------------------------------------------------------     
  } else if(SIMILARITY_FEATURE){
    #increase the instances_df column size by 1(for similarity feature)
    instances_df = data.frame(matrix(ncol = (ncol(frequency_day_vectors)*5+ncol(frequency_day_vectors)*X+1+1), nrow = 0))
  }
  #---------------------------------------------------------------------------------------------------  
  
  #if it is not for a test dataset and we chose similarity feature    
  if(!test && SIMILARITY_FEATURE){
    #get the first positive Observed Window(function: get_first_positive_OW)  
    first_positive_OW <<- get_first_positive_OW(frequency_day_vectors,target_event,X,M,Y,Z,N,test)
    
    print("First_positive OW is:")
    print(first_positive_OW)  
    print("")    
  }
  
  if(PATTERN_FEATURES && length(freq_pattern_items) <= 0){
    if(!csv){
      print("WARNING: No patterns found.")
      print("")
    }
  }
  
  return(instances_df)
}

### 3) Functions for exporting patterns using frequency
**3a) function: getFrequentPatterns**

**3b) function: getFrequentPatternsPaperConf**

**3c) function: getFrequentPatternsCustom**

In [226]:
#Functions used for export frequent patterns.

#For calling the appropriate function 
getFrequentPatterns <- function(frequency_day_vectors,target_event,X,M,Y,Z,N,conf_level,support){
  if(PATTERN_CUSTOM){
    return(getFrequentPatternsCustom(frequency_day_vectors,target_event,X,M,Y,Z,N,conf_level,support))
  } else {
    return(getFrequentPatternsPaperConf(frequency_day_vectors,target_event,X,M,Y,Z,N,conf_level,support))
  }
}

#Two same Functions
#----------------------------------------------------------------------------------------------------------
#Function that returns the patterns using apriori algorithm and filtering them with a confidence threashold
getFrequentPatternsPaperConf <- function(frequency_day_vectors,target_event,X,M,Y,Z,N,conf_level,support){
  test = FALSE
  events_list <- list()
  
  #for each OW   
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){
    #if the end of Predictive Window is equal to the total days then stop
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)){   #I CHANGED IT
      break
    }
    OW = frequency_day_vectors[i:((X*M)+(i-1)),] #Observation Window = subset of X*M days
    freq = sapply(OW, sum) #for each OW sum the frequences for every fault event
    events = names(freq[freq>0]) #keep the names of the events with freq>0
    events_list[length(events_list)+1] = list(events) #save the names to a list
  }
  names(events_list) <- paste("OW",c(1:length(events_list)), sep = "") #name each value of the list as "OW(i)" 1<=i<=(#of OWs)
  trans1 <- as(events_list, "transactions") #make list to transactions
    
  #https://www.rdocumentation.org/packages/arules/versions/1.6-4/topics/apriori  
  freq_patterns <- apriori(trans1, parameter = list(supp=support,target="frequent",maxtime=5,maxlen=10, minlen=1),control=list(verbose = FALSE))
  
  #if no frequency patterns found return   
  if(length(items(freq_patterns)) == 0){
    return(items(freq_patterns))
  }
  
  #init integer vectors with length(items(freq_patterns)) zeros   
  total_freq = integer(length(items(freq_patterns)))
  positive_freq = integer(length(items(freq_patterns)))
    
  #for each OW  
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){
    #if the end of PW is equal to the total days then stop
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)){   #I CHANGED IT
      break
    }

    #check the label of the OW(function: label_OW)
    Label = label_OW(i,frequency_day_vectors,target_event,X,M,Y,Z,N,test)
      
    if(is.null(Label)){
      if(!csv){
        print("Error in BW moving to the next OW")
      }
      next
    }
    
    OW = frequency_day_vectors[i:((X*M)+(i-1)),] #Observation Window = subset of X*M days
      
    logical_v = items(freq_patterns) %in% colnames(OW[, colSums(OW != 0) > 0]) #
      
    for(i in 1:length(logical_v)){
      if(logical_v[i]){
        #if the OW is labeled(which means that inside OW's PW the target event happened)   
        if(Label){
          positive_freq[i] = positive_freq[i] + 1
        } 
        total_freq[i] = total_freq[i] + 1
      }
    }
  }
  conf = positive_freq/total_freq #for each frequency pattern set conf=positive_freq/total_freq
  
  return(items(freq_patterns)[conf>=conf_level]) #return only the frequency patterns with minimum confidence=conf_level
}


getFrequentPatternsCustom <- function(frequency_day_vectors,target_event,X,M,Y,Z,N,conf_level,support){
  test = FALSE
  events_list <- list()
  
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){
    #if the end of PW is equal to the total days then stop
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)){
      break
    }
    #check the label of the OW
    Label = label_OW(i,frequency_day_vectors,target_event,X,M,Y,Z,N,test)
    if(is.null(Label)){
      if(!csv){
        print("Error in BW moving to the next OW")
      }
      next
    }
    
    #only for the labeled OWs
    if(Label){
      OW = frequency_day_vectors[i:((X*M)+(i-1)),] #subset by row, get the X*M days
      freq = sapply(OW, sum)
      events = names(freq[freq>0])
      events_list[length(events_list)+1] = list(events)
    }
  }
  names(events_list) <- paste("OW",c(1:length(events_list)), sep = "")
  trans1 <- as(events_list, "transactions")
  freq_patterns <- apriori(trans1, parameter = list(supp=support,target="frequent",maxtime=5,maxlen=10, minlen=1),control=list(verbose = FALSE))
  if(length(items(freq_patterns)) == 0){
    return(items(freq_patterns))
  }
  total_freq = integer(length(items(freq_patterns)))
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){
    #if the end of PW is equal to the total days then stop
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)){
      break
    }
    
    OW = frequency_day_vectors[i:((X*M)+(i-1)),] #subset by row, get the X*M days
    logical_v = items(freq_patterns) %in% colnames(OW[, colSums(OW != 0) > 0])
    for(i in 1:length(logical_v)){
      if(logical_v[i]){
        total_freq[i] = total_freq[i] + 1
      }
    }
  }
  conf = quality(freq_patterns)$count/total_freq
  
  

  return(items(freq_patterns)[conf>=conf_level])
}
#----------------------------------------------------------------------------------------------------------

### 4) Functions for exporting Observation Windows' informations
**4a) function: get_first_positive_OW**

**4b) function: label_OW**

In [227]:
#Functions used for exporting informations about the Observation Windows.

#Function that returns the first positive OW, which means that inside its PW the target event happened
get_first_positive_OW <- function(frequency_day_vectors,target_event,X,M,Y,Z,N,test){
   
  #for every OW  
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){
    
    #if the end of PW is equal to the total days then stop
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)){
      break
    }

    #check the label of the OW(function: label_OW)
    Label = label_OW(i,frequency_day_vectors,target_event,X,M,Y,Z,N,test)

    if(is.null(Label)){
      if(!csv){
        print("moving to the next")
      }
      next
    }
    
    if(Label){
      #if you found a labeled OW, return it 
      return(frequency_day_vectors[i:((X*M)+(i-1)),]) #subset by row, get the X*M days
    }
  }
}

#Function that returns the label(boolean variable) of an OW
#                         true->the target event happened in OW's PW
#                         false->the target event did not happen in OW's PW
label_OW <- function(index,frequency_day_vectors,target_event,X,M,Y,Z,N,test){
  
  PW = data.frame() #init an empty data frame PW   
  
  if(test){ #if it is for a test dataset  
    PW = frequency_day_vectors[(index-1+X*M+Z+1):(index-1+X*M+Z+1+Y-1),] #Predictive Window with length=Y     
  } else { #if it is not for a test dataset     
    if(Z != 0){ #if we use beffer window
      BW = frequency_day_vectors[(index-1+((X*M))+1):(index-1+((X*M))+1+Z-1),] #buffer window with length=Z
      if(sum(BW[paste("e_",target_event,sep="")]) > 0){
        return(NULL) #return null if inside the BW target event happened
      }
    }
    #if we do not use beffer window
    PW = frequency_day_vectors[(index-1+X*M+Z+1):(index-1+X*M+Z+1+Y-1),]
  }
  return(sum(PW[paste("e_",target_event,sep="")]) > 0) #return true if inside the PW target event happened, else false
}

### 5) Functions for computing this new instances of the set
**5a) function: compute_pattern_features**

**5b) function: compute_similarity_feature**

**5c) function: jaccard**

**5d) function: compute_advanced_statistic_features**

**5e) function: compute_basic_statistic_features**

In [228]:
#Functions used to compute the new instances of the set.

#Function to compute pattern features.
compute_pattern_features <- function(PATTERN_FEATURES,OW){
  P=list()
  if(PATTERN_FEATURES && length(freq_pattern_items) > 0){
    P = setNames(as.list(rep(0,length(freq_pattern_items))),as.list(paste("p_",labels(freq_pattern_items),sep="")))
    logical_v = freq_pattern_items %in% colnames(OW[, colSums(OW != 0) > 0])
    for(l in 1:length(logical_v)){
      if(logical_v[l]){
        P[l] = 1
      }
    }
  }
  return(P)
}


#Function to compute similarity features, based in Jaccard Similarity.
compute_similarity_feature <- function(SIMILARITY_FEATURE,OW){
  J=list()
  if(SIMILARITY_FEATURE){
    J = setNames(list(0),list("jaccard"))
    J[1] = jaccard(data.frame(OW,first_positive_OW),2)#JACARD SIMILARITY BETWEEN OW-FIRST_POSITIVE_OW
  }
  return(J)
}

#Function the returns the Jaccard Distance
jaccard <- function(df, margin) {
  if (margin == 1 | margin == 2) {
    M_00 <- apply(df, margin, sum) == 0
    M_11 <- apply(df, margin, sum) == 2
    if (margin == 1) {
      df <- df[!M_00, ]
      JSim <- sum(M_11) / nrow(df)
    } else {
      df <- df[, !M_00]
      JSim <- sum(M_11) / length(df)
    }
    JDist <- 1 - JSim
    return(JDist)
  } else break
}


#Function to compute advanced statistical features for each OW's fault events as:

#e_X_min_d: 1->every row of OW has in row e_X zero value
#           0->at least one row of OW has non zero value at column e_X

#e_X_max_d: max distance between a non zero value in column e_X and the end of OW

#e_X_mean_d: (sum of all distances between non zero values in column e_X and the end of OW)/(# of the OW's rows)

#e_X_mean_V:ex   e_X            (mean of all distances between two consecutively non zero values)
#                 1-|DIST1=1  
#                 1-|--------|DIST2=1    
#                 1-|--------|---------|DIST3=1               e_X_mean_V = (DIST1+DIST2+DIST3+DIST4)/4 = 5/4 = 1.25
#                 0                    |
#                 1-|--------|---------|--------|DIST4=1
#                 1-|--------|---------|--------|


#e_X_std_V: sample standard diviation of the error interval and for the the same example:
#           e_X_std_V=(((DIST1-e_X_mean_V)^2+(DIST2-e_X_mean_V)^2+(DIST3-e_X_mean_V)^2+(DIST4-e_X_mean_V)^2)/(4-1))^(1/2)

compute_advanced_statistic_features <- function(OW){
  #compute error interval
  V <- setNames(as.list(rep(0,(2*length(names(OW))))), c(paste(names(OW),"_meanV",sep=""), paste(names(OW),"_stdV",sep="")))
  for(e in 1:length(OW)){
    event = OW[e]
    error_interval = c()
    event_indeces = which(event > 0)
    for(d in length(event_indeces):2){
      error_interval[length(error_interval)+1] = event_indeces[d] - event_indeces[d-1] 
    }
    meanV = mean(error_interval)
    meanV = if(is.na(meanV)) 0 else meanV
    stdV = sd(error_interval)
    stdV = if(is.na(stdV)) 0 else stdV
    V[paste(names(event),"_meanV",sep="")] = meanV
    V[paste(names(event),"_stdV",sep="")] = stdV
  }
  
  #compute distance from Prediction Point
  D <- data.frame(matrix(ncol = b_length, nrow = 0))
  for(d in 1:nrow(OW)){
    error_instance_distance = nrow(OW)-d+1
    day = OW[d,]
    day[1:ncol(day)][day[1:ncol(day)] > 0] = error_instance_distance
    
    D <- rbind(D,day)
  }
  min = sapply(D,min)
  names(min) = paste(names(min),"_minD",sep="")
  
  max = sapply(D,max)
  names(max) = paste(names(max),"_maxD",sep="")
  
  mean = sapply(D,mean)
  names(mean) = paste(names(mean),"_meanD",sep="")

  return(c(min,max,mean,V))
}

#Function to compute basic statistical features for each OW's fault events as frequency:
compute_basic_statistic_features <- function(instance,OW){
  
    OW = split(OW, factor(sort(rank(row.names(OW))%%X)))

  for(j in 1:length(OW)){
    SW = OW[j]
    freq = sapply(SW[[1]], sum)     
    names(freq) = paste(names(freq),paste("_freq_",j,sep=""),sep="")  
    instance = c(instance,freq)
  }

  return(instance)
}

## Create instances
**For the training set**

In [229]:
#Create instances for training set.

instances_df = create_instances(frequency_day_vectors,target_event,X,M,Y,Z,N,PATTERN_FEATURES,JACCARD_FEATURE,FALSE,conf,sup)


#remove columns with all values equal to zero
#instances_df = instances_df[, colSums(instances_df != 0) > 0]


#what is factor -> https://www.stat.berkeley.edu/~s133/factors.html
label = instances_df$Label
instances_df$Label = as.factor(label)

print("The form of the training instances_df (which also used for fitting the predicted model) looks like:")
print(head(instances_df))


[1] "First_positive OW is:"
   e_1 e_2 e_3 e_4 e_5 e_6 e_7 e_8 e_9 e_10 e_11 e_12 e_13 e_14 e_15 e_16 e_17
17   0   0   0   0   0   0   0   0   0    0    0    0    0    1    0    0    0
18   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    0
19   0   0   0   0   0   1   0   0   0    0    0    1    0    0    0    0    0
20   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    0
21   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    0
22   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    0
23   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    1
24   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    0
25   0   0   0   0   0   0   0   0   0    1    0    0    0    0    0    0    0
26   0   0   0   0   0   0   0   0   0    0    1    0    1    0    0    0    0
27   0   0   0   0   0   0   0   0   0    0    0    0    0    0    0    0    0
28   0   0   0   0   0  

[1] "The form of the training instances_df (which also used for fitting the predicted model) looks like:"
    jaccard e_1_minD e_2_minD e_3_minD e_4_minD e_5_minD e_6_minD e_7_minD
1 0.8829787        0        0        0        0        0        0        0
2 0.9090909        0        0        0        0        0        0        0
3 0.9130435        0        0        0        0        0        0        0
4 0.9453552        0        0        0        0        0        0        0
5 0.9039548        0        0        0        0        0        0        0
6 0.9236641        0        0        0        0        0        0        0
  e_8_minD e_9_minD e_10_minD e_11_minD e_12_minD e_13_minD e_14_minD e_15_minD
1        0        0         0         0         0         0         0         0
2        0        0         0         0         0         0         0         0
3        0        0         0         0         0         0         0         0
4        0        0         0         0         0

**For the testing set**

In [230]:
#Create instances for testing set.

test_instances_df = create_instances(test_frequency_day_vectors,target_event,X,M,Y,Z,N,PATTERN_FEATURES,JACCARD_FEATURE,TRUE,conf,sup)

#Convert column Label into a factor column.
test_instances_df$Label = as.factor(test_instances_df$Label)

print("The form of the training test_instances_df (which also used for the prediction) looks like:")
print((test_instances_df))

[1] "The form of the training test_instances_df (which also used for the prediction) looks like:"
     jaccard e_1_minD e_2_minD e_3_minD e_4_minD e_5_minD e_6_minD e_7_minD
1  0.9027778        0        0        0        0        0        0        0
2  0.8896552        0        0        0        0        0        0        0
3  0.9150327        0        0        0        0        0        0        0
4  0.9225806        0        0        0        0        0        0        0
5  0.9032258        0        0        0        0        0        0        0
6  0.8947368        0        0        0        0        0        0        0
7  0.9007092        0        0        0        0        0        0        0
8  0.9463087        0        0        0        0        0        0        0
9  0.8812500        0        0        0        0        0        0        0
10 0.9096774        0        0        0        0        0        0        0
11 0.8881119        0        0        0        0        0        0

### Detect target event's days

In [231]:
#Find the rows of test_frequency_day_vectors where the target event happened.

failure_incidents = which(matrix(grepl(1, test_frequency_day_vectors[,paste("e_",target_event,sep="")]),ncol=1),arr.ind=TRUE)[,1]#in which day-line the target event happened

print("The failure_incidents of the testing_set are:")
print(failure_incidents)

[1] "The failure_incidents of the testing_set are:"
 [1]  30  52  86 114 153 179 221 254 298 321 362 402 435 468 506 551 585 612 643
[20] 680 719 730


## Function for Feature Selection
**1) function: top_feature_selection**

**2) function: best_feature_selection**

**3) PCA functions:**
      
   -   **3a) remove_zero_columns**
   
   -   **3a) pca_function**
   
   -   **3a) pca_dimensionality_reduction**

**4) function: feature_reduction**

In [232]:
#Funtion for selectiong the top features using ReliefF.
top_feature_selection <- function(instances_df,top_features,seed=500){
  #Feature selection using reliefF
  set.seed(seed)
  #attrEval function -> https://www.rdocumentation.org/packages/CORElearn/versions/1.53.1/topics/attrEval  
  #ReliefF -> https://medium.com/@yashdagli98/feature-selection-using-relief-algorithms-with-python-example-3c2006e18f83      
  estReliefF <- attrEval(Label ~ ., instances_df, estimator="ReliefFexpRank", ReliefIterations=50)
  
  #sort indeces of  estReliefF   
  sorted_indeces = order(estReliefF, decreasing = TRUE)
  
  #keep the the top (top_features) "useful" columns of instances data frame  
  instances_df = instances_df %>% select(sorted_indeces[1:top_features],ncol(instances_df))
  
  return(instances_df)
}

In [233]:
#Funtion for selectiong the best feature for the selected algorithm using ReliefF.
best_feature_selection <- function(instances_df,test_instances_df,step,failure_incidents
                                   ,choice=0,thread=0.5,seed=500){
  
  run = TRUE
  i = length(instances_df)-1 #-1 for taking out the label
  max_F1=0 #variable for keeping the max_F1 score
  max_instances_df = data.frame() #empty data frame named max_instances_df
  
  while(run){
    #Feature selection using reliefF
    
    #attrEval function -> https://www.rdocumentation.org/packages/CORElearn/versions/1.53.1/topics/attrEval  
    #ReliefF -> https://medium.com/@yashdagli98/feature-selection-using-relief-algorithms-with-python-example-3c2006e18f83  
    estReliefF <- attrEval(Label ~ ., instances_df, estimator="ReliefFexpRank", ReliefIterations=50)
    
    #sort indeces of  estReliefF 
    sorted_indeces = order(estReliefF, decreasing = TRUE)
    #print(sorted_indeces)
    #keep the the top i "useful" columns of instances data frame  
    instances_df = instances_df %>% select(sorted_indeces[1:i],ncol(instances_df))
    
    #if choice==1: find F1 score using XGBoost(function: evalXGBoost)
    if(choice==1){
      F1 = evalXGBoost(instances_df,test_instances_df,failure_incidents,TRUE,FALSE,threshold=thread,seed=seed)
    }
    #if choice==2: find F1 score using RF(function: evalRF)  
    else if(choice==2){
      F1=evalRF(instances_df,test_instances_df,failure_incidents,TRUE,FALSE,seed=seed)
    }
    else{
      F1=0
    }
    
    
    #if max F1 score is 0(first iteration)  
    if(max_F1 == 0){
      max_F1 = F1
      max_instances_df = instances_df
    } else if(F1>max_F1){ #if F1>=max_F1(which means that with less data we have at least the same F1 score)
      max_F1 = F1 #set as new max F1 score the current F1 score
      max_instances_df = instances_df #set new max_instances_df the current instances_df
    }
    i = i - step #-step for taking out the least "useful" columns
    if(i <= 0){
      run = FALSE
    }
  }
  return(max_instances_df)
}

In [234]:
#PCA functions

#Function for removing the zero columns of an array or a data_frame.
remove_zero_columns <- function(data_array){
  data_array=data.frame(data_array)
  return((data_array[colSums(data_array) > 0]))
}

#Functions tha makes pca to a given array.
pca_function <- function(flat_Xtrain_non_zero_cols){

  #for every element of the flat_Xtrain_non_zero_cols subtract the mean of its column.
  for(i in 1:length(flat_Xtrain_non_zero_cols[1,])){
    flat_Xtrain_non_zero_cols[,i]=flat_Xtrain_non_zero_cols[,i]-mean(flat_Xtrain_non_zero_cols[,i])
  }

  #Principal Components Analysis
  pca.out <- prcomp((flat_Xtrain_non_zero_cols),center = TRUE)
  
  #the matrix of variable loadings (i.e., a matrix whose columns contain the eigenvectors)  
  u=(pca.out$rotation*(+1))
    
  return(u)
}  

#Transforms pca feature reduction to the given dimension.
#If data_made_from=NULL the pca was contucted to the given data,
#else the pca was contucted to the given data_made_from.
pca_dimensionality_reduction<- function(data,dimension,pca_X_train,data_made_from=NULL){
      
  if(dimension>(length(data[1,]))){
    print("ERROR: Dimension must be integer smaller than datas' dimension!!!")
    return(data)
  }
  
    
  if(is.null(data_made_from)){
    for(i in 1:length(data[1,])){
      data[,i]=data[,i]-mean(data[,i])
    }
  }else{
    data=data.frame(data)
    data=data[,names(data_made_from)]
    for(i in 1:length(data[1,])){
      data[,i]=data[,i]-mean(data_made_from[,i])
    }
  }
  
  #from the matrix whose columns contain the eigenvectors  keep the first X=dimension
  pca_X_train=array(pca_X_train[,1:dimension],dim =c(length(data[1,]),dimension))
  
  #multiply the data(in which first we subtract the mean valumn at every column) with  pca_X_train
  red_data=data.matrix(data)%*%data.matrix(pca_X_train)
  
  return(red_data)
}


In [235]:
#Function for doing the reduction-selection.
feature_reduction <- function(instances_df,test_instances_df,PCA_REDUCTION=FALSE
                              ,top_features=0,fi=NULL,ch=0,thr=0.5,Has_Label=TRUE,seed=500){
  
  #if PCA_REDUCTION do pca feature extraction-reduction.  
  if(PCA_REDUCTION){
      
    #if in in instances_df the label is included
    if(Has_Label){
      #save as Label-column the last column   
      Label=instances_df[,length(instances_df[1,])]
        
      #save as instances_df non_Label-columns all columns except the last one 
      instances_df=instances_df[,1:length(instances_df[1,])-1]  
    }
    
    #if is already a data_frame 
    if(is.data.frame(instances_df)){
      #if in in instances_df the label is included  
      #if(Has_Label){
        #save as instances_df non_Label-columns all columns except the last one 
        #instances_df=instances_df[,1:length(instances_df[1,])-1]
      #}  
      #keep the non zero columns, by using remove_zero_columns    
      instances_df = instances_df[, colSums(instances_df != 0) > 0]
    }else{
      #keep the non zero columns, by using remove_zero_columns     
      instances_df=remove_zero_columns(instances_df)
    }
    
    #save the data that pca will contucted 
    data_made_from=instances_df
    
    #eigenvectors  
    pca_X_train=pca_function(instances_df)
      
    #make the dimensionality reduction  
    instances_df=pca_dimensionality_reduction(instances_df,top_features,pca_X_train)
    
      
    if(Has_Label){
      instances_df=data.frame(instances_df,Label)
    }else{
      instances_df=data.frame(instances_df)
    }
    
    #same process for testing_set
    if(Has_Label){
      Label=test_instances_df[,length(test_instances_df[1,])]
      test_instances_df=test_instances_df[,1:length(test_instances_df[1,])-1]
    }
    
    test_instances_df=pca_dimensionality_reduction(test_instances_df,top_features,pca_X_train,data_made_from)
    

    if(Has_Label){
      test_instances_df=data.frame(test_instances_df,Label)
    }else{
      test_instances_df=data.frame(test_instances_df)
    }
    
  #if not PCA_REDUCTION if Label is included do reliefF feature reduction.    
  }else if(Has_Label){
    
    #if top_features>0 call top_feature_selection
    if(top_features>0){
      #remove columns with all values equal to zero
      instances_df = instances_df[, colSums(instances_df != 0) > 0]
      
      #keep the top features
      instances_df = top_feature_selection(instances_df,top_features,seed=seed)
      
      #print("Merged_episodes after feature selection:")
      #print(head(instances_df))  
      
      #print("The names of the selected fearures are:")
      #print(names(instances_df))  
        
    #if top_features==0 call best_feature_selection    
    }else{

      instances_df=best_feature_selection(instances_df,test_instances_df,step,failure_incidents=fi
                                          ,choice=ch,thread=thr,seed=seed)
      
    }
    
    #from test_instances_df keep only the columns that instances_df has  
    test_instances_df=(test_instances_df[,names(instances_df)])
    
  }
    
  #return the new instances_df and test_instances_df  
  MyList<- list("a"=instances_df, "b"=test_instances_df)
  return(MyList)
}

## Functions for XGBoost 
**1) function: evalXGBoost**

**2) function: eval_predictions**

**3) function:compute_last_time_point_of_OW**

In [236]:
#Functions using for predicting and evaluating.

#Function that uses XGBoost for predictive the target event(i.e. fault) and evaluating the results.
evalXGBoost <- function(instances_df,test_instances_df,failure_incidents,fs=FALSE,plotbool=TRUE,threshold=0.5
                        ,return_preds=FALSE,seed=500){
    
  set.seed(seed) #for remaining the random output the same  
  instances_df = instances_df[ , order(names(instances_df))] #order attributes by their names
 
  #Training with XGBoost model using instances_df(made from training set) 
  #XGBoost tutorial and parameters explanation -> https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/  
  #XGBoost tree construction example step by step -> https://www.youtube.com/watch?v=3CC4N4z3GJc
  dtrain <- xgb.DMatrix(data = as.matrix(instances_df[ , !(names(instances_df) %in% c("Label"))]),label = as.integer(as.logical(instances_df$Label)))
  my.rf <- xgb.train(data = dtrain, nthread = 2, eta=0.6, nrounds = 10, objective = "binary:logistic",verbose = 2)
  
  #if plootbool then print the tree informations  
  if(plotbool==TRUE)
  {
   print("------------------------------------------------------------------------------------------------------")
      tryCatch(
        expr = {
            print(xgb.plot.tree(model = my.rf))
        },
        error = function(e){ 
            print(xgb.dump(my.rf, with_stats = T))
        }
      )  
    
  } 
      
  test_instances_df = test_instances_df[ , (names(test_instances_df) %in% names(instances_df))] #keep the same attributes as instances_df
  test_instances_df = test_instances_df[ , order(names(test_instances_df))] #order the attributes by their name

  #Predict test_instances_df's(made from testing set) regression value(using trees and the logistic function)
  dtest <- xgb.DMatrix(data = as.matrix(test_instances_df[ , !(names(test_instances_df) %in% c("Label"))]), label=as.integer(as.logical(test_instances_df$Label)))
  Prediction <- predict(my.rf, dtest)
  
  if(plotbool){  
    print("------------------------------------------------------------------------------------------------------")
    print("Prediction  of test_instances_df is:")
    print(Prediction)
  } 
    
  #Use threashold for classifing to the "TRUE" class, else to the "FALSE" class  
  Prediction <- as.logical(as.numeric(Prediction > threshold))  
  
  if(plotbool){  
    print("Class prediction of test_instances_df is:")
    print(Prediction)  
    print("------------------------------------------------------------------------------------------------------")  
  }   
      
  if(return_preds){
    return(Prediction)
  }
  #return F1 score of the prediction(function: eval_predictions)  
  return(eval_predictions(Prediction,failure_incidents,fs))
}


#Function that evaluates the predictions of the XGBoost and retutrns F1 score.
eval_predictions <- function(Prediction,failure_incidents,fs=FALSE,test_days=2*365,ret_new_F1_score=FALSE){
  predictions = list()
   
  #for every prediction  
  for(p in 1:length(Prediction)){
      
    #compute last row of the OW(function: compute_last_time_point_of_OW)  
    d = compute_last_time_point_of_OW(p)#d=X*M=3*2=6 8 10 ....

    #if prediction is "TRUE", add last last day of OW at predictions  
    if(Prediction[p] == "TRUE"){
      predictions = c(predictions,d)
    }
     
  }
  
  #print("The predicted days to happen the failure are:")
  #print(str(predictions))  
  #print("------------------------------------------------------------------------------------------------------")   
    
  true_positives = 0
  false_positives = 0
  false_negatives = 0
   
  before_wd=-1000
  a <- array(0,dim=c(test_days,1)) #array for helping find the real_false_positives
  real_false_positives=0  #the number of the days that in anyone of OWs the prediction was TRUE(but it had to)
    
  #for every row of the test_frequency_day_vectors where the target event happened(THE NUMBER OF PREDICTIONS DEPENDS ON THE NUMBER OF FAILURES)
  for(i in 1:length(failure_incidents)){
      
    d = failure_incidents[i] #d = the current failure incident
      
    warnings = list() #empty list keeping the warnings
      
    if(i == 1){
      #set warnings as the predictions before the current failure incident
      warnings = predictions[predictions < d] 
    } else {
      #set warnings as the predictions before the current failure incident but after the previous failure incidence  
      warnings = predictions[predictions > failure_incidents[i-1] & predictions <= d] 
    }
    
    #if there is no warning
    if(length(warnings) == 0){
      false_negatives = false_negatives + 1 #increase false negatives by 1
    #if there is warning(s)    
    } else {
      #if there is warnings before the max interval from the failure(target event) 
      warning_tp=warnings[warnings >= (d-max_warning_interval) & warnings <= (d-min_warning_interval)]

      #find false_positives and real_false_positives
      #----------------------------------------------------------------------------------------- 
      if(length(warnings[warnings < d-max_warning_interval]) > 0){

        warnings_fp=warnings[warnings < d-max_warning_interval] 
        
          
        if(length(warnings_fp)>=1){  
            for(i in 1:length(warnings_fp)){
                last_of_OW=as.numeric(warnings_fp[[i]])
                first_of_OW=as.numeric(warnings_fp[[i]])-(X*M)+1
                a[first_of_OW:last_of_OW]=a[first_of_OW:last_of_OW]+1
            }
          
            if(length(warning_tp)>=1){  
              for(i in 1:length(warning_tp)){
                  #last_of_OW=as.numeric(warning_tp[[i]])
                  first_of_OW=as.numeric(warning_tp[[i]])-(X*M)+1
                  #a[first_of_OW:last_of_OW]=-1000
                  a[first_of_OW:d]=-1000
              }
            }  
        }    
               
        false_positives = false_positives + length(warnings[warnings < d-max_warning_interval]) 
      }
      #-----------------------------------------------------------------------------------------  
        
      #if there is warnings after the max and before the min interval from the failure(target event)    
      if(length(warnings[warnings >= (d-max_warning_interval)]) > 0 & length(warnings[warnings <= (d-min_warning_interval)]) > 0){
        true_positives = true_positives + 1 #increase true positives by 1
      #if there is no correct warning    
      } else {
        false_negatives = false_negatives + 1 #increase false negatives by 1
      }
    }
  }
  

  
  precision = true_positives/(true_positives+false_positives) #calculate the precision of the model
  
  if((true_positives+false_positives) == 0){
    precision = 0
  }
  
  recall = true_positives/length(failure_incidents) #calculate recall of the model
  
  F1 = 2*((precision*recall)/(precision+recall)) #calculate F1 score of the model
  if((precision+recall) == 0){
    F1 = 0
  }
  

  if(ret_new_F1_score){
      false_positives=(length(a[a>=1])) 

      precision = true_positives/(true_positives+false_positives) #calculate the precision of the model

      if((true_positives+false_positives) == 0){
        precision = 0
      }

      recall = true_positives/length(failure_incidents) #calculate recall of the model

      new_F1_score = 2*((precision*recall)/(precision+recall)) #calculate F1 score of the model
      if((precision+recall) == 0){
        new_F1_score = 0
      }

      #prints  
      if(!fs){
        if(TRUE){
          cat(paste("dataset:",argv$test,"\ntrue_positives:", true_positives,"\nfalse_positives:"
                    , false_positives,"\nfalse_negatives:", false_negatives,"\nprecision:"
                    , precision,"\nrecall:", recall,"\nF1:", new_F1_score,"\n"))
        } else{
          #cat(paste(argv$test,",", true_positives,",", false_positives,",", false_negatives,",", precision,",", recall,",", F2,",",argv$fet,",",argv$tet,",",argv$X,",",argv$M,",",argv$Y,",",argv$Z,",",argv$N,",",argv$step,",",argv$sup,",",argv$conf,",",argv$fs,",",argv$pf,",",argv$sf,",",argv$plogic,",",argv$minwint,",",argv$maxwint, "\n",sep=""))
        }
      }
      return(new_F1_score)
  }
    
    
  #prints  
  if(!fs){
    if(TRUE){
      cat(paste("dataset:",argv$test,"\ntrue_positives:", true_positives,"\nfalse_positives:"
                , false_positives,"\nfalse_negatives:", false_negatives,"\nprecision:"
                , precision,"\nrecall:", recall,"\nF1:", F1,"\n"))
    } else{
      #cat(paste(argv$test,",", true_positives,",", false_positives,",", false_negatives,",", precision,",", recall,",", F1,",",argv$fet,",",argv$tet,",",argv$X,",",argv$M,",",argv$Y,",",argv$Z,",",argv$N,",",argv$step,",",argv$sup,",",argv$conf,",",argv$fs,",",argv$pf,",",argv$sf,",",argv$plogic,",",argv$minwint,",",argv$maxwint, "\n",sep=""))
    }
  }  
  return(F1)
}


#Function that returns the last row of the OW(function: compute_last_time_point_of_OW)  
compute_last_time_point_of_OW <- function(index){
  OW_length = X*M
  return(OW_length+((index-1)*N))
}

## Run XGBoost 

In [237]:
#Run XGBoost algorithm and print the results.
instances_df_xgb=instances_df
test_instances_df_xgb=test_instances_df
#print(length(instances_df_xgb[1,]))
#print(length(instances_df_xgb[,1]))
threas_XGB=0.1

step=50

cat("If you use PCA top_features must be less than:",length(instances_df_xgb[,1]),"\n")

top_features=0            
FEATURE_SELECTION=TRUE
PCA_REDUCTION=FALSE

if(FEATURE_SELECTION){
  red_list=feature_reduction(instances_df_xgb,test_instances_df_xgb,PCA_REDUCTION,top_features,fi=failure_incidents
                             ,ch=1,thr=threas_XGB,seed=seed)
  
  instances_df_xgb=red_list[1]$a
  test_instances_df_xgb=red_list[2]$b
}

#print(names(test_instances_df_xgb))
#print(names(instances_df_xgb))
cat("-------------------------------------------------------------\n")
test_instances_df_xgb=test_instances_df_xgb[,names(instances_df_xgb)]

resultXGBOOST = evalXGBoost(instances_df_xgb,test_instances_df_xgb,failure_incidents,fs=FALSE
                            ,plotbool=FALSE,threshold=threas_XGB,seed=seed)

If you use PCA top_features must be less than: 133 












"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_151_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_137_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov











"Variable e_102_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov









"Variable e_54_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov







"Variable e_6_stdV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov





"Variable e_44_meanV has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov





"Variable e_120_minD has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov



"Variable e_69_minD has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov

"Variable e_18_minD has all values equal."


ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_1_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_2_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_3_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_4_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_5_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_6_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_7_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remove from learning: e_8_minD

ERROR in CORElearn: All values of the attribute in a data split are equal, please remov

## Functions for Random Forest 
**1) function: evalRF**

**2) function: eval_predictions (the 2nd of XGBoost's functions)**

In [238]:
#Function that uses Random Forest for predictive the target event(i.e. fault) and evaluating the results.

evalRF <- function(instances_df,test_instances_df,failure_incidents,fs=FALSE,plot=TRUE,return_preds=FALSE,seed=500){
  
  set.seed(seed) #for remaining the random output the same
  #Training with randomForest model using instances_df(made from training set)
  #Random Forest R documentation and parameters explanation -> https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest  
  #Random Forest idea -> https://www.youtube.com/watch?v=loNcrMjYh64
  my.rf <-randomForest(Label ~ .,data=instances_df,importance=TRUE,ntree=500) #(default ntree=500) 
   
 
  if((plot==TRUE)){ 
        result = tryCatch({
            varImpPlot(my.rf) 
        }, error = function(e) {
            print("Error at varImpPlot(my.rf)")
        }) 
         
  }
    
  
  #Predict test_instances_df's(made from testing set)  
  Prediction <- predict(my.rf, test_instances_df[ , !(names(test_instances_df) %in% c("Label"))])
  #print(Prediction)
  #if ploot then print the tree informations  
  if(plot)
  {
      print("------------------------------------------------------------------------------------------------------")
      print(getTree(my.rf, 1,labelVar=TRUE))
      print(getTree(my.rf, 2,labelVar=TRUE))  
      print("------------------------------------------------------------------------------------------------------")
      #print("Prediction in eval is:\n")
      #print(Prediction)
      #print("------------------------------------------------------------------------------------------------------")
  }
  
  #return F1 score of the prediction(function: eval_predictions)
  if(return_preds){
    return(Prediction)
  }
  return(eval_predictions(Prediction,failure_incidents,fs))
}


## Run Random Forest

In [239]:
#Run Random Forest algorithm and print the results.

instances_df_rf=instances_df
test_instances_df_rf=test_instances_df
#print(length(instances_df_rf[1,]))
#print(length(instances_df_rf[,1]))

cat("If you use PCA top_features must be less than:",length(instances_df_rf[,1]),"\n")

top_features=5
FEATURE_SELECTION=TRUE
PCA_REDUCTION=FALSE

step=40

if(FEATURE_SELECTION){
  red_list=feature_reduction(instances_df_rf,test_instances_df_rf,PCA_REDUCTION,top_features
                             ,fi=failure_incidents,ch=2,seed=seed)
  
  instances_df_rf=red_list[1]$a
  test_instances_df_rf=red_list[2]$b
}

test_instances_df_rf=test_instances_df_rf[,names(instances_df_rf)]
cat("-------------------------------------------------------------\n")
resultRF = evalRF(instances_df_rf,test_instances_df_rf,failure_incidents,fs=FALSE,plot=FALSE)

If you use PCA top_features must be less than: 133 
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 13 
false_positives: 7 
false_negatives: 9 
precision: 0.65 
recall: 0.590909090909091 
F1: 0.619047619047619 


## KNN

In [240]:
#KNN  algorithm and print the results.
#install.packages("Rfast")
library(Rfast)
instances_df_KNN=instances_df
test_instances_df_KNN=test_instances_df

cat("If you use PCA top_features must be less than:",length(instances_df_KNN[,1]),"\n")

top_features=10         
FEATURE_SELECTION=TRUE
PCA_REDUCTION=TRUE

if(FEATURE_SELECTION){
  red_list=feature_reduction(instances_df_KNN,test_instances_df_KNN,PCA_REDUCTION,top_features)
  
  instances_df_KNN=red_list[1]$a
  test_instances_df_KNN=red_list[2]$b
}


instances_df_KNN = instances_df_KNN[ , !(names(instances_df_KNN) %in% c("Timestamps"))] #delete column Timestamps
instances_df_KNN_train = instances_df_KNN[ , !(names(instances_df_KNN) %in% c("Label"))]
test_instances_df_KNN=test_instances_df_KNN[,names(instances_df_KNN_train)]

K=1
knn.fast<- knn(xnew=as.matrix(test_instances_df_KNN) ,x=as.matrix(instances_df_KNN_train) 
               , y=(instances_df_KNN$Label), k=K,dist.type = "euclidean", type = "C")

prediction_knn_fast=((knn.fast-1)==1)

#print(prediction_knn_fast)
cat("-------------------------------------------------------------\n")
resultKNN=eval_predictions(prediction_knn_fast,failure_incidents,fs=FALSE)

If you use PCA top_features must be less than: 133 
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 3 
false_positives: 7 
false_negatives: 19 
precision: 0.3 
recall: 0.136363636363636 
F1: 0.1875 


# Neural Networks

   ### 1)LSTM

   ### 2)MLP

   ### 3)CNN

### Functions to prepare data for fitting the selected type of NN
**1) function: make_3D_data**

**2) function: find_input_shape**


In [241]:
#Function that returns OWs(3D) as training and testing sets.  
make_3D_data<- function(frequency_day_vectors,test_frequency_day_vectors,xdimtr,xdimte,b_length,X,M,Y,Z){
  
  Xtrain <- array(0,dim=c(xdimtr,X*M,b_length-1))
  Xtest <- array(0,dim=c(xdimte,X*M,b_length-1))
  
  for(i in  seq(1,nrow(frequency_day_vectors), by=N)){   
    #if the end of PW is equal to the total days then stop
    #if((i-1+X*M+Z+Y-1) > nrow(test_frequency_day_vectors)){
    if((i-1+X*M+Z+1+Y-1) > nrow(frequency_day_vectors)||i/N+1>xdimtr){ 
      break
    }  
    #take a window of X*M
    #subset by row, get the X*M days
    OW = frequency_day_vectors[i:((X*M)+(i-1)),2:b_length] 
    Xtrain[i/N+1,,]=as.matrix(OW, dim=c(X*M,b_length-1,1))
  }     
  
  for(i in  seq(1,nrow(test_frequency_day_vectors), by=N)){   
    #if the end of PW is equal to the total days then stop
    #if((i-1+X*M+Z+Y-1) > nrow(test_frequency_day_vectors)){
    if((i-1+X*M+Z+1+Y-1) > nrow(test_frequency_day_vectors)||i/N+1>xdimte){ 
      break
    }  
    #take a window of X*M
    #subset by row, get the X*M days
    OW = test_frequency_day_vectors[i:((X*M)+(i-1)),2:b_length] 
    Xtest[i/N+1,,]=as.matrix(OW, dim=c(X*M,b_length-1,1))
  }
  
  MyList<- list("Xtrain"=Xtrain, "Xtest"=Xtest)
  return(MyList)
}


#Function that returns the input shape of training_set to fit the chosen NN correctly.  
find_input_shape<-function(USE_3D_DATA=FALSE,LSTM_FOR_2D=FALSE,FEATURE_SELECTION=FALSE
                           ,top_features=0,features=0,xdimtr=0,xdimte=0,X=0,M=0,b_length=0
                           ,num_of_times=1){
  if(FEATURE_SELECTION){
    if(USE_3D_DATA){
      inputs_shape=c(X*M,top_features)     
    }else if(LSTM_FOR_2D){
      #train_set=array(train_set,c(xdimtr,num_of_times,top_features/num_of_times))
      #test_set=array(test_set,c(xdimte,num_of_times,top_features/num_of_times))
      inputs_shape=c(num_of_times,top_features/num_of_times)  
    }else{
      inputs_shape=c(top_features)
    }
  }else{
    if(USE_3D_DATA){
      inputs_shape=c(X*M,b_length-1)
    }else{
      if(LSTM_FOR_2D){
        #train_set=array(train_set,c(xdimtr,num_of_times,features/num_of_times))
        #test_set=array(test_set,c(xdimte,num_of_times,features/num_of_times))
        inputs_shape=c(num_of_times,features/num_of_times)  
      }else{
        inputs_shape=c(features)
      }
    }
  }
  
  return(inputs_shape)
}

### Functions for the implemantation of the selected NN
**1) function: make_model**

**2) function: fit_model**

**2) function: eval_model**

In [242]:
#Function for creating the NN.
make_model<-function(inputs_shape,USE_LSTM=FALSE,USE_CNN=FALSE,USE_MLP=FALSE,lr=0.001){
  
  hidden_layer1=32
  hidden_layer2=64
  hidden_layer3=32
  
  dense_layer1=128
  
  if(USE_LSTM){
    modeling = keras_model_sequential() %>%   
      
      layer_lstm(units=hidden_layer1, input_shape=inputs_shape,return_sequences = TRUE) %>%
      layer_dropout(0.25)%>%
      layer_lstm(units=hidden_layer2,return_sequences = TRUE) %>%
      layer_dropout(0.25)%>%
      layer_lstm(units=hidden_layer3,return_sequences = FALSE) %>%
      #layer_dropout(0.25)%>%
      #layer_dense(units=dense_layer1, activation = "relu") %>%
      #layer_dropout(0.25)%>%
      layer_dense(units=1,activation='sigmoid')
    
  }
  else if(USE_CNN){
    modeling = keras_model_sequential() %>%   
      
      layer_conv_1d(filters=X*M,kernel_size=3,input_shape=inputs_shape,activation ="relu") %>%
      layer_conv_1d(filters=X*M,kernel_size=3,activation ="relu") %>%
      #layer_max_pooling_1d(pool_size=2) %>%
      #layer_conv_1d(filters=2*X*M,kernel_size=3,activation ="relu") %>%
      #layer_conv_1d(filters=2*X*M,kernel_size=3,activation ="relu") %>%
      layer_global_average_pooling_1d()%>%
      layer_dropout(0.2)%>%
      layer_dense(units=1,activation='sigmoid')
    
    
    
  }else if(USE_MLP){
    modeling = keras_model_sequential() %>%   
      
      layer_dense(units=dense_layer1, input_shape=inputs_shape, activation = "relu") %>%
      layer_dropout(0.25)%>%
      layer_dense(units=dense_layer1*2, activation = "relu") %>%
      layer_dropout(0.25)%>%
      layer_dense(units=dense_layer1, activation = "relu") %>%
      layer_dropout(0.25)%>%
      layer_dense(units=1,activation='sigmoid')
    
  }else{
      cat("Choose a NN type!!!\n")
      return(NULL)
  }
  
  
  
  
  learningrate=lr
  sgd = optimizer_adam(lr=learningrate)#, momentum=0.9, nesterov=TRUE)
  modeling %>% compile(loss = 'binary_crossentropy',#'binary_crossentropy',#mse
                       optimizer = sgd,
                       metrics = list("mean_absolute_error") #"mean_absolute_error" 
  )
  
  modeling %>% summary()
  return(modeling)
  
}

#Function for fitting the NN with the training_set.
fit_model<-function(modeling,train_set,train_labels,epoch_size,batchs_size,classes_weights=NULL,shuffling = FALSE){
  if(is.null(classes_weights)){
    modeling %>% fit(train_set,train_labels, epochs=epoch_size,batch_size=batchs_size,shuffle = shuffling)#,class_weight=classes_weights
  }
  else{
    modeling %>% fit(train_set,train_labels, epochs=epoch_size,batch_size=batchs_size,shuffle = shuffling,class_weight=classes_weights)
  }
  return(modeling)
}

#Function for evaluating the NN with the testing_set.
eval_model<-function(modeling,test_set,threashold,return_preds=FALSE){
  
  Prediction = modeling %>% predict(test_set)
  #print(Prediction)
  Prediction2 <- as.logical(as.numeric(Prediction>threashold)) 
    
  if(return_preds){
    return(Prediction2)
  }  
  
  return(eval_predictions(Prediction2,failure_incidents,fs=FALSE))
  
}

### Setting the necessary variables

In [243]:
USE_3D_DATA=FALSE
LSTM_FOR_2D=FALSE
USE_LSTM=FALSE
USE_CNN=FALSE
USE_MLP=TRUE

### Prepare the data

In [244]:
#If you want before the split of the data you can do feauture selection-reduction sama as the previous algorithms 
#and split them after that.

In [245]:
#split data to Xtrain-Ytrain and Xtest-Ytest

xdimtr=length(instances_df[,1])         
xdimte=length(test_instances_df[,1])
features=length(instances_df[1,])-1

Xtrain=data.matrix(instances_df[,1:features], rownames.force = NA)
Xtest=data.matrix((test_instances_df[,names(instances_df)]) [,1:features], rownames.force = NA)

Ytrain=array(as.integer(as.logical(instances_df$Label))[1:xdimtr],dim=c(xdimtr,1))
Ytest=array(as.integer(as.logical(test_instances_df$Label))[1:xdimte],dim=c(xdimte,1))



if(USE_3D_DATA){
  my_3D_data=make_3D_data(frequency_day_vectors,test_frequency_day_vectors,xdimtr,xdimte,b_length,X,M,Y,Z)
  Xtrain=my_3D_data$Xtrain
  Xtest=my_3D_data$Xtest
}

### Feature selection-reduction

In [246]:
#Feature reduction to the splitted data, only pca

top_features=20       
FEATURE_SELECTION=TRUE
PCA_REDUCTION=TRUE


if(FEATURE_SELECTION){
  
  if(USE_3D_DATA){
    flat_Xtrain=data.frame(array(Xtrain,dim =c(xdimtr*X*M,(b_length-1))))
    flat_Xtest=data.frame(array(Xtest,dim =c(xdimte*X*M,(b_length-1))))
  }else{
    flat_Xtrain=data.frame(array(Xtrain,dim =c(xdimtr,features)))
    flat_Xtest=data.frame(array(Xtest,dim =c(xdimte,features)))
  }
  
  cat("If you use PCA top_features must be less than:",length(flat_Xtrain[,1]),"\n")  

  red_list=feature_reduction(flat_Xtrain,flat_Xtest,PCA_REDUCTION,top_features,Has_Label=FALSE)
  
  train_set=red_list[1]$a
  test_set=red_list[2]$b
  
  train_set=data.matrix(train_set, rownames.force = NA)
  test_set=data.matrix(test_set, rownames.force = NA)
  

  if(USE_3D_DATA){
    train_set=array(train_set,dim=c(xdimtr,X*M,top_features))
    test_set=array(test_set,dim=c(xdimte,X*M,top_features))
  }else{
    train_set=array(train_set,dim=c(xdimtr,top_features))
    test_set=array(test_set,dim=c(xdimte,top_features))
  }  
  
}else{
  train_set=Xtrain
  test_set=Xtest
}

If you use PCA top_features must be less than: 133 


### Make-Fit-Evaluate

In [247]:
inputs_shape=find_input_shape(USE_3D_DATA,LSTM_FOR_2D,FEATURE_SELECTION,top_features
                              ,features,xdimtr,xdimte,X,M,b_length)
train_labels=array(Ytrain,dim=c(xdimtr,1))

#Lab1=length(Ytrain[Ytrain==1])
#Lab0=length(Ytrain[Ytrain==0])
#weight_Lab0=Lab1/(Lab1+Lab0)
#weight_Lab1=Lab0/(Lab1+Lab0)



modeling=make_model(inputs_shape,USE_LSTM,USE_CNN,USE_MLP)
modeling=fit_model(modeling,train_set,train_labels,25,4)#,shuffling = TRUE,classes_weights=list("0"=weight_Lab0,"1"=weight_Lab1))

Model: "sequential_18"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
dense_72 (Dense)                    (None, 128)                     2688        
________________________________________________________________________________
dropout_54 (Dropout)                (None, 128)                     0           
________________________________________________________________________________
dense_73 (Dense)                    (None, 256)                     33024       
________________________________________________________________________________
dropout_55 (Dropout)                (None, 256)                     0           
________________________________________________________________________________
dense_74 (Dense)                    (None, 128)                     32896       
________________________________________________________________________________
dropo

In [248]:
threas_NN=0.1
eval_model(modeling,test_set,0.1)

dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 20 
false_positives: 28 
false_negatives: 2 
precision: 0.416666666666667 
recall: 0.909090909090909 
F1: 0.571428571428572 


## Over-Under sampling

**function: sampling**

In [249]:
#Function that uses over-under sampling to balance the labels of the training_set, randomly selects out of the new 
#balanced training set X=(times * # of balanced training data) data and runs the selected classifier Y=repeats times and
#returns the sum of the predictions (for Y=1 works only for sampling).

sampling<-function(train_set,test_set,inputs_shape=c(0),train_labels=NULL,times=0,repeats=0
                      ,USE_LSTM=FALSE,USE_CNN=FALSE,USE_MLP=FALSE
                      ,USE_OVERSAMPLING=FALSE,USE_UNDERSAMPLING=FALSE
                      ,LSTM_FOR_2D=FALSE,USE_3D_DATA=FALSE,epochs=25,batchsize=8){
  
  pp1=which(train_labels %in% c(1)) #positions of data that have Label=1
  pp0=which(train_labels %in% c(0)) #positions of data that have Label=0
  #print(length(pp1))
  #print(length(pp0))
  
  #If pp1>pp0 change them cause we want pp0 always be the class-Label with the most data
  if(length(pp1)>length(pp0)){
    temp=pp1
    pp1=pp0
    pp0=temp
  }
  
  #save how many test data you have                        
  if(LSTM_FOR_2D||USE_3D_DATA){
    xdimtest=length(test_set[,1,1])
  }
  else{
    xdimtest=length(test_set[,1])
  }
  
  sum_of_preds=array(0,dim=c(xdimtest,1)) #an array to save the results
  
  #for repeats times  
  for(i in 1:repeats){
    
    #Use over_sampling or under_sampling so the number of data of each class to be the same
    if(USE_OVERSAMPLING){
      pp01=pp0
      pp1=sample(pp1, length(pp0), replace=TRUE)
      pall=sample(c(pp01,pp1), times*length(c(pp01,pp1)), replace=TRUE)
    }
    else if(USE_UNDERSAMPLING){
      pp01=sample(pp0, length(pp1), replace=FALSE)
      pall=sample(c(pp01,pp1), times*length(c(pp01,pp1)), replace=TRUE)
    }
    else{
      return(NULL)
    }
    
    #make the model you chose with the new sampled_data
    if(LSTM_FOR_2D||USE_3D_DATA){
      
      train_set_i=train_set[pall,,]
      
      train_set_i=array(train_set_i,dim=c(length(pall),length(train_set[1,,1]),length(train_set[1,1,])))
      
      train_labels_i=train_labels[pall,]
      modeling_sample=make_model(inputs_shape,USE_LSTM,USE_CNN,USE_MLP)
      
    }else{
      
      train_set_i=train_set[pall,]
      train_labels_i=train_labels[pall,]
      modeling_sample=make_model(inputs_shape,USE_LSTM,USE_CNN,USE_MLP)
    }
    
    #fit the model
    modeling_sample=fit_model(modeling_sample,train_set_i,train_labels_i,epochs,batchsize)#,shuffling = TRUE)#,list("0"=0.1,"1"=2))
    
    #make the predictions of the current model  
    Prediction = modeling_sample %>% predict(test_set)
    
    #add the result of the predictions to the sum_of_preds  
    #if(LSTM_FOR_2D){
    if(FALSE){
      sum_of_preds=sum_of_preds+Prediction[,1,1]
    }
    else{
      sum_of_preds=sum_of_preds+Prediction
    }
  }
  
  #return  the average of the sum_of_preds
  return(sum_of_preds/repeats)
  
}

### Run the selected NN with over-under sampled training_set

In [250]:
sum_of_preds=sampling(train_set,test_set,inputs_shape,train_labels,times=15,repeats=5
                         ,USE_LSTM,USE_CNN,USE_MLP
                         ,USE_OVERSAMPLING=FALSE,USE_UNDERSAMPLING=TRUE
                         ,LSTM_FOR_2D,USE_3D_DATA)


Model: "sequential_19"
________________________________________________________________________________
Layer (type)                        Output Shape                    Param #     
dense_76 (Dense)                    (None, 128)                     2688        
________________________________________________________________________________
dropout_57 (Dropout)                (None, 128)                     0           
________________________________________________________________________________
dense_77 (Dense)                    (None, 256)                     33024       
________________________________________________________________________________
dropout_58 (Dropout)                (None, 256)                     0           
________________________________________________________________________________
dense_78 (Dense)                    (None, 128)                     32896       
________________________________________________________________________________
dropo

In [251]:
threas_SAMPLE=0.6
Prediction_sampling <- as.logical(as.numeric(sum_of_preds>0.6))  
eval_predictions(Prediction_sampling,failure_incidents,fs=FALSE)

dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 17 
false_positives: 24 
false_negatives: 5 
precision: 0.414634146341463 
recall: 0.772727272727273 
F1: 0.53968253968254 


## Combine algorithm predictions

**function: combine_algos**

In [252]:
#Function that combines the algos_predictions and if the percent of them predict the "TRUE" class then else
#their combination predicts "TRUE", else FALSE.
#By setting a weights for each classifier you can get the weghted combined prediction of the algos.
#If return_preds=FALSE the function print the results, else returns the combined predictions.

combine_algos<-function(algos_predictions,percent=0.5,return_preds=FALSE,weights=list()){
  
  xdimte=length(algos_predictions[[1]])  
    
  #array for saving the sum of the results of all algos   
  sum_of_algos=array(FALSE,dim=c(xdimte,1))
   
  #for every prediction  
  for(i in 1:xdimte){
    counter=0
    count_alg=0
    #for every algorithm  
    for(algo in algos_predictions){
      count_alg=count_alg+1
      #if prediction of the algorithm is TRUE
      if(algo[i]==TRUE){  
          if(length(weights)==0){
                counter=counter+1

          }else{
                counter=counter+as.numeric(weights[[count_alg]])
                
          }  
      }     
          
              
    }
    
    #if at least the given percent of the algos predict TRUE then their combined prediction is TRUE, else FALSE. 
    if(length(weights)==0){  
        if(counter>=ceiling(length(algos_predictions)*percent)){                      
          sum_of_algos[i]=TRUE
        }else{
          sum_of_algos[i]=FALSE
        }
    }else{
        if(counter>=percent){                      
          sum_of_algos[i]=TRUE
        }else{
          sum_of_algos[i]=FALSE
        }
    }
  }
    
  if(return_preds){
    return(sum_of_algos)
  }
  
  return(eval_predictions(sum_of_algos,failure_incidents,fs=FALSE))
}

### Collect the predictions  of each algorithm and print the results

In [253]:
cat("XGB results:\n")
cat("-------------------------------------------------------------\n")
PredictionXGBOOST<-evalXGBoost(instances_df_xgb,test_instances_df_xgb,failure_incidents,fs=FALSE,plotbool=FALSE,threshold=threas_XGB,return_preds = TRUE)
eval_predictions(PredictionXGBOOST,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

cat("\nRF results:\n")
cat("-------------------------------------------------------------\n")
PredictionRF<-evalRF(instances_df_rf,test_instances_df_rf,failure_incidents,fs=FALSE,plot=FALSE,return_preds = TRUE)
eval_predictions(PredictionRF,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

cat("\nKNN results:\n")
cat("-------------------------------------------------------------\n")
eval_predictions(prediction_knn_fast,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

cat("\nNN results:\n")
cat("-------------------------------------------------------------\n")
PredictionNN<-eval_model(modeling,test_set,threas_NN,return_preds=TRUE)
eval_predictions(PredictionNN,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

cat("\nSAMPLE NN results:\n")
cat("-------------------------------------------------------------\n")
eval_predictions(Prediction_sampling,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

XGB results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 21 
false_positives: 12 
false_negatives: 1 
precision: 0.636363636363636 
recall: 0.954545454545455 
F1: 0.763636363636364 


-------------------------------------------------------------

RF results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 13 
false_positives: 7 
false_negatives: 9 
precision: 0.65 
recall: 0.590909090909091 
F1: 0.619047619047619 


-------------------------------------------------------------

KNN results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 3 
false_positives: 7 
false_negatives: 19 
precision: 0.3 
recall: 0.136363636363636 
F1: 0.1875 


-------------------------------------------------------------

NN results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 20 
false_positives: 28 
false_negatives: 2 
precision: 0.416666666666667 
recall: 0.909090909090909 
F1: 0.571428571428572 


-------------------------------------------------------------

SAMPLE NN results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 17 
false_positives: 24 
false_negatives: 5 
precision: 0.414634146341463 
recall: 0.772727272727273 
F1: 0.53968253968254 


-------------------------------------------------------------


### Make a list of the algorithms' predictions you want to combine and print the result.

In [254]:
algos_predictions=list(PredictionRF,prediction_knn_fast,PredictionNN)

percent=2/3
cat("Combination RF-KNN-NN algos results:\n")
cat("-------------------------------------------------------------\n")
pred_RF_KNN_NN=combine_algos(algos_predictions,percent,return_preds=TRUE)
eval_predictions(pred_RF_KNN_NN,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

Combination RF-KNN-NN algos results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 12 
false_positives: 11 
false_negatives: 10 
precision: 0.521739130434783 
recall: 0.545454545454545 
F1: 0.533333333333333 


-------------------------------------------------------------


In [255]:
algos_predictions=list(pred_RF_KNN_NN,Prediction_sampling)

percent=1/2
cat("RF-KNN-NN and SAMPLE NN combination results:\n")
cat("-------------------------------------------------------------\n")
pred_RF_KNN_NN_SAMPLE=combine_algos(algos_predictions,percent,return_preds=TRUE)
eval_predictions(pred_RF_KNN_NN_SAMPLE,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

RF-KNN-NN and SAMPLE NN combination results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 18 
false_positives: 26 
false_negatives: 4 
precision: 0.409090909090909 
recall: 0.818181818181818 
F1: 0.545454545454546 


-------------------------------------------------------------


In [256]:
algos_predictions=list(pred_RF_KNN_NN,Prediction_sampling)
weight=list(0.9,0.1)


percent=1/2
cat("RF-KNN-NN and SAMPLE NN combination results:\n")
cat("-------------------------------------------------------------\n")
pred_RF_KNN_NN_SAMPLE=combine_algos(algos_predictions,percent,return_preds=TRUE,weights=weight)
eval_predictions(pred_RF_KNN_NN_SAMPLE,failure_incidents,fs=FALSE)
cat("-------------------------------------------------------------\n")

RF-KNN-NN and SAMPLE NN combination results:
-------------------------------------------------------------
dataset: C:/Users/petsi/Documents/ptyxiakh/testing_my_dataset_150.csv 
true_positives: 12 
false_positives: 11 
false_negatives: 10 
precision: 0.521739130434783 
recall: 0.545454545454545 
F1: 0.533333333333333 


-------------------------------------------------------------
