# Sequential pattern mining(SPM)
A **data-driven** technique with a wide range of applications is the **sequential pattern mining** (SPM). SPM consists of **discovering** useful **patterns** in the data, such
as **frequent itemsets**, associations, sequential rules, or periodic patterns. In PdM,
**SPM can provide useful information about associations between fault events as
a sequence of minor faults or other events can potentially lead to a major failure**.
Traditionally, SPM does not integrate the notion of time between the provided
associations [14]. However, there are research works like [7] that allow the spec-
ification of time constraints for the identification of the patterns, or works like
[3] and [12] that provide solutions for an extension of SPM for online processing
of temporal data sequences. The combination of such techniques with **Complex-
event processing** (CEP) can predict failures in a variety of complex systems,
such as the ones encountered in the industry.


3. Ao, X., Luo, P., Li, C., Zhuang, F., He, Q.: Online frequent episode mining. In:IEEE 31st Int. Conf. on Data Engineering (ICDE). pp. 891-902 (2015)

7. Hirate, Y., Yamana, H.: Generalized sequential pattern mining with item intervals.JCP 1(3), 51-60 (2006)

12. Li, H., Peng, S., Li, J., Li, J., Cui, J., Ma, J.: Once and once+: Counting the frequency of time-constrained serial episodes in a streaming sequence. arXiv preprint arXiv:1801.09639 (2018)

14. Wang, J., Li, C., Han, S., Sarkar, S., Zhou, X.: Predictive maintenance based on event-log analysis: A case study. IBM Journal of Research and Development 61(1), 11-121 (2017)

## Sequential pattern mining for PdM

In this work, we will examine the
**prediction effeciency** of a system that uses **SPM** with **time constraints between
events**. An outline is presented in Algorithm 1, where the main input parameters
consist of the **constraints** on the pattern period(**--minwi** , **--maxwi**) and the gap between events(**--minti** , **--maxti**), the
a parameter, which sets the support threshold(**--conf**) in relation to the occurrence of
faults in the training set, and ε, which keeps patterns not generating many false
alarms.



Algorithm 1 Sequential pattern mining for PdM
-------------------------------------------------------------------
procedure **Pattern Extraction**
<br>
**--------------------Hirate Yamana Algorithm(.jar Java)--------------------**
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;nof<-number of failures
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;min support<- (a * nof), 0<α<=1
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;constraints<-set constraints on the pattern period and the gap between events
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;Extract frequent sequential patterns given min support and constraints 
<br>    
**----------------------------SPM rules(.py Python)-----------------------------** 
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;Keep only the partners ending in the target event E1E2E3...EnX
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;Result<-{}
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;**for each** subset S of E1E2E3...En **do**
<br>
        &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**if** support(S) <= (1 + ε)*support(E1E2E3...EnX), ε>0 then
<br>
            &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Result<-Result U S
<br>

procedure **Pattern Usage**
<br>
    &nbsp;&nbsp;&nbsp;&nbsp;Continuously check whether any pattern in Result applies







### Setup

In [1]:
#Make the necessary imports.

suppressMessages(library(argparser))
suppressMessages(library(stringr))
suppressMessages(library(data.table))

"package 'data.table' was built under R version 3.6.2"

### Init variables

In [2]:
# Make an argument parser named p and keep there the necessary variables.

p <- arg_parser("Implementation of the SPM+CEP Predictor")

# Add a positional argument
p <- add_argument(p, "train", help="training dataset")
p <- add_argument(p, "test", help="test dataset")

p <- add_argument(p, "tet", help="type of the target fault events",default=11)

setwd("C:/Program Files (x86)")
p <- add_argument(p, "--java", help="the java path", default="./Java/jdk1.8.0_192/bin/java.exe")

setwd("C:/Users/PETROS PETSINIS")
p <- add_argument(p, "--python", help="the python path", default="./Anaconda/python.exe")

setwd("C:")

p <- add_argument(p, "--cep", help="complex event processing path", default="C:/Users/Public/ptyxiakh/my_spmrules.py")
p <- add_argument(p, "--spmf", help="the spmf path", default="C:/Users/Public/ptyxiakh/spmf.jar")

p <- add_argument(p, "--conf", help="minimum support (minsup)", default="70%")#50%

p <- add_argument(p, "--minti", help="minimum time interval allowed between two succesive itemsets of a sequential pattern", default=4)#2
p <- add_argument(p, "--maxti", help="maximum time interval allowed between two succesive itemsets of a sequential pattern", default=5)

p <- add_argument(p, "--minwi", help="minimum time interval allowed between the first itemset and the last itemset of a sequential pattern", default=4)#2
p <- add_argument(p, "--maxwi", help="maximum time interval allowed between the first itemset and the last itemset of a sequential pattern", default=5)

p <- add_argument(p, "--minwint", help="min # of days before failure to expect a warning for true positive decision", default=2)
p <- add_argument(p, "--maxwint", help="max # of days before failure to expect a warning for true positive decision", default=5)

p <- add_argument(p, "--csv", help="output for csv", default=TRUE)

### Define the necessary variables

In [3]:
#Define the necessary variables.

argv = data.frame() #make a data frame named argv
#if( length(commandArgs(trailingOnly = TRUE)) != 0){
if(FALSE){
  argv <- parse_args(p)
} else {
  #parse to argv the p's arguments as  argv <- parse_args(p,c("training dataset's path","test dataset's path",tet))  
  argv <- parse_args(p,c("C:/Users/Public/ptyxiakh/training_my_dataset2.csv","C:/Users/Public/ptyxiakh/testing_my_dataset2.csv",11))
}

#init the variables
train_path=argv$train
test_path=argv$test

spm_train_path = gsub(".csv","_spm.csv",argv$train)
spm_test_path = gsub(".csv","_spm.csv",argv$test)
spm_results_path = gsub(".csv","_results2.csv",argv$train)

target_event = argv$tet

confidence = argv$conf

min_dist_seq = argv$minti
max_dist_seq = argv$maxti

min_dist_first_last = argv$minwi
max_dist_first_last = argv$maxwi

max_warning_interval = argv$maxwint
min_warning_interval = argv$minwint

java_path = argv$java
jspmf_path = argv$spmf

python_path = argv$python
cep_path = argv$cep

csv = argv$csv

print("The data frame argv is:")
print(argv)

[1] "The data frame argv is:"
[[1]]
[1] FALSE

$help
[1] FALSE

$opts
[1] NA

$java
[1] "./Java/jdk1.8.0_192/bin/java.exe"

$python
[1] "./Anaconda/python.exe"

$cep
[1] "C:/Users/Public/ptyxiakh/my_spmrules.py"

$spmf
[1] "C:/Users/Public/ptyxiakh/spmf.jar"

$conf
[1] "70%"

$minti
[1] 4

$maxti
[1] 5

$minwi
[1] 4

$maxwi
[1] 5

$minwint
[1] 2

$maxwint
[1] 5

$csv
[1] TRUE

$train
[1] "C:/Users/Public/ptyxiakh/training_my_dataset2.csv"

$test
[1] "C:/Users/Public/ptyxiakh/testing_my_dataset2.csv"

$tet
[1] 11



### Reading function
**function: read_dataset**

In [4]:
#Function for reading the csv file and save it to a two column table.

read_dataset <- function(path){
  dataset = read.table(path, header = TRUE, sep = ",", dec = ".", comment.char = "#")
  dataset[, 2]  <- as.numeric(dataset[, 2])
  return(dataset)
}

### Read train and test set

The recorded log types read from csv files.
One csv file(at **train_path**) has the **training_set** and the other(at **test_path**) the **testing_set**.

In [5]:
#Reading train and test set.

training_set = read_dataset(train_path)
test_set =  read_dataset(test_path)

print("The test_set and training_set looks like:")
print(head(test_set))

[1] "The test_set and training_set looks like:"
  Timestamps Event_id
1 2014-01-21        1
2 2014-01-21        3
3 2014-01-21        4
4 2014-01-21        5
5 2014-01-21        6
6 2014-01-21        5


## Function for creating and saving(as .csv file) spm episodes

Split the table of the set to episodes. 

**episode** :the next day of a target event(or the start of the table) until the next target event(or the end of the table).

In [6]:
#Function for creating episodes list of each day's events.

create_episodes_list_base_line <- function(target_event,ds,output){
  if(!csv){
    print("~~~~~~~CREATING FREQUENCY VECTORS~~~~~~~")
  }
  #devide in episodes
  target_event_spotted = FALSE
    
  #a list with data.frames for the episodes (each episode one data.frame)
  episodes_list = list()
    
  #data.frame for episodes
  episode_df <- data.frame(Timestamps=as.Date(character()),Event_id=integer())
    
  #iterate over every line of the original dataset
  for(i in 1:nrow(ds)) {
    #get the current row of the ds
    meas <- ds[i,]
      
    #If it is the target event enable the appropriate flag
    if((meas$Event_id == target_event) || i==1){
      target_event_spotted = TRUE
    }
      
    #fill the episode data.frame with the events that are between two target events
    if(meas$Event_id != target_event && target_event_spotted){
      episode_df <- rbind(episode_df,data.frame(Timestamps=meas$Timestamps, Event_id=meas$Event_id))  
    } else if(meas$Event_id == target_event && target_event_spotted && is.data.frame(episode_df) && nrow(episode_df) != 0){
      #a second occurness of the target event is spotted, close the episode
      #target_event_spotted = FALSE
      #aggregate by day all the events to form the segments inside the episodes
      aggr_episode_df = aggregate(episode_df[ ,2], FUN=function(x){return(x)}, by=list(as.Date(episode_df$Timestamps, "%Y-%m-%d")))
      
      #add the episode to the episodes_list
      episodes_list[[length(episodes_list)+1]] = aggr_episode_df
        
      #reset episode_df to en empty data.frame
      episode_df <- data.frame(Timestamps=as.Date(character()),Event_id=integer())
    }
  }

  print("Episode list looks like:")
  print((episodes_list))
  print("-----------------------------------------------------")

  #if the file exists remove its content
  if (file.exists(output)) {
    invisible(file.remove(output))
  }
    
  #output for HirateYamana
  if(length(episodes_list)>0){
    for(ep_index in (1:length(episodes_list))){
      ep = episodes_list[[ep_index]]$x
      ep_list = list()
      for(i in (1:length(ep))){
        ep_list[i] = paste(ep[[i]],collapse=" ")
      }
      ep_list[length(ep_list)+1] = target_event
      episode = ""
      for(ep_lli in (1:length(ep_list))){
        index = paste(paste("<",ep_lli-1,sep=""),">",sep="") #I CHANGED IT FROM ep_lli to ep_lli-1
        if(episode == ""){
          episode = paste(index,ep_list[[ep_lli]],sep=" ")
        } else {
          episode = paste(episode,paste(index,ep_list[[ep_lli]],sep=" "),sep=" -1 ")
        }
      }
      write(paste(episode,"-1 -2"),file=output,append=TRUE)
    }
  }  
}

## Create episodes list

In [7]:
if(!csv){
  print("~~~~~~~SEQUENTIAL PATTERN MINING~~~~~~~")
}
print("For the training set...")
create_episodes_list_base_line(target_event,training_set,spm_train_path)
print("For the testing set...")
create_episodes_list_base_line(target_event,test_set,spm_test_path)

print("The .csv file of the test set looks like:")
testset <- read.csv(file = spm_test_path,header=FALSE)
head(testset)

[1] "For the training set..."
[1] "Episode list looks like:"
[[1]]
      Group.1                               x
1  2014-01-02                2, 4, 5, 7, 3, 6
2  2014-01-03   1, 2, 3, 4, 6, 7, 8, 9, 7, 10
3  2014-01-04                   3, 4, 5, 6, 8
4  2014-01-05          1, 2, 3, 4, 5, 6, 7, 6
5  2014-01-06             1, 2, 3, 4, 3, 9, 6
6  2014-01-07       2, 3, 5, 6, 7, 8, 9, 7, 3
7  2014-01-08   1, 2, 3, 4, 5, 6, 7, 3, 10, 4
8  2014-01-09                      1, 2, 4, 5
9  2014-01-10 1, 2, 3, 4, 5, 6, 7, 3, 9, 7, 7
10 2014-01-11                   1, 3, 4, 3, 5

[[2]]
     Group.1                         x
1 2014-01-12      2, 4, 5, 7, 4, 10, 5
2 2014-01-13    1, 3, 4, 5, 6, 7, 8, 7
3 2014-01-14 1, 2, 3, 4, 5, 6, 7, 8, 7
4 2014-01-15                7, 8, 5, 7
5 2014-01-16 1, 2, 3, 4, 5, 6, 7, 4, 7

[[3]]
     Group.1                         x
1 2014-01-17             2, 4, 5, 7, 3
2 2014-01-18       1, 3, 4, 5, 6, 7, 6
3 2014-01-19 1, 2, 3, 4, 5, 7, 3, 4, 7
4 2014-01-20    1, 2, 3

V1
<0> 1 3 4 5 6 5 6 7 -1 <1> 1 2 4 5 7 8 9 -1 <2> 1 2 3 5 6 8 4 10 -1 <3> 3 4 5 7 5 6 -1 <4> 1 2 3 4 5 6 7 8 4 7 -1 <5> 1 2 4 6 7 4 7 -1 <6> 11 -1 -2
<0> 2 3 4 5 6 7 3 -1 <1> 1 2 4 5 7 3 10 5 -1 <2> 3 4 7 9 7 -1 <3> 1 2 3 4 5 6 3 4 6 -1 <4> 11 -1 -2


## Run HirateYamana algorithm 

Using a jar file named **spmf.jar**(jspmf_path) the **HirateYamana** algorithm is running for the **spm_train_path**(.csv file of the training set). 

Hirate Yamana: https://www.philippe-fournier-viger.com/spmf/hirateyamana.pdf

Hirate Yamana example: https://www.philippe-fournier-viger.com/spmf/HirateYamana.php

In [8]:
if (file.exists(spm_results_path)) {
  invisible(file.remove(spm_results_path))
}

javaOutput <- system(paste(java_path,"-jar",jspmf_path,"run HirateYamana",spm_train_path,spm_results_path,confidence,min_dist_seq,max_dist_seq,min_dist_first_last,max_dist_first_last), intern = TRUE)

print("The .csv file of the HirateYamanas' results looks like:")
hiryamres <- read.csv(file = spm_results_path,header=FALSE)
head(hiryamres)

[1] "The .csv file of the HirateYamanas' results looks like:"


V1
<0> 3 -1 <4> 11 -1 #SUP: 3
<0> 4 -1 <4> 11 -1 #SUP: 3
<0> 4 7 -1 <4> 11 -1 #SUP: 3
<0> 4 5 -1 <4> 11 -1 #SUP: 3
<0> 4 5 7 -1 <4> 11 -1 #SUP: 3
<0> 5 -1 <4> 11 -1 #SUP: 3
<0> 5 7 -1 <4> 11 -1 #SUP: 3
<0> 7 -1 <4> 11 -1 #SUP: 3


## Extract the Rules and make predictions for the test set

In [16]:
setwd("C:/Users/PETROS PETSINIS")
pythonOutput <- system(paste(python_path,cep_path,spm_results_path,spm_test_path,target_event), intern = TRUE)

print("The .csv file of the test set looks like:")
testset <- read.csv(file = spm_test_path,header=FALSE)
head(testset)


print("The python output is:")
print(pythonOutput)

[1] "The .csv file of the test set looks like:"


V1
<0> 1 3 4 5 6 5 6 7 -1 <1> 1 2 4 5 7 8 9 -1 <2> 1 2 3 5 6 8 4 10 -1 <3> 3 4 5 7 5 6 -1 <4> 1 2 3 4 5 6 7 8 4 7 -1 <5> 1 2 4 6 7 4 7 -1 <6> 11 -1 -2
<0> 2 3 4 5 6 7 3 -1 <1> 1 2 4 5 7 3 10 5 -1 <2> 3 4 7 9 7 -1 <3> 1 2 3 4 5 6 3 4 6 -1 <4> 11 -1 -2


[1] "The python output is:"
 [1] "The rule keys are:"                                          
 [2] "dict_keys(['3', '4', '47', '45', '457', '5', '57', '7'])"    
 [3] ""                                                            
 [4] "-------------------------"                                   
 [5] "For episode 1"                                               
 [6] "-------------------------"                                   
[33] "6> Failure"                                                  
[34] "-------------------------"                                   
[35] "For episode 2"                                               
[36] "-------------------------"                                   
[56] "4> Failure"                                                  


## Evaluate spm results

Calculating recall, precion and F1 score for the predictions.

In [28]:
true_positives = 0
false_positives = 0
false_negatives = 0
total_failures = 0

day = 0

warnings = list()

ep_count=1

#for every line of pythonOutput
for(w in pythonOutput){ 
  #if string "Warning" appears in the line  
  if(grepl("Warning ",w,fixed=TRUE)){
    day = as.integer(str_extract(w, "\\-*\\d+\\.*\\d*")) #day's serial number  
    warnings = c(warnings,day) #
    #print("Waring list is:")
    #print(warnings)  
  #if string "Failure" appears in the line     
  } else if(grepl("Failure",w,fixed=TRUE)){  
      
    print("-------------------")  
    print("For episode:")
    print(ep_count) 
    #print("-------------------")
    ep_count=ep_count+1  
      
    day = as.integer(str_extract(w, "\\-*\\d+\\.*\\d*")) #day's serial number 
      
    total_failures = total_failures + 1 #increase total failures by 1
    
    day=day-1
      
    print("The serial number of the day, when the failure(target event) happens is:")
    print(day)  #day or day-1?
    #print("-------------------")
      
      
    #if there is no warning  
    if(length(warnings) == 0){
      false_negatives = false_negatives + 1 #increase false negatives by 1
    #if there is warning(s)    
    } else {
      if(length(warnings[warnings < day-max_warning_interval]) > 0){
        #increase false positives by the number of these warnings
        false_positives = false_positives + length(warnings[warnings < day-max_warning_interval]) 
      }
        
      #if there is warnings after the max and before the min interval from the failure(target event)   
      if(length(warnings[warnings >= (day-max_warning_interval)]) > 0 & length(warnings[warnings <= (day-min_warning_interval)]) > 0){
        true_positives = true_positives + 1 #increase true positives by 1
      #if there is no correct warning    
      } else {
        false_negatives = false_negatives + 1 #increase false negatives by 1
      }
    }
    warnings = list() #empty the list
  }
}

precision = true_positives/(true_positives+false_positives) #calculate the precision of the model
if((true_positives+false_positives) == 0){
  precision = 0
}

recall = true_positives/total_failures #calculate recall of the model

F1 = 2*((precision*recall)/(precision+recall)) #calculate F1 score of the model
if(is.na((precision+recall)<=0.00)){
  F1 = 0
}

#prints
#if(!csv){
print("------------------------------------------------------------")
if(TRUE){    
  cat(paste("dataset:",argv$test,"\ntrue_positives:", true_positives,"\nfalse_positives:", false_positives,"\nfalse_negatives:", false_negatives,"\nprecision:", precision,"\nrecall:", recall,"\nF1:", F1, "\n"))
} else {
  cat(paste(argv$test,",", true_positives,",", false_positives,",", false_negatives,",", precision,",", recall,",", F1,",",argv$conf,",",argv$minti,",",argv$maxti,",",argv$minwi,",",argv$maxwi,",",argv$minwint,",",argv$maxwint, "\n",sep=""))
}

[1] "-------------------"
[1] "For episode:"
[1] 1
[1] "The serial number of the day, when the failure(target event) happens is:"
[1] 5
[1] "-------------------"
[1] "For episode:"
[1] 2
[1] "The serial number of the day, when the failure(target event) happens is:"
[1] 3
[1] "------------------------------------------------------------"
dataset: C:/Users/Public/ptyxiakh/testing_my_dataset2.csv 
true_positives: 2 
false_positives: 0 
false_negatives: 0 
precision: 1 
recall: 1 
F1: 1 
