***Steps***  
***Part 1: Log Collection/Cleaning***
1. Run `parser.py` file in '/Parsing'. Will parse files in 'rawlogs' into 'cleanlogs'. Creates log_structured.csv and log_templates.csv.
- Example code: `%run parser.py`
2. Use `clean_logs()` to clean log_structured.csv. Rename events to E1, E2 etc., creates log_structured_clean.csv.
- Example code: `clean_logs("./Parsing/cleanlogs/Zookeeper_2k.log_structured.csv","./Parsing/cleanlogs/Zookeeper_2k.log_templates.csv","Zookeeper")`.  
2. 1:  For labeled data use `label_log()` to label your log_structured_clean.csv log into log_labelled.csv. ***Use for HDFS only***
- Example code: `label_log("./Parsing/cleanlogs/HDFS_structured_clean.csv","./Parsing/rawlogs/anomaly_label.csv")`.  
     
***Part 2: Log Parsing and Feature Extraction***  
3. Use `_session_window()` or `_fixed_window()` or `_sliding_window()` to parse and extract events from your log_structured_clean.csv or log_labelled.csv file. ***_session_window() for HDFS only***
- Example code (labelled): `x,y,df = _session_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",labels=True)`.  
- Example code: `x,y,df = _fixed_window(data_file="./Parsing/cleanlogs/Zookeeper_structured_clean.csv",windowsize=2,log_type="Zookeeper")`.  
4. Use `_split_data()` to split x,y into training and testing sets.  ***split_type = 'uniform' is only for Labeled Data***.  
- Example code (labeled): = `(x_train,y_train),(x_test,y_test) = _split_data(x,y,train_ratio=0.5,split_type="uniform")`.  
- Example code (unlabeled): `(x_train, _),( x_test, _) = _split_data(x,y=None,train_ratio=0.5,split_type="sequential")`.  


In [4]:
#Parsing and Feature Extreaction
import Parsing.clean_parse_extract as cpe

#For Modelling
from Models.loglizer.loglizer.models import PCA
from Models.loglizer.loglizer import preprocessing

In [5]:
#2
cpe.clean_logs("./Parsing/cleanlogs/HDFS_2k.log_structured.csv","./Parsing/cleanlogs/HDFS_2k.log_templates.csv","HDFS")
cpe.clean_logs("./Parsing/cleanlogs/Zookeeper_2k.log_structured.csv","./Parsing/cleanlogs/Zookeeper_2k.log_templates.csv","Zookeeper")

In [6]:
#2.1 *Run on HDFS to get unique blk_ids
cpe.label_log("./Parsing/cleanlogs/HDFS_structured_clean.csv","./Parsing/rawlogs/anomaly_label.csv")

In [7]:
#3
x_hl,y_hl,df_hl = cpe._session_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",labels=True)
x_z,y_z,df_z = cpe._fixed_window(data_file="./Parsing/cleanlogs/Zookeeper_structured_clean.csv",windowsize=2,log_type="Zookeeper")
x_h,y_h,df_h = cpe._sliding_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",windowsize=5,windowslide=2,log_type = "HDFS")

Number of unique blk_ids:  1994
Number of windows:  253
Number of window slides:  252
Number of windows:  15
Number of window slides:  14


In [8]:
#4
(x_train_z,_), (x_test_z,_) = cpe._split_data(x_z,y_z,train_ratio = 0.9, split_type = 'sequential')
(x_train_hl,y_train_hl), (x_test_hl,y_test_hl) = cpe._split_data(x_hl,y_hl,train_ratio = 0.5, split_type = 'uniform')

Size of training set:  227
Size of testing set:  26
Size of total data set:  253
Size of training set:  997
Size of testing set:  997
Size of total data set:  1994


In [9]:
df_hl[df_hl.Label == 1]

Unnamed: 0,BlockId,EventSequence,Label
38,blk_8181993091797661153,[E13],1
63,blk_8408125361497769001,[E3],1
89,blk_8787656642683881295,[E8],1
93,blk_-62891505109755100,[E11],1
139,blk_-7105305952901940477,[E6],1
...,...,...,...
1763,blk_-7606467001548719462,[E1],1
1856,blk_4516306414837452219,[E4],1
1904,blk_9173199815015538212,[E1],1
1942,blk_-8571819028995448536,[E10],1


In [None]:
feature_extractor = preprocessing.FeatureExtractor()
#feature_extractor.fit_transform(x_train) is to fit and transform the TRAINING matrix
#Transform Parameters: term_weighting = 'tf-idf' or None
#Transform Parameters: term_weighting = 'zero-mean' or None
x_train = feature_extractor.fit_transform(x_train_hl, term_weighting='tf-idf', 
                                          normalization='zero-mean')
#feature_extractor.transform(x_test) is to used the FITTED model and transformed the TESTING matrix 
x_test = feature_extractor.transform(x_test)
#PCA Model functions: model.fit()
#PCA Model functions: model.predict()
#PCA Model functions: model.evaluate()
model = PCA()
model.fit(x_train)