***Steps***  
***Part 1: Log Collection/Cleaning***
1. Run `parser.py` file in '/Parsing'. Will parse files in 'rawlogs' into 'cleanlogs'. Creates log_structured.csv and log_templates.csv.
- Example code: `%run parser.py`
2. Use `clean_logs()` to clean log_structured.csv. Rename events to E1, E2 etc., creates log_structured_clean.csv.
- Example code: `clean_logs("./Parsing/cleanlogs/Zookeeper_2k.log_structured.csv","./Parsing/cleanlogs/Zookeeper_2k.log_templates.csv","Zookeeper")`.  
2. 1:  For labeled data use `label_log()` to label your log_structured_clean.csv log into log_labelled.csv. ***Use for HDFS only***
- Example code: `label_log("./Parsing/cleanlogs/HDFS_structured_clean.csv","./Parsing/rawlogs/anomaly_label.csv")`.  
     
***Part 2: Log Parsing and Feature Extraction***  
3. Use `_session_window()` or `_fixed_window()` or `_sliding_window()` to parse and extract events from your log_structured_clean.csv or log_labelled.csv file. ***_session_window() for HDFS only***
- Example code (labelled): `x,y,df = _session_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",labels=True)`.  
- Example code: `x,y,df = _fixed_window(data_file="./Parsing/cleanlogs/Zookeeper_structured_clean.csv",windowsize=2,log_type="Zookeeper")`.  
4. Use `_split_data()` to split x,y into training and testing sets.  ***split_type = 'uniform' is only for Labeled Data***.  
- Example code (labeled): = `(x_train,y_train),(x_test,y_test) = _split_data(x,y,train_ratio=0.5,split_type="uniform")`.  
- Example code (unlabeled): `(x_train, _),( x_test, _) = _split_data(x,y=None,train_ratio=0.5,split_type="sequential")`.  


In [83]:
#Parsing and Feature Extreaction
import Parsing.clean_parse_extract as cpe

#For Modelling
from Models.loglizer.loglizer.models import PCA as pca2
from Models.loglizer.loglizer import preprocessing

In [20]:
#2
cpe.clean_logs("./Parsing/cleanlogs/HDFS_2k.log_structured.csv","./Parsing/cleanlogs/HDFS_2k.log_templates.csv","HDFS")
cpe.clean_logs("./Parsing/cleanlogs/Zookeeper_2k.log_structured.csv","./Parsing/cleanlogs/Zookeeper_2k.log_templates.csv","Zookeeper")

In [21]:
#2.1 *Run on HDFS to get unique blk_ids
cpe.label_log("./Parsing/cleanlogs/HDFS_structured_clean.csv","./Parsing/rawlogs/anomaly_label.csv")

In [22]:
#3
x_hl,y_hl,df_hl = cpe._session_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",labels=True)
x_z,y_z,df_z = cpe._fixed_window(data_file="./Parsing/cleanlogs/Zookeeper_structured_clean.csv",windowsize=2,log_type="Zookeeper")
x_h,y_h,df_h = cpe._sliding_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",windowsize=5,windowslide=2,log_type = "HDFS")

Number of unique blk_ids:  1994
Number of windows:  253
Number of window slides:  252
Number of windows:  15
Number of window slides:  14


In [32]:
df_z.head(3)

Unnamed: 0,window_id,EventSequence
0,window1,"[E8, E17, E31, E41, E41, E28, E28, E14, E9, E2..."
1,window2,"[E21, E34, E15, E34, E22, E22, E21, E22, E9, E..."
2,window3,"[E30, E15, E34, E22, E25]"


In [33]:
df_h.head(3)

Unnamed: 0,window_id,EventSequence
0,window1,"[E2, E2, E13, E2, E2, E13, E13, E13, E2, E3, E..."
1,window2,"[E3, E13, E3, E10, E10, E2, E10, E13, E13, E2,..."
2,window3,"[E8, E11, E8, E8, E8, E8, E8, E8, E11, E8, E11..."


In [34]:
df_hl.head(3)

Unnamed: 0,BlockId,EventSequence,Label
0,blk_38865049064139660,[E2],0
1,blk_-6952295868487656571,[E2],0
2,blk_7128370237687728475,[E13],0


In [37]:
#4
(x_train_z,_), (x_test_z,_) = cpe._split_data(x_z,y_z,train_ratio = 0.9, split_type = 'sequential')
(x_train_hl,y_train_hl), (x_test_hl,y_test_hl) = cpe._split_data(x_hl,y_hl,train_ratio = 0.5, split_type = 'uniform')

Size of training set:  227
Size of testing set:  26
Size of total data set:  253
Size of training set:  997
Size of testing set:  997
Size of total data set:  1994


In [38]:
df_hl[df_hl.Label == 1]

Unnamed: 0,BlockId,EventSequence,Label
38,blk_8181993091797661153,[E13],1
63,blk_8408125361497769001,[E3],1
89,blk_8787656642683881295,[E8],1
93,blk_-62891505109755100,[E11],1
139,blk_-7105305952901940477,[E6],1
...,...,...,...
1763,blk_-7606467001548719462,[E1],1
1856,blk_4516306414837452219,[E4],1
1904,blk_9173199815015538212,[E1],1
1942,blk_-8571819028995448536,[E10],1


In [101]:
from Models.loglizer.loglizer import dataloader, preprocessing

struct_log = './HDFS_100k.log_structured.csv' # The structured log file
label_file = './anomaly_label.csv' # The anomaly label file

(x_train, y_train), (x_test, y_test) = dataloader.load_HDFS(struct_log,
                     label_file=label_file,
                     window='session', 
                     train_ratio=0.5,
                     split_type='uniform')

Loading ./HDFS_100k.log_structured.csv
156 157
Total: 7940 instances, 313 anomaly, 7627 normal
Train: 3969 instances, 156 anomaly, 3813 normal
Test: 3971 instances, 157 anomaly, 3814 normal



In [102]:
feature_extractor = preprocessing.FeatureExtractor()
x_train = feature_extractor.fit_transform(x_train, term_weighting='tf-idf')
x_test = feature_extractor.transform(x_test)

model = LR()
model.fit(x_train, y_train)

print('Train validation:')
precision, recall, f1 = model.evaluate(x_train, y_train)

print('Test validation:')
precision, recall, f1 = model.evaluate(x_test, y_test)

Train data shape: 3969-by-14

Test data shape: 3971-by-14

Train validation:
Precision: 1.000, recall: 0.365, F1-measure: 0.535

Test validation:
Precision: 0.986, recall: 0.433, F1-measure: 0.602



In [91]:
feature_extractor = preprocessing.FeatureExtractor()

# Transform training set data
x_train = feature_extractor.fit_transform(x_train_hl, term_weighting='tf-idf', 
                                          normalization='none')
# Transform test set data
x_test = feature_extractor.transform(x_test_hl)

Train data shape: 997-by-12

Test data shape: 997-by-12



In [92]:
# Evaluating various models
# First: PCA
model1 = pca2()
model1.fit(x_train)

print('Train validation:')
precision, recall, f1 = model1.evaluate(x_train, y_train_hl)

print('Test validation:')
precision, recall, f1 = model1.evaluate(x_test, y_test_hl)

n_components: 10
Project matrix shape: 12-by-12
SPE threshold: 0.7488353511487638

Train validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111

Test validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111



In [93]:
y_test_pred = model1.predict(x_test)
y_test_hl[y_test_pred!=y_test_hl]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [97]:
# LR
from Models.loglizer.loglizer.models import LR

x_train = feature_extractor.fit_transform(x_train, term_weighting='tf-idf')
x_test = feature_extractor.transform(x_test)

model = LR()
model.fit(x_train, y_train_hl)

print('Train validation:')
precision, recall, f1 = model.evaluate(x_train, y_train_hl)

print('Test validation:')
precision, recall, f1 = model.evaluate(x_test, y_test_hl)

Train data shape: 997-by-15

Test data shape: 997-by-15

Train validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111

Test validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111



### Analysis
This is an imbalanced classification problem. Precision is 100% (for both training and test sets) which is a good sign as no anomalous log event is misclassified.



In [75]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [78]:
qda = QuadraticDiscriminantAnalysis(store_covariance=True)
y_pred = qda.fit(x_train, y_train_hl).predict(x_train)
qda.score(x_train,y_train_hl)



0.16950852557673018

In [14]:
feature_extractor = preprocessing.FeatureExtractor()
#feature_extractor.fit_transform(x_train) is to fit and transform the TRAINING matrix
#Transform Parameters: term_weighting = 'tf-idf' or None
#Transform Parameters: term_weighting = 'zero-mean' or None
x_train = feature_extractor.fit_transform(x_train_hl, term_weighting='tf-idf', 
                                          normalization='zero-mean')
#feature_extractor.transform(x_test) is to used the FITTED model and transformed the TESTING matrix 
x_test = feature_extractor.transform(x_test)
#PCA Model functions: model.fit()
#PCA Model functions: model.predict()
#PCA Model functions: model.evaluate()
model = PCA()
model.fit(x_train)

Train data shape: 997-by-12



NameError: name 'x_test' is not defined