***Steps***  
***Part 1: Log Collection/Cleaning***
1. Run `parser.py` file in '/Parsing'. Will parse files in 'rawlogs' into 'cleanlogs'. Creates log_structured.csv and log_templates.csv.
- Example code: `%run parser.py`
2. Use `clean_logs()` to clean log_structured.csv. Rename events to E1, E2 etc., creates log_structured_clean.csv.
- Example code: `clean_logs("./Parsing/cleanlogs/Zookeeper_2k.log_structured.csv","./Parsing/cleanlogs/Zookeeper_2k.log_templates.csv","Zookeeper")`.  
2. 1:  For labeled data use `label_log()` to label your log_structured_clean.csv log into log_labelled.csv. ***Use for HDFS only***
- Example code: `label_log("./Parsing/cleanlogs/HDFS_structured_clean.csv","./Parsing/rawlogs/anomaly_label.csv")`.  
     
***Part 2: Log Parsing and Feature Extraction***  
3. Use `_session_window()` or `_fixed_window()` or `_sliding_window()` to parse and extract events from your log_structured_clean.csv or log_labelled.csv file. ***_session_window() for HDFS only***
- Example code (labelled): `x,y,df = _session_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",labels=True)`.  
- Example code: `x,y,df = _fixed_window(data_file="./Parsing/cleanlogs/Zookeeper_structured_clean.csv",windowsize=2,log_type="Zookeeper")`.  
4. Use `_split_data()` to split x,y into training and testing sets.  ***split_type = 'uniform' is only for Labeled Data***.  
- Example code (labeled): = `(x_train,y_train),(x_test,y_test) = _split_data(x,y,train_ratio=0.5,split_type="uniform")`.  
- Example code (unlabeled): `(x_train, _),( x_test, _) = _split_data(x,y=None,train_ratio=0.5,split_type="sequential")`.  


In [1]:
#Parsing and Feature Extreaction
import Parsing.clean_parse_extract as cpe

In [2]:
#2
cpe.clean_logs("./Parsing/cleanlogs/HDFS_2k.log_structured.csv","./Parsing/cleanlogs/HDFS_2k.log_templates.csv","HDFS")
cpe.clean_logs("./Parsing/cleanlogs/Zookeeper_2k.log_structured.csv","./Parsing/cleanlogs/Zookeeper_2k.log_templates.csv","Zookeeper")

In [3]:
#2.1 *Run on HDFS to get unique blk_ids
cpe.label_log("./Parsing/cleanlogs/HDFS_structured_clean.csv","./Parsing/rawlogs/anomaly_label.csv")

In [4]:
#3
x_hl,y_hl,df_hl = cpe._session_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",labels=True)
x_z,y_z,df_z = cpe._fixed_window(data_file="./Parsing/cleanlogs/Zookeeper_structured_clean.csv",windowsize=2,log_type="Zookeeper")
x_h,y_h,df_h = cpe._sliding_window(data_file="./Parsing/cleanlogs/HDFS_labelled.csv",windowsize=5,windowslide=2,log_type = "HDFS")

Number of unique blk_ids:  1994
Number of windows:  253
Number of window slides:  252
Number of windows:  15
Number of window slides:  14


In [5]:
df_z.head(3)

Unnamed: 0,window_id,EventSequence
0,window1,"[E8, E17, E31, E41, E41, E28, E28, E14, E9, E2..."
1,window2,"[E21, E34, E15, E34, E22, E22, E21, E22, E9, E..."
2,window3,"[E30, E15, E34, E22, E25]"


In [6]:
df_h.head(3)

Unnamed: 0,window_id,EventSequence
0,window1,"[E2, E2, E13, E2, E2, E13, E13, E13, E2, E3, E..."
1,window2,"[E3, E13, E3, E10, E10, E2, E10, E13, E13, E2,..."
2,window3,"[E8, E11, E8, E8, E8, E8, E8, E8, E11, E8, E11..."


In [7]:
df_hl.head(3)

Unnamed: 0,BlockId,EventSequence,Label
0,blk_38865049064139660,[E2],0
1,blk_-6952295868487656571,[E2],0
2,blk_7128370237687728475,[E13],0


In [8]:
#4
(x_train_z,_), (x_test_z,_) = cpe._split_data(x_z,y_z,train_ratio = 0.9, split_type = 'sequential')
(x_train_hl,y_train_hl), (x_test_hl,y_test_hl) = cpe._split_data(x_hl,y_hl,train_ratio = 0.5, split_type = 'uniform')

Size of training set:  227
Size of testing set:  26
Size of total data set:  253
Size of training set:  997
Size of testing set:  997
Size of total data set:  1994


----
***Part 3: Feature Engineering***  

3. Use `fit_transform()` method from `FeatureExtractor` class to transform the input training data event sequence matrix into an output of transformed data matrix.

- Example code: `x_train = feature_extractor.fit_transform(x_train_hl, term_weighting='tf-idf', normalization='none')`.   

4. Next use `transform()` method from `FeatureExtractor` class to transform the input test data event sequence matrix into an output of transformed data matrix.

- Example code: `x_test = feature_extractor.transform(x_test_hl)`

In [23]:
#For Modelling
from Models.loglizer.loglizer import preprocessing

feature_extractor = preprocessing.FeatureExtractor()

# Transform training set data
x_train = feature_extractor.fit_transform(x_train_hl, term_weighting='tf-idf', 
                                          normalization='none')
# Transform test set data
x_test = feature_extractor.transform(x_test_hl)

Train data shape: 997-by-12

Test data shape: 997-by-12



In [24]:
x_train_hl.shape

(997,)

In [25]:
x_train.shape

(997, 12)

In [26]:
x_train_hl[0]

['E13']

In [27]:
x_train[0]

array([1.95599088, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        ])

----
***Part 4: Modeling***  

5. Use `fit()` method for `SVM` and `LR` class objects to use input training data to train SVM model.

- Example code: `mdl_svm.fit(x_train, y_train_hl)`.   


### Supervised Model

In [38]:
# Support Vector Machines

from Models.loglizer.loglizer.models import SVM

mdl_svm = SVM()
mdl_svm.fit(x_train, y_train_hl)

print('Train validation:')
precision, recall, f1 = mdl_svm.evaluate(x_train, y_train_hl)

print('Test validation:')
precision, recall, f1 = mdl_svm.evaluate(x_test, y_test_hl)

Train validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111

Test validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111



In [39]:
y_test_pred = mdl_svm.predict(x_test)
y_test_hl[y_test_pred!=y_test_hl]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [40]:
# Logistic Regression

from Models.loglizer.loglizer.models import LR

mdl_lr = LR()
mdl_lr.fit(x_train, y_train_hl)

print('Train validation:')
precision, recall, f1 = mdl_lr.evaluate(x_train, y_train_hl)

print('Test validation:')
precision, recall, f1 = mdl_lr.evaluate(x_test, y_test_hl)

Train validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111

Test validation:
Precision: 1.000, recall: 0.059, F1-measure: 0.111



In [41]:
y_test_pred = mdl_lr.predict(x_test)
y_test_hl[y_test_pred!=y_test_hl]

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

### Result
- This is an imbalanced classification problem. Accuracy will not be a good metric to consider for such data.
- Precision is 100% (for both training and test sets). This shows that for predicted TRUE, proportion of data that are TRUE. Hence all that were predicted true were correct.
- Recall is 5.9% which is quite low. Recall means of the actual TRUE labels, how many were correctly predicted. This metric is quite low.
- F1 score is ver low at 11% which indicates that the model is not working quite well.
- If we observe the misclassified data values, all are the anomalous events that were misclassified.

### Analysis
- This shows that the data provided is not enough to train the model properly.
- Things are aggravated due to high class imbalance. Only 0.03 (3%) of the data events are anomalies that is affecting the results.
- We can improve resulting scores by gathering more data, especially having anomalous events data.

## Unsupervised Model

In [48]:
from Models.loglizer.loglizer.models import PCA

mdl_pca = PCA() 
mdl_pca.fit(x_train)
y_train = mdl_pca.predict(x_train) 

print('Test phase:')
y_test = mdl_pca.predict(x_test)

n_components: 10
Project matrix shape: 12-by-12
SPE threshold: 0.7488353511487638

Test phase:


In [47]:
x_train.shape

(997, 12)

In [46]:
y_test.shape

(997,)