# Compare Naive Bayes and Sentiment Analysis to predict errors

Note: this notebook is also available in [Google Colaboratoy](https://colab.research.google.com/github/paranal-sw/collaborations-UFRO-2024-2s/blob/main/2-NB-SA-comparison/task_description.ipynb)

The observations at Paranal can end in three stages: STOP, ABORT by USER, ERROR (Aborted by system). Naive-Bayes models captures some of those events that leads to errors. In https://github.com/paranal-sw/parlogs-observations/blob/main/notebooks/04-naive-bayes.ipynb a simple Naive Bayes classificator is shown that predicts the df_meta["ERROR"] column based on some simple tokenizations. 

But if you see, some messages are clearly an indication of errors, for example 

```
issalignERR_EVT_TIMEOUT : TIMEOUT in sub-state <STRTLAG: Wait for IRIS Lab Guiding Event> while waiting for event.
```

The emotional tone of certain words like "err" or "timeout" could be captured by sentiment-analysis models and we believe that studying event logs by labelling its emotional tone in positive or negative could have predictive usefulness.



## Objectives
1. Dimensionality reduction, change the tokenization to remove all numbers using regexp. 
2. Train a Naive Bayes with all tokenized logs, not just logtype=ERR, and label the traces with a new "NB" column
3. Search and apply some Sentiment Analysis models, label the traces with "SA_1", "SA_2"... ([hint](https://huggingface.co/blog/sentiment-analysis-python))
4. Look for correlations in NB regarding the different values in "SA_1", "SA_2"...
5. Report with specific examples of (events, NB score, SA score) to understand what is happening behind scenes.

# Code Examples

In [2]:
import os 
import re
import pandas as pd
from urllib.request import urlretrieve
REPO_URL='https://huggingface.co/datasets/Paranal/parlogs-observations/resolve/main/data'

if 'COLAB_RELEASE_TAG' in os.environ.keys():
    !mkdir -p data
    PATH='data/' # Convenient name to be Colab compatible
else:
    PATH='../data/' # Local directory to your system

def load_dataset(INSTRUMENT, RANGE):

    fname = f'{INSTRUMENT}-{RANGE}-traces.parket'
    if not os.path.exists(f'{PATH}/{fname}'):
        urlretrieve(f'{REPO_URL}/{fname}', f'{PATH}/{fname}')
    df_inst=pd.read_parquet(f'{PATH}/{fname}')

    fname = f'{INSTRUMENT}-{RANGE}-traces-SUBSYSTEMS.parket'
    if not os.path.exists(f'{PATH}/{fname}'):
        urlretrieve(f'{REPO_URL}/{fname}', f'{PATH}/{fname}')
    df_subs=pd.read_parquet(f'{PATH}/{fname}')

    fname = f'{INSTRUMENT}-{RANGE}-traces-TELESCOPES.parket'
    if not os.path.exists(f'{PATH}/{fname}'):
        urlretrieve(f'{REPO_URL}/{fname}', f'{PATH}/{fname}')
    df_tele=pd.read_parquet(f'{PATH}/{fname}')

    all_traces = [df_inst, df_subs, df_tele]

    df_all = pd.concat(all_traces)
    df_all.sort_values('@timestamp', inplace=True)
    df_all.reset_index(drop=True, inplace=True)

    return df_all

def load_trace(INSTRUMENT, RANGE, trace_id):
    df_all = load_dataset(INSTRUMENT, RANGE)
    df_all = df_all[ df_all['trace_id']==trace_id ]
    return df_all

def load_meta(INSTRUMENT, RANGE):
    fname = f'{INSTRUMENT}-{RANGE}-meta.parket'
    if not os.path.exists(f'{PATH}/{fname}'):
        urlretrieve(f'{REPO_URL}/{fname}', f'{PATH}/{fname}')
    df_meta=pd.read_parquet(f'{PATH}/{fname}')

    return df_meta

## Meta Data

In [4]:
df_meta = load_meta("MATISSE", "1d")
df_meta

Unnamed: 0,START,END,TIMEOUT,system,procname,TPL_ID,ERROR,Aborted,SECONDS,TEL
0,2019-04-10 00:16:03.269,2019-04-10 00:20:35.831,False,MATISSE,bob_ins,MATISSE_img_acq,False,False,272.0,AT
1,2019-04-10 00:20:35.848,2019-04-10 00:42:17.085,False,MATISSE,bob_ins,MATISSE_hyb_obs,False,False,1301.0,AT
2,2019-04-10 00:42:17.106,2019-04-10 01:11:09.300,False,MATISSE,bob_ins,MATISSE_hyb_obs,False,False,1732.0,AT
3,2019-04-10 01:11:09.325,2019-04-10 01:22:27.609,False,MATISSE,bob_ins,MATISSE_hyb_obs,False,False,678.0,AT
4,2019-04-10 01:22:40.579,2019-04-10 01:29:34.127,False,MATISSE,bob_ins,MATISSE_img_acq,False,False,413.0,AT
...,...,...,...,...,...,...,...,...,...,...
65,2019-04-10 23:36:34.218,2019-04-10 23:37:22.734,False,MATISSE,bob_163744,MATISSE_img_acq,False,False,48.0,AT
66,2019-04-10 23:37:45.605,2019-04-10 23:40:12.335,False,MATISSE,bob_163744,MATISSE_img_acq,False,False,146.0,AT
67,2019-04-10 23:40:15.899,2019-04-10 23:43:40.407,False,MATISSE,bob_163744,MATISSE_img_acq,False,False,204.0,AT
68,2019-04-10 23:45:18.254,2019-04-10 23:51:49.993,False,MATISSE,bob_163744,MATISSE_img_acq,False,False,391.0,AT


## Tokenized Trace Example

Below a new column "event" is proposed to be used as the content of the event. The numbers were also removed. In other notebooks, you can choose your own tokenized method.

In [None]:
df_trace = load_trace("MATISSE", "1d", 1)
with pd.option_context('display.max_colwidth', None):
    display(df_trace[['@timestamp', 'logtype', 'system', 'keywname', 'keywvalue', 'logtext']][-30:])

In [15]:
df_trace

Unnamed: 0,@timestamp,system,hostname,loghost,logtype,envname,procname,procid,module,keywname,keywvalue,keywmask,logtext,trace_id
2238,2019-04-10 00:20:35.848,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,MATISSE_hyb_obs -- Celestial target observatio...,1
2239,2019-04-10 00:20:35.848,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,Started at 2019-04-10T00:20:35 (underlined),1
2240,2019-04-10 00:20:36.000,AT3,wat3tcs,wat3tcs,FEVT,wat3ics,logManager,0.0,,INS.OPTI1.MOVE,,wat3ics,Motion execution.,1
2241,2019-04-10 00:20:36.111,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,DET2 NCOHERENT VAL = '20.',1
2242,2019-04-10 00:20:36.111,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,DET2 APOY VAL = '512',1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14725,2019-04-10 00:42:17.082,MATISSE,wmt,wmt,LOG,wmt,mtoControl,130.0,boss,,,,SETUP command done.,1
14726,2019-04-10 00:42:17.083,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,Template MATISSE_hyb_obs finished.,1
14727,2019-04-10 00:42:17.083,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,-1 (SpringGreen4),1
14728,2019-04-10 00:42:17.085,MATISSE,wmt,wmt,LOG,wmt,bob_ins,155.0,seq,,,,Finished in 1302 seconds at 2019-04-10T00:42:1...,1


## Tokenized Events

In [4]:
# Create event column
df_trace["event"] = ""
for col in ['logtype', 'system', 'keywname', 'keywvalue', 'logtext']:
    df_trace["event"] += " " + df_trace[col]

# Tokenize
def tokenize(x):
    return re.sub(r'[0-9]', r'', x)

df_trace["tokenized"] = df_trace["event"].apply(tokenize)

# Restrict columns
df_event = df_trace[['@timestamp', 'event', 'tokenized']]

with pd.option_context('display.max_colwidth', None):
    display(df_event[-30:])

Unnamed: 0,@timestamp,event,tokenized
22544,2019-04-02 07:00:10.000,FEVT MATISSE INS.BSN3.CLOSE N BSN shutter closed.,FEVT MATISSE INS.BSN.CLOSE N BSN shutter closed.
22545,2019-04-02 07:00:11.000,FEVT AT3 INS.OPTI1.MOVE Motion execution.,FEVT AT INS.OPTI.MOVE Motion execution.
22546,2019-04-02 07:00:12.000,FEVT AT3 INS.OPTI1.MOVEDONE Motor offset done.,FEVT AT INS.OPTI.MOVEDONE Motor offset done.
22547,2019-04-02 07:00:12.000,FEVT AT3 INS.OPTI2.MOVE Motion execution.,FEVT AT INS.OPTI.MOVE Motion execution.
22548,2019-04-02 07:00:13.000,FEVT AT3 INS.OPTI2.MOVEDONE Motor offset done.,FEVT AT INS.OPTI.MOVEDONE Motor offset done.
22549,2019-04-02 07:00:13.613,"LOG AT4 Sent Command SETASM to lat4alt:alttrkServer with 10.815000,741.180000,14.500000","LOG AT Sent Command SETASM to latalt:alttrkServer with .,.,."
22550,2019-04-02 07:00:13.619,"LOG AT4 Sent Command SETASM to lat4az:aztrkServer with 10.815000,741.180000,14.500000","LOG AT Sent Command SETASM to lataz:aztrkServer with .,.,."
22551,2019-04-02 07:00:13.620,"LOG AT4 Sent Command SETASM to lat4fsm:probetrkServer with 10.815000,741.180000,14.500000","LOG AT Sent Command SETASM to latfsm:probetrkServer with .,.,."
22552,2019-04-02 07:00:13.621,"LOG AT4 Sent Command SETASM to lat4dcs:probetrkServer with 10.815000,741.180000,14.500000","LOG AT Sent Command SETASM to latdcs:probetrkServer with .,.,."
22553,2019-04-02 07:00:13.624,LOG AT4 Command SETASM completed by @lat4alt:alttrkServer,LOG AT Command SETASM completed by @latalt:alttrkServer


In [13]:
import nltk
import re
from collections import Counter
import pandas as pd
import math
#from gensim.models.word2vec import Word2Vec  # Para entrenar un nuevo modelo

"""
Necesitaré unas cuantas herramientas para ir probando distintos métodos de tokenización
y modelos.
"""


def tokenize(log: str, normalize: bool = False, method: str = 'RegExp') -> list:
    if method == 'RegExp':
        tokenizer = nltk.tokenize.RegexpTokenizer(r'\b[a-zA-Z]+\b')
        tokens = tokenizer.tokenize(log)
        # Normalización
        if normalize:
            tokens = [token.lower() for token in tokens]
        return tokens
    else:
        raise ValueError('No se reconoce el método de tokenización')
    

def create_BoW(df: pd.DataFrame, target_column: str = 'event') -> dict:
    if target_column in ['logtext', 'event']:
        text_acumulation = ' '.join(df[target_column])
        BoW = tokenize(log=text_acumulation, normalize=True)
        BoW = Counter(BoW)
        return BoW
    else:
        raise ValueError('No se reconoce la columna a tokenizar')


def tf_vectorizer(BoW: dict) -> dict:
    magnitude = sum(BoW.values())
    tf = {term: count / magnitude for term, count in BoW.items() }
    return tf

def compute_idf(df: pd.DataFrame, target_column: str = 'event'):
    if target_column in ['logtext', 'event']:
        N = len(df[target_column])
        idf = {}
        total_terms = set([term for log in df[target_column] for term in tokenize(log, normalize=True)])
        for term in total_terms:
            log_containing_term = sum(1 for log in df[target_column] if term in log)
            idf[term] = math.log(N / (1+ log_containing_term))
        return idf
    else:
        raise ValueError('No se reconoce la columna sobre la cual calcular')
    

def tf_idf_vectorizer(df: pd.DataFrame, target_column: str = 'event'):
    idf = compute_idf(df=df, target_column=target_column)
    BoW = create_BoW(df=df, target_column=target_column)
    tf = tf_vectorizer(BoW=BoW)
    tf_idf = {term: tf_value * idf[term] for term, tf_value in tf.items() }
    return tf_idf

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer

docs = df_event['event'].to_list()
Tokenizer = RegexpTokenizer(r'\b[a-zA-Z]+\b')
vectorizer = TfidfVectorizer(lowercase=True,
                             tokenizer=tokenize)


In [25]:
Model= vectorizer.fit_transform(docs)



In [67]:
df_trace["event"] = ""
for col in ['logtype', 'system', 'keywname', 'keywvalue', 'logtext']:
    df_trace["event"] += " " + df_trace[col]

df_trace["tokenized"] = df_trace["event"].apply(tokenize)

# Restrict columns
df_event = df_trace[['@timestamp', 'event', 'tokenized']]

BoW = create_BoW(df=df_event, target_column='event')
tf_idf = tf_idf_vectorizer(df=df_event)

In [6]:
# Create event column
df_trace["event"] = ""
for col in ['logtype', 'system', 'keywname', 'keywvalue', 'logtext']:
    df_trace["event"] += " " + df_trace[col]

# Tokenize
def tokenize(x):
    return re.sub(r'[0-9]', r'', x)

df_trace["tokenized"] = df_trace["event"].apply(tokenize)

# Restrict columns
df_event = df_trace[['@timestamp', 'event', 'tokenized']]

with pd.option_context('display.max_colwidth', None):
    display(df_event[-30:])

Unnamed: 0,@timestamp,event,tokenized
22544,2019-04-02 07:00:10.000,FEVT MATISSE INS.BSN3.CLOSE N BSN shutter closed.,"[FEVT, MATISSE, INS, CLOSE, N, BSN, shutter, closed]"
22545,2019-04-02 07:00:11.000,FEVT AT3 INS.OPTI1.MOVE Motion execution.,"[FEVT, INS, MOVE, Motion, execution]"
22546,2019-04-02 07:00:12.000,FEVT AT3 INS.OPTI1.MOVEDONE Motor offset done.,"[FEVT, INS, MOVEDONE, Motor, offset, done]"
22547,2019-04-02 07:00:12.000,FEVT AT3 INS.OPTI2.MOVE Motion execution.,"[FEVT, INS, MOVE, Motion, execution]"
22548,2019-04-02 07:00:13.000,FEVT AT3 INS.OPTI2.MOVEDONE Motor offset done.,"[FEVT, INS, MOVEDONE, Motor, offset, done]"
22549,2019-04-02 07:00:13.613,"LOG AT4 Sent Command SETASM to lat4alt:alttrkServer with 10.815000,741.180000,14.500000","[LOG, Sent, Command, SETASM, to, alttrkServer, with]"
22550,2019-04-02 07:00:13.619,"LOG AT4 Sent Command SETASM to lat4az:aztrkServer with 10.815000,741.180000,14.500000","[LOG, Sent, Command, SETASM, to, aztrkServer, with]"
22551,2019-04-02 07:00:13.620,"LOG AT4 Sent Command SETASM to lat4fsm:probetrkServer with 10.815000,741.180000,14.500000","[LOG, Sent, Command, SETASM, to, probetrkServer, with]"
22552,2019-04-02 07:00:13.621,"LOG AT4 Sent Command SETASM to lat4dcs:probetrkServer with 10.815000,741.180000,14.500000","[LOG, Sent, Command, SETASM, to, probetrkServer, with]"
22553,2019-04-02 07:00:13.624,LOG AT4 Command SETASM completed by @lat4alt:alttrkServer,"[LOG, Command, SETASM, completed, by, alttrkServer]"


In [5]:
# Note that thge number of different events decreased from 1913 to 763
df_event.describe(include='all')

Unnamed: 0,@timestamp,event,tokenized
count,6818,6818,6818
unique,,1913,763
top,,LOG AT4 WARNING: new header block is added ...,LOG ARAL Quadrant . Flux too low (min: ).
freq,,182,878
mean,2019-04-02 06:47:11.242488064,,
min,2019-04-02 06:36:28.631000,,
25%,2019-04-02 06:39:32.123500032,,
50%,2019-04-02 06:47:35.285499904,,
75%,2019-04-02 06:53:24.405250048,,
max,2019-04-02 07:00:16.566000,,
