# Rules to select the most representative error message from WMArchive entries

<b> Problem </b>: The WMArchive entries can have multiple steps with multiple error messages. Due to memory constraints of the GPU the <b> most representative message </b> has to be picked for the machine learning.

In [1]:
import pandas as pd
import numpy as np

### Load the filtered WMArchive messages

In [2]:
logs = pd.read_hdf('filtered_wmarchive.h5', 'frame')

This pandas frame contains the error messages written out from WMArchive. Each row corresponds to a WMArchive entry and each of the keys ('task_name', 'error', 'site') is unique and can be associated with the counts and labels from the console. The field 'error' is the error that is reported by the console. The fields 'exit_codes', 'error_msg', 'error_type', 'steps_counter' and 'names' correspond to the message sequence in the WMArchive entry. The field 'error' that is returned by the console appears at least one time in the 'exit_codes' of the WMArchive message sequence. For each of the sequences only one message should represent the error.

In [3]:
logs.head()

Unnamed: 0,task_name,error,site,exit_codes,error_msg,error_type,steps_counter,names
0,/amaltaro_Run2018A-v1-DoubleMuon-17Sep2018_102...,85,T1_UK_RAL,"[99996, 85, 8021, 99999, 85]","[Failed to find a step report for stageOut1!, ...","[ReportManipulatingError, CMSSWStepFailure, Fa...","[0, 2, 2, 2, 2]","[stageOut1, cmsRun1, cmsRun1, cmsRun1, cmsRun1]"
1,/amaltaro_Run2018A-v1-DoubleMuon-17Sep2018_102...,50664,T2_DE_RWTH,"[50664, 99996, 143, 50115, 99999, 143]",[Error in CMSSW step cmsRun1 Number of Cores: ...,"[PerformanceKill, ReportManipulatingError, CMS...","[0, 1, 3, 3, 3, 3]","[PerformanceError, stageOut1, cmsRun1, cmsRun1..."
2,/amaltaro_Run2018A-v1-DoubleMuon-17Sep2018_102...,50664,T2_DE_RWTH,"[50664, 99996, 143, 50115, 99999, 143]",[Error in CMSSW step cmsRun1 Number of Cores: ...,"[PerformanceKill, ReportManipulatingError, CMS...","[0, 1, 3, 3, 3, 3]","[PerformanceError, stageOut1, cmsRun1, cmsRun1..."
3,/amaltaro_Run2018A-v1-DoubleMuon-17Sep2018_102...,50664,T2_DE_RWTH,"[50664, 99996, 143, 50115, 99999, 143]",[Error in CMSSW step cmsRun1 Number of Cores: ...,"[PerformanceKill, ReportManipulatingError, CMS...","[0, 1, 3, 3, 3, 3]","[PerformanceError, stageOut1, cmsRun1, cmsRun1..."
4,/amaltaro_Run2018A-v1-DoubleMuon-17Sep2018_102...,99400,NoReportedSite,[99400],[Could not find jobReport ========== condor.2...,[RemovedByGLIDEIN],[0],[RemovedByGLIDEIN]


### Sequences per WMArchive entry

To get an overview over the types of sequences that appear, the occurenes are counted and some samples are checked

In [4]:
# Count the ocurrences of the sequences
logs['error_type_string'] = [','.join(map(str, l)) for l in logs['error_type']]
sequences = logs['error_type_string'].value_counts().rename_axis('sequences').reset_index(name='counts')

In [5]:
pd.options.display.max_colwidth = 1000
sequences.head(50)

Unnamed: 0,sequences,counts
0,"ReportManipulatingError,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition,WMAgentStepExecutionError",32804
1,NoJobReport,12543
2,PerformanceKill,11802
3,"ReportManipulatingError,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition",10443
4,"ReportManipulatingError,Fatal Exception,Unknown,WMAgentStepExecutionError",9554
5,"ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition,WMAgentStepExecutionError",5839
6,"PerformanceKill,ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition,WMAgentStepExecutionError",5306
7,"ReportManipulatingError,Fatal Exception,CMSException,WMAgentStepExecutionError",3394
8,"PerformanceKill,ReportManipulatingError,BadFWJRXML,CMSSWStepFailure,ErrorLoggingAddition,WMAgentStepExecutionError",2880
9,"PerformanceKill,ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition",2338


In [6]:
# Function to print an example sequence
def test_msg(logs, seq, i_key):
    
    logs_seq = logs[logs['error_type_string'] == seq]
    test = logs_seq.iloc[i_key]
    
    exit_codes = test['exit_codes']
    steps = test['steps_counter']
    names = test['names']
    error_type = test['error_type']
    error_msg = test['error_msg']    
    
    print
    print 'Error code from console:', test['error'], 'Site from console:', test['site']
    print
    for i in range(len(exit_codes)):
        
        print
        print 'Exit code:', exit_codes[i], 'Step:', steps[i], 'Name:', names[i], 'Type:', error_type[i]
        print 
        print error_msg[i]
        print

### 1. Rule: Choose the first 'Fatal Exception'

Many message sequences contain 'Fatal Exception' in the cmsRun1 step. In most of the cases the error code that is given by the console corresponds to the 'CMSSWStepFailure' message. The CMSSWStepFailure contains the last 25 lines of the CMSSW stdout. After checking several samples, it seems that the important information is in the first 'Fatal Exception'. Frequently appearing snippets like 'ReportManipulatingError', 'ErrorLoggingAddition' and 'WMAgentStepExecutionError' seem to be very similar in most of the cases and don't contain much information about the error. While the last 25 lines in CMSSW stdout seem to contain quite often the Fatal Exception, I think it is cleaner to simply use only the first 'Fatal Exception' snippet as representation of the error. The 'CMSSWStepFailure' normally reports the unix code, while the Fatal Exception seems to report a CMMSW code. The exit codes are documented here: https://twiki.cern.ch/twiki/bin/view/CMSPublic/JobExitCodes

In [7]:
# Some examples
seq1 = 'ReportManipulatingError,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition,WMAgentStepExecutionError'
seq2 = 'ReportManipulatingError,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition'
seq3 = 'ReportManipulatingError,Fatal Exception,TypeError,WMAgentStepExecutionError'
seq4 = 'ReportManipulatingError,Fatal Exception,UnknownException'
seq5 = 'ReportManipulatingError,Fatal Exception,CMSException'
seq6 = 'ReportManipulatingError,Fatal Exception,UnknownException,WMAgentStepExecutionError'
seq7 = 'ReportManipulatingError,CMSSWStepFailure,Fatal Exception,Fatal Exception,ErrorLoggingAddition,WMAgentStepExecutionError'
seq8 = 'ReportManipulatingError,LogArchiveFailure,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition,WMAgentStepExecutionError'
seq9 = 'ReportManipulatingError,Fatal Exception,StdException,WMAgentStepExecutionError'
seq10 = 'PerformanceKill,ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition,WMAgentStepExecutionError,CMSSWStepFailure,Fatal Exception,WMAgentStepExecutionError'

test_msg(logs, seq1, 50)


Error code from console: 85 Site from console: T2_US_Nebraska


Exit code: 99996 Step: 0 Name: stageOut1 Type: ReportManipulatingError

Failed to find a step report for stageOut1!


Exit code: 85 Step: 2 Name: cmsRun1 Type: CMSSWStepFailure



Exit code: 8021 Step: 2 Name: cmsRun1 Type: Fatal Exception

An exception of category 'FileReadError' occurred while    [0] Constructing the EventProcessor    [1] Constructing input source of type PoolSource    [2] Reading branch EventAuxiliary    [3] Calling XrdFile::readv()    [4] XrdAdaptor::ClientRequest::HandleResponse() failure while running connection recovery    [5] Handling XrdAdaptor::RequestManager::requestFailure()    [6] In XrdAdaptor::RequestManager::OpenHandler::HandleResponseWithHosts() Exception Message: XrdCl::File::Open(name='root://cmsxrootd.fnal.gov//store/data/Run2018A/SingleMuon/RAW/v1/000/316/700/00000/D8F9E18F-9064-E811-A478-FA163E0BC49B.root', flags=0x10, permissions=0660) => error '[ERROR] Operation expired' (errno=0, 

## 2. Rule: PerformanceKill 

If the error sequence contains 'PerformanceKill' the other snippets seem to not contain much additional information. Therefore only the short message of the 'Performance Kill' is chosen.

In [8]:
# Some examples
seq1 = 'PerformanceKill,ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition,WMAgentStepExecutionError'
seq2 = 'PerformanceKill,ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition'
seq3 = 'PerformanceKill,ReportManipulatingError,BadFWJRXML,CMSSWStepFailure,ErrorLoggingAddition,WMAgentStepExecutionError'
seq4 = 'PerformanceKill,ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition'
seq5 = 'PerformanceKill,ReportManipulatingError,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition,WMAgentStepExecutionError'
test_msg(logs, seq4, 20)


Error code from console: 50664 Site from console: T1_US_FNAL_Disk


Exit code: 50664 Step: 0 Name: PerformanceError Type: PerformanceKill

Error in CMSSW step cmsRun1 Job has been running for more than: 159600 Job has been running for: 159639.017875 


Exit code: 99999 Step: 1 Name: stageOut1 Type: ReportManipulatingError

Could not find report file for step stageOut1!


Exit code: 143 Step: 3 Name: cmsRun1 Type: CMSSWStepFailure



Exit code: 50115 Step: 3 Name: cmsRun1 Type: BadFWJRXML

Error reading XML job report file, possibly corrupt XML File: Details: no element found: line 1, column 0


Exit code: 99999 Step: 3 Name: cmsRun1 Type: ErrorLoggingAddition




## 3. Rule: both PerformanceKill and Fatal Exception ?

In some cases there seem to be both a performance kill and additionally a Fatal Exception. This is quite confusing and I propose to take the performance kill message.

In [14]:
# Some examples
seq1 = 'PerformanceKill,ReportManipulatingError,CMSSWStepFailure,Fatal Exception,ErrorLoggingAddition,WMAgentStepExecutionError'
test_msg(logs, seq1, 1)


Error code from console: 50660 Site from console: T1_UK_RAL


Exit code: 50660 Step: 0 Name: PerformanceError Type: PerformanceKill

Error in CMSSW step cmsRun1 Number of Cores: 8 Job has exceeded maxRSS: 6500 Job has RSS: 6666 


Exit code: 99996 Step: 1 Name: stageOut1 Type: ReportManipulatingError

Failed to find a step report for stageOut1!


Exit code: 85 Step: 3 Name: cmsRun1 Type: CMSSWStepFailure



Exit code: 8021 Step: 3 Name: cmsRun1 Type: Fatal Exception

An exception of category 'FileReadError' occurred while    [0] Rethrowing an exception that happened on a different thread.    [1] Reading branch recoCaloClusters_hgcalLayerClusters_sharing_RECO.    Additional Info:       [a] Fatal Root Error: @SUB=TBranchElement::GetBasket File: root://xrootd.echo.stfc.ac.uk//store/mc/PhaseIITDRFall17DR/TprimeBToTH_M-1500_Width-30p_LH_TuneCUETP8M2T4_14TeV-madgraph-pythia8/GEN-SIM-RECO/PU200_93X_upgrade2023_realistic_v2-v1/150000/EC4A0DC3-FCB7-E711-A38A-FA163ED5C256.root at byte:407936831

### 4. Rule: if there is no Fatal Exception and no Performance Kill choose the first message with exit code equal to the console exit code

In case there is neither a performance kill nor a fatal exception, choose simply the first message that has the same exit code as the error code reported by the console.

In [10]:
seq3 = 'ReportManipulatingError,CMSSWStepFailure,BadFWJRXML,ErrorLoggingAddition,WMAgentStepExecutionError'
seq4 = 'ReportManipulatingError,BadFWJRXML,CMSSWStepFailure,ErrorLoggingAddition,WMAgentStepExecutionError'
test_msg(logs, seq3, 33)


Error code from console: 139 Site from console: T2_CH_CERN


Exit code: 99996 Step: 0 Name: stageOut1 Type: ReportManipulatingError

Failed to find a step report for stageOut1!


Exit code: 139 Step: 2 Name: cmsRun1 Type: CMSSWStepFailure



Exit code: 50115 Step: 2 Name: cmsRun1 Type: BadFWJRXML

Error reading XML job report file, possibly corrupt XML File: Details: no element found: line 17, column 0


Exit code: 99999 Step: 2 Name: cmsRun1 Type: ErrorLoggingAddition



Exit code: 139 Step: 2 Name: cmsRun1 Type: WMAgentStepExecutionError




## Fractions of the rules

In [11]:
fatal = 'Fatal Exception'
perf = 'PerformanceKill'

In [12]:
fatal_msg = []
perf_msg = []
fatal_perf_msg = []
different = []

for i in range(len(sequences)):
    entry = sequences.iloc[i]
    seq = entry['sequences']
    counts = entry['counts']
    if perf in seq and fatal in seq:
        fatal_perf_msg.append(counts)
    elif perf in seq:
        perf_msg.append(counts)
    elif fatal in seq:
        fatal_msg.append(counts)
    else:
        different.append(counts)

In [13]:
print 'Fraction of', fatal, np.sum(fatal_msg) / float(sequences['counts'].sum())
print 'Fraction of', perf, np.sum(perf_msg) / float(sequences['counts'].sum())
print 'Fraction of', fatal, 'and', perf, np.sum(fatal_perf_msg) / float(sequences['counts'].sum())
print 'Fraction of different types', np.sum(different) / float(sequences['counts'].sum())

Fraction of Fatal Exception 0.521466449112788
Fraction of PerformanceKill 0.20487456659188252
Fraction of Fatal Exception and PerformanceKill 0.009747433544088652
Fraction of different types 0.26391155075124073
