Skip to content

DM-53238: Detect recoverable errors in Task SQL execution#986

Merged
jgates108 merged 3 commits intotickets/DM-43715from
tickets/DM-53238
Feb 10, 2026
Merged

DM-53238: Detect recoverable errors in Task SQL execution#986
jgates108 merged 3 commits intotickets/DM-43715from
tickets/DM-53238

Conversation

@jgates108
Copy link
Copy Markdown
Contributor

@jgates108 jgates108 commented Jan 21, 2026

If the SQL statement for a Task fails during execution, the entire user query is cancelled. It may be beneficial to take different actions by taking different actions for different errors. SQL syntax errors are not recoverable, but missing tables could be retried on a different worker.

@fritzm fritzm force-pushed the tickets/DM-53238 branch 2 times, most recently from 3ef7fd3 to 6517559 Compare January 24, 2026 00:59
@fritzm fritzm force-pushed the tickets/DM-53238 branch 2 times, most recently from 8bafbb5 to ed6de05 Compare February 6, 2026 19:32

return out;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes in util/Error and util/MultiError are meant to allow the errors from workers to be passed back and merged with other errors from the Executive. Then returned to the user with a reasonable number of lines describing the error (i.e. not have the real probalem burried under thousands of "cancel" messages.)

funcN << " Alternate worker=" << targetWorker->getWorkerId()
<< " found for chunk=" << chunkData->dump());
}
}
Copy link
Copy Markdown
Contributor Author

@jgates108 jgates108 Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code prevents using a Job to an UberJob to worker where it is known that the table is missing.

The map creation time is important as should make information about missing tables obsolete.

util::Error worker_err(util::Error::WORKER_SQL, e.errNo(), {_task->getChunkId()}, {_task->getJobId()},
errMsg);
_multiError.insert(worker_err);
erred = true;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Send the error numbers back to the czar so the czar can tell if this error was caused by a missing table.

job->avoidWorker(_wContactInfo, _familyMapTimestamp);
}
}
_unassignJobs();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examine the error from the worker. Errors caused by missing tables may be recoverable. _unassign the jobs so that buildAndSendUberjobs can try to find other workers to try.

thread t(registryUpdate, czarConfig);
t.detach();
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just should not have been here. I'm not sure if it was a rebase thing or just wasn't deleted when it was moved.

@fritzm fritzm changed the title Tickets/dm 53238 DM-53238: Detect recoverable errors in Task SQL execution Feb 7, 2026
* This is value is used for encoding jobId and attemptCount.
* For readability values should be 10, 100, 1000, etc.
*/
const int MAX_JOB_ATTEMPTS = 100;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a limit for the job attempts in the new code? And if so then where this limit is set?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was related to an old merging scheme that isn't in use on the branch, and probably isn't even used in the main branch.

The number of job attempts is set from the configuration file and doesn't have an arbitrary limit.

src/czar/Czar.cc Outdated
uq->submit();
uq->join();
ccontrol::QueryState qState = uq->join();
bool completeSuccess = (qState == ccontrol::QueryState::SUCCESS);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there "incomplete" success?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There could be a partial success, where several jobs succeed and then there was a failure, such as result too large. I'm just trying to indicate that the user query is successful, not just a part of it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then perhaps you need something more than just the binary flag, like what's already defined in:

class QInfo {
public:
  
    /**
     *  Constants for query status.
     */
    enum QStatus {
        EXECUTING,  ///< Query is currently executing (or being prepared)
        COMPLETED,  ///< Query execution completed successfully
        FAILED,     ///< Query execution failed
        FAILED_LR,  ///< Query execution failed due to large result set
        ABORTED     ///< Query execution was intentionally aborted
    };

if (targetWorkerInfo != nullptr) {
avoidWorker = jqPtr->isWorkerInAvoidMap(targetWorkerInfo, czFamilyMap->getLastUpdateTime());
}
return avoidWorker;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the function is too complex. It can be rewritten and reduced to something that is easier to comprehend:

    auto const itr = wContactMap->find(targetWorker->getWorkerId());
    if (itr == wContactMap->end()) return false;
    auto const workerInfo = itr->second;
    return workerInfo == nullptr ||
           jqPtr->isWorkerInAvoidMap(workerInfo, czFamilyMap->getLastUpdateTime());


/// Set error code and string.
void _setError(int code, std::string const& msg, int errorState);
void _setError(int code, int subCode, std::string const& msg, int errorState);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning and effect on the error reporting model of subCode? Could it be documented, please?

auto targetWorker = chunkData->getPrimaryScanWorker().lock();
if (targetWorker == nullptr || targetWorker->isDead()) {
bool avoidWorker = avoidThisWorker(targetWorker, wContactMap, jqPtr, czFamilyMap);
if (targetWorker == nullptr || targetWorker->isDead() || avoidWorker) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two tests for

targetWorker == nullptr || targetWorker->isDead()

could be included in the implementation of avoidThisWorker.

std::vector<Error>::size_type size() const;

void push_back(const std::vector<Error>::value_type& val);
std::vector<Error> getVector() const;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what this method does. However, should the result be sorted based on the severity of the errors?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These errors are presented to the user in the order they are in this map, so putting the most serious ones first is probably helpful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a mistake to assume that std::maps has any "ordering". According to the official documentation on this collection: https://en.cppreference.com/w/cpp/container/map.html

std::map is a sorted associative container that contains key-value pairs with unique keys. Keys are sorted by using the comparison function Compare. Search, removal, and insertion operations have logarithmic complexity. Maps are usually implemented as Red–black trees.

However, when fetching elements from the map:

Iterators of std::map iterate in ascending order of keys, where ascending is defined by the comparison that was used for construction.

What the last statement means is that in the default instantiation (the default comparator std::less<Key>), elements of your vector will be sorted using the following operator:

bool std::pair<int, int>::operator<(...)

Note that this is related to my next comment on your PR regarding the meaning of those two ints:

The structure (the meaning of the integers) of the composite key needs to be explained (commented).

Therefore, to guarantee the "most serious" types first, you would need to implement the comparator and pass the instantiation of the map:

namespace pmr {
    template<
        class Key,
        class T,
        class Compare = std::less<Key>
    > using map = std::map<Key, T, Compare,
                           std::pmr::polymorphic_allocator(<std::pair<const Key, T>>>;
}

You need to define your own class MyCompare and declare the data member as shown below:

class MyCompare { 
    ...
};

std::map<std::pair<int, int>, Error, MyCompare> _errorMap;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to mention an alternative solution. In case some specific ordering of the vector's elements is neeed, you may sort the vector using the method by:

#include <algorithm>
...
std::vector<Error> result = ...;
std::sort(result.begin(), result.end(), [] (Error const& a, Error const& b) -> bool {
    return a.getCode() < b.getCode();
});
return result;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::map<std::pair<int, int>, Error> _errorMap;
Where the pair of ints is code, subCode. Lower values for code are stated to be higher priority. pair and tuple do comparisons in order. Nothing interesting needs to be done to get high priority codes to the front of the list.


private:
std::vector<Error> _errorVector;
std::map<std::pair<int, int>, Error> _errorMap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure (the meaning of the integers) of the composite key needs to be explained (commented).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the map be defined as:

std::map<std::pair<Error::ErrCode, int>, Error> _errorMap;

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I see Error::ErrCode ec I really expect any value there to be defined in Error::ErrCode. I know C++ really defines that as int. I see setting ErrCode to something that isn't in ErrCode as a mistake.
int says any integer can be used. In this case, putting all possible error codes in ErrCode just isn't practical.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, yes, all possible codes would need to be mentione din the enum definition. In reality, C/C++ won't care if enum is define din this way (not as enum class : int). My point was that using the enum type in the relevant code would improve readability. As you know, there may be many "error codes" in Qserv.

The error handling in Qserv is the reall mess!

} catch (UnsupportedError const& e) {
LOGS(_log, LOG_LVL_ERROR, __func__ << " runQuery threw UnsupportedError " << e.what() << tIdStr);
errStr = e.what();
errStr += string(" threw:") + e.what();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        errStr += string(" exception:") + e.what();

bool logLvl = (_logLvlET != LOG_LVL_TRACE);
util::Error err(_chunkId, string("UberJob run error ") + errStr, util::ErrorCode::NONE, logLvl);
multiErr.push_back(err);
string strMsg = string("worker run error chunk=" + to_string(_chunkId) + ":" + errStr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excessive construction can make an extra copy. The following would be enough:

string strMsg = "worker run error chunk=" + to_string(_chunkId) + ":" + errStr;

util::Error err(_chunkId, string("UberJob run error ") + errStr, util::ErrorCode::NONE, logLvl);
multiErr.push_back(err);
string strMsg = string("worker run error chunk=" + to_string(_chunkId) + ":" + errStr);
strMsg += " worker=" + getUberJobData()->getWorkerId();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do it all in one line:

string const strMsg = "worker run error, chunk=" + to_string(_chunkId) + ":" + errStr + ", worker=" + getUberJobData()->getWorkerId();

friend std::ostream& operator<<(std::ostream& out, Error const& error);
~Error() = default;

int getCode() const { return _code; }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the return type of the method be enum type ErrCode?


private:
int _code = ErrorCode::NONE;
int _code = NONE;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the type be:

ErrCode _code = NONE;

?


private:
std::vector<Error> _errorVector;
std::map<std::pair<int, int>, Error> _errorMap;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the map be defined as:

std::map<std::pair<Error::ErrCode, int>, Error> _errorMap;

?

@iagaponenko
Copy link
Copy Markdown
Contributor

Looks okay to me.

@jgates108 jgates108 merged commit 106c77f into tickets/DM-43715 Feb 10, 2026
8 checks passed
@jgates108 jgates108 deleted the tickets/DM-53238 branch February 10, 2026 20:50
fritzm pushed a commit that referenced this pull request Feb 22, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Feb 22, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Feb 24, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Feb 26, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Feb 26, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Mar 19, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Mar 19, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Mar 20, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Mar 27, 2026
DM-53238: Detect recoverable errors in Task SQL execution
fritzm pushed a commit that referenced this pull request Mar 27, 2026
DM-53238: Detect recoverable errors in Task SQL execution
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants