DM-53238: Detect recoverable errors in Task SQL execution#986
DM-53238: Detect recoverable errors in Task SQL execution#986jgates108 merged 3 commits intotickets/DM-43715from
Conversation
d39b0e4 to
5cc4413
Compare
068fbaa to
84a271e
Compare
5cc4413 to
48b1a6c
Compare
84a271e to
72daed7
Compare
3ef7fd3 to
6517559
Compare
278c071 to
82c789b
Compare
6517559 to
d8c1979
Compare
82c789b to
22a44b5
Compare
d8c1979 to
ed159cb
Compare
22a44b5 to
7d3a0ea
Compare
ed159cb to
734601b
Compare
7d3a0ea to
4cb86b5
Compare
8bafbb5 to
ed6de05
Compare
|
|
||
| return out; | ||
| } | ||
|
|
There was a problem hiding this comment.
Changes in util/Error and util/MultiError are meant to allow the errors from workers to be passed back and merged with other errors from the Executive. Then returned to the user with a reasonable number of lines describing the error (i.e. not have the real probalem burried under thousands of "cancel" messages.)
| funcN << " Alternate worker=" << targetWorker->getWorkerId() | ||
| << " found for chunk=" << chunkData->dump()); | ||
| } | ||
| } |
There was a problem hiding this comment.
This code prevents using a Job to an UberJob to worker where it is known that the table is missing.
The map creation time is important as should make information about missing tables obsolete.
| util::Error worker_err(util::Error::WORKER_SQL, e.errNo(), {_task->getChunkId()}, {_task->getJobId()}, | ||
| errMsg); | ||
| _multiError.insert(worker_err); | ||
| erred = true; |
There was a problem hiding this comment.
Send the error numbers back to the czar so the czar can tell if this error was caused by a missing table.
| job->avoidWorker(_wContactInfo, _familyMapTimestamp); | ||
| } | ||
| } | ||
| _unassignJobs(); |
There was a problem hiding this comment.
Examine the error from the worker. Errors caused by missing tables may be recoverable. _unassign the jobs so that buildAndSendUberjobs can try to find other workers to try.
| thread t(registryUpdate, czarConfig); | ||
| t.detach(); | ||
| } | ||
|
|
There was a problem hiding this comment.
This just should not have been here. I'm not sure if it was a rebase thing or just wasn't deleted when it was moved.
| * This is value is used for encoding jobId and attemptCount. | ||
| * For readability values should be 10, 100, 1000, etc. | ||
| */ | ||
| const int MAX_JOB_ATTEMPTS = 100; |
There was a problem hiding this comment.
Is there a limit for the job attempts in the new code? And if so then where this limit is set?
There was a problem hiding this comment.
This was related to an old merging scheme that isn't in use on the branch, and probably isn't even used in the main branch.
The number of job attempts is set from the configuration file and doesn't have an arbitrary limit.
src/czar/Czar.cc
Outdated
| uq->submit(); | ||
| uq->join(); | ||
| ccontrol::QueryState qState = uq->join(); | ||
| bool completeSuccess = (qState == ccontrol::QueryState::SUCCESS); |
There was a problem hiding this comment.
Is there "incomplete" success?
There was a problem hiding this comment.
There could be a partial success, where several jobs succeed and then there was a failure, such as result too large. I'm just trying to indicate that the user query is successful, not just a part of it.
There was a problem hiding this comment.
Then perhaps you need something more than just the binary flag, like what's already defined in:
class QInfo {
public:
/**
* Constants for query status.
*/
enum QStatus {
EXECUTING, ///< Query is currently executing (or being prepared)
COMPLETED, ///< Query execution completed successfully
FAILED, ///< Query execution failed
FAILED_LR, ///< Query execution failed due to large result set
ABORTED ///< Query execution was intentionally aborted
};
src/ccontrol/UserQuerySelect.cc
Outdated
| if (targetWorkerInfo != nullptr) { | ||
| avoidWorker = jqPtr->isWorkerInAvoidMap(targetWorkerInfo, czFamilyMap->getLastUpdateTime()); | ||
| } | ||
| return avoidWorker; |
There was a problem hiding this comment.
The implementation of the function is too complex. It can be rewritten and reduced to something that is easier to comprehend:
auto const itr = wContactMap->find(targetWorker->getWorkerId());
if (itr == wContactMap->end()) return false;
auto const workerInfo = itr->second;
return workerInfo == nullptr ||
jqPtr->isWorkerInAvoidMap(workerInfo, czFamilyMap->getLastUpdateTime());
src/ccontrol/MergingHandler.h
Outdated
|
|
||
| /// Set error code and string. | ||
| void _setError(int code, std::string const& msg, int errorState); | ||
| void _setError(int code, int subCode, std::string const& msg, int errorState); |
There was a problem hiding this comment.
What is the meaning and effect on the error reporting model of subCode? Could it be documented, please?
| auto targetWorker = chunkData->getPrimaryScanWorker().lock(); | ||
| if (targetWorker == nullptr || targetWorker->isDead()) { | ||
| bool avoidWorker = avoidThisWorker(targetWorker, wContactMap, jqPtr, czFamilyMap); | ||
| if (targetWorker == nullptr || targetWorker->isDead() || avoidWorker) { |
There was a problem hiding this comment.
Those two tests for
targetWorker == nullptr || targetWorker->isDead()
could be included in the implementation of avoidThisWorker.
| std::vector<Error>::size_type size() const; | ||
|
|
||
| void push_back(const std::vector<Error>::value_type& val); | ||
| std::vector<Error> getVector() const; |
There was a problem hiding this comment.
I understand what this method does. However, should the result be sorted based on the severity of the errors?
There was a problem hiding this comment.
These errors are presented to the user in the order they are in this map, so putting the most serious ones first is probably helpful.
There was a problem hiding this comment.
It's a mistake to assume that std::maps has any "ordering". According to the official documentation on this collection: https://en.cppreference.com/w/cpp/container/map.html
std::map is a sorted associative container that contains key-value pairs with unique keys. Keys are sorted by using the comparison function Compare. Search, removal, and insertion operations have logarithmic complexity. Maps are usually implemented as Red–black trees.
However, when fetching elements from the map:
Iterators of std::map iterate in ascending order of keys, where ascending is defined by the comparison that was used for construction.
What the last statement means is that in the default instantiation (the default comparator std::less<Key>), elements of your vector will be sorted using the following operator:
bool std::pair<int, int>::operator<(...)
Note that this is related to my next comment on your PR regarding the meaning of those two ints:
The structure (the meaning of the integers) of the composite key needs to be explained (commented).
Therefore, to guarantee the "most serious" types first, you would need to implement the comparator and pass the instantiation of the map:
namespace pmr {
template<
class Key,
class T,
class Compare = std::less<Key>
> using map = std::map<Key, T, Compare,
std::pmr::polymorphic_allocator(<std::pair<const Key, T>>>;
}
You need to define your own class MyCompare and declare the data member as shown below:
class MyCompare {
...
};
std::map<std::pair<int, int>, Error, MyCompare> _errorMap;
There was a problem hiding this comment.
Forgot to mention an alternative solution. In case some specific ordering of the vector's elements is neeed, you may sort the vector using the method by:
#include <algorithm>
...
std::vector<Error> result = ...;
std::sort(result.begin(), result.end(), [] (Error const& a, Error const& b) -> bool {
return a.getCode() < b.getCode();
});
return result;
There was a problem hiding this comment.
std::map<std::pair<int, int>, Error> _errorMap;
Where the pair of ints is code, subCode. Lower values for code are stated to be higher priority. pair and tuple do comparisons in order. Nothing interesting needs to be done to get high priority codes to the front of the list.
|
|
||
| private: | ||
| std::vector<Error> _errorVector; | ||
| std::map<std::pair<int, int>, Error> _errorMap; |
There was a problem hiding this comment.
The structure (the meaning of the integers) of the composite key needs to be explained (commented).
There was a problem hiding this comment.
Should the map be defined as:
std::map<std::pair<Error::ErrCode, int>, Error> _errorMap;
?
There was a problem hiding this comment.
When I see Error::ErrCode ec I really expect any value there to be defined in Error::ErrCode. I know C++ really defines that as int. I see setting ErrCode to something that isn't in ErrCode as a mistake.
int says any integer can be used. In this case, putting all possible error codes in ErrCode just isn't practical.
There was a problem hiding this comment.
Ideally, yes, all possible codes would need to be mentione din the enum definition. In reality, C/C++ won't care if enum is define din this way (not as enum class : int). My point was that using the enum type in the relevant code would improve readability. As you know, there may be many "error codes" in Qserv.
The error handling in Qserv is the reall mess!
src/wbase/Task.cc
Outdated
| } catch (UnsupportedError const& e) { | ||
| LOGS(_log, LOG_LVL_ERROR, __func__ << " runQuery threw UnsupportedError " << e.what() << tIdStr); | ||
| errStr = e.what(); | ||
| errStr += string(" threw:") + e.what(); |
There was a problem hiding this comment.
errStr += string(" exception:") + e.what();
src/wbase/Task.cc
Outdated
| bool logLvl = (_logLvlET != LOG_LVL_TRACE); | ||
| util::Error err(_chunkId, string("UberJob run error ") + errStr, util::ErrorCode::NONE, logLvl); | ||
| multiErr.push_back(err); | ||
| string strMsg = string("worker run error chunk=" + to_string(_chunkId) + ":" + errStr); |
There was a problem hiding this comment.
Excessive construction can make an extra copy. The following would be enough:
string strMsg = "worker run error chunk=" + to_string(_chunkId) + ":" + errStr;
src/wbase/Task.cc
Outdated
| util::Error err(_chunkId, string("UberJob run error ") + errStr, util::ErrorCode::NONE, logLvl); | ||
| multiErr.push_back(err); | ||
| string strMsg = string("worker run error chunk=" + to_string(_chunkId) + ":" + errStr); | ||
| strMsg += " worker=" + getUberJobData()->getWorkerId(); |
There was a problem hiding this comment.
You can do it all in one line:
string const strMsg = "worker run error, chunk=" + to_string(_chunkId) + ":" + errStr + ", worker=" + getUberJobData()->getWorkerId();
| friend std::ostream& operator<<(std::ostream& out, Error const& error); | ||
| ~Error() = default; | ||
|
|
||
| int getCode() const { return _code; } |
There was a problem hiding this comment.
Should the return type of the method be enum type ErrCode?
|
|
||
| private: | ||
| int _code = ErrorCode::NONE; | ||
| int _code = NONE; |
There was a problem hiding this comment.
Should the type be:
ErrCode _code = NONE;
?
|
|
||
| private: | ||
| std::vector<Error> _errorVector; | ||
| std::map<std::pair<int, int>, Error> _errorMap; |
There was a problem hiding this comment.
Should the map be defined as:
std::map<std::pair<Error::ErrCode, int>, Error> _errorMap;
?
300a3e6 to
8269691
Compare
8acf1c7 to
5c07161
Compare
|
Looks okay to me. |
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
DM-53238: Detect recoverable errors in Task SQL execution
If the SQL statement for a Task fails during execution, the entire user query is cancelled. It may be beneficial to take different actions by taking different actions for different errors. SQL syntax errors are not recoverable, but missing tables could be retried on a different worker.