New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MM-45993: Return errors during sending websocket messages #20760
Conversation
During attaching an object to a websocket message, we would marshal it to json and attach the string output. But if the marshalling failed, we would just log a warning and move on. This would add an empty string to the message. But the client assumes that the object is correctly attached and would fail silently if it cannot find it. So we become more strict and return the error so that it reaches the caller. https://mattermost.atlassian.net/browse/MM-45993 ```release-note NONE ```
@@ -73,14 +72,14 @@ func setCollapsePreference(a *app.App, args *model.CommandArgs, isCollapse bool) | |||
} | |||
|
|||
if err := a.Srv().Store.Preference().Save(model.Preferences{pref}); err != nil { | |||
return &model.CommandResponse{Text: args.T("api.command_expand_collapse.fail.app_error"), ResponseType: model.CommandResponseTypeEphemeral} | |||
return &model.CommandResponse{Text: args.T("api.command_expand_collapse.fail.app_error") + err.Error(), ResponseType: model.CommandResponseTypeEphemeral} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cases like these are what I want to really fix. There is an error coming from the store, but we are not propagating that at all. Instead all the user knows is "An error occurred while expanding previews". But what exactly failed?. It wouldn't have been possible to know that until now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to propagate errors but I have some concerns over some of them though:
if appErr := a.sendTeamEvent(oldTeam, model.WebsocketEventUpdateTeam); appErr != nil { | ||
return nil, appErr | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so here we are :) I think returning the error is also come with other problems, because this method is UpdateTeam
and we successfully updated the team, this is an additional step, not necessarily a failure on the Update operation itself. So, logging the error would be a better approach IMO. Similar concern for similar occurrences in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, and you brought it up during our brainstorming session as well.
So my main concern was not to send a broken websocket message since that anyways fails silently on the client, and leads to all sorts of confusions. But at the same time, if we return an error, then we also generate a partial state in the DB. And making the full operation atomic is beyond the scope.
What do you think of logging it at a higher level, but not sending an empty string in the websocket message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@isacikgoz - I thought about this some more, and it looks to me that there are already some cases where the full operation isn't atomic.
For example, take a look at this RemoveTeam
operation:
or _, channel := range channelList {
if !channel.IsGroupOrDirect() {
a.invalidateCacheForChannelMembers(channel.Id)
if nErr = a.Srv().Store.Channel().RemoveMember(channel.Id, user.Id); nErr != nil {
return model.NewAppError("LeaveTeam", "app.channel.remove_member.app_error", nil, nErr.Error(), http.StatusInternalServerError)
}
}
}
if *a.Config().ServiceSettings.ExperimentalEnableDefaultChannelLeaveJoinMessages {
channel, cErr := a.Srv().Store.Channel().GetByName(team.Id, model.DefaultChannelName, false)
if cErr != nil {
var nfErr *store.ErrNotFound
switch {
case errors.As(cErr, &nfErr):
return model.NewAppError("LeaveTeam", "app.channel.get_by_name.missing.app_error", nil, nfErr.Error(), http.StatusNotFound)
default:
return model.NewAppError("LeaveTeam", "app.channel.get_by_name.existing.app_error", nil, cErr.Error(), http.StatusInternalServerError)
}
}
RemoveMember
can pass, but if Channel().GetByName
fails, it returns. But actually the DB will be in a partial success state.
And even referring to websocket events, here is some code in CreateGroupWithUserIds
:
messageWs := model.NewWebSocketEvent(model.WebsocketEventReceivedGroup, "", "", "", nil)
count, err := a.Srv().Store.Group().GetMemberCount(newGroup.Id)
if err != nil {
return nil, model.NewAppError("CreateGroupWithUserIds", "app.group.id.app_error", nil, err.Error(), http.StatusBadRequest)
}
group.MemberCount = model.NewInt(int(count))
groupJSON, jsonErr := json.Marshal(newGroup)
if jsonErr != nil {
mlog.Warn("Failed to encode group to JSON", mlog.Err(jsonErr))
}
messageWs.Add("group", string(groupJSON))
a.Publish(messageWs)
So the group gets created (before this code), but if the GetMemberCount
fails, the whole operation fails.
What appears to me that atomicity is anyways not being preserved in a lot of cases. This change just improves the behavior slightly by not passing the corrupted message to the client. The server side behavior (as a whole) was already non-atomic, and you can say, this change is making it slightly more non-atomic.
I am thinking that we should perhaps we should try to tackle atomicity as a whole separately as an OKR sometime in future.
And realistically speaking, json.Marshal
should never fail in our codebase. So all of this is just following the best practices and has no material effect. Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To solve this, we'd either need to use transactions, which would require a serious rewrite of the server; or a custom retry/rollback functionality. With this change at least we learn about the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, we need to keep a close look on the logs when trying it on the community server, I expect more errors than usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, longer transactions comes with new sets of problems. Let's discuss this more at Mattercon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking that we should perhaps we should try to tackle atomicity as a whole separately as an OKR sometime in future.
Agreed on this.
However, my main concern was even let's say we delete an entity, on failure we will return a deceptive return code such as InternalServerError
. Which on the client or caller side seem to be an unsuccessful operation. Maybe we need to think this with a some kind of different HTTP code, such as Accepted
which is described here. The goal here is that inform the caller that the main operation is successful but ancillary steps are not guaranteed at this point.
Just because we couldn't send a ws message shouldn't cause a rollback to me. Although, this can depend to every individual operation, I'm thinking more of the clarity of the state we transmit to the client/caller.
Aynway, this might be out of scope this PR as you mentioned this happens a lot in the code base already. I'll leave it to you @agnivade
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal here is that inform the caller that the main operation is successful but ancillary steps are not guaranteed at this point.
That is the goal, but a lot of times there is no clear distinction of "main operation" and "auxiliary operation". If we are just talking about changes in the DB, sometimes an API call makes changes to multiple tables in separate transactions. And still if the latter fails, we return an error.
I think we've had this discussion a long time back in the old server team, of having long running transactions that exist all throughout an API request and only gets committed at the end.
Aynway, this might be out of scope this PR as you mentioned this happens a lot in the code base already.
Yep, but this is certainly a valid issue though. Let's discuss more at Mattercon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the changes, we just need to keep a close eye on the errors reported in the near future.
/e2e-test |
Successfully triggered e2e testing! |
/e2e-test |
Successfully triggered e2e testing! |
No difference compared to master: https://mattermost-cypress-report.s3.amazonaws.com/226195-09b0f55-onprem-ent-server-pr-20760-mattermost/mm-ee-test09b0f55/mochawesome.html. Merging. |
During attaching an object to a websocket message, we would
marshal it to json and attach the string output. But if the
marshalling failed, we would just log a warning and move on.
This would add an empty string to the message. But the client
assumes that the object is correctly attached and would
fail silently if it cannot find it.
So we become more strict and return the error so that
it reaches the caller.
https://mattermost.atlassian.net/browse/MM-45993