MM-45993: Return errors during sending websocket messages #20760

agnivade · 2022-08-03T09:26:52Z

During attaching an object to a websocket message, we would
marshal it to json and attach the string output. But if the
marshalling failed, we would just log a warning and move on.

This would add an empty string to the message. But the client
assumes that the object is correctly attached and would
fail silently if it cannot find it.

So we become more strict and return the error so that
it reaches the caller.

https://mattermost.atlassian.net/browse/MM-45993

NONE

During attaching an object to a websocket message, we would marshal it to json and attach the string output. But if the marshalling failed, we would just log a warning and move on. This would add an empty string to the message. But the client assumes that the object is correctly attached and would fail silently if it cannot find it. So we become more strict and return the error so that it reaches the caller. https://mattermost.atlassian.net/browse/MM-45993 ```release-note NONE ```

agnivade · 2022-08-03T09:28:49Z

app/slashcommands/command_expand_collapse.go

@@ -73,14 +72,14 @@ func setCollapsePreference(a *app.App, args *model.CommandArgs, isCollapse bool)
 	}

 	if err := a.Srv().Store.Preference().Save(model.Preferences{pref}); err != nil {
-		return &model.CommandResponse{Text: args.T("api.command_expand_collapse.fail.app_error"), ResponseType: model.CommandResponseTypeEphemeral}
+		return &model.CommandResponse{Text: args.T("api.command_expand_collapse.fail.app_error") + err.Error(), ResponseType: model.CommandResponseTypeEphemeral}


Cases like these are what I want to really fix. There is an error coming from the store, but we are not propagating that at all. Instead all the user knows is "An error occurred while expanding previews". But what exactly failed?. It wouldn't have been possible to know that until now :)

isacikgoz

Looks good to propagate errors but I have some concerns over some of them though:

isacikgoz · 2022-08-03T13:14:03Z

app/team.go

+	if appErr := a.sendTeamEvent(oldTeam, model.WebsocketEventUpdateTeam); appErr != nil {
+		return nil, appErr
+	}


Okay, so here we are :) I think returning the error is also come with other problems, because this method is UpdateTeam and we successfully updated the team, this is an additional step, not necessarily a failure on the Update operation itself. So, logging the error would be a better approach IMO. Similar concern for similar occurrences in the PR.

Indeed, and you brought it up during our brainstorming session as well.

So my main concern was not to send a broken websocket message since that anyways fails silently on the client, and leads to all sorts of confusions. But at the same time, if we return an error, then we also generate a partial state in the DB. And making the full operation atomic is beyond the scope.

What do you think of logging it at a higher level, but not sending an empty string in the websocket message?

@isacikgoz - I thought about this some more, and it looks to me that there are already some cases where the full operation isn't atomic.

For example, take a look at this RemoveTeam operation:

or _, channel := range channelList { if !channel.IsGroupOrDirect() { a.invalidateCacheForChannelMembers(channel.Id) if nErr = a.Srv().Store.Channel().RemoveMember(channel.Id, user.Id); nErr != nil { return model.NewAppError("LeaveTeam", "app.channel.remove_member.app_error", nil, nErr.Error(), http.StatusInternalServerError) } } } if *a.Config().ServiceSettings.ExperimentalEnableDefaultChannelLeaveJoinMessages { channel, cErr := a.Srv().Store.Channel().GetByName(team.Id, model.DefaultChannelName, false) if cErr != nil { var nfErr *store.ErrNotFound switch { case errors.As(cErr, &nfErr): return model.NewAppError("LeaveTeam", "app.channel.get_by_name.missing.app_error", nil, nfErr.Error(), http.StatusNotFound) default: return model.NewAppError("LeaveTeam", "app.channel.get_by_name.existing.app_error", nil, cErr.Error(), http.StatusInternalServerError) } }

RemoveMember can pass, but if Channel().GetByName fails, it returns. But actually the DB will be in a partial success state.

And even referring to websocket events, here is some code in CreateGroupWithUserIds:

messageWs := model.NewWebSocketEvent(model.WebsocketEventReceivedGroup, "", "", "", nil) count, err := a.Srv().Store.Group().GetMemberCount(newGroup.Id) if err != nil { return nil, model.NewAppError("CreateGroupWithUserIds", "app.group.id.app_error", nil, err.Error(), http.StatusBadRequest) } group.MemberCount = model.NewInt(int(count)) groupJSON, jsonErr := json.Marshal(newGroup) if jsonErr != nil { mlog.Warn("Failed to encode group to JSON", mlog.Err(jsonErr)) } messageWs.Add("group", string(groupJSON)) a.Publish(messageWs)

So the group gets created (before this code), but if the GetMemberCount fails, the whole operation fails.

What appears to me that atomicity is anyways not being preserved in a lot of cases. This change just improves the behavior slightly by not passing the corrupted message to the client. The server side behavior (as a whole) was already non-atomic, and you can say, this change is making it slightly more non-atomic.

I am thinking that we should perhaps we should try to tackle atomicity as a whole separately as an OKR sometime in future.

And realistically speaking, json.Marshal should never fail in our codebase. So all of this is just following the best practices and has no material effect. Let me know your thoughts.

To solve this, we'd either need to use transactions, which would require a serious rewrite of the server; or a custom retry/rollback functionality. With this change at least we learn about the error.

That said, we need to keep a close look on the logs when trying it on the community server, I expect more errors than usual.

Yes, longer transactions comes with new sets of problems. Let's discuss this more at Mattercon :)

I am thinking that we should perhaps we should try to tackle atomicity as a whole separately as an OKR sometime in future.

Agreed on this.

However, my main concern was even let's say we delete an entity, on failure we will return a deceptive return code such as InternalServerError. Which on the client or caller side seem to be an unsuccessful operation. Maybe we need to think this with a some kind of different HTTP code, such as Accepted which is described here. The goal here is that inform the caller that the main operation is successful but ancillary steps are not guaranteed at this point.

Just because we couldn't send a ws message shouldn't cause a rollback to me. Although, this can depend to every individual operation, I'm thinking more of the clarity of the state we transmit to the client/caller.

Aynway, this might be out of scope this PR as you mentioned this happens a lot in the code base already. I'll leave it to you @agnivade

The goal here is that inform the caller that the main operation is successful but ancillary steps are not guaranteed at this point.

That is the goal, but a lot of times there is no clear distinction of "main operation" and "auxiliary operation". If we are just talking about changes in the DB, sometimes an API call makes changes to multiple tables in separate transactions. And still if the latter fails, we return an error.

I think we've had this discussion a long time back in the old server team, of having long running transactions that exist all throughout an API request and only gets committed at the end.

Aynway, this might be out of scope this PR as you mentioned this happens a lot in the code base already.

Yep, but this is certainly a valid issue though. Let's discuss more at Mattercon.

noxer

Love the changes, we just need to keep a close eye on the errors reported in the near future.

agnivade · 2022-08-04T14:03:54Z

/e2e-test

mattermod · 2022-08-04T14:03:57Z

Successfully triggered e2e testing!
https://git.internal.mattermost.com/qa/cypress-ui-automation/-/pipelines/226139

saturninoabril · 2022-08-04T16:03:57Z

/e2e-test

mattermod · 2022-08-04T16:04:00Z

Successfully triggered e2e testing!
https://git.internal.mattermost.com/qa/cypress-ui-automation/-/pipelines/226195

agnivade · 2022-08-05T03:18:54Z

No difference compared to master: https://mattermost-cypress-report.s3.amazonaws.com/226195-09b0f55-onprem-ent-server-pr-20760-mattermost/mm-ee-test09b0f55/mochawesome.html. Merging.

agnivade added the 2: Dev Review Requires review by a developer label Aug 3, 2022

agnivade requested review from noxer and isacikgoz August 3, 2022 09:26

mm-cloud-bot added the release-note-none Denotes a PR that doesn't merit a release note. label Aug 3, 2022

agnivade commented Aug 3, 2022

View reviewed changes

isacikgoz reviewed Aug 3, 2022

View reviewed changes

noxer approved these changes Aug 3, 2022

View reviewed changes

agnivade requested a review from isacikgoz August 4, 2022 08:54

isacikgoz approved these changes Aug 4, 2022

View reviewed changes

agnivade merged commit 14246ab into master Aug 5, 2022

agnivade deleted the wsLogging branch August 5, 2022 03:19

amyblais added Changelog/Not Needed Does not require a changelog entry Docs/Not Needed Does not require documentation labels Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MM-45993: Return errors during sending websocket messages #20760

MM-45993: Return errors during sending websocket messages #20760

agnivade commented Aug 3, 2022

agnivade Aug 3, 2022 •

edited

isacikgoz left a comment

isacikgoz Aug 3, 2022

agnivade Aug 3, 2022

agnivade Aug 3, 2022

noxer Aug 3, 2022 •

edited

noxer Aug 3, 2022

agnivade Aug 4, 2022

isacikgoz Aug 4, 2022

agnivade Aug 4, 2022

noxer left a comment

agnivade commented Aug 4, 2022

mattermod commented Aug 4, 2022

saturninoabril commented Aug 4, 2022

mattermod commented Aug 4, 2022

agnivade commented Aug 5, 2022

MM-45993: Return errors during sending websocket messages #20760

MM-45993: Return errors during sending websocket messages #20760

Conversation

agnivade commented Aug 3, 2022

agnivade Aug 3, 2022 • edited

Choose a reason for hiding this comment

isacikgoz left a comment

Choose a reason for hiding this comment

isacikgoz Aug 3, 2022

Choose a reason for hiding this comment

agnivade Aug 3, 2022

Choose a reason for hiding this comment

agnivade Aug 3, 2022

Choose a reason for hiding this comment

noxer Aug 3, 2022 • edited

Choose a reason for hiding this comment

noxer Aug 3, 2022

Choose a reason for hiding this comment

agnivade Aug 4, 2022

Choose a reason for hiding this comment

isacikgoz Aug 4, 2022

Choose a reason for hiding this comment

agnivade Aug 4, 2022

Choose a reason for hiding this comment

noxer left a comment

Choose a reason for hiding this comment

agnivade commented Aug 4, 2022

mattermod commented Aug 4, 2022

saturninoabril commented Aug 4, 2022

mattermod commented Aug 4, 2022

agnivade commented Aug 5, 2022

agnivade Aug 3, 2022 •

edited

noxer Aug 3, 2022 •

edited