Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MM-45993: Return errors during sending websocket messages #20760

Merged
merged 1 commit into from Aug 5, 2022
Merged

Conversation

agnivade
Copy link
Member

@agnivade agnivade commented Aug 3, 2022

During attaching an object to a websocket message, we would
marshal it to json and attach the string output. But if the
marshalling failed, we would just log a warning and move on.

This would add an empty string to the message. But the client
assumes that the object is correctly attached and would
fail silently if it cannot find it.

So we become more strict and return the error so that
it reaches the caller.

https://mattermost.atlassian.net/browse/MM-45993

NONE

During attaching an object to a websocket message, we would
marshal it to json and attach the string output. But if the
marshalling failed, we would just log a warning and move on.

This would add an empty string to the message. But the client
assumes that the object is correctly attached and would
fail silently if it cannot find it.

So we become more strict and return the error so that
it reaches the caller.

https://mattermost.atlassian.net/browse/MM-45993

```release-note
NONE
```
@agnivade agnivade added the 2: Dev Review Requires review by a developer label Aug 3, 2022
@mm-cloud-bot mm-cloud-bot added the release-note-none Denotes a PR that doesn't merit a release note. label Aug 3, 2022
@@ -73,14 +72,14 @@ func setCollapsePreference(a *app.App, args *model.CommandArgs, isCollapse bool)
}

if err := a.Srv().Store.Preference().Save(model.Preferences{pref}); err != nil {
return &model.CommandResponse{Text: args.T("api.command_expand_collapse.fail.app_error"), ResponseType: model.CommandResponseTypeEphemeral}
return &model.CommandResponse{Text: args.T("api.command_expand_collapse.fail.app_error") + err.Error(), ResponseType: model.CommandResponseTypeEphemeral}
Copy link
Member Author

@agnivade agnivade Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cases like these are what I want to really fix. There is an error coming from the store, but we are not propagating that at all. Instead all the user knows is "An error occurred while expanding previews". But what exactly failed?. It wouldn't have been possible to know that until now :)

Copy link
Member

@isacikgoz isacikgoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to propagate errors but I have some concerns over some of them though:

Comment on lines +234 to +236
if appErr := a.sendTeamEvent(oldTeam, model.WebsocketEventUpdateTeam); appErr != nil {
return nil, appErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so here we are :) I think returning the error is also come with other problems, because this method is UpdateTeam and we successfully updated the team, this is an additional step, not necessarily a failure on the Update operation itself. So, logging the error would be a better approach IMO. Similar concern for similar occurrences in the PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, and you brought it up during our brainstorming session as well.

So my main concern was not to send a broken websocket message since that anyways fails silently on the client, and leads to all sorts of confusions. But at the same time, if we return an error, then we also generate a partial state in the DB. And making the full operation atomic is beyond the scope.

What do you think of logging it at a higher level, but not sending an empty string in the websocket message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isacikgoz - I thought about this some more, and it looks to me that there are already some cases where the full operation isn't atomic.

For example, take a look at this RemoveTeam operation:

or _, channel := range channelList {
		if !channel.IsGroupOrDirect() {
			a.invalidateCacheForChannelMembers(channel.Id)
			if nErr = a.Srv().Store.Channel().RemoveMember(channel.Id, user.Id); nErr != nil {
				return model.NewAppError("LeaveTeam", "app.channel.remove_member.app_error", nil, nErr.Error(), http.StatusInternalServerError)
			}
		}
	}

	if *a.Config().ServiceSettings.ExperimentalEnableDefaultChannelLeaveJoinMessages {
		channel, cErr := a.Srv().Store.Channel().GetByName(team.Id, model.DefaultChannelName, false)
		if cErr != nil {
			var nfErr *store.ErrNotFound
			switch {
			case errors.As(cErr, &nfErr):
				return model.NewAppError("LeaveTeam", "app.channel.get_by_name.missing.app_error", nil, nfErr.Error(), http.StatusNotFound)
			default:
				return model.NewAppError("LeaveTeam", "app.channel.get_by_name.existing.app_error", nil, cErr.Error(), http.StatusInternalServerError)
			}
		}

RemoveMember can pass, but if Channel().GetByName fails, it returns. But actually the DB will be in a partial success state.

And even referring to websocket events, here is some code in CreateGroupWithUserIds:

messageWs := model.NewWebSocketEvent(model.WebsocketEventReceivedGroup, "", "", "", nil)
count, err := a.Srv().Store.Group().GetMemberCount(newGroup.Id)
if err != nil {
    return nil, model.NewAppError("CreateGroupWithUserIds", "app.group.id.app_error", nil, err.Error(), http.StatusBadRequest)
}
group.MemberCount = model.NewInt(int(count))
groupJSON, jsonErr := json.Marshal(newGroup)
if jsonErr != nil {
    mlog.Warn("Failed to encode group to JSON", mlog.Err(jsonErr))
}
messageWs.Add("group", string(groupJSON))
a.Publish(messageWs)

So the group gets created (before this code), but if the GetMemberCount fails, the whole operation fails.

What appears to me that atomicity is anyways not being preserved in a lot of cases. This change just improves the behavior slightly by not passing the corrupted message to the client. The server side behavior (as a whole) was already non-atomic, and you can say, this change is making it slightly more non-atomic.

I am thinking that we should perhaps we should try to tackle atomicity as a whole separately as an OKR sometime in future.

And realistically speaking, json.Marshal should never fail in our codebase. So all of this is just following the best practices and has no material effect. Let me know your thoughts.

Copy link
Contributor

@noxer noxer Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve this, we'd either need to use transactions, which would require a serious rewrite of the server; or a custom retry/rollback functionality. With this change at least we learn about the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we need to keep a close look on the logs when trying it on the community server, I expect more errors than usual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, longer transactions comes with new sets of problems. Let's discuss this more at Mattercon :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking that we should perhaps we should try to tackle atomicity as a whole separately as an OKR sometime in future.

Agreed on this.

However, my main concern was even let's say we delete an entity, on failure we will return a deceptive return code such as InternalServerError. Which on the client or caller side seem to be an unsuccessful operation. Maybe we need to think this with a some kind of different HTTP code, such as Accepted which is described here. The goal here is that inform the caller that the main operation is successful but ancillary steps are not guaranteed at this point.

Just because we couldn't send a ws message shouldn't cause a rollback to me. Although, this can depend to every individual operation, I'm thinking more of the clarity of the state we transmit to the client/caller.

Aynway, this might be out of scope this PR as you mentioned this happens a lot in the code base already. I'll leave it to you @agnivade

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal here is that inform the caller that the main operation is successful but ancillary steps are not guaranteed at this point.

That is the goal, but a lot of times there is no clear distinction of "main operation" and "auxiliary operation". If we are just talking about changes in the DB, sometimes an API call makes changes to multiple tables in separate transactions. And still if the latter fails, we return an error.

I think we've had this discussion a long time back in the old server team, of having long running transactions that exist all throughout an API request and only gets committed at the end.

Aynway, this might be out of scope this PR as you mentioned this happens a lot in the code base already.

Yep, but this is certainly a valid issue though. Let's discuss more at Mattercon.

Copy link
Contributor

@noxer noxer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the changes, we just need to keep a close eye on the errors reported in the near future.

@agnivade agnivade requested a review from isacikgoz August 4, 2022 08:54
@agnivade
Copy link
Member Author

agnivade commented Aug 4, 2022

/e2e-test

@mattermod
Copy link
Contributor

@saturninoabril
Copy link
Member

/e2e-test

@mattermod
Copy link
Contributor

@agnivade
Copy link
Member Author

agnivade commented Aug 5, 2022

@agnivade agnivade merged commit 14246ab into master Aug 5, 2022
@agnivade agnivade deleted the wsLogging branch August 5, 2022 03:19
@amyblais amyblais added Changelog/Not Needed Does not require a changelog entry Docs/Not Needed Does not require documentation labels Aug 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2: Dev Review Requires review by a developer Changelog/Not Needed Does not require a changelog entry Docs/Not Needed Does not require documentation release-note-none Denotes a PR that doesn't merit a release note.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants