Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Channel type not available" summarization error #4414

Closed
vladsud opened this issue Nov 19, 2020 · 22 comments
Closed

"Channel type not available" summarization error #4414

vladsud opened this issue Nov 19, 2020 · 22 comments
Assignees
Labels
bug Something isn't working telemetry
Milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Nov 19, 2020

Top OCE issue (assuming other two top issues are addressed - still waiting full confirmation).

717 hits per last 24 hours!

Office_Fluid_FluidRuntime_Error
| where Event_Time > ago(1d)
| where Data_error contains "Channel type not available"
| where Data_eventName contains ":Summar"
| summarize count() by Data_eventName, Data_message, Data_stack

Based on data, first hits are

  • Oct 26
  • 0.26.9

Really huge spike on Nov 17-18.
Based on versions, huge jump in 0.28.7

Typical stack:

Error: Channel type not available
at D.loadChannel (https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:384:1600)
at async D.summarizeInternal (https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:384:1208)
at async $.summarize (https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:162:3095)
at async https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:392:7506
at async Promise.all (index 0)
at async Object.summarize (https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:392:7307)
at async Object.t.summarize (https://officefluidprodversionedcdn.azureedge.net/container/hashed/atMentionsComponent.3e08f53224494453f793.js:1:17770)
at async J.summarizeInternal (https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:162:13855)
at async $.summarize (https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:162:3095)
at async https://officefluidprodversionedcdn.azureedge.net/container/hashed/officeContainer.2f220687466f2e8891d1.js:231:17197,

image
image

@vladsud vladsud added bug Something isn't working msft: oce Pertaining to Msft-internal on call tickets labels Nov 19, 2020
@vladsud vladsud added this to the November 2020 milestone Nov 19, 2020
@ghost ghost added the triage label Nov 19, 2020
@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

Looking at last 24 hours, here are how many events are in Performance table per runtimeVersion: >= 0.28.7
0.28.7 41641
0.28.9 44060
0.29.0 323
0.29.2 1533

And this is how many channel errors we have:
0.28.7 581
0.28.9 102
0.29.0 8
0.29.2 6

So either something got much better in 0.28.9 (but might be still there), or it's not corelated to runtime version (i.e. it's something external).

@vladsud
Copy link
Contributor Author

vladsud commented Nov 19, 2020

Note - last two weeks it got much worse on Friday, which (given UTC times) may very likely point to read-out file.
A lot of docs are affected, but these 5 have top hits:

FVDhiOnGL%2BoJneT5IbJWlTBzL8eU8yBOjvN4pcPfrbk%3D 182
9bUoF0vpc%2FYI3VWQZA4riM1F1%2B9Ec9A6Kjuy4gRtZEk%3D 179
bw0amaEcHMTQZDbCPsPl0SR%2B70y1ArE%2FkRVBcIwjY7M%3D 109
VpDT9YnurYKX7aVHhFLtzhlwsOPLQUHKKzEyIS7JFQ8%3D 106
KvVv6a3dJjIAK6eCMkT5Y78%2FWtFkdB8ew3TvUXjvh24%3D 101

Top file also hit a lot of "Cannot read property 'call' of undefined", which may suggest it's related to mismatch in code versions between summarizer and main client. That's actually true for all but 4th file in this list, so looks like there is some relationship.

@sumedhb1995
Copy link
Contributor

Looking at the git commits, the only change I am seeing specifically in 0.28.7 is 12898b1 and it doesn't look related to this

0.28.9 contained the fix for the summarizer timeout issue that was causing documents to error out after they had reached max number of ops prior to summary.

The other changes from 0.28.7 -> 0.28.9 don't look related

@sumedhb1995
Copy link
Contributor

On joining with the Bohemia errors, I'm seeing the highest number also for 0.28.7 package/loader versions although across a couple different app versions
Office_Fluid_QoS_Error
| distinct Data_hostCorrelationId, Data_hostName, Release_AudienceGroup, Data_hostScenarioName
| join (Office_Fluid_FluidRuntime_Error)
on $left.Data_hostCorrelationId == $right.Data_hostCorrelationId
| where Data_error contains "Channel type not available"
| summarize count() by Data_runtimeVersion, Data_loaderVersion, App_Version

image

@sumedhb1995
Copy link
Contributor

For the fourth document on the list of top erroring ones above, where the ID is VpDT9YnurYKX7aVHhFLtzhlwsOPLQUHKKzEyIS7JFQ8%3D, I was also seeing the "Error 409 from the server, msg = , type = cors" error message immediately prior to the "Channel type not available" error

image

union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_*
| where Data_docId == "VpDT9YnurYKX7aVHhFLtzhlwsOPLQUHKKzEyIS7JFQ8%3D"
| where Event_Time >= datetime(2020-11-03T16:01:31.273Z)
| where Event_Time <= datetime(2020-11-03T18:01:31.273Z)
|project Event_Time, Data_eventName, Data_error, Data_message

@sumedhb1995
Copy link
Contributor

For the first document, the first occurrence of "Channel type not available" is also preceded immediately by a 503 error from the service
image
union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_*
| where Data_eventName contains ":Summar"
| where Event_Time >= datetime(2020-10-26T19:48:14.098Z)
| where Event_Time <= datetime(2020-10-26T21:00:14.616Z)
| where Data_docId == "FVDhiOnGL%2BoJneT5IbJWlTBzL8eU8yBOjvN4pcPfrbk%3D"
|project Event_Time, Data_eventName, Data_error, Data_message

@sumedhb1995
Copy link
Contributor

On the document KvVv6a3dJjIAK6eCMkT5Y78%2FWtFkdB8ew3TvUXjvh24%3D as well, we see another 404 from the service prior to the channel error
union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_*
| where Event_Time >= datetime(2020-11-19T12:21:52.288Z)
| where Event_Time <= datetime(2020-11-19T13:31:52.288Z)
| where Data_docId == "KvVv6a3dJjIAK6eCMkT5Y78%2FWtFkdB8ew3TvUXjvh24%3D"
|project Event_Time, Data_eventName, Data_error, Data_message
image

@sumedhb1995
Copy link
Contributor

Found this document Q50eQ%2FtQmJY8Re0Fc8udsPH621ini7LkRsI9ki12AJE%3D where we see the error happen after a summary ack timeout instead of a server error

image

union kind=outer database("Office Fluid").Office_Fluid_FluidRuntime_*
| where Event_Time >= datetime(2020-11-18 17:11:34.7100000)
| where Event_Time <= datetime(2020-11-18 19:11:34.7100000)
| where Data_docId == "Q50eQ%2FtQmJY8Re0Fc8udsPH621ini7LkRsI9ki12AJE%3D"
|project Event_Time, Data_eventName, Data_error, Data_message

@sumedhb1995
Copy link
Contributor

sumedhb1995 commented Nov 20, 2020

Current versions of the client never pass a this.attachMessageType to the RemoteChannelContext and haven't for a while now.
So, if .attributes ever fails to be parsed out from the channel context's objectStorage tree, it will automatically hit this error.

I am seeing this other comment here from container.ts where we parse out attributes from the snapshot tree and it shows this:
image
but this is something specific to the "protocol" tree?

The initial objectStorage tree begins with a baseSnapshot passed in from dataStoreRuntime. And a channel is created for each path in this tree. So I'm wondering if this also iterates over the "protocol" path. And for that path, since the value is stored in tree.trees[".protocol"].blobs.attributes, it gets back undefined even though it isnt for an older document

@vladsud
Copy link
Contributor Author

vladsud commented Nov 24, 2020

PR with more telemetry has been submitted.
Moving to December, as it's unlikely we will get enough data to make it actionable in next 1-2 days

@jatgarg
Copy link
Contributor

jatgarg commented Dec 8, 2020

@vladsud I am getting 0 hits in last 30days for this query:
Office_Fluid_FluidRuntime_Error
| where Event_Time > ago(30d)
| where Data_error has "Channel type not available"
| where Data_eventName has "Summar"
| summarize count() by Data_eventName, Data_message, Data_stack

@vladsud
Copy link
Contributor Author

vladsud commented Dec 8, 2020

@jatgarg, I do not know what "has" operator does, but "contains" is the right one and it finds a ton of hits.

@jatgarg
Copy link
Contributor

jatgarg commented Dec 8, 2020

ohh, it is same but better performance wise, imo. Kusto always sayd to use has instead of contains. let me check.

@jatgarg
Copy link
Contributor

jatgarg commented Dec 10, 2020

All of these errors are for @ms/atmentions package with id atmentions. No other data store faces this issue.

@vladsud
Copy link
Contributor Author

vladsud commented Dec 13, 2020

Aha, I know what it is. It's _search/01 blob that we output for search.
Data store runtime assumes that every sub-tree under data store tree is DDS sub-tree. But it's not true here.
So it tries to load a DDS, but that tree does not have same structure as DDS, so it fails.

This will (eventually) be addressed by @arinwt'swork to move DDSs (and data stores level up) into its own "channels" sub-tree. Meanwhile we should assess if these issues are affecting users in any way. If not, we should just wait for right structure of snapshots to happen.

@vladsud vladsud modified the milestones: December 2020, January 2021 Dec 15, 2020
@vladsud
Copy link
Contributor Author

vladsud commented Dec 15, 2020

Keeping this item for tracking only (as no actual additional work is required - the issue will go away once snapshots for DDSs are under their own path). Moving to January as it will not happen this month.

@arinwt
Copy link
Contributor

arinwt commented Dec 15, 2020

I think we should do a hotfix to exclude "_search" in the meantime? That change we are waiting for requires multiple releases.

Then we can resolve this issue with the hotfix and leave the other issue to track that change.

@vladsud
Copy link
Contributor Author

vladsud commented Dec 16, 2020

What is not clear to me is why we are trying to load that "channel"? It definitely has no ops, as this is not real channel. It is present (as a sub-tree) in previous snapshot, and now summarizer decided to realize data store and summarize it, but it should go only after channels that changed (had ops on top of latest snapshot).

Is this because we switch (for some reason) to full summary, and thus disregard incremental process?
If so, counts are rather high, are we not efficient (as an aggregate)?

@vladsud
Copy link
Contributor Author

vladsud commented Dec 19, 2020

Just hit that issue in debugger when forcing full = true for summaries. Which makes sense - full tries to rehydrate all DDSs, and it treats _search as DDS.
This means we are blocking all full summaries, which is not good (and it's good that this does not seem to manifest to directly observed user badness).

@arinwt - can you please do an assessment if it deserves a patch? It's simple code changes, the only reason I hesitate is that I do not want to see "_search" term in Fluid repo, we need to not forget to delete it once we have better structure to isolate channels.

@vladsud
Copy link
Contributor Author

vladsud commented Dec 19, 2020

This Chapter 1 file (which is very important for our planning) fails (with no tinkering in debugger) with these errors.
This is no good. Not sure why summaries are being generated full.
Also I see like 20 errors right after load, and then no more errors, not sure why I get 20 errors in a row.

vladsud added a commit to vladsud/FluidFramework that referenced this issue Dec 22, 2020
vladsud added a commit that referenced this issue Dec 22, 2020
vladsud added a commit that referenced this issue Dec 22, 2020
@vladsud
Copy link
Contributor Author

vladsud commented Dec 23, 2020

Fixed has been submitted to ignore _search sub-tree. This issue will be tracking reversal of this fix once it becomes unnecessary with the work that Arin is doing.

@vladsud
Copy link
Contributor Author

vladsud commented Jan 28, 2021

Forked the only remaining work item here as #4963

@vladsud vladsud closed this as completed Jan 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working telemetry
Projects
None yet
Development

No branches or pull requests

6 participants