Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist repository build logs for later access #1156

Open
betatim opened this issue Sep 30, 2020 · 9 comments
Open

Persist repository build logs for later access #1156

betatim opened this issue Sep 30, 2020 · 9 comments

Comments

@betatim
Copy link
Member

betatim commented Sep 30, 2020

Proposed change

When a repository is being built we stream the output of repo2docker to the website the user sees. It would be great if this log would also be available after the build has failed. For example builds can take a very long time (hours or more) in which users close the tab or otherwise leave. They have no way to recover the build log once the build has failed. The only option is to restart the build and this time wait. It is also hard to copy&paste text from the build log to share for debugging purposes. In both cases it would help to have a stable URL that for some time shows the build log.

Screenshot 2020-09-30 at 19 30 28

This is an attempt to make a actionable issue out of #155.

Alternative options

For successful builds we could store the build log in the container image as a special file. This way users could access it from "within" their binder. This would not help users for which the build fails, as they'd never get a launching binder.

Who would use this feature?

People who have long running builds that need debugging or otherwise want to share/look at the output of a build.

(Optional): Suggest a solution

A new endpoint in BinderHub that outputs the build log of the last build for a given "repo spec". The log would be overwritten the next time the same spec is built. It would be a reasonably stable URL that is easy to discover. An alternative would be to assign a "build number" to each build, this would be super stable but creates the challenge of how a user would discover what the build number of their build was.

Build logs can be large. This means it is probably not feasible to store them in the process memory of BinderHub. In memory or on disk in the pod storage also means that on a cluster that runs several instances of the binderhub pod you'd need a mechanism for sending requests for the log to the right pod. This points towards the need for an additional service in which to store the logs.

The BinderHub process already sees a copy of the build log as it streams it to users. It could stream it to a log sink at the same time. Or the log could be archived independent of the BinderHub process directly from the repo2docker pod.

@manics
Copy link
Member

manics commented Oct 1, 2020

Using an s3 compatible object store might be an option? It's easier to setup on public cloud than managing a K8S persistent volume, and it also means we don't need an endpoint to download the logs, if the bucket is public you can give people a direct https download link.

You can optionally specify a TTL for auto-deletion, and it should work out cheaper than a disk volume.

@minrk
Copy link
Member

minrk commented Oct 2, 2020

@manics That sounds like a great solution. We would need to:

  1. collect / process logs in binderhub prior to deletion of the pod (logs are actually JSON, what folks want is terminal-processed ANSI output of log messages)
  2. upload those to public blob store with TTL, etc.
  3. distribute stable URL for blob to user pods / other places folks might be interested. An API with deterministic blob resolution would work, too

@consideRatio
Copy link
Member

I wonder how much can be accomplished by already existing software related to k8s logging, and what makes sense to develop ourselves. Here are some relevant links to consider this further.

@betatim
Copy link
Member Author

betatim commented Oct 3, 2020

I like the idea of storing the "end product" in a S3 bucket. Especially if what we store is "finished HTML" so that serving a "log page" can be done by nginx or even directly from the bucket.

I read k8s logging docs and not sure which of the scenarios they list would fit us best. An idea from the guide I like is to add a sidecar to the repo2docker Pod that takes care of processing and streaming the logs to the bucket.

I have used filebeat to ship JSONL from a file in a container to a ElasticSearch instance. It worked well once setup but I found the documentation for filebeat confusing/hard to read so it involved a lot off trial&error. A quick google suggests filebeat can't ship to S3. On the one hand having a off the shelf tool do this for us would be nice (one less thing to maintain), on the other hand it might be as much or more work to find and configure one as it is to write a small utility ourselves that does exactly what we want (produce ANSI coloured HTML output).

For (3) from Min's suggestion:

Thinking about the API endpoint: we could have http://binder/v2/logs/<buildspec> which redirects you to the appropriate bucket (appropriate == the latest one for that spec?) with a permanent redirect. That way you'd get a stable URL to share as long as you visit http://binder/v2/logs/<buildspec> at approximately the right time. (One day we could maybe build an extension like http://binder/v2/logs/<buildspec>/<datetime> which tries to find the most likely bucket for that timestamp.)

What is a nice way to make the URL of the log available inside the launched Binder or do we skip that for now? I'm not sure how to do that nicely. Could BinderHub use that API endpoint to work out the bucket URL (or use the code behind that endpoint directly) and set it as a environment variable? Would we frequently be sending people to the previous build's logs because log processing hadn't completed/the new bucket not ready yet when the Pod is launched after being built?

@manics
Copy link
Member

manics commented Oct 3, 2020

I think it'd be best to store the raw logs (plain text) instead of wrapping it with HTML It's what Travis and GitHub actions do, and it means if you've got a large log you can easily download and search it. Initially I think giving users a direct link to the plain text is fine since this is a new feature and you'll probably need some developer experience to understand it. A second version could add a simple HTML viewer to BinderHub.

I don't think you can't easily stream logs to S3, so it'd be a case of write them to a temporary file and upload at the end of the build (Edit: See https://stackoverflow.com/a/8881939)

We already calculate a deterministic image name for each build, so if we name the log something like /bucket/buildlogs/<image-name>.log or /bucket/buildlogs/<image-name>/repo2docker.log we could inject it as an environment variable at run time?

@manics
Copy link
Member

manics commented Oct 3, 2020

It seemed like a fun mini-project to investigate on a rainy day: jupyterhub/repo2docker#967

The S3 upload bit should be relatively easy whether it's in repo2docker or BinderHub, as has been pointed out the difficult bit is getting hold of all the logs we want.

I also found this issue moby/buildkit#1472 which if implemented would allow us to access logs inside the container that's failed to build.

@manics
Copy link
Member

manics commented Oct 4, 2020

Thinking ahead to run-time logs, using a centralised logging system could work but we'd still need some infrastructure on top of it to filter the relevant logs for users. I don't think it's safe to make all logs public to everyone since there may be private information in there, either related to launching Jupyter or because users have run something private in their container that has emitted some logs.

If this was limited to only the launch phase then if BinderHub knows the pod name it could do something like upload the output of clientapi.read_namespaced_pod_log(namespace=NAMESPACE, name=PODNAME) to S3 for both successful and failed launches. If the bucket policy is set to prevent listing objects then a user would only have access to the log file if they were given the URL.

@minrk
Copy link
Member

minrk commented Oct 5, 2020

I think we can handle run-time logs with a much simpler approach by using an entrypoint that tees the 'real' entrypoint output to a file folks can read within the container, rather than deployment-specific external storage.

Something like

#!/bin/sh
exec real-entrypoint "$@" 2>&1 | tee server-log.txt

It doesn't persist beyond the life of the container, but I think that's a good thing.

@meeseeksmachine
Copy link

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/accessing-the-jupyter-notebook-logs/6263/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants