Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6.0: opening storage failed: mkdir data/: read-only file system #5043

Closed
cmoad opened this Issue Dec 26, 2018 · 23 comments

Comments

Projects
None yet
@cmoad
Copy link

cmoad commented Dec 26, 2018

Bug Report

What did you do?

Upgrading from Docker 2.5.0 to 2.6.0 introduces a fatal error, "opening storage failed: mkdir data/: read-only file system". Deleting everything and reverting to 2.5.0 with the exact same configuration does not have this problem.

What did you expect to see?

I don't see that any changes are required for 2.6.0.

Environment
This is running in a vanilla minikube running k8s 1.11.6.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  namespace: monitor
  name: prometheus
spec:
  replicas: 1
  revisionHistoryLimit: 2
  strategy:
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0
  template:
    metadata:
      name: prometheus
      labels:
        service: prometheus
        apiVersion: v2
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: quay.io/prometheus/prometheus:v2.5.0
        imagePullPolicy: Always
        ports:
        - name: web
          containerPort: 9090
        livenessProbe:
          tcpSocket:
            port: 9090
          initialDelaySeconds: 30
          timeoutSeconds: 5
        readinessProbe:
          tcpSocket:
            port: 9090
        resources:
          requests:
            cpu: 10m
            memory: 32Mi
          limits:
            memory: 64Mi
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
        - name: data
          mountPath: /prometheus
      volumes:
      - name: config
        configMap:
          name: prometheus
      - name: data
        persistentVolumeClaim:
          claimName: prometheus
  • System information:

    quay.io/prometheus/prometheus:v2.6.0

  • Prometheus version:

    v2.6.0

  • Logs:

level=info ts=2018-12-26T14:41:12.22879461Z caller=main.go:243 msg="Starting Prometheus" version="(version=2.6.0, branch=HEAD, revision=dbd1d58c894775c0788470944b818cc724f550fb)"
level=info ts=2018-12-26T14:41:12.228845931Z caller=main.go:244 build_context="(go=go1.11.3, user=root@bf5760470f13, date=20181217-15:14:46)"
level=info ts=2018-12-26T14:41:12.228864658Z caller=main.go:245 host_details="(Linux 4.15.0 #1 SMP Fri Dec 21 23:51:58 UTC 2018 x86_64 prometheus-85c56c84d8-chf77 (none))"
level=info ts=2018-12-26T14:41:12.228880573Z caller=main.go:246 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-12-26T14:41:12.228894018Z caller=main.go:247 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-12-26T14:41:12.229899898Z caller=main.go:561 msg="Starting TSDB ..."
level=info ts=2018-12-26T14:41:12.229984509Z caller=web.go:429 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-12-26T14:41:12.229996811Z caller=main.go:430 msg="Stopping scrape discovery manager..."
level=info ts=2018-12-26T14:41:12.230007845Z caller=main.go:444 msg="Stopping notify discovery manager..."
level=info ts=2018-12-26T14:41:12.23001292Z caller=main.go:466 msg="Stopping scrape manager..."
level=info ts=2018-12-26T14:41:12.230018683Z caller=main.go:440 msg="Notify discovery manager stopped"
level=info ts=2018-12-26T14:41:12.23008139Z caller=main.go:426 msg="Scrape discovery manager stopped"
level=info ts=2018-12-26T14:41:12.23009587Z caller=main.go:460 msg="Scrape manager stopped"
level=info ts=2018-12-26T14:41:12.230107221Z caller=manager.go:664 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-12-26T14:41:12.230125069Z caller=manager.go:670 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-12-26T14:41:12.230134183Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
level=info ts=2018-12-26T14:41:12.230144125Z caller=main.go:615 msg="Notifier manager stopped"
level=error ts=2018-12-26T14:41:12.230506795Z caller=main.go:624 err="opening storage failed: mkdir data/: read-only file system"
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Dec 26, 2018

CC @SuperQ, you made some changes to the Dockerfile in #4976, could that be related?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Dec 26, 2018

I see that the new Docker image now makes /prometheus be a symlink in the image itself, does that interfere with the ability to mount volumes directly onto that path?

@aixeshunter

This comment has been minimized.

Copy link
Contributor

aixeshunter commented Dec 28, 2018

@cmoad Hi, your pv is hostpath or other remote storage? Did the storage work?

@cmoad

This comment has been minimized.

Copy link
Author

cmoad commented Dec 29, 2018

@aixeshunter Here is the volume claim definition. Minikube takes care of the rest.

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  namespace: monitor
  name: prometheus
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
@Dravere

This comment has been minimized.

Copy link

Dravere commented Jan 3, 2019

I can confirm that old data is not loaded in Prometheus Docker v2.6.0 when the data is located in a mounted volume.

My docker run configuration:

docker run -d \
    --name prometheus \
    --hostname prometheus \
    --restart always \
    --volume /opt/prometheus/conf:/etc/prometheus/conf:ro \
    --volume /opt/prometheus/data:/prometheus \
    --log-opt max-size=1m --log-opt max-file=5 \
    --env "TZ=Europe/Zurich" \
    --network monitoring \
    prom/prometheus:v2.6.0 --web.external-url=https://nowhere.example.org --config.file=/etc/prometheus/conf/prometheus.yml
@chlunde

This comment has been minimized.

Copy link

chlunde commented Jan 3, 2019

We also had this issue. It works when changing the data path:

        - name: data
          mountPath: /prometheus-data
         args:
           - '--config.file=/etc/prometheus/prometheus.yaml'
+          - '--storage.tsdb.path=/prometheus-data'
@ezraroi

This comment has been minimized.

Copy link

ezraroi commented Jan 3, 2019

yes indeed, we also faced the same issue when we upgraded from 2.5.0. to 2.6.0 docker

@roman-vynar

This comment has been minimized.

Copy link
Contributor

roman-vynar commented Jan 4, 2019

This breaking change was reported 17 days ago and still not reflected on release notes for 2.6.0. This is very sad.

@lparet-yseop

This comment has been minimized.

Copy link

lparet-yseop commented Jan 8, 2019

Same issue for me when I upgrade Prometheus from 2.5.0. to 2.6.0
err="opening storage failed: mkdir data/: permission denied"

Some news about this issue ? I think it is a real breaking change for many people

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 8, 2019

First of all, we're truly sorry for the hassle this change has introduced for some of you. When we reviewed PR #4976, we genuinely thought that we had a solution that would improve the container image without breaking things. Reverting the change might hurt deployments that have already been updated so I'm hesitant to go this path.

I've updated the changelog text for v2.6.0 in the Releases page to flag it as a change and submitted #5078 as well.

In the future, I'll do my best to be more strict and flag appropriately changes that might be breaking for the community.

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Jan 8, 2019

Yes, sorry for not also replying here. I'm not a Docker expert, but I also thought this wasn't going to cause as many issues as it has.

I wanted to our Docker file into a state where it's as clean as possible. The original layout was extremely messy, and I thought we could clean it up.

I'm also open to rolling back to the previous layout, and designing a new layout for a future release.

@Dravere

This comment has been minimized.

Copy link

Dravere commented Jan 8, 2019

What would be good if there was a recommended way to upgrade now. Should we use the way chlunde wrote? Or should we map the mounted volume to /etc/prometheus/data? Which would be a really strange location to store data.

epuertat added a commit to rhcs-dashboard/ceph-dev that referenced this issue Jan 11, 2019

Do not pull latest Prometheus image
As it has breaking changes (it expects the mounted location to be rw and store data in it) due to prometheus/prometheus#4976 (prometheus/prometheus#5043), let's stay in the previously working version.
@ilude

This comment has been minimized.

Copy link

ilude commented Jan 18, 2019

Please take this as constructive criticism and not some rant on the team owing me or anyone else anything! You're doing great work!

I'm at a loss to see how the symlink introduced in #4976 improves anything. I was trying to setup and test prometheus and spent the better part of two days trying to figure out why i was getting this perm issue as Google does not find this issue. And even once I did find it, without looking at the changes in the commit its still nearly impossible to understand whats going on and why anything mentioned in this thread fixes things. "opening storage failed: mkdir data/: permission denied" is a relative path error and distinctly unhelpful in troubleshooting the issue.

Symlinking your data directory into /etc makes no sense to me and is against several decades of Unix file system convention. Configs go in /etc, data should be in /opt, /var or in docker's case /prometheus, /data or /prometheus/data would be acceptable.

On top of all this, you've outdated and broken every promethues docker tutorial out on the web with this change. I would urge you to reconsider and possibly revert the symlink or at least make a section in the main README that addresses this change, the permission denied error it causes, the changes needed to fix it, and a well thought out explanation of why you needed to do this. I can only imagine that its going to affect your adoption/use rates for the next year, at least, as folk get frustrated following an online docker tutorials that all no longer work.

In closing please understand that I'm not mad and understand changes happen and are often needed. I don't have the time to dig into this change and understand why the symlink was needed, so I may in fact be the one that's in the wrong!

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Jan 18, 2019

@ilude The improvement we were trying to make was to get rid of all of the flag changes in the ENTRYPOINT. This was intended to fix issues introduced by the previous change, #4796.

The overall problem we were trying to solve is if a user wants to use the container, but add one additional flag change, like --storage.tsdb.retention, they used to have to reproduce the entire CMD flag list for the container to work like default.

The main problem here is we're stuck with how all of this works due to existing deployments, and the original configuration was really annoying. Had I done this in the first place, I would have started with the WORKDIR /prometheus, and just done everything in one location, rather than try and conform to normal Unix filesystem layouts, because they really don't matter in a container like this.

I just didn't think about all the permutations of how people been using this Dockerfile before making the symlink change.

@roman-vynar

This comment has been minimized.

Copy link
Contributor

roman-vynar commented Jan 18, 2019

IMHO, the best way to fix this issue at the
moment is to revert the change. Later to do something related
to the flags accounting what was happened.

@ilude

This comment has been minimized.

Copy link

ilude commented Jan 18, 2019

@SuperQ that makes it a bit more clear, but still not sure you're going about it in the right way. I would humbly recommend you consider rolling back the change, and spin up a 2.0 branch that fixes the WORKDIR. Which seems to be the real problem, instead of trying to mess with the existing versions image file layout. The way it is now you are breaking existing deployments when they upgrade and there current volume setups cause these issus. And you are invalidating any tutorials that were written by the community to this point.

I understand that the rollback has its own risks to folks who have already dealt with this. But that's what README and CHANGELOG files are for. Make it clear what you are doing, why you are doing it and continue to update the docs/wiki with info on how to deal with the common issues that may come up from the revert.

https://semver.org/

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes,
MINOR version when you add functionality in a backwards-compatible manner, and
PATCH version when you make backwards-compatible bug fixes.

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Jan 18, 2019

Yes, I know, you don't need to mansplain semver. This is already our policy.

We didn't expect this to be a breaking change, hence why it didn't get documented this way.

I have already considered a simple rollback to the previous Dockerfile and making changes in 3.0. But like @roman-vynar says, rollback is now just as problematic.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jan 18, 2019

To be fair, we do not know for sure it's more problematic to roll back or to keep it like it is now. It depends on how many people have already upgraded to this version. If I had to guess, then I'd say a lot more people will still be on older Prometheus versions and still have this upgrade pain ahead of them.

Thus I'm slightly in favor of rolling it back. Wondering if it is still possible to do in 2.7.0 though (rc0 released today) /cc @gouthamve.

@SuperQ

This comment has been minimized.

Copy link
Member

SuperQ commented Jan 18, 2019

Consider this my vote for a rollback.

@Dravere

This comment has been minimized.

Copy link

Dravere commented Jan 18, 2019

If we would follow Semver even the rollback would be a major version change. I wonder if this actually shows another problem. The docker version isn't really part of the Prometheus version. They might want to follow independent versions or at least partially related?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jan 18, 2019

@Dravere The Dockerfile wasn't explicitly mentioned in the stability guarantees in https://prometheus.io/blog/2016/07/18/prometheus-1-0-released/#what-does-1-0-mean-for-you, so we don't have an official rule for that. Most likely we would not have made a breaking change intentionally though.

While rolling back would be another breaking change, you could also see a rollback as a bugfix of a previous unintentional breakage that should never have happened. Especially if most users still have to hit it.

@Dravere

This comment has been minimized.

Copy link

Dravere commented Jan 18, 2019

Well according to Semver rollback of breaking change is a breaking change. But I get your point. I mean I'm not opposed to it, I rolled back to 2.5.0 and currently waiting for the solution of this issue.

Another possible solution to fix this without breaking everything again might be to have an entry via a bash/shell file or similar. Something that I have seen in a lot of places. You could do some additional checks in there before deciding how you want to startup. Arguments could be passed forward without a problem. There is even a mention of such a method on the Docker docs: https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#entrypoint (example of how Postgres does it)

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jan 21, 2019

After more thinking and although most attention on this issue is from people that have been hurt by the change, I would also be inclined to revert to the previous behavior. But @gouthamve being the release shepherd for v2.7, he has the final word on this.

gouthamve added a commit to gouthamve/prometheus that referenced this issue Jan 21, 2019

Rollback Dockerfile to version @ 2.5.x
Fixes prometheus#5043

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

gouthamve added a commit that referenced this issue Jan 21, 2019

Rollback Dockerfile to version @ 2.5.x (#5122)
Fixes #5043

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

pgier added a commit to pgier/prometheus that referenced this issue Jan 23, 2019

revert Dockerfile to older settings
See prometheus#5043
Also, this brings Dockerfile more in sync with Dockerfile.rhel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.