Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ports values were swapped in the UI based on Mesos data #4781

Closed
marcomonaco opened this issue Dec 6, 2016 · 5 comments
Closed

Ports values were swapped in the UI based on Mesos data #4781

marcomonaco opened this issue Dec 6, 2016 · 5 comments
Assignees

Comments

@marcomonaco
Copy link

marcomonaco commented Dec 6, 2016

copy of https://mesosphere.atlassian.net/browse/DCOS-9629
reported by @lloesche

I have a service that uses persistent volumes and two tcp ports.
When restarting the service it's being restarted on the same host using the same random host ports. However the mapping of container to host port might be shuffled.
E.g.
container 9090 -> host 21926
container 9093 -> host 21927
RESTART
container 9090 -> host 21927
container 9093 -> host 21926

However the DC/OS UI shows the host to container port mapping wrong every now and then.
It's correct after initial start but sometimes incorrect after the ports have been shuffled due to a restart.
screen shot 2016-09-06 at 19 04 13
state.json.txt

Example /service/marathon/v2/apps/prometheus/server output fetched when the screenshot above was taken.

{
  "app": {
    "id": "/prometheus/server",
    "cmd": null,
    "args": null,
    "user": null,
    "env": {
      "PAGERDUTY_KEY": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "NODE_EXPORTER_SRV": "_node-exporter.prometheus._tcp.marathon.mesos"
    },
    "instances": 1,
    "cpus": 0.5,
    "mem": 2048,
    "disk": 0,
    "gpus": 0,
    "executor": "",
    "constraints": [],
    "uris": [],
    "fetch": [],
    "storeUrls": [],
    "backoffSeconds": 1,
    "backoffFactor": 1.15,
    "maxLaunchDelaySeconds": 3600,
    "container": {
      "type": "DOCKER",
      "volumes": [
        {
          "containerPath": "prometheus",
          "mode": "RW",
          "persistent": {
            "size": 1024
          }
        }
      ],
      "docker": {
        "image": "lloesche/prometheus-dcos",
        "network": "BRIDGE",
        "portMappings": [
          {
            "containerPort": 9090,
            "hostPort": 0,
            "servicePort": 10001,
            "protocol": "tcp",
            "labels": {
              "VIP_0": "/prometheus/server:9090"
            }
          },
          {
            "containerPort": 9093,
            "hostPort": 0,
            "servicePort": 10002,
            "protocol": "tcp",
            "labels": {
              "VIP_1": "/prometheus/server:9093"
            }
          }
        ],
        "privileged": false,
        "parameters": [],
        "forcePullImage": true
      }
    },
    "healthChecks": [
      {
        "path": "/metrics",
        "protocol": "HTTP",
        "portIndex": 0,
        "gracePeriodSeconds": 300,
        "intervalSeconds": 60,
        "timeoutSeconds": 20,
        "maxConsecutiveFailures": 3,
        "ignoreHttp1xx": false
      }
    ],
    "readinessChecks": [],
    "dependencies": [],
    "upgradeStrategy": {
      "minimumHealthCapacity": 0.5,
      "maximumOverCapacity": 0
    },
    "labels": {},
    "acceptedResourceRoles": null,
    "ipAddress": null,
    "version": "2016-09-06T16:07:36.977Z",
    "residency": {
      "relaunchEscalationTimeoutSeconds": 3600,
      "taskLostBehavior": "WAIT_FOREVER"
    },
    "secrets": {},
    "taskKillGracePeriodSeconds": null,
    "ports": [
      10001,
      10002
    ],
    "portDefinitions": [
      {
        "port": 10001,
        "protocol": "tcp",
        "labels": {}
      },
      {
        "port": 10002,
        "protocol": "tcp",
        "labels": {}
      }
    ],
    "requirePorts": false,
    "versionInfo": {
      "lastScalingAt": "2016-09-06T16:07:36.977Z",
      "lastConfigChangeAt": "2016-09-06T13:33:54.912Z"
    },
    "tasksStaged": 0,
    "tasksRunning": 1,
    "tasksHealthy": 1,
    "tasksUnhealthy": 0,
    "deployments": [],
    "tasks": [
      {
        "id": "prometheus_server.9f5e60f6-7435-11e6-b7c8-70b3d5800001",
        "slaveId": "cdf14879-60f4-484e-8179-b8962c61322e-S5",
        "host": "167.114.254.10",
        "state": "TASK_RUNNING",
        "startedAt": "2016-09-06T16:07:51.070Z",
        "stagedAt": "2016-09-06T16:07:42.513Z",
        "ports": [
          21927,
          21926
        ],
        "version": "2016-09-06T16:07:36.977Z",
        "ipAddresses": [
          {
            "ipAddress": "172.17.0.2",
            "protocol": "IPv4"
          }
        ],
        "localVolumes": [
          {
            "containerPath": "prometheus",
            "persistenceId": "prometheus_server#prometheus#9f5debc5-7435-11e6-b7c8-70b3d5800001"
          }
        ],
        "appId": "/prometheus/server",
        "healthCheckResults": [
          {
            "alive": true,
            "consecutiveFailures": 0,
            "firstSuccess": "2016-09-06T16:08:37.051Z",
            "lastFailure": null,
            "lastSuccess": "2016-09-06T17:02:38.127Z",
            "lastFailureCause": null,
            "taskId": "prometheus_server.9f5e60f6-7435-11e6-b7c8-70b3d5800001"
          }
        ]
      }
    ],
    "lastTaskFailure": {
      "appId": "/prometheus/server",
      "host": "167.114.254.10",
      "message": "Container terminated",
      "state": "TASK_FAILED",
      "taskId": "prometheus_server.9f5e60f6-7435-11e6-b7c8-70b3d5800001",
      "timestamp": "2016-09-06T16:07:38.346Z",
      "version": "2016-09-06T16:07:36.977Z",
      "slaveId": "cdf14879-60f4-484e-8179-b8962c61322e-S5"
    }
  }
}

The Load balancer UI shows the correct mapping btw. it's only the Service view that doesn't

@marcomonaco marcomonaco added the bug label Dec 6, 2016
@marcomonaco marcomonaco added this to the Marathon 1.4 milestone Dec 6, 2016
@jdef jdef added the gui label Dec 7, 2016
@jdef
Copy link
Contributor

jdef commented Dec 7, 2016

it sounds like this is a DCOS UI bug vs. a Marathon bug

@unterstein
Copy link
Contributor

unterstein commented Dec 7, 2016

I think the data source for this display is the groups api (the UI always use the groups endpoint) inside the nodes under $appId > container. I don`t think the ui manipulates this data. @wavesoft @orlandohohmeier please confirm :)

@unterstein
Copy link
Contributor

Talked to @lloesche, he said that this happens only for resident tasks and this was an ui issue. The api worked as expected. I tried to reproduce this with current testing/master cluster but it did not happen within 20 restarts.

@unterstein unterstein self-assigned this Dec 8, 2016
@unterstein unterstein removed the ready label Dec 8, 2016
@unterstein
Copy link
Contributor

unterstein commented Dec 8, 2016

With the following app definition this behavior is reproducible:

{
  "id": "/sleepy",
  "cmd": "sleep 1000",
  "cpus": 1,
  "mem": 128,
  "disk": 100,
  "instances": 1,
  "executor": null,
  "fetch": null,
  "constraints": null,
  "acceptedResourceRoles": null,
  "user": null,
  "container": {
    "docker": {
      "image": "ubuntu",
      "forcePullImage": false,
      "privileged": false,
      "portMappings": [
        {
          "containerPort": 80,
          "protocol": "tcp"
        },
        {
          "containerPort": 443,
          "protocol": "tcp"
        }
      ],
      "network": "BRIDGE"
    },
    "type": "DOCKER",
    "volumes": [
      {
        "containerPath": "data",
        "persistent": {
          "size": 100
        },
        "mode": "RW"
      }
    ]
  },
  "updateStrategy": {
    "maximumOverCapacity": 0,
    "minimumHealthCapacity": 0
  },
  "residency": {
    "relaunchEscalationTimeoutSeconds": 10,
    "taskLostBehavior": "WAIT_FOREVER"
  },
  "healthChecks": null,
  "env": null
}

UI:

{
    "type": "DOCKER",
    "docker": {
        "image": "ubuntu",
        "network": "BRIDGE",
        "port_mappings": [
            {
                "host_port": 24781,
                "container_port": 80,
                "protocol": "tcp"
            },
            {
                "host_port": 24782,
                "container_port": 443,
                "protocol": "tcp"
            }
        ],
        "privileged": false,
        "force_pull_image": false
    }
}

Docker deamon:

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                                           NAMES
eb9d70971e7f        ubuntu              "/bin/sh -c 'sleep 10"   48 seconds ago      Up 48 seconds       0.0.0.0:24782->80/tcp, 0.0.0.0:24781->443/tcp   mesos-a6893e50-40a4-48cb-b4c2-b9e63bd2a08d-S0.2afc67d1-a6c9-44ae-92c5-784cdc56fcfa

@unterstein unterstein reopened this Dec 8, 2016
@unterstein
Copy link
Contributor

Ok, investigated again:

Marathon re-uses TaskIds for resident Tasks, this means that the mesos state.json will return something like this:

[  
  {  
    "id":"m2.6d62fd63-be00-11e6-ba0c-d23a4f1e7619",
    ...
    "port_mappings":[  
      {  
        "host_port":17294,
        "container_port":80,
        "protocol":"tcp"
      },
      {  
        "host_port":17295,
        "container_port":443,
        "protocol":"tcp"
      }
    ]
  },
  "id":"m2.6d62fd63-be00-11e6-ba0c-d23a4f1e7619",
  ...
  "port_mappings":[  
    {  
      "host_port":17295,
      "container_port":80,
      "protocol":"tcp"
    },
    {  
      "host_port":17294,
      "container_port":443,
      "protocol":"tcp"
    }
  ]
}
]

and when the users clicks on a particular task in the ui, the ui tries to request the information for the task with the id m2.6d62fd63-be00-11e6-ba0c-d23a4f1e7619. But this id is present multiple times within this json array and it is not possible to decide which one to choose to display the data. Closing this in favor of this GH issue, addressing the root cause: #4819

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants