Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting guide for installation and service-mesh #3940

Closed
piotrmsc opened this issue May 7, 2019 · 9 comments
Closed

Troubleshooting guide for installation and service-mesh #3940

piotrmsc opened this issue May 7, 2019 · 9 comments
Assignees
Labels
area/documentation Issues or PRs related to documentation area/installation Issues or PRs related to installation area/service-mesh Issues or PRs related to service-mesh Epic kind/feature Categorizes issue or PR as related to a new feature.

Comments

@piotrmsc
Copy link

piotrmsc commented May 7, 2019

Description

Recently we observed many topics related to installation which are duplicates, however, valid and we do not have documentation for it. For example problems with tiller and transport is closing and istio crds not being visible. Creating a troubleshooting guide with the most common topics which occur in the community will help developers and will help us with giving help on community channels.

AC:

New documentation topic displayed on kyma-project.io in area installation and service-mesh covering most common topics such as tiller certificates, installer logs/retires, istio crds not being visibile.

Reasons
Improve developer experience by giving easy access to documentation describing most common states that may occur, but are not a bug itself.

Attachments

@piotrmsc piotrmsc added kind/feature Categorizes issue or PR as related to a new feature. area/documentation Issues or PRs related to documentation labels May 7, 2019
@piotrmsc piotrmsc added this to the Backlog_Goat milestone May 7, 2019
@tomekpapiernik
Copy link
Contributor

Ideas for Installation troubleshooting:

  1. Extract the existing 4 common cases from the local installation guide.
  2. (local) Console throws errors - Kyma cert not added to trusted -> SLACK LINK
  3. (local) Restarting minikube the right way
  4. (local) Console UI isn't accessible - Istio ingressgateway doesn't get the necessary config - Pod must be restarted to get the config - Console starts working -> related to a bug in Istio.

Can't think of any cluster-related issues that are common enough to be added to a troubleshooting doc.

@piotrmsc
Copy link
Author

piotrmsc commented May 24, 2019

I would also add to installation area problems with analyzing logs output of "is-installed" script. People often get confused with istio installation and think this is the root cause although different component has failed.

@piotrmsc
Copy link
Author

For service mesh I would describe problems with connection refused - ports not being opened in the ingress gateway. Additionally, mention problems with mTLS - crucial for people who deploy apps without sidecar

@piotrmsc piotrmsc modified the milestones: Backlog_Goat, Sprint_Goat_13 May 27, 2019
@Demonsthere
Copy link

Demonsthere commented May 27, 2019

Installation problem:

  • When using your own image, and putting in in the wrong place in my-kyma.yaml causes this
  • For pre 1.0 version of kyma, not upgrading tiller will cause this

@Demonsthere Demonsthere self-assigned this May 28, 2019
@Demonsthere
Copy link

Demonsthere commented May 28, 2019

Troubleshooting guide

Installation

Network error when accessing kyma console

Description:

When accessing kyma console, a Network Error is presented. This will happen on any cluster (minikube, or cloud provider) that uses a Self Signed TLS certificate, and means that the certificate is not present in the system trust store.

Fix:

There are two methods of fixing this issue:

  • (Preferred) Adding the wildcard certificate to the OS trust store (MacOS, Linux/Debian based, Windows)
  • (Optional) Trusting the self signed certificates in your browser (Chrome, Firefox). This needs to be done for the following sites: apiserver.foo.bar, console.foo.bar, dex.foo.bar, console-backend.foo.bar

Errors after restarting minikube

Description:

In the case of restarting kyma in minikube (by rebooting your machine, or using run.sh when a cluster is already present) minikube can enter a state in which it is unresponsive, and may require a full reinstallation in order to work properly.

Fix:

In order to properly halt the minikube and restart it, please follow this document

Kyma-installer is stuck at ContainerCreating step, when using custom image

Description:
Since release 0.9.0 we have focused on security, which resulted in securing the connection between helm client and server. Because of that, the kyma-installer cannot start, if a set of client-server certificates is not present on the system.

Fix:

A common reason for this, is an error in the user supplied my-kyma.yaml, when the user supplies his own image. As of 0.9.0 the installer.yaml has 2 image fields:

image: eu.gcr.io/kyma-project/test-infra/alpine-kubectl:v20190325-ff66a3a
image: eu.gcr.io/kyma-project/develop/installer:0fdc80dd

Th second image is the one that needs to be replaced, not the first.

Kyma-installer fails with error: Unable create helm client

Description:

If the kyma-installer is failing with the following error:

Unable create helm client. Error: could not read x509 key pair (cert: "/etc/certs/tls.crt", key: "/etc/certs/tls.key"): can't load key pair from cert /etc/certs/tls.crt and key /etc/certs/tls.key: open /etc/certs/tls.crt: no such file or directory

It means, you are probably trying to upgrade a pre 0.9.0 release to a newer one. In 0.9.0 we introduced a security feature: TLS connection between Tiller and Helm. Because of this, the kyma-installer expects a secret to be present in kubernetes, which is generated by our tiller deployment. If during the upgrade you didn't upgrade tiller, this error may happen.

Fix:

Upgrade your tiller deployment by calling: kubectl apply -f https://raw.githubusercontent.com/kyma-project/kyma/{RELEASE_TAG}/installation/resources/tiller.yaml, where RELEASE_TAG is your new desired release.

Can't access Tiller (transport is closing error on any helm command)

Description:

Since 0.9.0 communication between Helm and Tiller is secured by a TLS certificate. Because of this, this certificate need to be present in the local system in order to access Tiller in the kubernetes cluster.

Fix:

Please follow this document

Istio error during installation

Description:

It may happen, that during the installation of kyma one will see the following errors in the kyma-installer error log:

 Step error:  Details: Helm install error: rpc error: code = Unknown desc = validation failed: [unable to recognize "": no matches for kind "DestinationRule" in version "networking.istio.io/v1alpha3", unable to recognize "": no matches for kind "DestinationRule" in version "networking.istio.io/v1alpha3", unable to recognize "": no matches for kind "attributemanifest" in version "config.istio.io/v1alpha2"

This behaviour is normal and expected. Istio is the first big component being installed, and it may happen, that not all of its CRDs are installed in time, for the next component. In such a case, the installation step is repeated, giving Istio more time for setup, and the error is displayed.

Fix:

There is nothing to fix, as the installation is repeated.

Job failed: DeadlineExceeded during installation

Description:

This error means, that a job object in kubernetes couldn't be finished in a set time, and a timeout occurred. This error is often followed by another:

Helm install error: rpc error: code = Unknown desc = a release named core already exists

The second error only means, that the installation couldn't get past a release, which installation has failed. The second error is not the root cause, but rather an identification, in which component (release) we can find the failing job.

Fix:

As this is a timeout error, it should be enough to restart the installation. However, such information might be vital in further development of kyma, so pinpointing and reporting the problem, may be important for us.

In order to find the broken job try using the following commands:

# Get currently installed components (helm releases)
helm ls --tls
# A high number of revisions may suggest that a component has been reinstalled several times.
# A status different that Deployment suggest that this component couldn't be installed

# Get details about a component
helm status ${RELEASE_NAME} --tls
# Pods with containers 1/2 in READY may be broken, and cause such an error

# Get deployed jobs
kubectl get jobs --all-namespaces
# Jobs that are not completed can cause such an error

With such informations please open a ticket in kyma repository so we can improve this.

Service mesh

Can't access console, or other endpoints

Description:

All endpoints presented by kyma, are exposed by a gateway. Those endpoints can be seen by calling:

kubectl get virtualservice --all-namespaces

Receiving 503 errors by accessing any of those endpoints, may mean an error in the gateway exposing them.
This is an error in the istio gateway

Fix:
In the case of a broken gateway, is enough to kill the current gateway pods, which should be recreated, and their configuration renewed. To verify if this is the problem, run:

kubectl exec -t -n istio-system $(kubectl get pod -l app=istio-ingressgateway -n istio-system | grep "istio-ingressgateway" | awk '{print $1}') -- netstat -lptnu

This should print all used ports by the gateway. If the ports 80 and 443 are not used, please kill the pods:

kubectl delete pod -l app=istio-ingressgateway -n istio-system

Connection refused errors after upgrade to kyma 1.0.0

Description:

In kyma 1.0.0 we have enabled MutualTLS(mTLS) inside the Isito Service Mesh. Because of this, every element of the mesh requires to either have a istio-proxy sidecar enabled, or a proper DestinationRule/AuthorizationPolicy which whitelists a deployment in the Service Mesh.

Fix:

  • As istio side-car injection is enabled by default, it should be enough to delete existing pods in order to enable them in the Service Mesh.
  • Whitelisting a service requires creating a DestionationRule, which disables TLS traffic. Example:
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: $YOUR_SERVICE
spec:
  host: $YOUR_SERVICE.$YOUR_NAMESPACE.svc.cluster.local
  trafficPolicy:
    tls:
      mode: DISABLE

@tomekpapiernik tomekpapiernik self-assigned this May 28, 2019
@piotrmsc
Copy link
Author

piotrmsc commented May 28, 2019

Kyma-installer is stuck at ContainerCreating step and it's fix applies to a flow with using custom installer image(from sources). If this happens from release (w/o modification, there should also be fix described)
In the installation, I would also describe transport is closing from helm - missing --tls flag or HELM_HOME not set. Docu can point to https://kyma-project.io/docs/root/kyma/#installation-use-helm.

@Demonsthere
Copy link

Updated content

@Demonsthere
Copy link

Update 2

@piotrmsc piotrmsc added Epic area/service-mesh Issues or PRs related to service-mesh area/installation Issues or PRs related to installation labels May 28, 2019
@Demonsthere
Copy link

Update 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Issues or PRs related to documentation area/installation Issues or PRs related to installation area/service-mesh Issues or PRs related to service-mesh Epic kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants