New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openshift Origin Installation fails if cloud provider (AWS) used in inventory file #5692
Comments
Same error here since end of August with RHEL7.4 + Openshift Enterprise on AWS. (See also #5691) Nodes are registered in the cluster with AWS DNS domain suffix (.compute.internal) instead of the public_dns_domain we provide. Forcing it with hostname in inventory file using 'openshift_hostname' doesn't help. Support request opened with Red Hat (case 01937377), still waiting for resolution.
Same for networks.
|
It seems to be a timing issue, I have similar error messages installing on aws. I was able to start the origin-node service on the machine after waiting several minutes when the installation failed. After that, running installation a second time seems to work. |
@j00p34 ...I don't think in my setup I have timing issue ... I had tried manually restarting the node service on each machine after a while and it was throwing the same error ... |
@j00p34 / @poonia0arun same for me. Restarting installation doesn't help. |
@poonia0arun Sorry to hear that. It would have been easy to workaround then. I must say that I am using the 3.7 alpha version of openshift I didn't try it with 3.6 yet. Another big difference I see from your config is that your specifying aws keys. I am using IAM roles for my instances so they have rights to the AWS API without specifying keys:
@patlachance Your setup is a lot different I guess as you are on Enterprise version. I have used terraform to set up the machines and configure everything in AWS. I've got openshift running except for the registry. The registry can't start because it's trying to use a base image from docker hub that doesn't exist. I did find this: That seems to configure your complete environment so it could be a better option to know everything is configured ok. I think I'll look into that this week. This is also an interesting read : refererence architecture 3.6 |
@j00p34 Your wright, I'm trying to install Enterprise version, following instructions from the link you provided. Only difference is that I'm trying to deploy Openshift in a private VPC behind a custom proxy/reverse proxy instances. |
@poonia0arun There's one thing I remember from a previous installation: When I provided openshift hostname my cluster couldn't start either. I can't remember exactly what the problem was but it had something to do with kubernetes resolving the hostname while the node names are different. Maybe you should try it without the |
@j00p34 if I run my ansible playbook without openshift_hostname value... API on master doesn't restart because it tries to resolve to ip-10-30.1.248.bdteam.local hostname which is not a dns record on my dns server so API service on master fails .. [root@osmaster01 centos]# systemctl status origin-master-api.service
● origin-master-api.service - Atomic OpenShift Master API
Loaded: loaded (/usr/lib/systemd/system/origin-master-api.service; enabled; vendor preset: disabled)
Active: activating (start) since Tue 2017-10-10 16:53:28 UTC; 23s ago
Docs: https://github.com/openshift/origin
Main PID: 31254 (openshift)
Memory: 25.4M
CGroup: /system.slice/origin-master-api.service
└─31254 /usr/bin/openshift start master api --config=/etc/origin/master/master-config.yaml --loglevel=2 --listen=https://0.0.0.0:8443 --master=https://ip-10-30-1-27.bdteam.local:8443
Oct 10 16:53:38 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:42 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:43 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:44 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:45 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Oct 10 16:53:51 osmaster01.bdteam.local openshift[31254]: grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup ip-10-30-1-248.bdteam.local: no such host"; Reconnecting to {ip-10-30-1-2...m.local:2379 <nil>}
Hint: Some lines were ellipsized, use -l to show in full.
[root@osmaster01 centos]# I am using ELB in front my HA pair of masters. |
I fixed my problem, seems unrelated after all. I am running 3.7 and I noticed origin-master-controllers.service was crash looping because in this version you need to set ClusterID when in aws. While running the playbook I added
to /etc/origin/cloudprovider/aws.conf After that the install proceeded without a problem. The reason it worked after a while was probably because I was starting it at the right moment |
@j00p34 .. oh okay.. I can't find the right solution for this ... I am still waiting for a solution |
@sdodson do you happen to have any pointer on this issue ? |
Hello, The aws cloud provider works fine only when you use the hostname/domain in your ansible inventory *.hosts file, the same as displayed in the AWS instance Private DNS field (in ec2 instance description) e.g.:
To do so you must have VPC DHCP options configured with empty domain-name eg.:
The hostname in CentOS Linux must be the same as above: ip-10-212-31-117.eu-west-1.compute.internal. The following commands also must return ip-10-212-31-117.eu-west-1.compute.internal:
The similar problem is also mentioned in the issue: kubernetes/kubernetes#11543 I'm looking forward for a fix or workaround to use custom domain and hostnames when using aws cloud provider. Regards, |
One of my Colleague spent some time into this issue ...he suggested to create A record on Route53 as ip-X-X-X-X.local.domain and assign masters and nodes IP accordingly to each A record...In my setup, I am using ELB in-front of each masters so create a classic loadbalancer listening on port 8443 of each masters. I made three changes to make it work on my current setup even though I can't use proper custom hostname:
Hosts file
Inventory File:
No error occurred:
hopefully this will help to someone who is still trying to make it work. |
@DanyC97 the kubeletPreferredAddressTypes arg goes in master config under apiserver arguments |
thanks a bunch @liggitt , i'll give it a try and report back. Initially i've done kubernetes/kubernetes#11543 (comment) but not much luck. |
@liggitt something is not right. I've applied the change as suggested and i got
any ideas ? |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Description
On openshift cluster if I use AWS as cloud provider. My installation fails while trying to start node service on each node. If I don't use any cloud provider it appears to be successful
Version
I am using RPM Installation
Steps To Reproduce
Expected Results
Expected result should be Node service start successfully and I see the output of oc get nodes as successful not in the state of NotReady.
Observed Results
Node service is unable to restart on each nodes or masters.
Kubectl describe node output
If I dont use any cloud provider in my ansible config.yml file my installation works fine but I need to resolve this for AWS or any cloud provider
Systemctl output of node service on a particular node
Logs output from one of the node (/var/log/messages)
Additional Information
The text was updated successfully, but these errors were encountered: