Skip to content
This repository was archived by the owner on Nov 22, 2024. It is now read-only.

Added EKS deployment support#65

Merged
skonto merged 6 commits intolightbend:masterfrom
eschizoid:feature/k8s-support
Jan 17, 2020
Merged

Added EKS deployment support#65
skonto merged 6 commits intolightbend:masterfrom
eschizoid:feature/k8s-support

Conversation

@eschizoid
Copy link
Contributor

@eschizoid eschizoid commented Dec 6, 2019

The purpose of this PR is to add support for EKS deployments.

@lightbend-cla-validator
Copy link
Collaborator

Hi @eschizoid,

Thank you for your contribution! We really value the time you've taken to put this together.

Before we proceed with reviewing this pull request, please sign the Lightbend Contributors License Agreement:

http://www.lightbend.com/contribute/cla

@eschizoid eschizoid force-pushed the feature/k8s-support branch from 5649d35 to 390545b Compare December 6, 2019 20:01
@lightbend-cla-validator
Copy link
Collaborator

Hi @eschizoid,

Thank you for your contribution! We really value the time you've taken to put this together.

Before we proceed with reviewing this pull request, please sign the Lightbend Contributors License Agreement:

http://www.lightbend.com/contribute/cla

@eschizoid
Copy link
Contributor Author

I already signed the CLA.

@eschizoid eschizoid mentioned this pull request Dec 6, 2019
@eschizoid eschizoid force-pushed the feature/k8s-support branch 13 times, most recently from 1d0664e to 2d094a5 Compare December 9, 2019 04:06

tags:
targetGke: true
targetEks: false
Copy link
Contributor Author

@eschizoid eschizoid Dec 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how important is to parametrized this tag (or if it really matters)

@eschizoid eschizoid force-pushed the feature/k8s-support branch 11 times, most recently from 35f58ae to 8e4ac0d Compare December 9, 2019 07:16
@skonto
Copy link
Contributor

skonto commented Jan 15, 2020

@eschizoid from the list produced by $ aws ec2 describe-security-groups which is the cluster security group id? I see that you have what I have there. It would be good if you could show a screenshot of the console with the security group ids related to the cluster only.
Regarding the IAM roles I do have AdministratorAccess for all services including EFS, EKS. So I suspect it is not related to that. Also I would like to keep my user clean in order to review the steps required for this PR.
In order to proceed could you please check what the EFS provisioner requires in terms of security so it can run. I mean minimum requirements regarding roles. Other user may face issues here. What are the assumptions you make which my setup does not satisfy and the pod never runs or what does EFS require? We need to have that as part of the documentation in this PR.
Btw here is additional log output for the failing pod:

88s         Warning   FailedMount   pod/cloudflow-efs-efs-provisioner-575d8f8567-mggwf   (combined from similar events): MountVolume.SetUp failed for volume "pv-volume" : mount failed: exit status 32
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c5cda11c-3786-11ea-9270-0a4ae02a62dd/volumes/kubernetes.io~nfs/pv-volume --scope -- mount -t nfs .efs..amazonaws.com:/ /var/lib/kubelet/pods/c5cda11c-3786-11ea-9270-0a4ae02a62dd/volumes/kubernetes.io~nfs/pv-volume

pv-volume is part of the installer/efs-provisioner/templates/deployment.yaml.
Something is missing there: .efs..amazonaws.com, Values.efsProvisioner.awsRegion is not being set from what I see. Could you fix that?

@skonto
Copy link
Contributor

skonto commented Jan 15, 2020

@eschizoid I talked with my colleagues who have previous successful experience with EFS and Cloudflow.
You need additionally to follow the rules as discussed here in security groups to enable traffic between an EC2 instance of the Kubernetes Cluster and a mount target (and thus the file system). The general docs for EFS are here: https://aws.amazon.com/premiumsupport/knowledge-center/eks-pods-efs/, please follow that and update the documentation in this PR accordingly.

@agemooij
Copy link
Contributor

Thanks a lot @eschizoid for helping out with this!

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 15, 2020

@eschizoid from the list produced by $ aws ec2 describe-security-groups which is the cluster security group id? I see that you have what I have there. It would be good if you could show a screenshot of the console with the security group ids related to the cluster only.

Yeah I can do that later today

Regarding the IAM roles I do have AdministratorAccess for all services including EFS, EKS. So I suspect it is not related to that. Also, I would like to keep my user clean in order to review the steps required for this PR.

Yeah, I agree 100%. Are we sure that the AdminAccess for EKS and EFS is sufficient? I am asking this because I see a couple of mange policies from my previous screenshot that might do the trick. For instance, the IAMFullAccess policy might be needed, and also some policies related to just the EKS worker nodes.

In order to proceed could you please check what the EFS provisioner requires in terms of security so it can run. I mean minimum requirements regarding roles. Other user may face issues here. What are the assumptions you make which my setup does not satisfy and the pod never runs or what does EFS require? We need to have that as part of the documentation in this PR.

I am trying to remember what manual steps I did for my user, vpc's or sg's, but I can't remember any particular setup other than attaching the IAM policies manually.

Btw here is additional log output for the failing pod:

88s         Warning   FailedMount   pod/cloudflow-efs-efs-provisioner-575d8f8567-mggwf   (combined from similar events): MountVolume.SetUp failed for volume "pv-volume" : mount failed: exit status 32
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/c5cda11c-3786-11ea-9270-0a4ae02a62dd/volumes/kubernetes.io~nfs/pv-volume --scope -- mount -t nfs .efs..amazonaws.com:/ /var/lib/kubelet/pods/c5cda11c-3786-11ea-9270-0a4ae02a62dd/volumes/kubernetes.io~nfs/pv-volume

pv-volume is part of the installer/efs-provisioner/templates/deployment.yaml.
Something is missing there: .efs..amazonaws.com, Values.efsProvisioner.awsRegion is not being set from what I see. Could you fix that?

We are already doing that here:

EFS_SERVER_NAME=cloudflow-efs
EFS_CHART_NAME=efs-provisioner
install_efs_provisioner() {
helm upgrade $EFS_SERVER_NAME stable/$EFS_CHART_NAME \
--install \
--namespace "$1" \
--timeout 600 \
--set efsProvisioner.dnsName="$2.efs.$3.amazonaws.com" \
--set efsProvisioner.efsFileSystemId="$2" \
--set efsProvisioner.awsRegion="$3"
}

Looks like when the install.sh calls the function install_efs_provisioner the AWS_DEFAULT_REGION is not being passed, which is really weird because the install exports that variable. Oh man, I hate bash because all this variable leaking everywhere.

I might need to start from scratch with a new user until we figure out what especial setup I did that I cannot longer remember :(

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 15, 2020

@agemooij / @skonto how do you guys feel about merging this guy and I take care of the 3 open items on a follow up PR.

Right now the only critical part that is not working is just the efs-provisioner. However, I feel this PR is in good shape (other than again fixing the merge conflicts) and adds value as it is.

@skonto
Copy link
Contributor

skonto commented Jan 15, 2020

@eschizoid I put some extra logging:

     file_system_id="$(aws efs describe-file-systems --query "FileSystems[?Name=='$NAMESPACE'].FileSystemId" --output json | jq -r '.[]')"
+    echo "awsd: $AWS_DEFAULT_REGION"
+    echo "fsid: $file_system_id"
+    echo "ns: $NAMESPACE"

Unfortunately some variables are not set:

Installing EFS Provisioner
awsd: 
fsid: 
ns: cloudflow

This part does not work for me: "FileSystems[?Name=='$NAMESPACE'].FileSystemId"
I had to use the following:

$ aws efs describe-file-systems --query "FileSystems[?Name=='stavros12']" --output json | jq -r '.[]'
{
  "OwnerId": "405074236871",
  "CreationToken": "stavros12",
  "FileSystemId": "fs-e4542f2f",
  "CreationTime": 1579019334,
  "LifeCycleState": "available",
  "Name": "stavros12",
  "NumberOfMountTargets": 3,
  "SizeInBytes": {
    "Value": 6144,
    "Timestamp": 1579106728,
    "ValueInIA": 0,
    "ValueInStandard": 6144
  },
  "PerformanceMode": "generalPurpose",
  "Encrypted": false,
  "ThroughputMode": "bursting",
  "Tags": [
    {
      "Key": "Name",
      "Value": "stavros12"
    }
  ]
}

Also there is an assumption there that $AWS_DEFAULT_REGION will be available but that var is exported via the create-cluster-eks.sh script but that will not preserve it.
I will try to check the right IAM roles and see if you are missing the steps I pointed above about the NFS security group thing.

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 15, 2020

This part does not work for me: "FileSystems[?Name=='$NAMESPACE'].FileSystemId"
I had to use the following:

ahhhh i think that is where the problem might be. it seems that some variables are not being passed correctly :(

@skonto
Copy link
Contributor

skonto commented Jan 15, 2020

@eschizoid I tested it with all vars set and mount still fails: Warning FailedMount 24s kubelet, ip-192-168-67-151.eu-west-1.compute.internal MountVolume.SetUp failed for volume "pv-volume" : mount failed: exit status 32, I will follow the instructions here: https://docs.aws.amazon.com/efs/latest/ug/accessing-fs-create-security-groups.html and most likely we will solve this ;)
Please when you have spare time resolve the PR conflicts and fix the above minor issue with vars.
I will get back to you with what I got following the instructions and possibly adapting IAM roles.
Btw regarding the helm charts we can always use the stable ones later on via another PR, let's make this work and we can merge.
Thank you for your time!

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 16, 2020

@eschizoid I tested it with all vars set and mount still fails: Warning FailedMount 24s kubelet, ip-192-168-67-151.eu-west-1.compute.internal MountVolume.SetUp failed for volume "pv-volume" : mount failed: exit status 32, I will follow the instructions here: https://docs.aws.amazon.com/efs/latest/ug/accessing-fs-create-security-groups.html and most likely we will solve this ;)
Please when you have spare time resolve the PR conflicts and fix the above minor issue with vars.
I will get back to you with what I got following the instructions and possibly adapting IAM roles.
Btw regarding the helm charts we can always use the stable ones later on via another PR, let's make this work and we can merge.
Thank you for your time!

I fixed all the merge conflicts. However all the cloudlfow stuff is no longer working for me, this was expected since a lot of things changed from the last commit (all the tiller stuff was repackaged). The good thing is that I now know where the problem for the missing env variable is (DEFAULT_AWS_REGION). I will work on that tomorrow and also try to figure out where cloudflow installation is breaking for me.

@skonto @agemooij Just out of curiosity, I don't see any tests around the installation process, how do we know that the last commit didn't break the whole installation like it just happened recently for the flink and cloudflow operators? I am asking this because last time we spent two days until we figure out that master was broken. Can somebody warranty that the last merge to master actually works installation-wise?

@skonto
Copy link
Contributor

skonto commented Jan 16, 2020

@eschizoid regarding the integration tests you are 100% right. That is something we have discussed with our team extensively and is one of the top priorities. We need to make sure at least something is working before merging a PR but also we want night builds to be on the safe side. Of course we need to develop the right tests and cover enough.

@skonto
Copy link
Contributor

skonto commented Jan 16, 2020

@eschizoid good news, I got the EFS provisioner running by just setting the two security groups described in the link above:

$ kubectl get pods -n cloudflow
NAME                                                         READY   STATUS    RESTARTS   AGE
cloudflow-efs-efs-provisioner-66d757fbff-2kcsc               1/1     Running   0          4m29s
cloudflow-flink-flink-operator-76865f984-mnswc               1/1     Running   0          2m15s
cloudflow-operator-7c86ffbfbb-5rsfn                          1/1     Running   0          112s
cloudflow-sparkoperator-fdp-sparkoperator-6c9bbf5ccf-4gjqr   1/1     Running   0          119s
cloudflow-strimzi-entity-operator-5c5d6f948-vbgp4            2/2     Running   0          23s
cloudflow-strimzi-kafka-0                                    2/2     Running   0          61s
cloudflow-strimzi-kafka-1                                    2/2     Running   0          61s
cloudflow-strimzi-kafka-2                                    2/2     Running   0          61s
cloudflow-strimzi-zookeeper-0                                2/2     Running   0          110s
cloudflow-strimzi-zookeeper-1                                2/2     Running   0          110s
cloudflow-strimzi-zookeeper-2                                2/2     Running   0          110s
strimzi-cluster-operator-5d7d946c6d-2wgmf                    1/1     Running   0          2m6s

I havent tried any IAM role settings so we are good to go. I havent also update my local branch with your new updates because you mentioned the Cloudflow things dont work for you.
I can also help with where things break for you.

@skonto
Copy link
Contributor

skonto commented Jan 16, 2020

@eschizoid I was able also to run sensor-data-scala app using your PR before the rebase:

$ kubectl get pods -n sensor-data-scala
NAME                                                  READY   STATUS    RESTARTS   AGE
sensor-data-scala-file-ingress-78bf6d5d97-spjbn       1/1     Running   0          10m
sensor-data-scala-http-ingress-7c4964f6-fmbqk         1/1     Running   0          10m
sensor-data-scala-invalid-logger-85648fb4bb-5wqsb     1/1     Running   0          10m
sensor-data-scala-merge-7fcd4bf54b-txdkb              1/1     Running   0          10m
sensor-data-scala-metrics-79b6d87d46-q6n4l            1/1     Running   0          10m
sensor-data-scala-rotor-avg-logger-7d495d94db-dq9bm   1/1     Running   0          10m
sensor-data-scala-rotorizer-5f45966f8b-mchjm          1/1     Running   0          10m
sensor-data-scala-valid-logger-75f7f458c4-rvvz4       1/1     Running   0          10m
sensor-data-scala-validation-9489989b6-bpps7          1/1     Running   0          10m

$ kubectl logs sensor-data-scala-invalid-logger-85648fb4bb-5wqsb -n sensor-data-scala

[WARN] [01/16/2020 12:25:19.024] [akka_streamlet-akka.actor.default-dispatcher-4] [akka.actor.ActorSystemImpl(akka_streamlet)] Invalid metric detected! {"metric": {"deviceId": "c75cb448-df0e-4692-8e06-0321b7703992", "timestamp": 1495545346279, "name": "power", "value": -1.7}, "error": "All measurements must be positive numbers!"}

A few minor fixes required:
a) in templates/efs.yaml the storage class name should be aws-efs.
b) when running the deploy command user needs to pass the decrypted password as described here:
https://docs.aws.amazon.com/cli/latest/reference/ecr/get-authorization-token.html
Right now you have: $ kubectl cloudflow deploy -u $(aws iam get-user | jq -r .User.UserName) --volume-mount file-ingress.source-data-mount=claim1 index.docker.io/<user>/sensor-data-scala:8-2a0f65d-dirty -p "<docker_hub_password>
It should be:
-p "$(aws ecr get-authorization-token --output text --query 'authorizationData[].authorizationToken' | base64 -d | cut -d: -f2)" assuming ECR is used.
Regarding IAM roles since I am admin I guess I faced no issues, other users may need to adjust them, so that means we need to leave a comment int he docs syaing adjust your roles for EFS, EKS etc as a warning ;)
I will know try your rebased code. Hopefully I can figure out what is wrong now that I have a reference working deployment.

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 16, 2020

@skonto these are great news :)

Anyways, I pushed all my changes (including the env variable fix). And this is the last error I am facing:

Error: failed to download "https://github.com/lightbend/flink-operator/releases/download/v0.7.1/flink-operator-0.7.1.tgz" (hint: running `helm repo update` may help)

a) in templates/efs.yaml the storage class name should be aws-efs.

fixed

b) when running the deploy command user needs to pass the decrypted password as described here:

I think we should definitely do ECR, docker hub should probably be paired with a minikube installation. Let me start adding the documentation for the ECR login and all that before I incorporate this change.

Feels like we are close, by the way, thanks for taking care of the investigation on the sg's side ;)

@skonto
Copy link
Contributor

skonto commented Jan 16, 2020

Let me try the helm charts the url is working from what I see.

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 17, 2020

@skonto alright I have good news after my last change I got all cloudflow operators up and running:

$ kubectl get pods  --watch -n cloudflow
NAME                                                         READY   STATUS    RESTARTS   AGE
cloudflow-efs-efs-provisioner-5759594969-vw5g5               1/1     Running   0          2m35s
cloudflow-flink-flink-operator-76865f984-89d52               1/1     Running   0          2m29s
cloudflow-operator-7c86ffbfbb-m9sx2                          1/1     Running   0          2m20s
cloudflow-sparkoperator-fdp-sparkoperator-6c9bbf5ccf-484mx   1/1     Running   0          2m23s
cloudflow-strimzi-entity-operator-5c5d6f948-gpjhl            2/2     Running   0          43s
cloudflow-strimzi-kafka-0                                    2/2     Running   0          93s
cloudflow-strimzi-kafka-1                                    2/2     Running   0          93s
cloudflow-strimzi-kafka-2                                    2/2     Running   0          93s
cloudflow-strimzi-zookeeper-0                                2/2     Running   0          2m16s
cloudflow-strimzi-zookeeper-1                                2/2     Running   0          2m16s
cloudflow-strimzi-zookeeper-2                                2/2     Running   0          2m16s
strimzi-cluster-operator-5d7d946c6d-f2q8m                    1/1     Running   0          2m26s

Feel free to try the PR, hopefully for the last time :)

I think we only have two remaining open items:

  • Document ECR login and docker push to ECR
  • Document security groups setup

For the second on I am wondering if there something we can to add it to the create cluster script for eks. I can take care of the second one over the weekend.

Cheers

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

@eschizoid glad that you made everything run on your side! The pending things I noticed beyond docs are:

  • security-groups "$SECURITY_GROUP_IDS" ${CLUSTER_SECURITY_GROUP_ID//null/} I still need this fix to deal with the null cluster security group id.
  • file_system_id="$(aws efs describe-file-systems --query "FileSystems[?Name=='$NAMESPACE'].FileSystemId" --output json | jq -r '.[]')". $NAMESPACE needs to be replaced with the cluster name, I suspect you have the same name for both that is why it works for you. If I use NAMESPACE it returns nothing.
  • I noticed you changed the API of the install script. Now you only need to run:
    ./install.sh eks but now I cannot target a specific cluster. Before your final update I could do: ./install.sh stavros12 eks. Am I missing something here? Could you keep the API as on master where we can target a specific clustername for gke?

@eschizoid
Copy link
Contributor Author

eschizoid commented Jan 17, 2020

  • security-groups "$SECURITY_GROUP_IDS" ${CLUSTER_SECURITY_GROUP_ID//null/} I still need this fix to deal with the null cluster security group id.

Hmm I am still not clear for what type of fix we are looking for, do we just want a shield in case CLUSTER_SECURITY_GROUP_ID is null? Maybe something like this?

...

CLUSTER_SECURITY_GROUP_ID="$(aws eks describe-cluster --name "$CLUSTER_NAME" | jq -r '.cluster.resourcesVpcConfig.clusterSecurityGroupId')"
CLUSTER_SECURITY_GROUP_ID="${CLUSTER_SECURITY_GROUP_ID:-}"

...

aws efs create-mount-target \
    --file-system-id "$FILE_SYSTEM_ID" \
    --subnet-id "$SUBNET_ID" \
    --security-groups "$SECURITY_GROUP_IDS" "$CLUSTER_SECURITY_GROUP_ID"

Let me know :)

  • file_system_id="$(aws efs describe-file-systems --query "FileSystems[?Name=='$NAMESPACE'].FileSystemId" --output json | jq -r '.[]')". $NAMESPACE needs to be replaced with the cluster name, I suspect you have the same name for both that is why it works for you. If I use NAMESPACE it returns nothing.

My bad, this was fixed and same for the docs

  • I noticed you changed the API of the install script. Now you only need to run:
    ./install.sh eks but now I cannot target a specific cluster. Before your final update I could do: ./install.sh stavros12 eks. Am I missing something here? Could you keep the API as on master where we can target a specific clustername for gke?

Same

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

@eschizoid yes a protection against a null value is required. I will do another round now for the PR.

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

@eschizoid I launched with the right arguments:

$./install.sh stavros15 eks
...
The tiller namespace is 'kube-system'
Tiller is correctly configured
Installing Cloudflow 1.2.1
 - cluster: stavros15
 - namespace: cloudflow
Not enough arguments supplied
Usage: install-cloudflow.sh [CLUSTER_NAME] [CLOUDFLOW_NAMESPACE] [CLUSTER_TYPE]

You need to set to 3 the number of args ;)

# Usage: install-cloudflow.sh [CLUSTER_NAME] [CLOUDFLOW_NAMESPACE] [CLUSTER_TYPE]
if [ $# -ne 2 ]; then
  echo "Not enough arguments supplied"

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

@eschizoid pls fix the above two issues and I will merge, all stuff are running:

$ kubectl get pods -n cloudflow
NAME                                                         READY   STATUS    RESTARTS   AGE
cloudflow-efs-efs-provisioner-6f9d584b64-sfl7n               1/1     Running   0          5m55s
cloudflow-flink-flink-operator-76865f984-nwh9p               1/1     Running   0          5m44s
cloudflow-operator-7c86ffbfbb-pqzqp                          1/1     Running   0          5m10s
cloudflow-sparkoperator-fdp-sparkoperator-6c9bbf5ccf-ndx4r   1/1     Running   0          5m31s
cloudflow-strimzi-entity-operator-5c5d6f948-d289h            2/2     Running   0          3m19s
cloudflow-strimzi-kafka-0                                    2/2     Running   0          4m7s
cloudflow-strimzi-kafka-1                                    2/2     Running   0          4m7s
cloudflow-strimzi-kafka-2                                    2/2     Running   0          4m7s
cloudflow-strimzi-zookeeper-0                                2/2     Running   0          5m6s
cloudflow-strimzi-zookeeper-1                                2/2     Running   0          5m6s
cloudflow-strimzi-zookeeper-2                                2/2     Running   0          5m6s
strimzi-cluster-operator-5d7d946c6d-2ct4k                    1/1     Running   0          5m37s

We can update the docs in another PR no need to make this take longer ;) Just add to the docs only the link for the security groups and should be enough.

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

@eschizoid it still fails just remove quotes around "${CLUSTER_SECURITY_GROUP_ID:-}" ;)
${CLUSTER_SECURITY_GROUP_ID:-} does the trick, right now I am getting:

>   --security-groups "sg-09e25829fc45c2721" "${CLUSTER_SECURITY_GROUP_ID:-}"

An error occurred (BadRequest) when calling the CreateMountTarget operation: invalid security group ID:

The rest looks good. Thank you!

@skonto
Copy link
Contributor

skonto commented Jan 17, 2020

Merging the PR, @eschizoid thanks a lot for your patience and your contribution! Let's do a follow up PR using the stable helm charts.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants