We recommend one of the following two methods for deploying Pachyderm on AWS:
- By manually deploying Kubernetes and Pachyderm.
- This is appropriate if you (i) already have a kubernetes deployment, (ii) if you would like to customize the types of instances, size of volumes, etc. in your cluster, (iii) if you're setting up a production cluster, or (iv) if you are processing a lot of data or have computationally expensive workloads.
- By executing a one shot deploy script that will both deploy Kubernetes and Pachyderm.
- This option is appropriate if you are just experimenting with Pachyderm. The one-shot script will get you up and running in no time!
In addition, we recommend setting up AWS CloudFront for any production deployments. AWS puts S3 rate limits in place that can limit the data throughput for your cluster, and CloudFront helps mitigate this issue. Follow these instructions to deploy with CloudFront
- AWS CLI - have it installed and have your AWS credentials configured.
- kubectl
- kops
- pachctl
The easiest way to install Kubernetes on AWS (currently) is with kops. You can follow this step-by-step guide from Kuberenetes for the deploy. Note, we recommend using at r4.xlarge
or larger instances in the cluster.
Once, you have a Kubernetes cluster up and running in AWS, you should be able to see the following output from kubectl
:
$ kubectl get all
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 7m
To deploy Pachyderm on your k8s cluster you will need to:
- Install the
pachctl
CLI tool, - Add some storage resources on AWS,
- Deploy Pachyderm on top of the storage resources.
To deploy and interact with Pachyderm, you will need pachctl
, Pachyderm's command-line utility. To install pachctl
run one of the following:
# For OSX:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.7
# For Linux (64 bit) or Window 10+ on WSL:
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.7.1/pachctl_1.7.1_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
You can try running pachctl version --client-only
to verify that pachctl
has been successfully installed.
$ pachctl version --client-only
1.7.0
Pachyderm needs an S3 bucket, and a persistent disk (EBS in AWS) to function correctly.
Here are the environmental variables you should set up to create and utilize these resources:
# BUCKET_NAME needs to be globally unique across the entire AWS region
$ BUCKET_NAME=<The name of the S3 bucket where your data will be stored>
# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=<the size of the EBS volume that you are going to create, in GBs. e.g. "10">
$ AWS_REGION=<the AWS region of your Kubernetes cluster. e.g. "us-west-2" (not us-west-2a)>
Then to actually create the backing S3 bucket, you can run one of the following:
# If AWS_REGION is us-east-1.
$ aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION}
# If AWS_REGION is outside of us-east-1.
$ aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}
As a sanity check, you should be able to see the bucket that you just created when you run the following:
$ aws s3api list-buckets --query 'Buckets[].Name'
You can deploy Pachyderm on AWS using:
Run the following command to deploy your Pachyderm cluster:
$ pachctl deploy amazon ${BUCKET_NAME} ${AWS_REGION} ${STORAGE_SIZE} --dynamic-etcd-nodes=1 --iam-role <your-iam-role>
Note that for this to work, the following need to be true:
-
The nodes on which Pachyderm is deployed need to be assigned with the utilized IAM role. If you created your cluster with
kops
, the nodes should have a dedicated IAM role. You can find this IAM role by going to the AWS console, clicking on one of the EC2 instance in the k8s cluster, and inspecting the "Description" of the instance. -
The IAM role needs to have access to the bucket you just created. To ensure that it has access, you can go to the
Permissions
tab of the IAM role and edit the policy to include the following segment (Make sure to replaceyour-bucket
with your actual bucket name):{ "Effect": "Allow", "Action": [ "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::your-bucket" ] }, { "Effect": "Allow", "Action": [ "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "arn:aws:s3:::your-bucket/*" ] }
-
The IAM role needs to have the proper "trust relationships" set up. You can verify this by navigating to the
Trust relationships
tab of your IAM role, clickingEdit trust relationship
, and ensuring that you see astatement
withsts:AssumeRole
. For instance, this would be a valid trust relationship:{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }
Once you've run pachctl deploy ...
and waited a few minutes, you should see the following running pods in Kubernetes:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dash-6c9dc97d9c-89dv9 2/2 Running 0 1m
etcd-0 1/1 Running 0 4m
pachd-65fd68d6d4-8vjq7 1/1 Running 0 4m
Note: If you see a few restarts on the pachd nodes, that's totally ok. That simply means that Kubernetes tried to bring up those containers before etcd was ready so it restarted them.
If you see the above pods running, the last thing you need to do is forward a couple ports so that pachctl
can talk to the cluster:
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can verify that the cluster is working by executing pachctl version
, which should return a version for both pachctl
and pachd
:
$ pachctl version
COMPONENT VERSION
pachctl 1.7.0
pachd 1.7.0
When you installed kops, you should have created a dedicated IAM user (see here for details). You could deploy Pachyderm using the credentials of this IAM user directly, although that's not recommended:
$ AWS_ACCESS_KEY_ID=<access key ID>
$ AWS_SECRET_ACCESS_KEY=<secret access key>
Run the following command to deploy your Pachyderm cluster:
$ pachctl deploy amazon ${BUCKET_NAME} ${AWS_REGION} ${STORAGE_SIZE} --dynamic-etcd-nodes=1 --credentials "${AWS_ACCESS_KEY_ID},${AWS_SECRET_ACCESS_KEY},"
Note, the ,
at the end of the credentials
flag in the deploy command is for an optional temporary AWS token. You might utilize this sort of temporary token if you are just experimenting with a deploy. However, such a token should NOT be used for a production deploy.
It may take a few minutes for Pachyderm to start running on the cluster, but you you should eventually see the following running pods:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dash-6c9dc97d9c-89dv9 2/2 Running 0 1m
etcd-0 1/1 Running 0 4m
pachd-65fd68d6d4-8vjq7 1/1 Running 0 4m
If you see an output similar to the above, the last thing you need to do is forward a couple ports so that pachctl
can talk to the cluster.
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can verify that the cluster is working by running pachctl version
, which should return a version for both pachctl
and pachd
:
$ pachctl version
COMPONENT VERSION
pachctl 1.7.0
pachd 1.7.0
Once you have the prerequisites mentioned above, download and run our AWS deploy script by running:
$ curl -o aws.sh https://raw.githubusercontent.com/pachyderm/pachyderm/master/etc/deploy/aws.sh
$ chmod +x aws.sh
$ sudo -E ./aws.sh
This script will use kops
to deploy Kubernetes and Pachyderm in AWS. The script will ask you for your AWS credentials, region preference, etc. If you would like to customize the number of nodes in the cluster, node types, etc., you can open up the deploy script and modify the respective fields.
The script will take a few minutes, and Pachyderm will take an addition couple of minutes to spin up. Once it is up, kubectl get pods
should return something like:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dash-6c9dc97d9c-89dv9 2/2 Running 0 1m
etcd-0 1/1 Running 0 4m
pachd-65fd68d6d4-8vjq7 1/1 Running 0 4m
You will then need to forward a couple ports so that pachctl
can talk to the cluster:
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can verify that the cluster is working by executing pachctl version
, which should return a version for both pachctl
and pachd
:
$ pachctl version
COMPONENT VERSION
pachctl 1.7.0
pachd 1.7.0
You can delete your Pachyderm cluster using kops
:
$ kops delete cluster
In addition, there is the entry in /etc/hosts
pointing to the cluster that will need to be manually removed.