Below, we show how to deploy Pachyderm on AWS in a couple of different ways:
- By manually deploying Kubernetes and Pachyderm.
- By executing a one shot deploy script that will both deploy Kubernetes and Pachyderm.
If you already have a Kubernetes deployment or would like to customize the types of instances, size of volumes, etc. in your Kubernetes cluster, you should follow option (1). If you just want a quick deploy to experiment with Pachyderm in AWS or would just like to use our default configuration, you might want to try option (2)
Note - for production deployments we recommend setting up AWS CloudFront. AWS puts S3 rate limits in place that can limit the data throughput for your cluster, and CloudFront helps mitigate this issue.
Follow the instructions here to deploy a Pachyderm cluster with CloudFront
- AWS CLI - have it installed and have your AWS credentials configured.
- kubectl
- kops
The easiest way to install Kubernetes on AWS is with kops. Kubenetes has provided a step by step guide for the deploy. Please follow this guide to deploy Kubernetes on AWS.
Once, you have a Kubernetes cluster up and running in AWS, you should be able to see the following output from kubectl
:
$ kubectl get all
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes 10.0.0.1 <none> 443/TCP 22s
To deploy Pachyderm we will need to:
- Install the
pachctl
CLI tool, - Add some storage resources on AWS,
- Deploy Pachyderm on top of the storage resources.
To deploy and interact with Pachyderm, you will need pachctl
, a command-line utility used for Pachyderm. To install pachctl
run one of the following:
# For OSX:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@1.5
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v1.5.2/pachctl_1.5.2_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
You can try running pachctl version
to check that this worked correctly, but Pachyderm itself isn't deployed yet so you won't get a pachd
version.
$ pachctl version
COMPONENT VERSION
pachctl 1.4.6
pachd (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded.
Pachyderm needs an S3 bucket, and a persistent disk (EBS) to function correctly.
Here are the environmental variables you should set up to create these resources:
$ kubectl cluster-info
Kubernetes master is running at https://1.2.3.4
...
$ KUBECTLFLAGS="-s [The public IP of the Kubernetes master. e.g. 1.2.3.4]"
# BUCKET_NAME needs to be globally unique across the entire AWS region
$ BUCKET_NAME=[The name of the S3 bucket where your data will be stored]
# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the EBS volume that you are going to create, in GBs. e.g. "10"]
$ AWS_REGION=[the AWS region of your Kubernetes cluster. e.g. "us-west-2" (not us-west-2a)]
$ AWS_AVAILABILITY_ZONE=[the AWS availability zone of your Kubernetes cluster. e.g. "us-west-2a"]
Then to actually create the resources, you can run:
$ aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}
$ aws ec2 create-volume --size ${STORAGE_SIZE} --region ${AWS_REGION} --availability-zone ${AWS_AVAILABILITY_ZONE} --volume-type gp2
Record the "volume-id" that is output (e.g. "vol-8050b807") from the above create-volume
command as shown below (you can also view it in the aws console or with aws ec2 describe-volumes
):
$ STORAGE_NAME=<volume id>
Now, as a sanity check, you should be able to see the bucket and the EBS volume that are just created:
aws s3api list-buckets --query 'Buckets[].Name'
aws ec2 describe-volumes --query 'Volumes[].VolumeId'
When you installed kops, you should have created a dedicated IAM user (see here for details). To deploy Pachyderm you will need to export these credentials to the following environmental variables:
$ AWS_ACCESS_KEY_ID=[access key ID]
$ AWS_SECRET_ACCESS_KEY=[secret access key]
Run the following command to deploy your Pachyderm cluster:
$ pachctl deploy amazon ${BUCKET_NAME} ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY} " " ${AWS_REGION} ${STORAGE_SIZE} --static-etcd-volume=${STORAGE_NAME}
(Note, the " "
in the deploy command is for an optional temporary AWS token, if you are just experimenting with a deploy. Such a token should NOT be used for a production deploy). It may take a few minutes for the pachd nodes to be running because it's pulling containers from DockerHub. You can see the cluster status by using:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
po/etcd-4197107720-br61m 1/1 Running 0 8m
po/pachd-3548222380-s086m 1/1 Running 2 8m
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/etcd 10.111.11.36 <nodes> 2379:32379/TCP 8m
svc/kubernetes 10.96.0.1 <none> 443/TCP 10m
svc/pachd 10.97.116.5 <nodes> 650:30650/TCP,651:30651/TCP 8m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/etcd 1 1 1 1 8m
deploy/pachd 1 1 1 1 8m
NAME DESIRED CURRENT READY AGE
rs/etcd-4197107720 1 1 1 8m
rs/pachd-3548222380 1 1 1 8m
Note: If you see a few restarts on the pachd nodes, that's totally ok. That simply means that Kubernetes tried to bring up those containers before etcd was ready so it restarted them.
Finally, we need to set up forward a port so that pachctl can talk to the cluster.
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can test to make sure the cluster is working by trying pachctl version
or even creating a new repo.
$ pachctl version
COMPONENT VERSION
pachctl 1.4.6
pachd 1.4.6
This scripted deploy requires a couple of prerequisites in addition to the ones listed under Prerequisites:
Once you have the prerequisites mentioned above, download and run our AWS deploy script by running:
curl -o aws.sh https://raw.githubusercontent.com/pachyderm/pachyderm/master/etc/deploy/aws.sh
chmod +x aws.sh
sudo -E ./aws.sh
This script will use kops to deploy Kubernetes and Pachyderm in AWS. The script will ask you for your AWS credentials, region preference, etc. If you would like to customize the number of nodes in the cluster, node types, etc., you can open up the deploy script and modify the respective fields.
The script will take a few minutes, and Pachyderm will take an addition couple of minutes to spin up. Once it is up, kubectl get all
should return something like:
NAME READY STATUS RESTARTS AGE
po/etcd-4197107720-br61m 1/1 Running 0 8m
po/pachd-3548222380-s086m 1/1 Running 2 8m
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/etcd 10.111.11.36 <nodes> 2379:32379/TCP 8m
svc/kubernetes 10.96.0.1 <none> 443/TCP 10m
svc/pachd 10.97.116.5 <nodes> 650:30650/TCP,651:30651/TCP 8m
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/etcd 1 1 1 1 8m
deploy/pachd 1 1 1 1 8m
NAME DESIRED CURRENT READY AGE
rs/etcd-4197107720 1 1 1 8m
rs/pachd-3548222380 1 1 1 8m
Finally, we need to set up forward a port so that pachctl can talk to the cluster.
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can test to make sure the cluster is working by trying pachctl version
:
$ pachctl version
COMPONENT VERSION
pachctl 1.4.6
pachd 1.4.6