Pachyderm is built on Kubernetes. As such, Pachyderm can run on any platform that supports Kubernetes. This guide covers the following commonly used platforms:
Google Cloud Platform has excellent support for Kubernetes through Google Container Engine.
- Google Cloud SDK >= 124.0.0
If this is the first time you use the SDK, make sure to follow the quick start guide. This may update your ~/.bash_profile
and point your $PATH
at the location where you extracted google-cloud-sdk
. We recommend extracting this to ~/bin
.
If you do not already have kubectl
installed, after the SDK is installed, run:
$ gcloud components install kubectl
This will download the kubectl
binary to google-cloud-sdk/bin
To create a new Kubernetes cluster in GKE, just run:
$ CLUSTER_NAME=[any unique name, e.g. pach-cluster]
$ GCP_ZONE=[a GCP availability zone. e.g. us-west1-a]
$ gcloud config set compute/zone ${GCP_ZONE}
$ gcloud config set container/cluster ${CLUSTER_NAME}
# By default this spins up a 3-node cluster. You can change the default with `--num-nodes VAL`
$ gcloud container clusters create ${CLUSTER_NAME} --scopes storage-rw
This may take a few minutes to start up. You can check the status on the GCP Console.
# Update your kubeconfig to point at your newly created cluster
$ gcloud container clusters get-credentials ${CLUSTER_NAME}
Check to see that your cluster is up and running:
$ kubectl get all
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/kubernetes 10.0.0.1 <none> 443/TCP 22s
Pachyderm needs a GCS bucket and a persistent disk to function correctly.
Here are the parameters to create these resources:
# BUCKET_NAME needs to be globally unique across the entire GCP region
$ BUCKET_NAME=[The name of the GCS bucket where your data will be stored]
# Name this whatever you want, we chose pach-disk as a default
$ STORAGE_NAME=pach-disk
# For a demo you should only need 10 GB. This stores PFS metadata. For reference, 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the volume that you are going to create, in GBs. e.g. "10"]
And then run:
$ gsutil mb gs://${BUCKET_NAME}
$ gcloud compute disks create --size=${STORAGE_SIZE}GB ${STORAGE_NAME}
To check that everything has been set up correctly, try:
$ gcloud compute instances list
# should see a number of instances
$ gsutil ls
# should see a bucket
$ gcloud compute disks list
# should see a number of disks, including the one you specified
pachctl
is a command-line utility used for interacting with a Pachyderm cluster.
# For OSX:
$ brew tap pachyderm/tap && brew install pachctl
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://pachyderm.io/pachctl.deb && sudo dpkg -i /tmp/pachctl.deb
You can try running pachctl version
to check that this worked correctly, but Pachyderm itself isn't deployed yet so you won't get a pachd
version.
$ pachctl version
COMPONENT VERSION
pachctl 1.3.2
pachd (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded.
Now we're ready to boot up Pachyderm:
$ pachctl deploy google ${BUCKET_NAME} ${STORAGE_NAME} ${STORAGE_SIZE}
It may take a few minutes for the pachd nodes to be running because it's pulling containers from DockerHub. You can see the cluster status by using:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
po/etcd-xzc0d 1/1 Running 0 55s
po/pachd-6m6wm 1/1 Running 0 55s
po/rethink-388b3 1/1 Running 0 55s
NAME DESIRED CURRENT READY AGE
rc/etcd 1 1 1 55s
rc/pachd 1 1 1 55s
rc/rethink 1 1 1 55s
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/etcd 10.0.0.92 <none> 2379/TCP,2380/TCP 55s
svc/kubernetes 10.0.0.1 <none> 443/TCP 9m
svc/pachd 10.0.0.61 <nodes> 650:30650/TCP,651:30651/TCP 55s
svc/rethink 10.0.0.87 <nodes> 8080:32080/TCP,28015:32081/TCP,29015:32085/TCP 55s
NAME DESIRED SUCCESSFUL AGE
jobs/pachd-init 1 1 55s
Note: If you see a few restarts on the pachd nodes, that's totally ok. That simply means that Kubernetes tried to bring up those containers before Rethink was ready so it restarted them.
Finally, we need to set up forward a port so that pachctl can talk to the cluster.
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can test to make sure the cluster is working by trying pachctl version
or even creating a new repo.
$ pachctl version
COMPONENT VERSION
pachctl 1.3.2
pachd 1.3.2
- Make sure you have the AWS CLI installed and have your AWS credentials configured.
The easiest way to deploy a Kubernetes cluster is to use the official Kubernetes guide. The script defaults to using one m3.medium instance and three t2.micros. These instances can have network, cpu, and disc space problems so we suggest using all m3.large or larger. Before running kube-up.sh make sure to set:
export NODE_SIZE=m3.large
export MASTER_SIZE=m3.large
# You can also easily change the number of nodes
export NUM_NODES=2
# The kubernetes guide lists a bunch of other configurations that you can change
NOTE: If you've already got a Kubernetes cluster running, you may see the error An error occurred (InvalidIPAddress.InUse) when calling the RunInstances operation: Address 172.20.0.9 is in use
. You can terminate the old cluster with kubernetes/cluster/kube-down.sh
and then rerun the script.
NOTE: If you already had kubectl set up from the minikube demo, kubectl will now be talking to your aws cluster. You can switch back to talking to minikube with:
kubectl config use-context minikube
# You can also view your current context
kubectl config current-context
aws-kubernetes
Now we've got Kubernetes up and running, it's time to deploy Pachyderm!
Before we deploy Pachyderm, we need to add some storage resources to our cluster so that Pachyderm has a place to put data.
Pachyderm needs an S3 bucket, and a persistent disk (EBS) to function correctly.
Here are the parameters to set up these resources:
$ kubectl cluster-info
Kubernetes master is running at https://1.2.3.4
...
$ KUBECTLFLAGS="-s [The public IP of the Kubernetes master. e.g. 1.2.3.4]"
# BUCKET_NAME needs to be globally unique across the entire AWS region
$ BUCKET_NAME=[The name of the S3 bucket where your data will be stored]
# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the EBS volume that you are going to create, in GBs. e.g. "10"]
$ AWS_REGION=[the AWS region of your Kubernetes cluster. e.g. "us-west-2" (not us-west-2a)]
$ AWS_AVAILABILITY_ZONE=[the AWS availability zone of your Kubernetes cluster. e.g. "us-west-2a"]
And then run:
$ aws s3api create-bucket --bucket ${BUCKET_NAME} --region ${AWS_REGION} --create-bucket-configuration LocationConstraint=${AWS_REGION}
$ aws ec2 create-volume --size ${STORAGE_SIZE} --region ${AWS_REGION} --availability-zone ${AWS_AVAILABILITY_ZONE} --volume-type gp2
Record the "volume-id" that is output (e.g. "vol-8050b807"). You can also view it in the aws console or with aws ec2 describe-volumes
. Export the volume-id:
$ STORAGE_NAME=[volume id]
Now you should be able to see the bucket and the EBS volume that are just created:
aws s3api list-buckets --query 'Buckets[].Name'
aws ec2 describe-volumes --query 'Volumes[].VolumeId'
pachctl
is a command-line utility used for interacting with a Pachyderm cluster.
# For OSX:
$ brew tap pachyderm/tap && brew install pachctl
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://pachyderm.io/pachctl.deb && sudo dpkg -i /tmp/pachctl.deb
You can try running pachctl version
to check that this worked correctly, but Pachyderm itself isn't deployed yet so you won't get a pachd
version.
$ pachctl version
COMPONENT VERSION
pachctl 1.3.2
pachd (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded.
First get a set of temporary AWS credentials by using this command:
$ aws sts get-session-token
Then set these variables:
$ AWS_ID=[access key ID]
$ AWS_KEY=[secret access key]
$ AWS_TOKEN=[session token]
Run the following command to deploy your Pachyderm cluster:
$ pachctl deploy amazon ${BUCKET_NAME} ${AWS_ID} ${AWS_KEY} ${AWS_TOKEN} ${AWS_REGION} ${STORAGE_NAME} ${STORAGE_SIZE}
Note: For a permanent deployment, all you have to do is leave the token blank and make sure the user has the right permissions.
It may take a few minutes for the pachd nodes to be running because it's pulling containers from DockerHub. You can see the cluster status by using:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
po/etcd-xzc0d 1/1 Running 0 55s
po/pachd-6m6wm 1/1 Running 0 55s
po/rethink-388b3 1/1 Running 0 55s
NAME DESIRED CURRENT READY AGE
rc/etcd 1 1 1 55s
rc/pachd 1 1 1 55s
rc/rethink 1 1 1 55s
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/etcd 10.0.0.92 <none> 2379/TCP,2380/TCP 55s
svc/kubernetes 10.0.0.1 <none> 443/TCP 9m
svc/pachd 10.0.0.61 <nodes> 650:30650/TCP,651:30651/TCP 55s
svc/rethink 10.0.0.87 <nodes> 8080:32080/TCP,28015:32081/TCP,29015:32085/TCP 55s
NAME DESIRED SUCCESSFUL AGE
jobs/pachd-init 1 1 55s
Note: If you see a few restarts on the pachd nodes, that's totally ok. That simply means that Kubernetes tried to bring up those containers before Rethink was ready so it restarted them.
Finally, we need to set up forward a port so that pachctl can talk to the cluster.
# Forward the ports. We background this process because it blocks.
$ pachctl port-forward &
And you're done! You can test to make sure the cluster is working by trying pachctl version
or even creating a new repo.
$ pachctl version
COMPONENT VERSION
pachctl 1.3.2
pachd 1.3.2
The easiest way to deploy a Kubernetes cluster is to use the official Kubernetes guide.
Pachyderm requires an object store (Azure Storage) and a data disk to function correctly.
Here are the parameters required to create these resources:
# Needs to be globally unique across the entire Azure location
$ AZURE_RESOURCE_GROUP=[The name of the resource group where the Azure resources will be organized]
$ AZURE_LOCATION=[The Azure region of your Kubernetes cluster. e.g. "West US2"]
# Needs to be globally unique across the entire Azure location
$ AZURE_STORAGE_NAME=[The name of the storage account where your data will be stored]
$ CONTAINER_NAME=[The name of the Azure blob container where your data will be stored]
# Needs to end in a ".vhd" extension
$ STORAGE_NAME=pach-disk.vhd
# We recommend between 1 and 10 GB. This stores PFS metadata. For reference 1GB
# should work for 1000 commits on 1000 files.
$ STORAGE_SIZE=[the size of the data disk volume that you are going to create, in GBs. e.g. "10"]
And then run:
$ azure group create --name ${AZURE_RESOURCE_GROUP} --location ${AZURE_LOCATION}
$ azure storage account create ${AZURE_STORAGE_NAME} --location ${AZURE_LOCATION} --resource-group ${AZURE_RESOURCE_GROUP} --sku-name LRS --kind Storage
# Retrieve the Azure Storage Account Key
$ AZURE_STORAGE_KEY=`azure storage account keys list ${AZURE_STORAGE_NAME} --resource-group ${AZURE_RESOURCE_GROUP} --json | jq .[0].value -r`
# Build the microsoft_vhd container.
$ make docker-build-microsoft-vhd
# Create an empty data disk in the "disks" container
$ STORAGE_VOLUME_URI=`docker run -it microsoft_vhd ${AZURE_STORAGE_NAME} ${AZURE_STORAGE_KEY} "disks" ${STORAGE_NAME} ${STORAGE_SIZE}G`
To check that everything has been setup correctly, try:
$ azure storage account list
# should see a number of storage accounts, including the one specified with ${AZURE_STORAGE_NAME}
$ azure storage blob list --account-name ${AZURE_STORAGE_NAME} --account-key ${_AZURE_STORAGE_KEY}
# should see a disk with the name ${STORAGE_NAME}
pachctl
is a command-line utility used for interacting with a Pachyderm cluster.
# For OSX:
$ brew tap pachyderm/tap && brew install pachctl
# For Linux (64 bit):
$ curl -o /tmp/pachctl.deb -L https://pachyderm.io/pachctl.deb && dpkg -i /tmp/pachctl.deb
You can try running pachctl version
to check that this worked correctly, but Pachyderm itself isn't deployed yet so you won't get a pachd
version.
$ pachctl version
COMPONENT VERSION
pachctl 1.3.2
pachd (version unknown) : error connecting to pachd server at address (0.0.0.0:30650): context deadline exceeded.
Now we're ready to boot up Pachyderm:
$ pachctl deploy microsoft ${CONTAINER_NAME} ${AZURE_STORAGE_NAME} ${AZURE_STORAGE_KEY} ${STORAGE_VOLUME_URI} ${STORAGE_SIZE}
It may take a few minutes for the pachd nodes to be running because it's pulling containers from DockerHub. You can see the cluster status by using:
$ kubectl get all
NAME READY STATUS RESTARTS AGE
po/etcd-xzc0d 1/1 Running 0 55s
po/pachd-6m6wm 1/1 Running 0 55s
po/rethink-388b3 1/1 Running 0 55s
NAME DESIRED CURRENT READY AGE
rc/etcd 1 1 1 55s
rc/pachd 1 1 1 55s
rc/rethink 1 1 1 55s
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/etcd 10.0.0.92 <none> 2379/TCP,2380/TCP 55s
svc/kubernetes 10.0.0.1 <none> 443/TCP 9m
svc/pachd 10.0.0.61 <nodes> 650:30650/TCP,651:30651/TCP 55s
svc/rethink 10.0.0.87 <nodes> 8080:32080/TCP,28015:32081/TCP,29015:32085/TCP 55s
NAME DESIRED SUCCESSFUL AGE
jobs/pachd-init 1 1 55s
Note: If you see a few restarts on the pachd nodes, that's totally ok. That simply means that Kubernetes tried to bring up those containers before Rethink was ready so it restarted them.
Finally, we need to set up forward a port so that pachctl can talk to the cluster.
# Forward the ports. We background this process because it blocks.
$ pachctl portforward &
And you're done! You can test to make sure the cluster is working by trying pachctl version
or even creating a new repo.
$ pachctl version
COMPONENT VERSION
pachctl 1.3.2
pachd 1.3.2
OpenShift is a popular enterprise Kubernetes distribution. Pachyderm can run on OpenShift with two additional steps:
- Make sure that privilege containers are allowed (they are not allowed by default):
oc edit scc
and setallowPrivilegedContainer: true
everywhere. - Remove
hostPath
everywhere from your cluster manifest (e.g.etc/kube/pachyderm-versioned.json
if you are deploying locally).
Problems related to OpenShift deployment are tracked in this issue: #336
Pachyderm automatically reports anonymized usage metrics. These metrics help us
understand how people are using Pachyderm and make it better. They can be
disabled by setting the env variable METRICS
to false
in the pachd
container.