-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Setup
-
What version of Dgraph are you using?
v1.0.10 -
Have you tried reproducing the issue with latest release?
Sure have. -
What is the hardware spec (RAM, OS)?
Two r4.2xlarge ec2 instances, each with 61GiB RAM, Debian GNU/Linux 8.10 -
Steps to reproduce the issue (command/config used to run Dgraph).
Runkubectl applywith this config:
apiVersion: v1
kind: Service
metadata:
name: dgraph-server-public
labels:
app: dgraph-server
spec:
type: NodePort
ports:
- port: 8090
targetPort: 8090
nodePort: 30089
name: server-http
- port: 9090
nodePort: 30099
targetPort: 9090
name: server-grpc
selector:
app: dgraph-server
---
apiVersion: v1
kind: Service
metadata:
name: dgraph-zero-public
labels:
app: dgraph-zero
spec:
type: NodePort
ports:
- port: 6080
targetPort: 6080
nodePort: 30068
name: zero-http
selector:
app: dgraph-zero
---
# This is a headless service which is neccessary for discovery for a dgraph-server StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: dgraph-server
labels:
app: dgraph-server
spec:
ports:
- port: 7090
targetPort: 7090
name: server-grpc
clusterIP: None
selector:
app: dgraph-server
---
# This is a headless service which is neccessary for discovery for a dgraph-zero StatefulSet.
apiVersion: v1
kind: Service
metadata:
name: dgraph-zero
labels:
app: dgraph-zero
spec:
ports:
- port: 5080
targetPort: 5080
name: grpc
clusterIP: None
selector:
app: dgraph-zero
---
# This StatefulSet runs 1 replicas of Zero
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: dgraph-zero
spec:
serviceName: "dgraph-zero"
replicas: 3
template:
metadata:
labels:
app: dgraph-zero
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- dgraph-zero
topologyKey: kubernetes.io/hostname
containers:
- name: zero
image: dgraph/dgraph:v1.0.10
imagePullPolicy: Always
ports:
- containerPort: 5080
name: intra-node
volumeMounts:
- name: datadir
mountPath: /dgraph
command:
- bash
- "-c"
- |
set -ex
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
idx=$(($ordinal + 1))
if [[ $ordinal -eq 0 ]]; then
dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 1 --telemetry=false
else
dgraph zero --my=$(hostname -f):5080 --peer dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 --idx $idx --replicas 1 --telemetry=false
fi
terminationGracePeriodSeconds: 60
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 4Gi
storageClassName: standard-ssd
---
# This StatefulSet runs 1 replicas of Server
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: dgraph-server
spec:
serviceName: "dgraph-server"
replicas: 6
template:
metadata:
labels:
app: dgraph-server
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- dgraph-server
topologyKey: kubernetes.io/hostname
containers:
- name: server
image: dgraph/dgraph:v1.0.10
resources:
limits:
memory: 16Gi
requests:
memory: 4Gi
imagePullPolicy: Always
volumeMounts:
- name: datadir
mountPath: /dgraph
command:
- bash
- "-c"
- |
set -ex
dgraph alpha --my=$(hostname -f):7090 \
--zero dgraph-zero-0.dgraph-zero.default.svc.cluster.local:5080 \
--port_offset 10 \
--query_edge_limit 18446744073709551615 \
--lru_mb 4000 \
--expand_edge="false"
terminationGracePeriodSeconds: 60
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- metadata:
name: datadir
spec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 200Gi
storageClassName: standard-ssd
#Expected
A working cluster.
Result
A working cluster for a while, however, after fairly intensive usage we got connection errors on most alpha and zero pods, causing queries to fail.
On the client side:
$ curl <myDgraphServer>/query -d 'schema {}'
error while fetching schema error: pb.error: No connection exists
Alpha logs
alpha pod was spamming this error:
E1210 03:50:53.282921 1 groups.go:109] Error while connecting with group zero: rpc error: code = Unknown desc = Invalid address
alpha pods 0 and 2-5 were spamming errors like:
E1210 03:44:30.708661 1 pool.go:178] Echo error from dgraph-server-5.dgraph-server.default.svc.cluster.local:7090. Err: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup dgraph-server-5.dgraph-server.default.svc.cluster.local: no such host"
I1210 03:44:30.708699 1 pool.go:118] CONNECTED to dgraph-server-5.dgraph-server.default.svc.cluster.local:7090
zero pod 0 appeared to connect correctly, as it repeated these sorts of logs:
is starting a new election at term 5
became pre-candidate at term 5
received MsgPreVoteResp from 1 at term 5
[logterm: 5, index: 14044] sent MsgPreVote request to 2 at term 5
[logterm: 5, index: 14044] sent MsgPreVote request to 3 at term 5
[1] Read index context timed out
Connected: id:11619 group_id:8 addr:"dgraph-server-1.dgraph-server.default.svc.cluster.local:7090"
Got connection request: addr:"dgraph-server-2.dgraph-server.default.svc.cluster.local:7090"
Trying to add 2 to cluster. Addr: dgraph-zero-1.dgraph-zero.default.svc.cluster.local:5080
Current confstate at 1: nodes:1 nodes:2 nodes:3
Zero logs
zero pod 1 repeated:
E1210 00:58:07.972793 11 oracle.go:479] Got error: Assigning IDs is only allowed on leader. while leasing timestamps: val:1
zero pod 2 repeated:
E1210 00:58:07.972793 11 oracle.go:479] Got error: Assigning IDs is only allowed on leader. while leasing timestamps: val:1
Debugging attempts
I tried resetting the whole cluster multiple times with:
kubectl delete statefulset dgraph-server dgraph-zero
kubectl delete pvc datadir-dgraph-server-0 datadir-dgraph-server-1 datadir-dgraph-server-2 datadir-dgraph-server-3 datadir-dgraph-server-4 datadir-dgraph-server-5 datadir-dgraph-zero-0 datadir-dgraph-zero-1 datadir-dgraph-zero-2
kubectl apply -f dgraph_nodes.yaml
Each time I ended up with the same sorts of connection issues, although the different error types seemed to assign randomly, and sometimes more alpha pods failed.