Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inception server not working #621

Closed
lqj679ssn opened this issue Apr 9, 2018 · 12 comments

Comments

@lqj679ssn
Copy link
Contributor

commented Apr 9, 2018

i follow the step of Serve a model using TensorFlow Serving in user_guide

kubectl get svc -n kubeflow inception

NAME        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
inception   ClusterIP   10.97.23.123   <none>        9000/TCP,8000/TCP   13m

kubectl get po -n kubeflow

NAME                                READY     STATUS    RESTARTS   AGE
ambassador-849fb9c8c5-288fp         2/2       Running   0          3d
ambassador-849fb9c8c5-6ltq5         2/2       Running   0          3d
ambassador-849fb9c8c5-hzc4k         2/2       Running   0          3d
inception-7bc4df4546-6b4gj          1/1       Running   0          14m
jupyter-ciscoai                     1/1       Running   0          3d

it seems running!

also i did kubectl describe pod/inception-7bc4df4546-6b4gj -n kubeflow, and it shows:

Events:
  Type    Reason                 Age   From               Message
  ----    ------                 ----  ----               -------
  Normal  Scheduled              55s   default-scheduler  Successfully assigned inception-7bc4df4546-6b4gj to ciscoai
  Normal  SuccessfulMountVolume  55s   kubelet, ciscoai   MountVolume.SetUp succeeded for volume "default-token-qwmpv"
  Normal  Pulled                 55s   kubelet, ciscoai   Container image "gcr.io/kubeflow-images-staging/tf-model-server-cpu:v20180327-995786ec" already present on machine
  Normal  Created                55s   kubelet, ciscoai   Created container
  Normal  Started                54s   kubelet, ciscoai   Started container

then i use kubectl get pod/inception-7bc4df4546-6b4gj -n kubeflow -o yaml to get portIP is 192.168.175.202(it succeed for jupyter), so i think 192.168.175.202:9000 is correct.

then i deployed inception-client, after that: python label.py -s 192.168.175.202 -p 9000 images/sleeping-pepper.jpg, but the result is not nice

Traceback (most recent call last):
  File "label.py", line 82, in <module>
    main(args.images, args.server, args.port)
  File "label.py", line 56, in main
    result = stub.Predict(request, 10.0)  # 10 secs timeout
  File "/home/ciscoai/kubeflow/components/k8s-model-server/inception-client/client/local/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 310, in __call__
    self._request_serializer, self._response_deserializer)
  File "/home/ciscoai/kubeflow/components/k8s-model-server/inception-client/client/local/lib/python2.7/site-packages/grpc/beta/_client_adaptations.py", line 196, in _blocking_unary_unary
    raise _abortion_error(rpc_error_call)
grpc.framework.interfaces.face.face.AbortionError: AbortionError(code=StatusCode.UNAVAILABLE, details="Connect Failed")

anyone can help me?

--

ks version

ksonnet version: 0.9.2
jsonnet version: v0.9.5
client-go version: 1.8

--

kubectl version

Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:55:54Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.0", GitCommit:"fc32d2f3698e36b93322a3465f63a14e9f0eaead", GitTreeState:"clean", BuildDate:"2018-03-26T16:44:10Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
@jlewi

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

@jlewi

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

192.168.175.202 is an internal network of the cluster.

What is your networking configuration? In general you won't be able to reach the internal pod IPs of your cluster from your local machine.

Are you running python label.py from within your cluster or from your local machine?

@lqj679ssn

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2018

@jlewi thanks for replying!
I setup kubernetes on ubuntu server (call it 'X') using kubeadm. And then setup kubeflow. I think all pods are running on local('X'), i can see them in docker ps.

actually for running jupyterHub, the podIP is 192.168.175.197 and I can visit it on local('X').

for this one, 192.168.175.202, i can ping it successfully.

the following is my network configuration

cali59d29607332 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1041 errors:0 dropped:0 overruns:0 frame:0
          TX packets:980 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7910197 (7.9 MB)  TX bytes:149106 (149.1 KB)

cali74b8dbd2216 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:352622041 errors:0 dropped:2 overruns:0 frame:0
          TX packets:352067802 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:49325697208 (49.3 GB)  TX bytes:35664971035 (35.6 GB)

cali79f2538cdf8 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:10 errors:0 dropped:2 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:744 (744.0 B)  TX bytes:764 (764.0 B)

cali7bd0b61c758 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:82795348 errors:0 dropped:2 overruns:0 frame:0
          TX packets:82816770 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7032993430 (7.0 GB)  TX bytes:13723628625 (13.7 GB)

cali8152074cdfe Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:4641 errors:0 dropped:2 overruns:0 frame:0
          TX packets:4626 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1744663 (1.7 MB)  TX bytes:8503738 (8.5 MB)

cali86945c262dc Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:645770 errors:0 dropped:2 overruns:0 frame:0
          TX packets:645244 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:107833513 (107.8 MB)  TX bytes:163462441 (163.4 MB)

cali9a64e76b7c6 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3519 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2295 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:280466 (280.4 KB)  TX bytes:432291 (432.2 KB)

calic4348e8e710 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:82796390 errors:0 dropped:2 overruns:0 frame:0
          TX packets:82816841 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7032970455 (7.0 GB)  TX bytes:13723368944 (13.7 GB)

calid11e750ccc1 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:82795134 errors:0 dropped:2 overruns:0 frame:0
          TX packets:82814179 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:7032937532 (7.0 GB)  TX bytes:13723234296 (13.7 GB)

calie5620efe275 Link encap:Ethernet  HWaddr ee:ee:ee:ee:ee:ee  
          inet6 addr: fe80::ecee:eeff:feee:eeee/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:8 errors:0 dropped:2 overruns:0 frame:0
          TX packets:9 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:648 (648.0 B)  TX bytes:726 (726.0 B)

docker0   Link encap:Ethernet  HWaddr 02:42:a4:a1:a0:e7  
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          inet6 addr: fe80::42:a4ff:fea1:a0e7/64 Scope:Link
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:4041 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5269 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:236541 (236.5 KB)  TX bytes:79834894 (79.8 MB)

eno1      Link encap:Ethernet  HWaddr 00:1f:c6:9b:c1:d1  
          inet addr:10.157.159.75  Bcast:10.157.159.255  Mask:255.255.254.0
          inet6 addr: fe80::21f:c6ff:fe9b:c1d1/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:109280895 errors:0 dropped:0 overruns:0 frame:0
          TX packets:104478734 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:23221986701 (23.2 GB)  TX bytes:8656224359 (8.6 GB)
          Interrupt:16 Memory:dc400000-dc420000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:37635388 errors:0 dropped:0 overruns:0 frame:0
          TX packets:37635388 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:8753065826 (8.7 GB)  TX bytes:8753065826 (8.7 GB)

tunl0     Link encap:IPIP Tunnel  HWaddr   
          inet addr:192.168.175.192  Mask:255.255.255.255
          UP RUNNING NOARP  MTU:1440  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)
@jlewi

This comment has been minimized.

Copy link
Contributor

commented Apr 9, 2018

The error Unavailable "Connect Failed" looks to me like a networking issue; i.e. the client can't even connect to the server.

Do you look at the logs of TFServer, do they indicate any errors?

@lqj679ssn

This comment has been minimized.

Copy link
Contributor Author

commented Apr 9, 2018

@jlewi yeah it shows errors!!!

2018-04-10 00:38:12.549847: W external/org_tensorflow/tensorflow/core/platform/cloud/google_auth_provider.cc:160] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "Not found: Could not locate the credentials file.". Retrieving token from GCE failed with "Aborted: All 10 retry attempts failed. The last failure: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 'Callback aborted')".
2018-04-10 00:39:13.811566: E external/org_tensorflow/tensorflow/core/platform/cloud/curl_http_request.cc:515] The transmission has been stuck at 0 bytes for 61 seconds and will be aborted.
2018-04-10 00:39:13.811780: I external/org_tensorflow/tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 0.532596 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 'Callback aborted')
2018-04-10 00:40:15.605247: E external/org_tensorflow/tensorflow/core/platform/cloud/curl_http_request.cc:515] The transmission has been stuck at 0 bytes for 61 seconds and will be aborted.
2018-04-10 00:40:15.605370: I external/org_tensorflow/tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 1.0482 seconds (attempt 2 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 'Callback aborted')
2018-04-10 00:41:17.915692: E external/org_tensorflow/tensorflow/core/platform/cloud/curl_http_request.cc:515] The transmission has been stuck at 0 bytes for 61 seconds and will be aborted.
2018-04-10 00:41:17.915907: I external/org_tensorflow/tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 2.8029 seconds (attempt 3 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 'Callback aborted')
2018-04-10 00:42:21.974283: E external/org_tensorflow/tensorflow/core/platform/cloud/curl_http_request.cc:515] The transmission has been stuck at 0 bytes for 61 seconds and will be aborted.
2018-04-10 00:42:21.974448: I external/org_tensorflow/tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 4.73015 seconds (attempt 4 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 'Callback aborted')
2018-04-10 00:43:27.964608: E external/org_tensorflow/tensorflow/core/platform/cloud/curl_http_request.cc:515] The transmission has been stuck at 0 bytes for 61 seconds and will be aborted.
2018-04-10 00:43:27.964829: I external/org_tensorflow/tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 8.56516 seconds (attempt 5 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 0, error code 42, error message 'Callback aborted')
@jlewi

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2018

Are you running on GCP?
Are you trying to load the model from GCS?

The log messages indicate its trying to get a GCP credential and failing. The first error message indicates there is no private key file and the HTTP errors are most likely trying to contact the metadata server to get a token.

@lqj679ssn

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

@jlewi yes!

I used the model path from user_guide. It's gs://kubeflow-models/inception. So could u tell me what should I do?

thanks a lot

@lqj679ssn

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

i'm not running on GCP. But I used gs:// link.

@jlewi

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2018

I think that model is public but it looks like the GCS libraries in TFServing may not be smart enough to detect that.

My suggestion would be to copy that model to a directory that you can serve from.

You can use the CLI gsutil.

@tremblerz

This comment has been minimized.

Copy link

commented Apr 10, 2018

As @jlewi said, Issue is with the debian package which is not able to download the model(although it is publicly available on the google cloud storage and should work with google's default auth). I have verified the issue by downloading and running the debian package directly on the ubuntu machine(without any kubeflow) and it threw the same error. One possible fix could be to download the model explicitly in the dockerfile and then run the binary with model_base_path as the downloaded model directory's absolute path. The other possible solution would be to get the debian package fixed somehow.

@lqj679ssn

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

thanks @jlewi @tremblerz
now i find the problem...

the model is public but if u logout google, you can not download that model
so now i'm going to download it to my mac and transfer the model inside the container

after that i'll see if inception run well

@jlewi

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2018

@lluunn We should document this in the user guide.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.